# Notación

- $n$ documentos: vectores de dimensión $D$, numerados del $0$ al $n-1$
- matriz de similitud de documentos $S$ de dimensión $n*n$
- matriz de clusters $C$ de dimensión $1*(n+1)$ (1 fila, $n+1$ columnas)
    - $C_{i,j} = 1$ ssi documento $j$ pertenece a cluster $i$
    - Inicialmente un solo cluster
- Umbral de similitud $\tau > 0$
- $\text{ones}_{a, b}$ es una matriz con 1's de dimensión $a*b$
- $A_{\cdot, j}$ es la columna $j$-ésima de $A$
- $A_{i, \cdot}$ es la fila $i$-ésima de $A$
- $A_{i,j}$ es el valor en la posición $i,j$ de $A$

Inicialmente:

- $S_{i,i} \leftarrow \tau$, para $i = 0,..,n-1$ (la diagonal de $S$ tiene el umbral de similitud)
- Agregar una nueva fila y columna a $S$ con $\tau$ como valor: $S_{n,\cdot} \leftarrow \tau$, y $S_{\cdot,n} \leftarrow \tau$
- $C_{0,0} \leftarrow 1$ (documento 0 está en el cluster 0)
- $C_{0, n} \leftarrow 1$ (documento $n$-ésimo está en el cluster 0)


# Online clustering

- $i \leftarrow 0$
- para $j = 1,..,n-1$
    - Agregar una fila a $C$ (fila $i+1$)
    - $C_{(i+1), n} \leftarrow 1$
    - $T \leftarrow \frac{1}{C \cdot \text{ones}_{(i+1), 1}}$ (elemento $k$ del vector $T$ tiene a (tamaño del cluster $k$)$^{-1}$)
    - computar $v \leftarrow C \cdot S_{\cdot, j} \cdot T$
    - eliminar elementos de $v$ cuyo valor sea $\leq \tau$
    - $i^* \leftarrow \text{argmax}(v)$
    - $C_{i^*, j} \leftarrow 1$
    - $i \leftarrow i + 1$
    
Clusters están dados por $C$, sin contar la última columna $n$

# Experimento
## Preparar datos

In [1]:
from theano import tensor
from theano import function
import numpy as np

from tqdm import tqdm, trange

tau = .7

Load vectors and remove rows with NAs

In [2]:
documents = np.load('data/fasttext_vectors_event_hurricane_irma.npy')
documents.shape

(10746, 100)

In [3]:
# all indices
idx = range(len(documents))
# indices de doc_vectors con NA (son como 15 no más :P)
remove_idx = np.where(np.isnan(documents).any(axis=1))[0]

docs = np.array([documents[i] for i in idx if i not in remove_idx])
n = docs.shape[0]
docs.shape

(10731, 100)

Compute similarity matrix and assign diagonal and new row/column

In [4]:
d = docs @ docs.T
norm = (docs * docs).sum(1, keepdims=True) ** .5
S = d / norm / norm.T
S.shape

(10731, 10731)

In [5]:
assert np.allclose(S.diagonal(), 1)

In [6]:
np.fill_diagonal(S, tau)
assert np.allclose(S.diagonal(), tau)

## Online clustering

In [7]:
ONE = np.ones((n, 1))
i = 0

C = np.zeros((n, n))
C[0, 0] = 1
#C[0, n] = 1


for j in trange(1, n):
    C[i + 1, j] = 1

    T = 1 / C.dot(ONE)
    T[T > 1] = 0

    v = C @ S[:, j]

    v = v * T.T

    v[v < tau] = 0

    k = np.argmax(v)

    C[i + 1, j] = 0
    C[k, j] = 1
    i += 1

100%|██████████| 10730/10730 [08:18<00:00, 21.54it/s]


In [8]:
C1 = C[~np.all(C == 0, axis=1)]
C1.shape

(71, 10731)

In [9]:
C1

array([[ 1.,  1.,  1., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

## Explore clusters

In [33]:
from db.engines import engine_of215 as engine
from db.models_new import *
from db import events
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(format='%(asctime)s | %(name)s | %(levelname)s : %(message)s', level=logging.INFO)
# tokenizer = Tokenizer()
from sqlalchemy.orm import sessionmaker

# server, engine = connect_from_rafike(username='mquezada', password='100486')
Session = sessionmaker(engine, autocommit=True)
session = Session()

In [34]:
event_name = 'hurricane_irma'

In [42]:
documents = get_documents_from_event(event_name, session)
documents = np.array([documents[i] for i in idx if i not in remove_idx])

In [38]:
print("Cluster sizes (first 20)\n")

for c in sorted(C1, key=lambda x: sum(x), reverse=True)[:20]:
    print(sum(c))

Cluster sizes (first 20)

4281.0
3021.0
2886.0
116.0
72.0
54.0
51.0
39.0
27.0
25.0
19.0
18.0
18.0
5.0
5.0
5.0
4.0
4.0
3.0
3.0


In [52]:
for c in sorted(C1, key=lambda x: sum(x), reverse=True)[:20]:
    for d in documents[np.argwhere(c > 0)][:20][0][0]:
        print(d.tweet_id, ' '.join(d.text.split()))
    print()
    print()

906266219304685568 ***THIS IS AS REAL AS IT GETS*** ***NOWHERE IN THE FLORIDA KEYS WILL BE SAFE*** ***YOU STILL HAVE TIME TO EVACUAT… https://t.co/5AnaYguVaG
906266219304685568 ***THIS IS AS REAL AS IT GETS*** ***NOWHERE IN THE FLORIDA KEYS WILL BE SAFE*** ***YOU STILL HAVE TIME TO EVACUAT… https://t.co/5AnaYguVaG


906109913423806464 California Guardsmen sent to rescue mission in Florida as Hurricane Irma looms https://t.co/b3hGkJiGNo https://t.co/zqS2pXf68x
906109913423806464 California Guardsmen sent to rescue mission in Florida as Hurricane Irma looms https://t.co/b3hGkJiGNo https://t.co/zqS2pXf68x


906266219304685568 ***THIS IS AS REAL AS IT GETS*** ***NOWHERE IN THE FLORIDA KEYS WILL BE SAFE*** ***YOU STILL HAVE TIME TO EVACUAT… https://t.co/5AnaYguVaG
906266219304685568 ***THIS IS AS REAL AS IT GETS*** ***NOWHERE IN THE FLORIDA KEYS WILL BE SAFE*** ***YOU STILL HAVE TIME TO EVACUAT… https://t.co/5AnaYguVaG


906875281474035712 Hurricane Irma: Florida police urge residents not t

In [24]:
C1[0]

array([ 1.,  1.,  1., ...,  1.,  0.,  0.])