# Clustering

This Notebook presents attempts to cluster text news.

## Data loading

In [1]:
import pandas as pd
from pathlib import Path
from collections import Counter

data_path = Path(r'./data')
files_data = {
    file.stem : file for file in data_path.iterdir()
}

df_clustering = pd.read_json(files_data["clusters"])

Let's see what is inside the data.

In [2]:
df_clustering.head()

Unnamed: 0,id,text,title,lang,date,cluster,cluster_name
0,https://www.fondsk.ru/news/2020/03/24/litva-ne...,The coronavirus epidemic in Lithuania has clea...,"Lithuania: No Physicians, No Food Stock - Stra...",eng,2020-03-24 00:00:00,0,MS fails to respond
1,https://www.rubaltic.ru/article/politika-i-obs...,European experts say that the countries of Eas...,Coronavirus caused a catastrophe in the Baltic,eng,2020-03-19 08:50:37,0,MS fails to respond
2,http://pandemya.ru/shveciya-otkazalas-ot-borby...,Sweden refused to fight against coronavirus: “...,Sweden abandoned the fight against coronavirus...,eng,,0,MS fails to respond
3,https://southfront.org/coronavirus-hysteria-hi...,"Donate\nDuring the past week, the center of th...",Coronavirus Hysteria Hits Russia As Europe Bec...,eng,2020-03-14 12:27:02,0,MS fails to respond
4,https://www.fondsk.ru/news/2020/03/24/shvedski...,The Swedish oligarchs in conjunction with the ...,Swedish capital slows down in an epidemic and ...,eng,2020-03-24 00:00:00,0,MS fails to respond


Now, let's display how many elements are there in each cluster.

In [3]:
Counter(df_clustering["cluster_name"])

Counter({'MS fails to respond': 9,
         'Anti-Russia': 8,
         'Claims about China': 22,
         'Collapse': 8,
         'Coronavirus is not serious': 11,
         'Cure': 6,
         'EU fails to respond': 20,
         'Miscellaneous': 20,
         'Origins': 8,
         'Properties': 6,
         'Was predicted': 4,
         'Secret plan of the global elite': 16,
         'Ukraine fails to respond': 9,
         'USA created COVID-2019': 34})

Let's see if all the rows are populated with a title.

In [4]:
df_clustering[df_clustering["title"]=="".index]

Unnamed: 0,id,text,title,lang,date,cluster,cluster_name


Now, we can shuffle them.

In [5]:
df=df_clustering.sample(frac=1).reset_index(drop=True)

We'll get the titles in a list and the labels (clusters) in another.

In [6]:
texts, labels = df["title"].to_list(), df["cluster"].to_list()

## Clustering

The idea is to cluster the documents by topic. For that, we'll do some preprocessing in the text (e.g., remove stop words and normalize the words--i.e., lemmatization in this case). We'll need a language model for doing that. To install spaCy's language model, first we need to download it in the console with: ```python -m download en_core_web_sm```. Once available, we can load it.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
import numpy as np


nlp = spacy.load("en_core_web_sm")

We define next a custome tokenizer to pass to the ``CountVectorizer``. This tokenizer first process the text through a spaCy pipeline and returns a list of lemmas (normalized word) as long as the lemma is not part of the `STOP_WORDS` set.

In [21]:
def tokenizer(text):
    """Normalizes text and removes stop words."""
    doc = nlp(text)
    return([token.lemma_ for token in doc if token.text not in STOP_WORDS])

`CountVectorizer` takes texts as input and returns a term-document matrix.

In [26]:
count_vec = CountVectorizer(tokenizer=tokenizer)
X = count_vec.fit_transform(texts)

In [27]:
X.shape

(181, 591)

We can fit a KMeans estimator asking fro 14 clusters.

In [28]:
estimator = KMeans(n_clusters=14, random_state=24)
estimator.fit(X)

KMeans(n_clusters=14, random_state=24)

These are the labels assigned by the clustering algorithm.

In [44]:
print(estimator.labels_)

[ 8  2  2  0  8  2  0  0  3  3  1  3  5  0  3  3  6  0  3  2  0  3  3 13
  3  3  0 12  5  1  1  0  7  2  2  6  2  0  6 10  3  5  2  0  2  2  2  2
  3  2  8  8 13 12 13  2  3  2  2  8  6  5  3  2  2  3  5  8  2  3  2  2
  2  0  3  3  3  3  3  2  2  3  3  2  3  2  0  0  1  2  3  2  6  2  8 13
  3  3  0  9  5  3  0  0  6  3  0  0  2  4  0  3  2  2  2  5  1  6  1  0
 13  0  2  3  3  1 13  5 10  0 10  3  3  2  0  3 12  6 11 12  2  3  3  3
  0  8  2  0  0  8  3  3  1  6  8  8  2  2  2 12  3  3 12  2  2  3  3  0
  3  2  2  3  2  3  3  0  5  0  8  0  3]


We can compare them agaisnt the real labels keeping in mind that the label numbers would not be the same. However, we can do a comparison element-wise to check if documents with the same cluster assigned match with documents that belonged to the same cluster in the original data set.

In [34]:
print(labels)

[13, 6, 5, 4, 11, 13, 6, 13, 10, 2, 0, 6, 2, 4, 13, 0, 13, 13, 7, 11, 7, 5, 11, 13, 7, 12, 6, 0, 13, 6, 3, 2, 0, 2, 13, 13, 2, 3, 7, 1, 7, 13, 2, 3, 1, 13, 13, 2, 10, 13, 5, 11, 9, 11, 13, 13, 0, 3, 3, 11, 4, 11, 4, 3, 0, 11, 9, 8, 11, 1, 13, 2, 12, 7, 2, 2, 6, 7, 1, 7, 13, 13, 2, 7, 7, 8, 0, 2, 12, 9, 6, 13, 4, 4, 5, 13, 13, 1, 7, 8, 9, 6, 7, 6, 2, 13, 6, 8, 6, 6, 12, 12, 13, 2, 7, 6, 9, 2, 2, 1, 9, 11, 13, 7, 5, 2, 13, 7, 7, 2, 13, 12, 6, 6, 13, 8, 0, 7, 13, 6, 7, 8, 0, 2, 4, 11, 12, 6, 13, 11, 8, 10, 12, 2, 1, 11, 7, 6, 3, 11, 13, 12, 11, 6, 11, 4, 4, 6, 5, 13, 4, 13, 7, 1, 10, 3, 13, 2, 4, 8, 2]


## Latent Dirichlet Allocation

Another approach is to use Latent Dirichlet Allocation (LDiA). This method would see each document as a combination of words (Bag-of-Words) where each word has a probability associated to each category or topic. We ask the algorithm to decompose each document into 14 "topics" when we `fit` the estimator.

In [35]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=14, random_state=0)
lda.fit(X)

LatentDirichletAllocation(n_components=14, random_state=0)

Now we can ask the model what is the probabilty distribution that a document belongs to each of the 14 topics. For example:

In [38]:
lda.transform(X[0])

array([[0.00793653, 0.00793655, 0.37639588, 0.00793653, 0.00793654,
        0.00793655, 0.00793651, 0.00793653, 0.00793653, 0.00793654,
        0.00793653, 0.00793653, 0.52836569, 0.00793656]])

Here we see that the highest probability is to topic with index 12.

Once again we can get the labels for each document and compare them against the original ones. Again, keeping in mind that the topics and the original labels don't necessarily match.

In [43]:
print(np.argmax(lda.transform(X), axis=1))

[12  5  4  4  2 13  8 11  7  2  2  1  5  4  6  7  3  1  3 13  2  1  1 13
  4  1  5  0 13  2  4  8  2 10  7  0  5 11 13 12 12  6  3 13  5  7  0 13
  3  2  4  2  1  7  7 11  3 11 13  2  3 10  3 13  7  1 13 11 13 10 13 10
 11  8  3  1 11  1 12  6 11  1  7  3  3  8  3  8  4 13  4  9 13  9  4  7
  8  1  1  3 10 12  7 10  3  1 13  7 10  5  3  8  0 10  5  5  2  0  2  2
  6  0  7  0  2  3  0  0 12  9 12  6  9 10  8 11  8  5  2 10 13 13 13  3
  8  2 12  3 12  2 12 12  3  9  2  2  1  5 11  1  1  5 12  7  4  3 11 13
 11 13  1  3  1 13  7 11 13  1  2 10 13]


In [45]:
print(labels)

[13, 6, 5, 4, 11, 13, 6, 13, 10, 2, 0, 6, 2, 4, 13, 0, 13, 13, 7, 11, 7, 5, 11, 13, 7, 12, 6, 0, 13, 6, 3, 2, 0, 2, 13, 13, 2, 3, 7, 1, 7, 13, 2, 3, 1, 13, 13, 2, 10, 13, 5, 11, 9, 11, 13, 13, 0, 3, 3, 11, 4, 11, 4, 3, 0, 11, 9, 8, 11, 1, 13, 2, 12, 7, 2, 2, 6, 7, 1, 7, 13, 13, 2, 7, 7, 8, 0, 2, 12, 9, 6, 13, 4, 4, 5, 13, 13, 1, 7, 8, 9, 6, 7, 6, 2, 13, 6, 8, 6, 6, 12, 12, 13, 2, 7, 6, 9, 2, 2, 1, 9, 11, 13, 7, 5, 2, 13, 7, 7, 2, 13, 12, 6, 6, 13, 8, 0, 7, 13, 6, 7, 8, 0, 2, 4, 11, 12, 6, 13, 11, 8, 10, 12, 2, 1, 11, 7, 6, 3, 11, 13, 12, 11, 6, 11, 4, 4, 6, 5, 13, 4, 13, 7, 1, 10, 3, 13, 2, 4, 8, 2]
