## Clustering text documents

In [18]:
from sklearn.datasets import fetch_20newsgroups
from nltk.tokenize import word_tokenize #Used to extract words from documents
import nltk
nltk.download('punkt','wordnet')
from nltk.stem import WordNetLemmatizer #Used to lemmatize words
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics

from sklearn.cluster import KMeans

import sys
from time import time

import pandas as pd
import numpy as np

[nltk_data] Downloading package punkt to wordnet...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [19]:
# Selected 3 categories from the datasets
categories=['talk.politics.misc','sci.space','comp.graphics']
print("Loading 20 newsgroups dataset for categories:")
print(categories)



Loading 20 newsgroups dataset for categories:
['talk.politics.misc', 'sci.space', 'comp.graphics']


In [20]:
df = fetch_20newsgroups(subset='all', categories=categories, 
                             shuffle=False, remove=('headers', 'footers', 'quotes'))

In [21]:
labels=df.target
true_k=len(np.unique(labels))
print(true_k)

3


In [10]:

# Lemmatizing the documents
lemmatizer = WordNetLemmatizer()
lemmatized_data = []

for i in range(len(df.data)):
    word_list = word_tokenize(df.data[i])
    lemmatized_doc = " ".join([lemmatizer.lemmatize(word) for word in word_list])
    lemmatized_data.append(lemmatized_doc)


In [27]:
print(df.data[1])


In regards to fractal commpression, I have seen 2 fractal compressed "movies".
They were both fairly impressive.  The first one was a 64 gray scale "movie" of
Casablanca, it was 1.3MB and had 11 minutes of 13 fps video.  It was a little
grainy but not bad at all.  The second one I saw was only 3 minutes but it
had 8 bit color with 10fps and measured in at 1.2MB.

I consider the fractal movies a practical thing to explore.  But unlike many 
other formats out there, you do end up losing resolution.  I don't know what
kind of software/hardware was used for creating the "movies" I saw but the guy
that showed them to me said it took 5-15 minutes per frame to generate.  But as
I said above playback was 10 or more frames per second.  And how else could you
put 11 minutes on one floppy disk?


We next convert our corpus into tf-idf vectors. We remove common stop words, terms with very low document frequency (many of them are numbers or misspells), accents. 

In [28]:
vectorizer=TfidfVectorizer(strip_accents='unicode',stop_words='english',min_df=2)
X=vectorizer.fit_transform(df.data)

In [29]:
print(X.shape)

(2735, 16134)


## Clustering using standard K-Means

We first cluster documents using the standard k-means algorithm (actually, a refined variant called k-means++), without any further date preprocessing. The key parameter of choice when performing k-means is $k$. Alas, there really is no principled way to choose an initial value for $k$. Essentially we have two options:

1. We choose a value that reflects our knowledge about the data, as in this case
2. We may try several value, possibly in increasing order. We proceed this way as long as the quality of the resulting clustering (as measured by one or more quality indices) increases and stop when it starts decreasing. 

In this specific case, we set $k = 3$ of course

In [18]:
km=KMeans(n_clusters=true_k,init='k-means++',max_iter=100)
t0=time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))

done in 0.301s


### Standard measures of cluster quality

In [19]:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

Homogeneity: 0.372
Completeness: 0.409
V-measure: 0.389
Adjusted Rand-Index: 0.326
Silhouette Coefficient: 0.007


### Identify the 10 most relevant terms in each cluster

In [24]:
centroids = km.cluster_centers_.argsort()[:, ::-1] ## Indices of largest centroids' entries in descending order
terms = vectorizer.get_feature_names_out()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Cluster 0: graphics thanks image file files know format program does looking
Cluster 1: space like just think don shuttle know nasa orbit time
Cluster 2: people don government think just men like right make did


## Visualization¶



In [25]:

from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [30]:
def frequency_dict(cluster_index):
    if cluster_index>true_k-1:
        return cluster_index
    term_frequency=km.cluster_centers_[cluster_index]
    sorted_terms=centroids[cluster_index]
    frequencies = {terms[i]: term_frequencies[i] for i in sorted_terms}
    return frequencies