<h3>
    Name: Babin Joshi <br/>
    Roll No: 19
</h3>    

<center>
    <div>
        <h2>Chapter 3: Clustering - Finding Related Posts</h2>
    </div>
</center>

In the previous chapter, you learned how to find the classes or categories of
individual datapoints. With a handful of training data items that were paired with
their respective classes, you learned a model, which we can now use to classify
future data items. We called this supervised learning because the learning was
guided by a teacher; in our case, the teacher had the form of correct classifications.<br/><br/>
Let's now imagine that we do not possess those labels by which we can learn the
classification model. Still, we could find
some pattern within the data itself. That is, let the data describe itself. This is what
we will do in this chapter, where we consider the challenge of a question and answer
website. When a user is browsing our site, perhaps because they were searching for
particular information, the search engine will most likely point them to a specific
answer. If the presented answers are not what they were looking for, the website
should present (at least) the related answers so that they can quickly see what other
answers are available and hopefully stay on our site.

We will achieve this goal in this chapter using clustering. This is a method of
arranging items so that similar items are in one cluster and dissimilar items are in
distinct ones. The tricky thing that we have to tackle first is how to turn text into
something on which we can calculate similarity. With such a similarity measurement,
we will then proceed to investigate how we can leverage that to quickly arrive at a
cluster that contains similar posts. Once there, we will only have to check out those
documents that also belong to that cluster. 

<h3>Measuring the relatedness of posts</h3><br/>
From the machine learning point of view, raw text is useless. Only if we manage to transform it inot meaningul numbers, can we then feed it into our machine learning algorithms, suchs as clustering. This is true for more mundance operations on text such as similarity measurement.

<h1> How to do it </h1><br/>
More robust than edit distance is the so-called bag of word approach. It totally
ignores the order of words and simply uses word counts as their basis. For each
word in the post, its occurrence is counted and noted in a vector. 
Take, for instance, two example posts with the following word counts:

![alt text](bagofword.png)

The columns Occurrences in post 1 and Occurrences in post 2 can now be treated as
simple vectors. We can simply calculate the Euclidean distance between the vectors
of all posts and take the nearest one (too slow, as we have found out earlier). And as
such, we can use them later as our feature vectors in the clustering steps according to
the following procedure:
1. Extract salient features from each post and store it as a vector per post.
2. Then compute clustering on the vectors.
3. Determine the cluster for the post in question.
4. From this cluster, fetch a handful of posts having a different similarity to the
post in question. This will increase diversity.

<h1>Preprocessing - similarity measure as a similar number of common words</h1></br>
The bag of word approach is both fast and robust. It is, though, not without challenges. 

<h3>Converting raw text into a bag of words</h3></br>

<b>Scikit's</b> <u>CountVectorizer</u> method counts the words and represents those counts as a vector not only efficiently but also with a very convenient interface.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
vectorizer = CountVectorizer(min_df=1)

The <b>min_df</b> parameter determines how <b>CountVectorizer</b> treats seldom words (minimum document frequency). If it is set to an integer, all words occurring less than that value will be dropped. If it is a fraction, all words that occur in less than that fraction of the overall dataset will be dropped. The <b>max_df</b> parameter works in a similar manner. If we print the instance, we see what other parameters SciKit provides together with their default values:

In [3]:
vectorizer.analyzer

'word'

In [4]:
vectorizer.token_pattern

'(?u)\\b\\w\\w+\\b'

Here, we can see that, the counting is done at word level  and that the words are determined by the regular expression pattern. It will, for example, tokenize "cross-validated" into "cross" and "validated". 

In [5]:
content = ['How to format my hard disk', 'Hard disk format problems ']

We can now put this list of subject lines into the fit_transform() function of our vectorizer, which does all the hard vectorization work.

In [6]:
X = vectorizer.fit_transform(content)

In [7]:
vectorizer.get_feature_names()

['disk', 'format', 'hard', 'how', 'my', 'problems', 'to']

In [8]:
print(X.toarray().transpose())

[[1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]]


This means that the first sentence contains all the words except "problems", while the second contains all but "how", "my", and "to".

<h3>Counting Words</h3>

In [9]:
import os
paths = os.getcwd()

In [10]:
DIR = os.path.join(paths, 'toy')

In [11]:
DIR

'D:\\KU\\7th Sem\\Machine Learning\\ML_Practicals\\toy'

In [12]:
posts = [open(os.path.join(DIR, f)).read() for f in os.listdir(DIR)]

In [13]:
posts

['This is a toy post about machine learning. Actually, it contains not much interesting stuff.',
 'Imaging databases provide storage capabilities.',
 'Most imaging databases save images permanently.\n',
 'Imaging databases store data.',
 'Imaging databases store data. Imaging databases store data. Imaging databases store data.']

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

We have to notify the vectorizer about the full dataset so that it knows upfront what words are to be expected:

In [15]:
X_train = vectorizer.fit_transform(posts)

In [16]:
num_samples, num_features = X_train.shape

In [17]:
print(f"No. of Samples: {num_samples} & No. of Features: {num_features}")

No. of Samples: 5 & No. of Features: 25


In [18]:
print(vectorizer.get_feature_names())

['about', 'actually', 'capabilities', 'contains', 'data', 'databases', 'images', 'imaging', 'interesting', 'is', 'it', 'learning', 'machine', 'most', 'much', 'not', 'permanently', 'post', 'provide', 'save', 'storage', 'store', 'stuff', 'this', 'toy']


Now, we can vectorize our new post.

In [19]:
new_post = "imaging databases"

In [20]:
new_post_vec = vectorizer.transform([new_post])

Note that the count vectors returned by the transform method are sparse. That is,
each vector does not store one count value for each word, as most of those counts
will be zero (the post does not contain the word). Instead, it uses the more memoryefficient implementation coo_matrix (for "COOrdinate")

In [21]:
print(new_post_vec)

  (0, 5)	1
  (0, 7)	1


Via its <b>toarray()</b> method, we can once again access the full ndarray:

In [22]:
print(new_post_vec.toarray())

[[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


We need to use the full array, if we want to use it as a vector for similarity calculations. For the similarity measurement (the naive one), we calculate the Euclidean distance between the count vectors of the new post and all the old posts:

In [23]:
import scipy as sp

In [24]:
def dist_raw(v1, v2):
    delta = v1 - v2
    return sp.linalg.norm(delta.toarray())

The <b>norm()</b> function calculates the Euclidean norm(shortest distance).

With <b>dist_raw</b> we just need to iterate over all the posts and remember the nearest one:

In [25]:
import sys
best_doc = None
best_dist = sys.maxsize
best_i = None
for i in range(num_samples):
    post=posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_raw(post_vec, new_post_vec)
    print(f"=== Post {i} with dist={d:.2f}:{post}")
    if d<best_dist:
        best_dist = d
        best_i = i
print(f"Best post is {best_i} with dist = {best_dist:.2f}")

=== Post 0 with dist=4.00:This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=1.73:Imaging databases provide storage capabilities.
=== Post 2 with dist=2.00:Most imaging databases save images permanently.

=== Post 3 with dist=1.41:Imaging databases store data.
=== Post 4 with dist=5.10:Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist = 1.41


In [26]:
print(X_train.getrow(3).toarray())

[[0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]]


In [27]:
print(X_train.getrow(4).toarray())

[[0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]


<h3>Normalizing Word Count Vectors</h3>

We will have to extend <b>dist_raw</b> to calculate the vector distance not on the raw vectors but on the normalized instead:

In [28]:
def dist_norm(v1, v2):
    v1_normalized = v1/sp.linalg.norm(v1.toarray())
    v2_normalized = v2/sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

In [29]:
import sys
best_doc = None
best_dist = sys.maxsize
best_i = None
for i in range(num_samples):
    post=posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print("=== Post %i with dist=%.2f: %s"%(i, d, post))
    if d<best_dist:
        best_dist = d
        best_i = i
print("Best post is %i with dist=%.2f"%(best_i, best_dist))

=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage capabilities.
=== Post 2 with dist=0.92: Most imaging databases save images permanently.

=== Post 3 with dist=0.77: Imaging databases store data.
=== Post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=0.77


Here, Post 3 and Post 4 are calculated as being equally similar. One could argue whether that much repitition would be a deligh to the reader, but from the point of counting the words in the posts this seems to be right.

In [30]:
posts

['This is a toy post about machine learning. Actually, it contains not much interesting stuff.',
 'Imaging databases provide storage capabilities.',
 'Most imaging databases save images permanently.\n',
 'Imaging databases store data.',
 'Imaging databases store data. Imaging databases store data. Imaging databases store data.']

<h3>Removing less Important Words</h3>

Like in post 2, words such as "most" appear very often in all sorts of different contexts and are called stop words. They do not carry as much information and thus should not be weighed as much as words such as "images", which doesn't occur often in different contexts. The best option would be to remove all the words that are so frequent that they do not help to distinguish between difeerent texts. These words are called <b>Stop Words</b>

As this is such a common step in text processing, there is a simple parameter in CountVectorizer to achieve that:

In [31]:
vectorizer = CountVectorizer(min_df=1, stop_words='english')

If we have a list of words that we want to use as stop words, we can pass them as a list. But setting <b>stop_words</b> to english will use a set of 318 English stop words. 

In [32]:
print(sorted(vectorizer.get_stop_words())[0:20])

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst']


In [33]:
X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
print("#samples: %d, #features: %d" % (num_samples,num_features))

#samples: 5, #features: 18


In [34]:
print(vectorizer.get_feature_names())

['actually', 'capabilities', 'contains', 'data', 'databases', 'images', 'imaging', 'interesting', 'learning', 'machine', 'permanently', 'post', 'provide', 'save', 'storage', 'store', 'stuff', 'toy']


In [35]:
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])

In [36]:
print(new_post_vec)

  (0, 4)	1
  (0, 6)	1


In [37]:
print(new_post_vec.toarray())

[[0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0]]


In [38]:
import sys
best_doc = None
best_dist = sys.maxsize
best_i = None
for i in range(num_samples):
    post=posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print(f"=== Post {i} with dist = {d:.2f} : {post}")
    if d<best_dist:
        best_dist = d
        best_i = i
print(f"Best post is {best_i} with dist = {best_dist:.2f}")

=== Post 0 with dist = 1.41 : This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist = 0.86 : Imaging databases provide storage capabilities.
=== Post 2 with dist = 0.86 : Most imaging databases save images permanently.

=== Post 3 with dist = 0.77 : Imaging databases store data.
=== Post 4 with dist = 0.77 : Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist = 0.77


Wihtout stop words, Post 2 is now on par with Post 1.

<h3>Stemming</h3>

We count similar words in different variants as different words. Post 2, for instance, contains "imaging" and "images". It will make sense to count them together. After all, it is the same concept they are referring to.

We need a function that reduces words to their specific word stem.Natural Language Tooolkit(NLTK), we can download a free software toolkit, which provides a stemmer that we can easily plug into CountVectorizer.

In [39]:
import nltk

NLTK comes with different stemmers because every langugae has a different set of rules for stemming. For English, we can take SnowballStemmer.

In [40]:
s = nltk.stem.SnowballStemmer('english')

In [41]:
s.stem('graphics')

'graphic'

In [42]:
s.stem('imaging')

'imag'

In [43]:
s.stem('imagination')

'imagin'

In [44]:
s.stem('imagine')

'imagin'

In [45]:
s.stem('buys')

'buy'

In [46]:
s.stem('buying')

'buy'

In [47]:
s.stem('bought')

'bought'

<h3>Extending the vectorizer with NLTK's stemmer</h3>

We need to stem the posts before we feed them into CountVectorizer. The class provides several hooks with which we can customize the stage's preprocessing and tokenization. The preprocessor and tokenizer can be set as parameters in the constructor. We do not want to place the stemmer into any of them, because we  will then have to do the tokenization and normalization by ourselves. Instead, we overwrite the <b>build_analyzer</b> method:

In [48]:
import nltk.stem

In [49]:
english_stemmer = nltk.stem.SnowballStemmer('english')

In [50]:
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer,self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
vectorizer = StemmedCountVectorizer(min_df=1,stop_words='english')

In [51]:
vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')

This will do the following process for each post:
1. The first step is lower casing the raw post in the preprocessing step
(done in the parent class).
2. Extracting all individual words in the tokenization step (done in the
parent class).
3. This concludes with converting each word into its stemmed version.

In [52]:
X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
print("#samples: %d, #features: %d" % (num_samples,num_features))

#samples: 5, #features: 17


In [53]:
print(vectorizer.get_feature_names())

['actual', 'capabl', 'contain', 'data', 'databas', 'imag', 'interest', 'learn', 'machin', 'perman', 'post', 'provid', 'save', 'storag', 'store', 'stuff', 'toy']


In [54]:
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])

In [55]:
print(new_post_vec)

  (0, 4)	1
  (0, 5)	1


In [56]:
print(new_post_vec.toarray())

[[0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0]]


In [57]:
import sys
best_doc = None
best_dist = sys.maxsize
best_i = None
for i in range(num_samples):
    post=posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print(f"=== Post {i} with dist = {d:.2f} : {post}")
    if d<best_dist:
        best_dist = d
        best_i = i
print(f"Best post is {best_i} with dist = {best_dist:.2f}")

=== Post 0 with dist = 1.41 : This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist = 0.86 : Imaging databases provide storage capabilities.
=== Post 2 with dist = 0.63 : Most imaging databases save images permanently.

=== Post 3 with dist = 0.77 : Imaging databases store data.
=== Post 4 with dist = 0.77 : Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 2 with dist = 0.63


<h3>Stop words on steroids</h3>

We want a high value for a given term in a given value, if that term occurs often in that particular post and very seldom anywhere else.

This is exactly what term frequency - inverse document frequency (TF-IDF) des. TF stands for the counting part, while IDF factors in the discounting. 

In [58]:
import scipy as sp

In [59]:
def tfidf(term, doc, corpus):
    tf = doc.count(term)/ len(doc)
    num_docs_with_term = len([d for d in corpus if term in d])
    idf = sp.log(len(corpus)/num_docs_with_term)
    return tf * idf

Here, we simply did not count the terms only, but also normalize the counts by the document length. 

In [60]:
a, abb, abc = ["a"], ["a", "b", "b"], ["a", "b", "c"]

In [61]:
D = [a, abb, abc]

In [62]:
print(tfidf("a", a, D))

0.0


  idf = sp.log(len(corpus)/num_docs_with_term)


In [63]:
print(tfidf("a", abb, D))

0.0


  idf = sp.log(len(corpus)/num_docs_with_term)


In [64]:
print(tfidf("a", abc, D))

0.0


  idf = sp.log(len(corpus)/num_docs_with_term)


In [65]:
print(tfidf("b", abb, D))

0.27031007207210955


  idf = sp.log(len(corpus)/num_docs_with_term)


In [66]:
print(tfidf("a", abc, D))

0.0


  idf = sp.log(len(corpus)/num_docs_with_term)


In [67]:
print(tfidf("b", abc, D))

0.13515503603605478


  idf = sp.log(len(corpus)/num_docs_with_term)


In [68]:
print(tfidf("c", abc, D))

0.3662040962227032


  idf = sp.log(len(corpus)/num_docs_with_term)


We see that a carries no meaning for any document since it is contained everywhere. The b term is more important for the document abb than for abc as it occurs there twice.

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [70]:
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc : (
            english_stemmer.stem(w) for w in analyzer(doc)
        )

In [71]:
vectorizer = StemmedTfidfVectorizer(min_df=1, 
                                    stop_words='english',
                                    decode_error='ignore')

In [72]:
X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
print(f"#samples: {num_samples}, #features: {num_features}")

#samples: 5, #features: 17


In [73]:
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])
print(new_post_vec)

  (0, 5)	0.7071067811865476
  (0, 4)	0.7071067811865476


In [74]:
best_doc = None
best_dist = sys.maxsize
best_i = None

for i, post in enumerate(posts):
    if post == new_post:
        continue
        
    post_vec = X_train.getrow(i)
    
    d = dist_norm(post_vec, new_post_vec)
    
    print(f"=== Post {i} with dist = {d:.2f} : {post}")
    if d < best_dist:
        best_dist = d
        best_i = i 
        
print(f"Best post is {best_i} with dist = {best_dist:.2f}")

=== Post 0 with dist = 1.41 : This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist = 1.08 : Imaging databases provide storage capabilities.
=== Post 2 with dist = 0.86 : Most imaging databases save images permanently.

=== Post 3 with dist = 0.92 : Imaging databases store data.
=== Post 4 with dist = 0.92 : Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 2 with dist = 0.86


<h3>Clustering</h3>

Finally, we have our vectors, which we believe capture the posts to a sufficient degree.
Not surprisingly, there are many ways to group them together. Most clustering
algorithms fall into one of the two methods: flat and hierarchical clustering.

<b>Flat Clustering: </b>Flat clustering divides the posts into a set of clusters without relating the clusters to
each other. The goal is simply to come up with a partitioning such that all posts in
one cluster are most similar to each other while being dissimilar from the posts in all
other clusters. Many flat clustering algorithms require the number of clusters to be
specified up front.

<b>Hierarchial Clustering: </b>In hierarchical clustering, the number of clusters does not have to be specified.
Instead, hierarchical clustering creates a hierarchy of clusters. While similar posts
are grouped into one cluster, similar clusters are again grouped into one uber-cluster.
This is done recursively, until only one cluster is left that contains everything. In
this hierarchy, one can then choose the desired number of clusters after the fact.
However, this comes at the cost of lower efficiency

<h3>K-means Clustering</h3>

k-means is the most widely used flat clustering algorithm. After initializing it with
the desired number of clusters, num_clusters, it maintains that number of so-called
cluster centroids. Initially, it will pick any num_clusters posts and set the centroids
to their feature vector. Then it will go through all other posts and assign them the
nearest centroid as their current cluster. Following this, it will move each centroid
into the middle of all the vectors of that particular class. This changes, of course, the
cluster assignment. Some posts are now nearer to another cluster. So it will update
the assignments for those changed posts. This is done as long as the centroids move
considerably. After some iterations, the movements will fall below a threshold and
we consider clustering to be converged.

<h3>Getting test data to evaluate our ideas on </h3>

In [75]:
import sklearn.datasets
all_data = sklearn.datasets.fetch_20newsgroups(subset='all')
len(all_data.filenames)

18846

In [76]:
print(all_data.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [77]:
train_data = sklearn.datasets.fetch_20newsgroups(subset='train')
print(f"{len(train_data.filenames)}")

11314


In [78]:
test_data = sklearn.datasets.fetch_20newsgroups(subset='test')
print(f"{len(test_data.filenames)}")

7532


In [79]:
groups = ['comp.graphics', 'comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware','comp.windows.x', 'sci.space']

In [80]:
train_data = sklearn.datasets.fetch_20newsgroups(subset='train', categories=groups)
print(len(train_data.filenames))

3529


In [81]:
test_data = sklearn.datasets.fetch_20newsgroups(subset='test', categories=groups)
print(len(test_data.filenames))

2349


<h2>Clustering Posts</h2>

The newsgropu dataset contains invalid characters that will result in UnicodeDecorder so we have to tell the vectorizer to ignore them:

In [116]:
vectorizer = StemmedTfidfVectorizer(min_df=10, max_df=0.5,
                                   stop_words='english',
                                   decode_error='ignore')

In [117]:
vectorized = vectorizer.fit_transform(train_data.data)

In [118]:
num_samples, num_features = vectorized.shape

In [119]:
print(f"No. of samples: {num_samples} & No. of features: {num_features}")

No. of samples: 3529 & No. of features: 4712


So we have a pool of 3529 posts and have extracted for each of them a feature vetor of 4712 dimensions. That is what K-means takes an input. We will fix the cluster size to 50 for this chapter.

In [120]:
num_clusters = 50
from sklearn.cluster import KMeans
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,
           verbose=1, random_state=3)
km.fit(vectorized)

Initialization complete
Iteration 0, inertia 5899.5595831471655
Iteration 1, inertia 3218.297747726279
Iteration 2, inertia 3184.3328334733214
Iteration 3, inertia 3164.867358130041
Iteration 4, inertia 3152.003949571175
Iteration 5, inertia 3143.1109963529184
Iteration 6, inertia 3136.2559774422048
Iteration 7, inertia 3129.3248717684405
Iteration 8, inertia 3124.5674798201394
Iteration 9, inertia 3121.9001105797406
Iteration 10, inertia 3120.209894571872
Iteration 11, inertia 3118.62745619288
Iteration 12, inertia 3117.3625259783616
Iteration 13, inertia 3116.8112664390364
Iteration 14, inertia 3116.587892365764
Iteration 15, inertia 3116.417048753848
Iteration 16, inertia 3115.760414808626
Iteration 17, inertia 3115.3736535034473
Iteration 18, inertia 3115.155454436256
Iteration 19, inertia 3114.949117560754
Iteration 20, inertia 3114.5149932662175
Iteration 21, inertia 3113.9369169464094
Iteration 22, inertia 3113.719999300366
Iteration 23, inertia 3113.547519005385
Iteration 24, i

KMeans(init='random', n_clusters=50, n_init=1, random_state=3, verbose=1)

We provided a random state just so that we can get the same results. In real-world applications, we will not do this. After fitting, we can get the clustering information out of memebrs of km. For every vectorized post that has been fit, there is a corresponding integer label in km.labels_:

In [121]:
print(km.labels_)

[38 17 47 ... 41 14 16]


In [122]:
print(km.labels_.shape)

(3529,)


The cluster centers can be accessed via <b>km.cluster_centers_</b>

<h3>Solving our initial challenge</h3>

In [123]:
new_post = "Disk drive problems. Hi, I have a problem with my hard \
disk. After 1 year it is working only sporadically now.\
I tried to format it, but now it doesn't boot any more.\
Any ideas? Thanks."
new_post

"Disk drive problems. Hi, I have a problem with my hard disk. After 1 year it is working only sporadically now.I tried to format it, but now it doesn't boot any more.Any ideas? Thanks."

In [124]:
new_post_vec = vectorizer.transform([new_post])
new_post_label = km.predict(new_post_vec)[0]
new_post_label

7

In [125]:
new_post_label

7

Now that we have the clustering, we do not need to compare new_post_vec to all post vectors. Instead, we can focus only on the posts of the same cluster.

In [126]:
similar_indices = (km.labels_==new_post_label).nonzero()[0]
len(similar_indices)

166

The comparison in the bracket results in a Boolean array, and nonzero converts that
array into a smaller array containing the indices of the True elements.

Using similar_indices, we then simply have to build a list of posts together with
their similarity scores:

In [127]:
import numpy as np
similar = []

for i in similar_indices:
    dist = np.linalg.norm((new_post_vec - vectorized[i]).toarray())
    similar.append((dist, train_data.data[i]))
                   
similar = sorted(similar)
len(similar)

166

In [128]:
show_at_1 = similar[0]
show_at_2 = similar[int(len(similar)/10)]
show_at_3 = similar[int(len(similar)/2)]

In [129]:
for i in [show_at_1, show_at_2, show_at_2]:
    print(f"{i[0]} \t {i[1]}")
    print("-------------------")

1.0378441731334072 	 From: Thomas Dachsel <GERTHD@mvs.sas.com>
Subject: BOOT PROBLEM with IDE controller
Nntp-Posting-Host: sdcmvs.mvs.sas.com
Organization: SAS Institute Inc.
Lines: 25

Hi,
I've got a Multi I/O card (IDE controller + serial/parallel
interface) and two floppy drives (5 1/4, 3 1/2) and a
Quantum ProDrive 80AT connected to it.
I was able to format the hard disk, but I could not boot from
it. I can boot from drive A: (which disk drive does not matter)
but if I remove the disk from drive A and press the reset switch,
the LED of drive A: continues to glow, and the hard disk is
not accessed at all.
I guess this must be a problem of either the Multi I/o card
or floppy disk drive settings (jumper configuration?)
Does someone have any hint what could be the reason for it.
Please reply by email to GERTHD@MVS.SAS.COM
Thanks,
Thomas
+-------------------------------------------------------------------+
| Thomas Dachsel                                                    |
| Internet

<h3>Another look at noise</h3>

In [130]:
post_group = zip(train_data.data, train_data.target)

In [131]:
all = [(len(post[0]), post[0], train_data.target_names[post[1]]) for post in post_group]

In [132]:
graphics = sorted([post for post in all if post[2] == 'comp.graphics'])

In [133]:
print(graphics[5])



In [134]:
noise_post = graphics[5][1]

In [135]:
analyzer = vectorizer.build_analyzer()

In [136]:
print(list(analyzer(noise_post)))

['situnaya', 'ibm3090', 'bham', 'ac', 'uk', 'subject', 'test', 'sorri', 'organ', 'univers', 'birmingham', 'unit', 'kingdom', 'line', 'nntp', 'post', 'host', 'ibm3090', 'bham', 'ac', 'uk']


In [137]:
useful = set(analyzer(noise_post)).intersection(vectorizer.get_feature_names())

In [138]:
print(sorted(useful))

['ac', 'birmingham', 'host', 'kingdom', 'nntp', 'sorri', 'test', 'uk', 'unit', 'univers']


In [139]:
for term in sorted(useful):
    print(f"IDF({term}) =\
 {vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]]:.2f}\
    ")

IDF(ac) = 3.51    
IDF(birmingham) = 6.77    
IDF(host) = 1.74    
IDF(kingdom) = 6.68    
IDF(nntp) = 1.77    
IDF(sorri) = 4.14    
IDF(test) = 3.83    
IDF(uk) = 3.70    
IDF(unit) = 4.42    
IDF(univers) = 1.91    
