# Assignment 10 - Vector Semantics  

Fanny Chow  
November 2, 2018  

I collaborated with Stephanie Rivera on this assignment.  

**Please write an iPython notebook that represents words and articles in the Reuters corpus as vectors and clusters the article vectors. Please use the nltk.corpus.reuters "training" documents (as shown in reuters.fileids()) in one (and only one) of the categories **. 

`ship, trade, interest, money-fx, crude`

In [1]:
from nltk.corpus import reuters, stopwords
import pandas as pd
import collections
from sklearn.feature_extraction.text import CountVectorizer
import re
import nltk
import html
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# 1. Create a term-document matrix containing a row for every word in the corpus vocabulary and a column for each document, where each entry is the tf-idf score of a word for a document

Many documents from the `reuters` corpus can have multiple categories. But I'm only interested in filtering for the documents that have **one and only one** category. For example: an artilce can be about "ship", "finance", and "collagen." But we don't want those articles. We only want the training articles that have one and only one category, say "ship" -- that's it.  

Of those documents with only one category tag, I obtain the documents with the tags for the 5 categories of interest listed below.  

In [2]:
categories = ['ship','trade', 'interest', 'money-fx', 'crude']

def get_category_file_names(category_names):
    fnames = []
    for filetype in category_names:
        for fileid in nltk.corpus.reuters.fileids(filetype):
            if len(nltk.corpus.reuters.categories(fileid)) == 1:
                if 'train' in fileid:
                    fnames.append(fileid)
                else:
                    pass
    return fnames

training_file_names = get_category_file_names(categories)
training_file_names[:4]

['training/10302', 'training/10388', 'training/10391', 'training/10394']

I normalize the text by: removing stop words, removing punctuation, removing numbers, and unescaping some HTML syntax that was also read in. It's not perfect but does the job sufficiently for now. After this process, we have 1,024 training files, which seems like a good size for this training set. I'd be weary if I had any less articles to train with.  

In [3]:
#now create a function to clean a doc 
def clean_doc(training_file):
    stopwords = nltk.corpus.stopwords.words('english')
    doc = reuters.raw(training_file)
    doc = doc.lower()
    doc = html.unescape(doc)
    doc = re.sub(r'[^a-zA-Z ]+', "", doc)
    store = doc.split()
    resultwords  = [word for word in store if word not in stopwords]
    wordCounts = collections.Counter(resultwords)
    return wordCounts

In [4]:
term_doc_matrix = dict()

for fname in training_file_names:
    counts_dict = {fname: clean_doc(fname)}
    term_doc_matrix.update(counts_dict)

In [5]:
df = pd.DataFrame(term_doc_matrix).fillna(0)

In [6]:
df.head()

Unnamed: 0,training/10302,training/10388,training/10391,training/10394,training/1052,training/10559,training/10717,training/10748,training/11251,training/11271,...,training/9208,training/9253,training/9279,training/9293,training/930,training/9445,training/945,training/952,training/9527,training/9674
ab,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abalkhail,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abandon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abandoned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abbas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
df.tail()

Unnamed: 0,training/10302,training/10388,training/10391,training/10394,training/1052,training/10559,training/10717,training/10748,training/11251,training/11271,...,training/9208,training/9253,training/9279,training/9293,training/930,training/9445,training/945,training/952,training/9527,training/9674
ziyang,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zoete,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zone,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zones,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zulia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Note from this sample processed document that our data mining technique is not perfect. We get a lot of fake words, such as "mln" and "dlrs",and words that share similar word stems, such as "year" and "years."

Let's get the counts of individual words from our list of documents and convert them into a dataframe.  

In [8]:
df['row_sum'] = df.sum(axis=1)

Interesting how these are our top words that occur at least one in the most documents. The top 5 seem to make sense, but I wonder if "mln" or "dlrs" were poorly parsed words or initials. According to https://acronyms.thefreedictionary.com/MLN, "mln" can stand for millions and maybe "dlrs" means "dollars."

In [9]:
df.row_sum.nlargest(10)

said       3511.0
trade      1264.0
us         1185.0
pct        1136.0
oil        1100.0
mln         899.0
would       841.0
bank        786.0
billion     749.0
dlrs        681.0
Name: row_sum, dtype: float64

In [10]:
tf = df.copy()

In [11]:
def calculate_tf(x):
    if x > 0:
        return 1 + np.log10(x)
    else:
        return 0

In [12]:
tf = tf.loc[:, tf.columns != 'row_sum'].applymap(calculate_tf)

Let's take a look athe vectors for some words most obviously associated with the Retuers categories. 

In [13]:
tf.head()

Unnamed: 0,training/10302,training/10388,training/10391,training/10394,training/1052,training/10559,training/10717,training/10748,training/11251,training/11271,...,training/9208,training/9253,training/9279,training/9293,training/930,training/9445,training/945,training/952,training/9527,training/9674
ab,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abalkhail,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abandon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abandoned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abbas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
idf = np.log10(len(df)/df.row_sum)

In [15]:
idf.head()

ab           3.526942
abalkhail    4.004063
abandon      3.402003
abandoned    3.703033
abbas        3.703033
Name: row_sum, dtype: float64

In [16]:
tfidf = tf.mul(idf, axis=0)
tfidf.loc[['crude']]

Unnamed: 0,training/10302,training/10388,training/10391,training/10394,training/1052,training/10559,training/10717,training/10748,training/11251,training/11271,...,training/9208,training/9253,training/9279,training/9293,training/930,training/9445,training/945,training/952,training/9527,training/9674
crude,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.934459,0.0,0.0,1.486867,0.0,1.486867,0.0,1.934459,0.0


In [17]:
tfidf['max_tfidf'] = tfidf.max(axis=1)

In [18]:
tfidf.loc[["oil"]]

Unnamed: 0,training/10302,training/10388,training/10391,training/10394,training/1052,training/10559,training/10717,training/10748,training/11251,training/11271,...,training/9253,training/9279,training/9293,training/930,training/9445,training/945,training/952,training/9527,training/9674,max_tfidf
oil,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.711774,1.542256,1.776222,1.421981,1.776222,1.542256,1.252463,0.962671,1.711774,2.193689


In [19]:
tfidf.max_tfidf.nlargest(10)

darman          6.109550
polish          6.008127
thai            6.008127
poland          5.970833
countertrade    5.960089
eep             5.960089
potash          5.960089
terra           5.960089
eurosterling    5.901431
png             5.901431
Name: max_tfidf, dtype: float64

# 2. Reduce the size of the matrix. Compute the maximum tf-idf-score for each word and keep the 500 rows with the top 500 maxima. Did that remove maximum tf-idf score of any column?  

In [20]:
top500idf = tfidf.nlargest(500, "max_tfidf")

I'm astonished by the top words! What do "polish", "countertrade", and "thai" have to do with Reuters news articles?! I thought that Reuters covered more press about the West--why would "Thai" have so much inmportance in this vector space representation? Also I was surprised to see "eep" and "potash" as top words. This makes me wonder if the pre-processing is insufficient because these techincally aren't real words.  

In [21]:
top500idf.max_tfidf.head(15)

darman          6.109550
polish          6.008127
thai            6.008127
poland          5.970833
countertrade    5.960089
eep             5.960089
potash          5.960089
terra           5.960089
eurosterling    5.901431
png             5.901431
abu             5.849764
azzam           5.828601
dibona          5.828601
greek           5.828601
melamed         5.828601
Name: max_tfidf, dtype: float64

Note how sparse this matrix is! I bet that will hinder our results.  

In [22]:
top500idf.tail(10)

Unnamed: 0,training/10302,training/10388,training/10391,training/10394,training/1052,training/10559,training/10717,training/10748,training/11251,training/11271,...,training/9253,training/9279,training/9293,training/930,training/9445,training/945,training/952,training/9527,training/9674,max_tfidf
saskatchewan,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.882024
sheet,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.882024
shultz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.882024
shutdowns,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.882024
sticking,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.882024
targeted,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.882024
tool,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.882024
unrest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.882024
shell,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.53221,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.88154
nakasone,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.875827


The words "crude" and "oil" are no longer in the matrix after only keeping the top 500 tf-idf. I'm surprised. I thought those words would characterize a majority of the Reuters articles and make the cut. I hope this isn't due to  logic error.  

In [23]:
# Note: this returns an error because "crude" is not in the top 500 tf-idf Reuters documents.  
# top500idf.loc[["crude"]]

# 3. Cluster the document vectors into five clusters using an unsupervised algorithm like k-means. Create a 5x5 matrix that compares each cluster to each of the 5 categories, using the Jaccard index. Comment. 

One of the reasons understanding tf-idf is important is because of its representation of document similarity. By knowing what documents are similar you’re able to find related documents and automatically group documents into clusters.

For example: let’s cluster these documents using K-Means clustering (check out this gif below). K-means basically plots all the tf-idf vectors on a graph and grabs the  that group together.  

![](http://practicalcryptography.com/media/miscellaneous/files/k_mean_send.gif)  
Source: http://practicalcryptography.com/

In [24]:
# drop the max_tfidf label & transpose matrix to be able to use km() func
top500idf = top500idf.drop(labels="max_tfidf", axis=1)
# top500idf = top500idf.transpose()

In [25]:
top500idf.T

Unnamed: 0,darman,polish,thai,poland,countertrade,eep,potash,terra,eurosterling,png,...,saskatchewan,sheet,shultz,shutdowns,sticking,targeted,tool,unrest,shell,nakasone
training/10302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
training/10388,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
training/10391,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
training/10394,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
training/1052,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
training/10559,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
training/10717,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
training/10748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
training/11251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0
training/11271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0


In [26]:
number_of_clusters = 5
km = KMeans(n_clusters=number_of_clusters)

# transponse to have 1024 documents as rows
km.fit(top500idf.T)
km.fit

<bound method KMeans.fit of KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)>

In [27]:
km.labels_

array([1, 1, 1, ..., 1, 1, 1], dtype=int32)

In [28]:
# sanity check: 1024 documents
len(km.labels_)

1024

In [29]:
results = pd.DataFrame()
results['file_name'] = training_file_names
results['cluster'] = km.labels_

In [30]:
# Sanity check -- 1024 documents clustered
np.shape(results)

(1024, 2)

I wonder what these documents have in common. Maybe the shared words about finance (ie jobs and currency) ?

In [31]:
results[results.cluster == 0].head()

Unnamed: 0,file_name,cluster
104,training/9479,0
117,training/10455,0
133,training/11198,0
179,training/13649,0
745,training/9170,0


In [32]:
results[results.cluster == 1].head()

Unnamed: 0,file_name,cluster
0,training/10302,1
1,training/10388,1
2,training/10391,1
3,training/10394,1
4,training/1052,1


In [33]:
results[results.cluster == 2]

Unnamed: 0,file_name,cluster
46,training/3217,2
112,training/10265,2
114,training/10352,2
115,training/10355,2
120,training/10623,2
125,training/10767,2
132,training/11175,2
144,training/11397,2
166,training/12472,2
174,training/13045,2


In [34]:
results[results.cluster == 3]

Unnamed: 0,file_name,cluster
614,training/12815,3
618,training/13244,3


I can see why this document didn't share commanilities with others. It's quite grim and an account of a ferry disaster seems too specific. 

In [35]:
results[results.cluster == 4]

Unnamed: 0,file_name,cluster
645,training/2765,4


Wow this is pretty terrifiyingly not helpful clustering at first glance. I think this may be a byproduct of how sparse the idtf matrix ended up being after only keeping the top 500 words. Wait, *Stephanie just made a fantastic point* : that's a possible reflection of the fact that these 1024 documents in the training set might not contain any of the top 500 tfidf words left over. So on second thought, maybe it's not as terrifyingly bad as I thought!! 

In [36]:
results.groupby('cluster').size()

cluster
0       7
1    1002
2      12
3       2
4       1
dtype: int64

Acknowledging the limitations stemming from using such a sparse matrix to begin with, I proceed with the steps to calculate the Jaccard Similarity below. The **Jaccard Index compares two sets A and B** using the formula:


$$\begin{align*}J(A,B) = \frac{A \cap B}{A \cup B}\end{align*}$$

1. Count the number of members which are **shared between both sets**.
2. Count the total number of **members in both sets (shared and un-shared)**.
3. **Divide** the number of **shared members** by the **total number of members**
4. Multiply the number you found above by 100.  

This percentage **tells you how similar the two sets are**.

In [37]:
def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

In order to use the equation above, let's get the tfidf matrix and the actual categories organized. 

In [38]:
actual_cats = pd.DataFrame(top500idf.columns, columns = ['fname'])

In [39]:
actual_cats['cat'] = actual_cats.fname.apply(nltk.corpus.reuters.categories)

In [40]:
# https://stackoverflow.com/questions/38147447/how-to-remove-square-bracket-from-pandas-dataframe
actual_cats['cat'] = actual_cats['cat'].str.get(0)

In [41]:
actual_cats.head()

Unnamed: 0,fname,cat
0,training/10302,ship
1,training/10388,ship
2,training/10391,ship
3,training/10394,ship
4,training/1052,ship


In [42]:
# join the actual categories with the clusters df
cluster_df = results.join(actual_cats.set_index('fname'), on='file_name')

In [43]:
cluster_df[cluster_df.file_name == 'training/6086']

Unnamed: 0,file_name,cluster,cat
948,training/6086,1,crude


In [44]:
cluster_df[cluster_df.cluster == 4]

Unnamed: 0,file_name,cluster,cat
645,training/2765,4,money-fx


What's the Jaccard Similarity index between the "crude" category and cluster 0?  

In [45]:
jaccard_similarity(cluster_df[cluster_df.cat == 'crude'].file_name, cluster_df[cluster_df.cluster == 0].file_name)

0.007751937984496124

Sanity check -- ofcourse the same set of documents will have 100% Jaccard Index! Phew. 

In [47]:
jaccard_similarity(training_file_names, training_file_names)

1.0

In [48]:
a = []
clusters = range(0,5)
for i in clusters:
    row = []
    for j in categories:
        row.append(jaccard_similarity(cluster_df[cluster_df.cluster == i].file_name, cluster_df[cluster_df.cat == j].file_name))
    a.append(row)
a

[[0.008771929824561403,
  0.011811023622047244,
  0.0,
  0.0043859649122807015,
  0.007751937984496124],
 [0.10557768924302789,
  0.23228346456692914,
  0.1906187624750499,
  0.21669980119284293,
  0.25],
 [0.008403361344537815, 0.043824701195219126, 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, 0.009009009009009009, 0.0],
 [0.0, 0.0, 0.0, 0.0045045045045045045, 0.0]]

In [111]:
results_doc_vector = pd.DataFrame(a, index=list(range(0,5)),
         columns=categories)
results_doc_vector

Unnamed: 0,ship,trade,interest,money-fx,crude
0,0.008772,0.011811,0.0,0.004386,0.007752
1,0.105578,0.232283,0.190619,0.2167,0.25
2,0.008403,0.043825,0.0,0.0,0.0
3,0.0,0.0,0.0,0.009009,0.0
4,0.0,0.0,0.0,0.004505,0.0


In [112]:
results_doc_vector.idxmax()

ship        1
trade       1
interest    1
money-fx    1
crude       1
dtype: int64

Oh wow the Jaccard similarity is 0 for most of them and each category is just classifed as cluster 1. This is pretty disappointing. I was worried that the jaccard simliartiy was just dropping significant digits and leaving it at 0.0. But I checked the original function and it was able to return a float with more than 1 significant digit.   

I think that this demonstartes the flaws in clustering based on the documents. Perhaps it's not granular enough and I expect better results with clustering the words next.  

# 4. Try clustering the words and comparing those clusters to the categories, too.  Comment on the results.  

Do we cluster the top 500 words with the topidf scores into those categories? Think of how we can cluster the column vectors fo the `top500idf` dataframe instead of the row-vectors, which we did previously. Maybe the row vectors have similar representation to each other.  

In [50]:
# sanity check -- words are rows & documents are columns
np.shape(top500idf)

(500, 1024)

In [51]:
number_of_clusters = 5
km = KMeans(n_clusters=number_of_clusters)

km.fit(top500idf)
km.fit

<bound method KMeans.fit of KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)>

In [52]:
# sanity check -- 500 words
len(km.labels_)

500

In [53]:
top500idf.index

Index(['darman', 'polish', 'thai', 'poland', 'countertrade', 'eep', 'potash',
       'terra', 'eurosterling', 'png',
       ...
       'saskatchewan', 'sheet', 'shultz', 'shutdowns', 'sticking', 'targeted',
       'tool', 'unrest', 'shell', 'nakasone'],
      dtype='object', length=500)

In [54]:
word_results = pd.DataFrame()
word_results['word'] = top500idf.index
word_results['cluster'] = km.labels_

Upon first glance, the distribution of the clusters looks improved than before because not all the vectors are clustered into just one cluster.  

In [55]:
word_results.groupby('cluster').size()

cluster
0      7
1    469
2      6
3      1
4     17
dtype: int64

In [56]:
word_results.head()

Unnamed: 0,word,cluster
0,darman,1
1,polish,1
2,thai,1
3,poland,1
4,countertrade,1


In [57]:
word_results[word_results.cluster == 0]

Unnamed: 0,word,cluster
137,california,0
396,rig,0
434,rigs,0
461,count,0
471,hughes,0
472,inland,0
496,tool,0


In [58]:
# https://stackoverflow.com/questions/32768555/find-the-set-of-column-indices-for-non-zero-values-in-each-row-in-pandas-data-f
cols = top500idf.columns
bt = top500idf.apply(lambda x: x > 0)
word_files = bt.apply(lambda x: list(cols[x.values]), axis=1)

In [59]:
word_files.head()

darman                       [training/12806]
polish                        [training/1963]
thai                          [training/6352]
poland          [training/1963, training/925]
countertrade                 [training/10375]
dtype: object

In [60]:
# set the indices to match
# https://stackoverflow.com/questions/26221300/nan-values-when-new-column-added-to-pandas-dataframe/26221919
word_results = word_results.set_index(word_results.word).drop(['word'], axis=1)
word_results.head()

Unnamed: 0_level_0,cluster
word,Unnamed: 1_level_1
darman,1
polish,1
thai,1
poland,1
countertrade,1


In [61]:
word_results['word_file'] = word_files

In [62]:
word_results.head()

Unnamed: 0_level_0,cluster,word_file
word,Unnamed: 1_level_1,Unnamed: 2_level_1
darman,1,[training/12806]
polish,1,[training/1963]
thai,1,[training/6352]
poland,1,"[training/1963, training/925]"
countertrade,1,[training/10375]


In [63]:
actual_cats = pd.DataFrame(top500idf.columns, columns = ['fname'])

In [64]:
actual_cats['cat'] = actual_cats.fname.apply(nltk.corpus.reuters.categories)
actual_cats['cat'] = actual_cats['cat'].str.get(0)
actual_cats.head()

Unnamed: 0,fname,cat
0,training/10302,ship
1,training/10388,ship
2,training/10391,ship
3,training/10394,ship
4,training/1052,ship


Phew! Now that we've gone through the process of identifying the files that contain the top 500 words and also the actual categories for each file, we can calculate the Jaccard Index.  

For example:  
What is the Jaccard Index between cluster 0 and category ship? 

In [70]:
word_results[word_results.cluster == 0]

Unnamed: 0_level_0,cluster,word_file
word,Unnamed: 1_level_1,Unnamed: 2_level_1
california,0,"[training/1347, training/8029, training/8044, ..."
rig,0,"[training/3430, training/8553]"
rigs,0,"[training/10375, training/1616, training/3430,..."
count,0,"[training/3430, training/7684, training/8553]"
hughes,0,"[training/3563, training/8553]"
inland,0,"[training/1616, training/6086, training/8553]"
tool,0,"[training/3563, training/8553]"


In [None]:
def get_cluster_files(clst_num):
    cluster_fnames = list()
    for each_list in word_results[word_results.cluster == clst_num].word_file:
        for file in each_list:
            cluster_fnames.append(file)
    return set(cluster_fnames)

In [86]:
# unique files in cluster 0
get_cluster_files(0)

{'training/10375',
 'training/11403',
 'training/1347',
 'training/1616',
 'training/3430',
 'training/3563',
 'training/6086',
 'training/7684',
 'training/8015',
 'training/8029',
 'training/8044',
 'training/8553',
 'training/8835',
 'training/8914'}

In [87]:
# files in category ship
actual_cats[actual_cats.cat == 'ship'].fname.head()

0    training/10302
1    training/10388
2    training/10391
3    training/10394
4     training/1052
Name: fname, dtype: object

In [89]:
jaccard_similarity(get_cluster_files(0), actual_cats[actual_cats.cat == 'ship'].fname)

0.0

Great! Now that I crunched the Jaccard Index for one cluster and one category, I can do the same for every combination.  

In [95]:
b = []
clusters = range(0,5)
for clst in clusters:
    row = []
    for cat in categories:
        row.append(jaccard_similarity(get_cluster_files(clst), actual_cats[actual_cats.cat == cat].fname))
    b.append(row)

In [96]:
b

[[0.0, 0.015384615384615385, 0.0, 0.0, 0.038910505836575876],
 [0.16,
  0.21684867394695787,
  0.08748114630467571,
  0.12071535022354694,
  0.2649434571890145],
 [0.0, 0.003861003861003861, 0.0, 0.0, 0.03543307086614173],
 [0.0703125,
  0.02197802197802198,
  0.023255813953488372,
  0.02032520325203252,
  0.014388489208633094],
 [0.013422818791946308,
  0.01384083044982699,
  0.0,
  0.011450381679389313,
  0.1297709923664122]]

In [100]:
word_vector_results = pd.DataFrame(b, index=list(range(0,5)),
         columns=categories)
word_vector_results

Unnamed: 0,ship,trade,interest,money-fx,crude
0,0.0,0.015385,0.0,0.0,0.038911
1,0.16,0.216849,0.087481,0.120715,0.264943
2,0.0,0.003861,0.0,0.0,0.035433
3,0.070312,0.021978,0.023256,0.020325,0.014388
4,0.013423,0.013841,0.0,0.01145,0.129771


Wow! This matrix looks markedly better than the one before. Looks like clustering document vectors is not as informative as classifying word vectors. Although it still looks sparse, it's definitely an improvment from before. I'll do some validation by hand to make sense of the results.  
On second thought, this isn't as promising as I'd thought it be if each category will just be classifed as the cluster 1. This is pretty disappointing result but I learned a lot in the process. Perhaps the classification could be improved using a different type of clustering technique like LDA or improving the pre-processing of the text by choosing different stemming techniques might warrant results

In [110]:
word_vector_results.idxmax()

ship        1
trade       1
interest    1
money-fx    1
crude       1
dtype: int64