For individuals or pairs

Please write an iPython notebook that represents words and articles in the Reuters corpus as vectors and clusters the article vectors. Please use the nltk.corpus.reuters "training" documents (as shown in reuters.fileids()) in one (and only one) of the categories 

`ship, trade, interest, money-fx, crude`

In [95]:
import nltk
import os
#nltk.download('reuters')
from nltk.corpus import reuters
from collections import Counter
import pandas as pd
import numpy as np

In [2]:
money = reuters.fileids("money-fx")
# # separate train from test
# positive_train = [f for f in money_files if "training/" in f]
# positive_test = [f for f in money_files if "test/" in f]

# all_files = reuters.fileids()
# all_train = [f for f in all_files if "training/" in f]
# all_test = [f for f in all_files if "test/" in f]

# # take difference of sets to get negative classes
# negative_train = list(set(all_train).difference(set(positive_train)))
# negative_test = list(set(all_test).difference(set(positive_test)))

1. Create a term-document matrix containing a row for every word in the corpus vocabulary and a column for each document, where each entry is the tf-idf score of a word for a document.
    
    

In [96]:
superset_words = set()
corpus_wordcounts = Counter()
wordcounts_per_article = {}

for article in money:
    words_in_article = reuters.words(article)
    # get rid of symbols and numbers
    words_in_article = [word.lower() for word in words_in_article if word.isalnum() and not word.isnumeric() 
                        and not word.startswith("0")  # i know this is a terrible hack
                       and not word.startswith("1")
                       and not word.startswith("4")
                       and not word.startswith("5")
                       and not word.startswith("8")]
    superset_words = superset_words | set(words_in_article)
    
    article_word_counts = Counter(words_in_article)
    corpus_wordcounts += article_word_counts
    wordcounts_per_article[article] = article_word_counts
    
# TODO: term freq is 1+log_10(count(t,d)) if count > 0, else 0
# TODO: tf-idf = tf * idf 

In [112]:
superset_words = sorted(list(superset_words))  # get words in alphabetical order

In [113]:
# raw_term_freqs = {}
# for article in money:
#     term_frequencies = [wordcounts_per_article[article][word] for word in superset_words]
#     raw_term_freqs[article] = term_frequencies
    
term_freqs = {}
for article, raw_counts in wordcounts_per_article.items():
    scaled_counts = Counter()
    for word, raw_count in raw_counts.items():
        if raw_count > 0:
            count = 1 + np.log10(raw_count)
        else:
            count = 0
        scaled_counts[word] = count
    """
    Counters are unordered, so do list comprehension to get the vals in sorted order
    """
    term_freqs[article] = [scaled_counts[word] for word in superset_words]
    
term_freq_df = pd.DataFrame.from_dict(term_freqs)
term_freq_df["word"] = superset_words
term_freq_df.set_index("word", inplace=True)

In [114]:
def count_N_docs_term_present(row):
    times_term_present = 0
    for count in row:
        if count > 0:
            times_term_present += 1
            
    return times_term_present

N_docs_term_present = term_freq_df.apply(count_N_docs_term_present, axis=1)
N_documents = len(money)
inverse_document_frequencies = np.log10(N_documents / N_docs_term_present)
tf_idf_df = term_freq_df.apply(lambda column: column * inverse_document_frequencies, axis=0)

In [115]:
tf_idf_df

Unnamed: 0_level_0,test/14849,test/14861,test/14890,test/14913,test/14919,test/14931,test/14964,test/14987,test/15048,test/15212,...,training/9862,training/9864,training/9871,training/9880,training/9923,training/9946,training/9955,training/9957,training/9975,training/999
word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a,0.265163,0.186974,0.298803,0.230235,0.143712,0.298803,0.143712,0.265163,0.21228,0.265163,...,0.0,0.186974,0.244162,0.143712,0.265163,0.287424,0.0,0.143712,0.308424,0.0
aa,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0
abandon,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0
abandoned,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0
abandoning,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0
abandons,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0
abate,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0
abated,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0
abdel,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0
abdul,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0


2. Reduce the size of the matrix. Compute the maximum tf-idf score for each word and keep the 500 rows with the top 500 maxima. Did that remove the maximum tf-idf score of any column? Comment. 
    

In [126]:
max_score_per_word = tf_idf_df.apply(max, axis=1)
top_500_maxima = max_score_per_word.sort_values(ascending=False).head(500)
top_500_maxima_words = top_500_maxima.index


['sudan',
 'darman',
 'gleske',
 'unquoted',
 'roderick',
 'peso',
 'azzam',
 'melamed',
 'fekete',
 'eiu',
 'mueller',
 'leigh',
 'pemberton',
 'maekawa',
 'm0',
 'leutwiler',
 'bangemann',
 'fecs',
 'herrhausen',
 'marris',
 'bolivars',
 'cme',
 'liffe',
 'hungary',
 'ing',
 'cds',
 'bayliss',
 'se',
 'tietmeyer',
 'ft',
 'pdvsa',
 'herstatt',
 'ruding',
 'sprinkel',
 'seipp',
 'australia',
 'rand',
 'viermetz',
 'recognised',
 'telex',
 'netting',
 'monsod',
 'cambio',
 'usx',
 'khartoum',
 'lots',
 'cbot',
 'cdu',
 'raiders',
 'byers',
 'strikes',
 'zimbabwe',
 'chirac',
 'dri',
 'tian',
 'feb',
 'endaka',
 'shilling',
 'guiara',
 'ig',
 'cohen',
 'metall',
 'labor',
 'chinese',
 'la',
 'skeoch',
 'eyskens',
 'ldp',
 'miti',
 'sarney',
 'koehler',
 'unions',
 'venezuela',
 'naira',
 'kaufman',
 'amro',
 'buttrose',
 'phlx',
 'nigeria',
 'indonesia',
 'jordan',
 'cd',
 'option',
 'horner',
 'finland',
 'squeezing',
 'schlecht',
 'el',
 'nelissen',
 'leslie',
 'negara',
 'muldoon',
 

In [127]:
# TODO figure out how to do this

max_score_per_document = tf_idf_df.apply(max, axis=0)
max_score_per_document

test/14849       3.094367
test/14861       2.010421
test/14890       4.713283
test/14913       2.378398
test/14919       1.539667
test/14931       3.094367
test/14964       1.855519
test/14987       2.855519
test/15048       3.715116
test/15212       3.715116
test/15234       2.855519
test/15253       3.323467
test/15364       4.403806
test/15375       3.323467
test/15378       2.378398
test/15431       2.855519
test/15436       2.010421
test/15442       3.552676
test/15444       2.969636
test/15448       2.931818
test/15449       2.702718
test/15450       3.158597
test/15452       3.323467
test/15453       3.715116
test/15460       3.471395
test/15510       1.651399
test/15522       2.615618
test/15523       2.253459
test/15527       2.931818
test/15539       3.574833
                   ...   
training/9689    2.855519
training/9698    3.094367
training/9699    1.812492
training/9701    2.855519
training/9720    2.554489
training/9727    3.323467
training/9730    2.554489
training/974

3. Cluster the document vectors into five clusters using an unsupervised algorithm like k-means. Create a 5x5 matrix that compares each cluster to the each of the above five categories, using the Jaccard Index (see below). Comment.

4. Try clustering the words and comparing those clusters to the categories, too. Comment on the results.

The Jaccard Index compares two sets A and B using the formula

J(A,B) = |A and B | / |A or B|