<h1>TF-IDF</h1>
Author: Tristan

You can learn about TF-IDF here:
https://en.wikipedia.org/wiki/Tf–idf
In brief, it is a way to choose the weight of different terms in a corpus.

It is *not* a way to reduce the number of terms.
Terms can be reduced afterwards using Semantic Analysis (for which there are multiple methods)
https://en.wikipedia.org/wiki/Semantic_analysis_(machine_learning)

In [1]:
import pandas as pd
import numpy
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import *
import scipy

In [2]:
import nlp

In [3]:
data = nlp.proc_text()
data.head()

Unnamed: 0,index,title,text,links,categories,process
0,3,Art,[[File:Chemin montant dans les hautes herbes -...,"[renoir, senses, bowl, creativity, drawing, pa...","[Art, Non-verbal communication, Basic English ...",art activ creation peopl import attract human ...
1,19,Abbreviation,An '''abbreviation''' is a shorter way to writ...,"[english language, apostrophe, period (punctua...",[Linguistics],abbrevi shorter way write word phrase peopl us...
2,29,Astronomy,[[File:Atlas Coelestis-1.jpg|thumb|280px|18th ...,"[natural science, atmosphere of earth, astrono...",[Astronomy],astronomi natur scienc studi everyth outsid at...
3,50,Browser,{{Distinguish|Web browser}}\n[[Image:Giraffe f...,"[herbivorous, mammal, leaves, shrub, grass, gr...","[Ecology, Zoology]",browser anim usual herbivor mammal eat leav sh...
4,63,Bubonic plague,{{Infobox disease\n| Name = Bubonic...,"[lymphatic system, plague, bacterium, yersinia...","[Plague, Pulmonology]",bubon plagu best known form diseas plagu caus ...


In [4]:
data['process'][2627]

'gerd ller born novemb n rdlingen former footbal player play fc bayern nchen germani nation team best striker time still own mani score record career club career start play footbal tsv n rdlingen score season goal play fc bayern nchen first season bayern nchen score match goal till score bundesliga match score goal record still exist till play us profession leagu ford lauderdal striker smith brother loung intern start intern career octob ankara versu turkey second match versu albania score first four goal nation team member fifa world cup team ten goal best scorer tournament germani european championship ller best scorer tournament germani fifa world cup score second goal victori netherland tournament resign team record ller score goal match germani best scorer team miroslav klose broke record two world cup score goal record broken ronaldo world cup honor titl bayern munich intercontinent cup european champion cup european cup winner cup bundesliga german cup regionalliga intern world 

In [5]:
vectorizer = TfidfVectorizer(analyzer = "word", max_df=1.0, min_df=.03)
#max_df is the fraction of documents that must have a word before it is ignored.
#min_df is the fraction of documents that must have a word for it to be considered.
#norm="l2" normalizes each document vector to a (pythagorean) length of 1.
clean_text = data["process"]
weighted_words = vectorizer.fit_transform(clean_text)

In [6]:
#also get the bag of words without weighting for comparison
vectorizer2 = CountVectorizer(analyzer = "word",min_df=.03)
unweighted_words = vectorizer2.fit_transform(clean_text)

In [7]:
unweighted_words.shape

(10000, 394)

In [8]:
#Want a function to list words alongside their frequency
def get_word_frequency(sparse_matrix,doc,word_list):
    #find number of distinct words in given document
    num_words = sparse_matrix[doc,:].getnnz()
    #initialize DataFrame
    word_frequency = pd.DataFrame(index=range(num_words), columns=['word','frequency'])
    #convert to another kind of sparse matrix
    cx = scipy.sparse.coo_matrix(sparse_matrix[doc,:])
    #Loop over nonzero elements in the sparse matrix
    #with i = column number, j = weight, and k being the appropriate row of the DataFrame
    for i,j,k in zip(cx.col,cx.data,range(num_words)):
        word_frequency['word'][k] = word_list[i]
        word_frequency['frequency'][k] = j
        
    #Finally, sort the DataFrame
    word_frequency.sort_values('frequency',inplace=True,ascending=False)
    return word_frequency

In [9]:
doc_number = 2
test_freq = get_word_frequency(unweighted_words,doc_number,vectorizer2.get_feature_names())
test_weight = get_word_frequency(weighted_words,doc_number,vectorizer.get_feature_names())
print(str(doc_number) + "th document, before TF-IDF:\n",test_freq[0:10])
print("After TF-IDF:\n",test_weight[0:10])

2th document, before TF-IDF:
        word frequency
71     star        32
4       use        18
73     time        12
72  univers        12
70    studi        12
52    chang        10
14     bodi         9
9      make         8
95   around         8
42     look         8
After TF-IDF:
         word frequency
84      star  0.593086
85     studi  0.221979
151      use  0.220351
83   univers   0.20068
141     bodi  0.176538
103    chang  0.170967
113     look  0.156743
82      time  0.156632
97      come  0.139222
98    period  0.138106


# Saving data

Although the purpose of this notebook was originally to implement TF-IDF, it offers a most convenient point to save data.

About this data: We included a random selection of 10k articles from simple wikipedia.  The document-term matrix is unweighted, and only includes the 394 words that occur in at least 3% of articles.

In [10]:
data.to_pickle("processed_10k_articles.pkl")

In [11]:
numpy.save("document_term_matrix",unweighted_words,allow_pickle=True)

In [12]:
pd.DataFrame(vectorizer.get_feature_names()).to_pickle("term_list.pkl")

## Loading data
The following code can load the data in its original form.  Note that loading the npy file is a bit tricky.

In [13]:
processed_10k_articles = pd.read_pickle("processed_10k_articles.pkl")
document_term_matrix = numpy.reshape(numpy.load("document_term_matrix.npy"),(1))[0]
term_list = pd.read_pickle("term_list.pkl")[0].tolist()