A couple of reminders about how to use Jupyter Notebook: 
- Hit Shift+Enter to run code and finish typing in Markdown
- You should not have to change much code here, but be careful to document carefully if you do 

Import all the necessary packages into Python. If there are any you don't have, type "!pip install package-name" above the import statements and it should install that package. You should only have to do that once, so you can delete it right afterwards. 

In [46]:
import nltk
import re
import gensim
import string

This tutorial compares a set of three debates for each election for 6 presidential elections, for three presidents. It takes the first set of debates for a given president and compares it to the second set of debates for that president. However, this code can be adapted to do any comparison of any two (or more) documents. It measure "Term Frequency Inverse Document Frequency", which is a measure of frequency of a term in a document relative to all other documents in that set.  

We start by reading text files into a doc list. If you have more than two documents, you may consider creating a loop or a function to read them in. Since we're using a doc list for the rest of the steps, you can add as many docs as you like.

In [47]:
doc = []

with open ("Clinton_term_one.txt", "r",encoding="utf-8") as myfile:
    term_one=myfile.read().replace('\n', ' ')
    doc.append(term_one)

with open ("Clinton_term_two.txt", "r",encoding="utf-8") as myfile:
    term_two=myfile.read().replace('\n', ' ')
    doc.append(term_two)

Use the following code to get rid of punctuation and tokenize the documents - that is, break them into individual words. Uncomment "#print(term_vec[0])" to see the results of this preprocessing.

In [49]:
# Remove punctuation, then tokenize documents

punc = re.compile( '[%s]' % re.escape( string.punctuation ) )
term_vec = [ ]
stop_words = nltk.corpus.stopwords.words( 'english' )
stop_words = stop_words.append(["youd","5000","300","95","”","7","wasnt"])

for d in doc:
    d = d.lower()
    d = punc.sub( '', d )
    term_vec.append( nltk.word_tokenize( d ) )
    
#print(term_vec[0])

Remove stopwords from the docs, i.e. unnecessary words that don't provide much information. To see the results of this processing, uncomment the print statement.

In [50]:
# Remove stop words from term vectors

stop_words = nltk.corpus.stopwords.words( 'english' )
term_list = []
additional_words = ["youd","5000","300","95","”","7","wasnt"]
#doc_list = []
for doc in term_vec:
    for term in doc:
        if term in stop_words:
            doc.remove(term)
#print(term_vec[1])

Now we use the Porter stemmer to stem the words so that we don't get redundant information - we only need the roots of the words, like "lead", "leader", "leadership" will all be "lead" for our purposes.

In [51]:
# Porter stem remaining terms

porter = nltk.stem.porter.PorterStemmer()

for i in range( 0, len( term_vec ) ):
    for j in range( 0, len( term_vec[ i ] ) ):
        term_vec[ i ][ j ] = porter.stem( term_vec[ i ][ j ] )

#print(term_vec[0])

Here we use the gensim dictionary function to convert the term vectors into a gensim dictionary, which assigns an index to each unique word in term_vec. Below that we create a bag of words model (a model for text which ignores word order and context), and finally we create a tfidf matrix of the term frequency inverse document frequency values.

In [52]:
#  Convert term vectors into gensim dictionary

dict = gensim.corpora.Dictionary(term_vec)

corp = []
for i in range(0, len(term_vec)):
    corp.append( dict.doc2bow( term_vec[ i ] ) )

#  Create TFIDF vectors based on term vectors bag-of-word corpora

tfidf_model = gensim.models.TfidfModel( corp )

tfidf = [ ]
for i in range( 0, len( corp ) ):
    tfidf.append( tfidf_model[ corp[ i ] ] )

dictionary = dict.token2id
#print(tfidf[0])

Remember that the numbers you generated in the TFIDF list up there correspond to a measure of frequency of occurrences in *this* document compared to other documents. See which words show up in one document that don't in the other. 

In [53]:
#for key,val in dictionary.items():
#    if val == 121:
#        print(key)

for i in range(0,len(tfidf)):
    sorted_values = sorted(tfidf[i], key=lambda x: x[1])
#term_one_tf = sorted(tfidf[0], key=lambda x: x[1])
    first = (len(sorted_values))-39
    top_40 = sorted_values[first:]
    #print(top_40)
    values =  [int(i[0]) for i in top_40]
    print("Doc",str(i),"terms of note:")
    for key,val in dictionary.items():
        if val in values:
            print(key)
    print()

Doc 0 terms of note:
trickledown
challeng
bush
courag
experi
fourth
200000
middleclass
corpor
research
wealthi
statement
150000
engag
technolog
mobil
arm
applaud
quickli
execut
reinvest
low
near
rebuild
strength
statu
surplu
involv
medicin
oregon
ceil
wors
oversea
theori
format
again
wait
down
influenc

Doc 1 terms of note:
dole
took
peac
protect
environ
childrear
ban
bridg
young
tobacco
sale
focu
270
associ
scheme
financ
alway
prove
led
border
penalti
effect
east
becam
cuba
whatev
san
diego
partisan
depart
level
toxic
youll
stronger
left
sell
qualifi
religi
up

