<a href="https://colab.research.google.com/github/abishek/learning-anlp/blob/master/basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Shakespeare corpus has multiple plays in it. Each play is a document. I'll compute TF-IDF for all the terms across all the documents. Then I will try to use the compute Tf.Idf matrix as the basis for a cosine distance computation for a given query.

In [9]:
import nltk
nltk.download('stopwords')
nltk.download('shakespeare')
from nltk.probability import FreqDist
from nltk.corpus import stopwords, shakespeare
import pandas as pd
from math import log

stop_words = set(stopwords.words('english'))

# shakespeare corpus has a few plays in it. let me pick the first one.
print(shakespeare.fileids())

def process_document(doc):
  words = nltk.Text(nltk.corpus.shakespeare.words(doc))
  #convert to small letters
  words=[word.lower() for word in words if word.isalpha() ]
  words=[word.lower() for word in words if word not in stop_words ]
  return words

all_words = []
words_in_document = {}
for play in shakespeare.fileids():
  words = process_document(play)
  all_words.extend(words)
  words_in_document[play] = words
  

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package shakespeare to /root/nltk_data...
[nltk_data]   Package shakespeare is already up-to-date!
['a_and_c.xml', 'dream.xml', 'hamlet.xml', 'j_caesar.xml', 'macbeth.xml', 'merchant.xml', 'othello.xml', 'r_and_j.xml']


The overall frequency distribution of words can be done using nltk FreqDist methods. Here is what it looks like.

In [7]:
fDist = FreqDist(all_words)
heading = ['Word','Frequency']
tf_list = []
for x,v in fDist.most_common(20):
    tf_list.append((x,v))
print(pd.DataFrame(tf_list,columns=heading))

      Word  Frequency
0     thou       1141
1    shall        816
2      thy        670
3     lord        631
4     come        626
5     thee        626
6     good        606
7   caesar        570
8     love        567
9    enter        563
10     let        557
11  antony        517
12    well        502
13   would        479
14  hamlet        472
15     man        436
16      go        425
17    know        410
18    hath        408
19    upon        403


But let us compute Tf.Idf on a per-document basis. So the all_words distribution is less useful for this exercise. But all_words is the bag of words we'll use to compute the frequencies over.


In [14]:
tfs = {d: {t: 0 for t in words_in_document[d]} for d in words_in_document.keys()}
dfs = {t: 0 for t in all_words}
tf_idfs = {d: {t: 0 for t in words_in_document[d]} for d in words_in_document.keys()}

for term in all_words:
  df = 0
  for doc in words_in_document.keys():
    tfs[doc][term] = words_in_document[doc].count(term) / len(words_in_document)
    if term in words_in_document[doc]:
      df += 1
  dfs[term] = log( len(words_in_document.keys()) / df )

for doc in words_in_document.keys():
  for term in all_words:
    tf_idfs[doc][term] = tfs[doc][term] * dfs[term]

print(pd.DataFrame.from_dict(tf_idfs, orient='columns'))

            a_and_c.xml  dream.xml  ...  othello.xml  r_and_j.xml
tragedy        0.016691   0.016691  ...     0.016691     0.016691
antony        33.530995   0.000000  ...     0.000000     0.086643
cleopatra     47.480582   0.000000  ...     0.000000     0.173287
dramatis       0.000000   0.000000  ...     0.000000     0.000000
personae       0.000000   0.000000  ...     0.000000     0.000000
...                 ...        ...  ...          ...          ...
pothecary      0.000000   0.000000  ...     0.000000     0.259930
jointure       0.000000   0.000000  ...     0.000000     0.259930
sacrifices     0.000000   0.000000  ...     0.000000     0.259930
glooming       0.000000   0.000000  ...     0.000000     0.259930
punished       0.000000   0.000000  ...     0.000000     0.259930

[11192 rows x 8 columns]
