<a href="https://colab.research.google.com/github/alt-nikitha/NLP-For-Dummies/blob/master/NLTKForTermFrequency.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook helps us calculate term frequency in a document. This is a very basic and helpful step to do any further analysis in the text.

In [1]:
# !pip install nltk.     Not necessary

In [5]:
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords

Remove stop words like not, and, in, etc because they are common across most documents and don't add much value to the term frequency calculation. NLTK already provides a set of stopwords we can use to check our text against. 

In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Remember to choose english stop words

In [7]:
stop_words=set(stopwords.words('english'))



Download sample resources given by the nltk library that we can work on

In [8]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

Choose one text document since this is a simple tutorial and store the words. Also convert the words to lower case and choose only those words that contain alphabets. Then remove words that are present in the stopwords list we downloaded.

In [9]:
words=nltk.Text(nltk.corpus.gutenberg.words('bryant-stories.txt'))
words=[word.lower() for word in words if(word.isalpha())]
words=[word.lower() for word in words if(word not in stop_words)]

Let's get the frequency distribution of these words.

In [10]:
fDist=FreqDist(words)

Now before we go ahead, let us see how many words we have, and how many unique words make up this set. The unique set defines the vocabulary of this text.

In [11]:
print(len(words))
print(len(set(words)))

21718
3688


Now let's print the top 10 commonly occurring words in the text.

In [12]:
for x,v in fDist.most_common(10):
  print(x,v)

little 597
said 453
came 191
one 183
could 158
king 141
went 122
would 112
great 110
day 107


To get a better picture, let us now see what proportion of the entire text these words make up

In [13]:
for x,v in fDist.most_common(10):
  print(x,v/len(fDist))

little 0.1618763557483731
said 0.12283080260303687
came 0.05178958785249458
one 0.04962039045553145
could 0.042841648590021694
king 0.038232104121475055
went 0.03308026030368764
would 0.03036876355748373
great 0.02982646420824295
day 0.02901301518438178


There is no word that dominates, but we see that around 16 percent of the text is 'little', and 12 percent is 'said'. Whether this SAYS too LITTLE about this text, we will know only after further analysis.

Another important feature that can help us identify the range of vocabulary used in documents and compare against other texts is the type-token ratio (TTR). This is the ratio of number of unique words to total number of words used in the text. So let us compare the TTR of two texts.

We need to ensure that the texts we're comparing have the same number of words. This helps us see which text uses a greater vocabulary range with the same word limit.
We first perform the same operations we did as before, but choose only the first 15000 words in both.

In [14]:
words_bryant=nltk.Text(nltk.corpus.gutenberg.words('bryant-stories.txt'))
words_emma=nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

words_bryant=[word.lower() for word in words_bryant if(word.isalpha())]
words_bryant=[word.lower() for word in words_bryant if(word not in stop_words)][:15000]

words_emma=[word.lower() for word in words_emma if(word.isalpha())]
words_emma=[word.lower() for word in words_emma if(word not in stop_words)][:15000]


In [15]:
TTR_Bryant= len(set(words_bryant))/len(words_bryant)
TTR_Emma=len(set(words_emma))/len(words_emma)

In [16]:
print('Bryant: Number of tokens= ',len(words_bryant),'Vocabulary length= ',len(set(words_bryant)))
print('Emma: Number of tokens= ',len(words_emma),'Vocabulary length= ',len(set(words_emma)))


Bryant: Number of tokens=  15000 Vocabulary length=  2796
Emma: Number of tokens=  15000 Vocabulary length=  3274
