<a href="https://colab.research.google.com/github/abishek/learning-anlp/blob/master/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The purpose of this note book is to understand basic tf-idf generation and maybe even understand cosine similarity and querying. I am constructing my own limited corpus of about 10 documents with simple sentences to implement this.

Most of the code is self explanatory. I am also trying literate pattern. So you'll need to run all the code blocks in sequence.

In [19]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.probability import  FreqDist
import pandas as pd

stop_words = set(stopwords.words('english'))
my_corpus = {
  'd0': 'Agricultural robots are used for harvesting of crops and weed control',
  'd1': 'Weed control robots are more efficient alternatives to mass spraying of herbicide.',
  'd2': 'Robots employing computer vision algorithms target weeds and spray herbicides with high precision.',
  'd3': 'This reduces the development of herbicide resistance in weeds.',
  'd4': 'Precision spraying done by robots also reduces the amount of herbicide that is sprayed on crops.',
  'd5': 'Driverless tractors optimize operations on the farm and speed up the rate at which fields are tilled.',
  'd6': 'The speed of the tractors, obstacle detection and avoidance, and the definition of preset routes is handled by an AI system.',
  'd7': 'A supervisor monitors all operations from a central control room and takes remote control if necessary.',
  'd8': 'Automated harvesting can compensate for labor shortages.',
  'd9': 'Robotic crop pickers significantly reduce production costs for many crops.',
}

N_documents = 10
words = []
for document in my_corpus.values():
  words.extend(document.split())

# convert to lower case and remove stop words. this seems to take care of punctuations as well.
words=[word.lower() for word in words if word.isalpha() ]
words=[word.lower() for word in words if word not in stop_words]

# I might need to lemmatize the content for better processing. 
# But I'll get to this later once the basics are thorough.
print("There are {0} words in the corpus. {1} are unique.".format(len(words), len(set(words))))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
There are 74 words in the corpus. 61 are unique.


We can calculate term frequencies using FreqDist method in nltk. I have computed using a simple list.count as well as using FreqDist. It seems they are the same.

In [20]:
heading = ['Word','Frequency']
term_frequencies_custom = {}
for term in set(words):
  term_frequencies_custom[term] = words.count(term)
# print(pd.DataFrame(term_frequencies_custom.items(), columns=heading))

term_frequencies = FreqDist(words)
tf_list = []
for x,v in term_frequencies.most_common(5):
    tf_list.append((x,v))
print(pd.DataFrame(tf_list,columns=heading))

         Word  Frequency
0      robots          4
1     control          4
2  harvesting          2
3        weed          2
4    spraying          2


Next we compute the document frequencies for these terms.

In [26]:
document_frequencies = {term: [] for term in term_frequencies.keys()}
for term in term_frequencies.keys():
  for k, document in my_corpus.items():
    words_in_document = document.split()
    words_in_document=[word.lower() for word in words_in_document if word.isalpha() ]
    words=[word.lower() for word in words_in_document if word not in stop_words]
    document_frequencies[term].append(words.count(term))

freq_matrix = pd.DataFrame.from_dict(document_frequencies, orient='index', columns=my_corpus.keys())
print(freq_matrix)


               d0  d1  d2  d3  d4  d5  d6  d7  d8  d9
agricultural    1   0   0   0   0   0   0   0   0   0
robots          1   1   1   0   1   0   0   0   0   0
used            1   0   0   0   0   0   0   0   0   0
harvesting      1   0   0   0   0   0   0   0   1   0
crops           1   0   0   0   0   0   0   0   0   0
...            ..  ..  ..  ..  ..  ..  ..  ..  ..  ..
significantly   0   0   0   0   0   0   0   0   0   1
reduce          0   0   0   0   0   0   0   0   0   1
production      0   0   0   0   0   0   0   0   0   1
costs           0   0   0   0   0   0   0   0   0   1
many            0   0   0   0   0   0   0   0   0   1

[61 rows x 10 columns]


Now that we have term and document frequencies, lets compute the Tf.IDf values for the terms in each document
