<a href="https://colab.research.google.com/github/abishek/learning-anlp/blob/master/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The purpose of this note book is to understand basic tf-idf generation and maybe even understand cosine similarity and querying. I am constructing my own limited corpus of about 10 documents with simple sentences to implement this.

Most of the code is self explanatory. I am also trying literate pattern. So you'll need to run all the code blocks in sequence.

In [1]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.probability import  FreqDist
import pandas as pd

stop_words = set(stopwords.words('english'))
my_corpus = {
  'd0': 'Agricultural robots are used for harvesting of crops and weed control',
  'd1': 'Weed control robots are more efficient alternatives to mass spraying of herbicide.',
  'd2': 'Robots employing computer vision algorithms target weeds and spray herbicides with high precision.',
  'd3': 'This reduces the development of herbicide resistance in weeds.',
  'd4': 'Precision spraying done by robots also reduces the amount of herbicide that is sprayed on crops.',
  'd5': 'Driverless tractors optimize operations on the farm and speed up the rate at which fields are tilled.',
  'd6': 'The speed of the tractors, obstacle detection and avoidance, and the definition of preset routes is handled by an AI system.',
  'd7': 'A supervisor monitors all operations from a central control room and takes remote control if necessary.',
  'd8': 'Automated harvesting can compensate for labor shortages.',
  'd9': 'Robotic crop pickers significantly reduce production costs for many crops.',
}

N_documents = 10
words = []
for document in my_corpus.values():
  words.extend(document.split())

# convert to lower case and remove stop words. this seems to take care of punctuations as well.
words=[word.lower() for word in words if word.isalpha() ]
words=[word.lower() for word in words if word not in stop_words]

# I might need to lemmatize the content for better processing. 
# But I'll get to this later once the basics are thorough.
print("There are {0} words in the corpus. {1} are unique.".format(len(words), len(set(words))))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
There are 74 words in the corpus. 61 are unique.


We can calculate term frequencies using FreqDist method in nltk. I have computed using a simple list.count. It seems this is the same as what FreqDist will give me. But the catch is, these for the entire corpus as a single document. I am interested in per document numbers.

In [6]:
heading = ['Word','Tf']
term_frequencies_custom = {}
for term in set(words):
  term_frequencies_custom[term] = words.count(term)
custom_tf_df = pd.DataFrame(term_frequencies_custom.items(), columns=heading)
custom_tf_df.sort_values(by=['Tf', 'Word'], inplace=True, ascending=[False, True])
print(custom_tf_df[:5])


          Word  Frequency
52     control          4
32      robots          4
6   harvesting          2
31   herbicide          2
1   operations          2


Let's compute TF.IDF for each of these terms. TF is the same as above but taken per-document.

We define IDF as follows:

$IDF_t = log(\frac{N}{D_{ft}})$

Where N is the number of documents, $D_{ft}$ is the number of documents containing the term t.

In [19]:
from math import log

tfs = {term: {} for term in term_frequencies_custom.keys()}
idfs = {term: 0 for term in term_frequencies_custom.keys()}
for term in term_frequencies_custom.keys():
  tfs[term] = {k: 0 for k in my_corpus.keys()}
  dft = 0
  for k, document in my_corpus.items():
    words_in_document = document.split()
    words_in_document=[word.lower() for word in words_in_document if word.isalpha() ]
    words=[word.lower() for word in words_in_document if word not in stop_words]
    if term in words:
      dft += 1
      tfs[term][k] = words.count(term)
  idfs[term] = log(N_documents/dft)

tf_idfs = {term: {k: 0 for k in my_corpus.keys()} for term in term_frequencies_custom.keys()}
for term in term_frequencies_custom.keys():
  for k in my_corpus.keys():
    tf_idfs[term][k] = tfs[term][k] * idfs[term]

print(pd.DataFrame.from_dict(tf_idfs, orient='index'))


                    d0        d1        d2  ...        d7        d8        d9
reduce        0.000000  0.000000  0.000000  ...  0.000000  0.000000  2.302585
operations    0.000000  0.000000  0.000000  ...  1.609438  0.000000  0.000000
compensate    0.000000  0.000000  0.000000  ...  0.000000  2.302585  0.000000
crops         2.302585  0.000000  0.000000  ...  0.000000  0.000000  0.000000
alternatives  0.000000  2.302585  0.000000  ...  0.000000  0.000000  0.000000
...                ...       ...       ...  ...       ...       ...       ...
farm          0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
remote        0.000000  0.000000  0.000000  ...  2.302585  0.000000  0.000000
spray         0.000000  0.000000  2.302585  ...  0.000000  0.000000  0.000000
efficient     0.000000  2.302585  0.000000  ...  0.000000  0.000000  0.000000
done          0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000

[61 rows x 10 columns]
