<a href="https://colab.research.google.com/github/gangabhavani145/PGD_Practice/blob/main/TF_IDF_Implementation_from_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##TF-IDF Simple Implementation from Scratch
###TF-IDF is a technique which is used to convert text to vector.

In [2]:
#importing required libraries and functions

import numpy as np
from math import log

###We will now see how to apply tf-idf to convert given corpus (collection of documents) containing text to Vectors.

In [3]:
corpus = [
     'this is the first document mostly',
     'this document is the second document',
     'and this is the third one',
     'is this the first document here',
]

Step 1> Construct a list of unique words from corpus. 

In [7]:
corpus_words = []

# iterating through each sentence/document(called in NLP), splitting sentences to words and using set() to get only unique words.
for sent in corpus:
  words = sent.split()
  corpus_words.extend(words)
print("words in corpus: ",corpus_words)
  
# we have 11 unique words in corpus
unique_words = list(set(corpus_words))
print("Unique words list: ", unique_words)
print(len(unique_words))

words in corpus:  ['this', 'is', 'the', 'first', 'document', 'mostly', 'this', 'document', 'is', 'the', 'second', 'document', 'and', 'this', 'is', 'the', 'third', 'one', 'is', 'this', 'the', 'first', 'document', 'here']
Unique words list:  ['one', 'is', 'first', 'mostly', 'this', 'here', 'second', 'third', 'the', 'and', 'document']
11


Step 2> Finding Frequency/count of each unique word in corpus and creating dictionary which stores word:word_frequency as key:value pairs.

In [9]:
word_freq_dict = {}

for word in unique_words:
  if word in corpus_words:
    word_freq = corpus_words.count(word)
  print(word, " frequency is ",word_freq)

  # adding key:value pairs to dictionary
  word_freq_dict[word] = word_freq

# sorting the unique words based on their frequency(value) in decreasing order 
sorted_unique_words = sorted(word_freq_dict.items(), key=lambda x: x[1], reverse=True)

# sorted dictionary with word:word frequency as key:value pairs
sorted_unique_words_dict  = dict(sorted_unique_words)
print("sorted dictionary: ", sorted_unique_words_dict)

# list of keys or unique words of dictionary
unique_words_list = list(sorted_unique_words_dict.keys())
print("sorted keys of dictionary: ", unique_words_list)

one  frequency is  1
is  frequency is  4
first  frequency is  2
mostly  frequency is  1
this  frequency is  4
here  frequency is  1
second  frequency is  1
third  frequency is  1
the  frequency is  4
and  frequency is  1
document  frequency is  4
sorted dictionary:  {'is': 4, 'this': 4, 'the': 4, 'document': 4, 'first': 2, 'one': 1, 'mostly': 1, 'here': 1, 'second': 1, 'third': 1, 'and': 1}
sorted keys of dictionary:  ['is', 'this', 'the', 'document', 'first', 'one', 'mostly', 'here', 'second', 'third', 'and']


Step 3> computing IDF for unique words

Inverse Document-Frequency(IDF) for a word = Log ((Total number of Documents in Corpus)/(Number of documents in corpus containing that word))

Observation:  
*   For frequent words, IDF will be lower.
*   For rare words, IDF will be higher and 
If word occurs in all documents then IDP will be Zero. 



In [11]:
idf = {}
len_corpus = len(corpus)

for word in unique_words_list:
  count = 0
  for sentence in corpus:
    sentence = sentence.split()
    if word in sentence:
      count += 1
  idf[word] = log(len_corpus/count)

idf_of_word  = idf[word]
print("IDF for unique words: ",idf)

IDF for unique words:  {'is': 0.0, 'this': 0.0, 'the': 0.0, 'document': 0.28768207245178085, 'first': 0.6931471805599453, 'one': 1.3862943611198906, 'mostly': 1.3862943611198906, 'here': 1.3862943611198906, 'second': 1.3862943611198906, 'third': 1.3862943611198906, 'and': 1.3862943611198906}


Step 4> Computing TF for each word in corpus.
Term Frequency(TF) for a word = (Number of times word occurs in a document) / (Total number of words in that document)

TF tells the probability of word occuring in the document.

TF always lies in the range of [0,1]


---

TF-IDF for each word = TF of word  * IDF of word

In [12]:
tf_idf= []
for sentence in corpus:
  sentence = sentence.split()
  for word in sentence:
    # Term frequency of each word in corpus 
    tf_of_word = sentence.count(word)/len(sentence)

    # Computing TF-IDF for each word in corpus 
    tf_idf_of_word = tf_of_word * idf[word]

    # rounding values to 2 decimals
    tf_idf_of_word = round((tf_idf_of_word), 2)
    tf_idf.append(tf_idf_of_word)


tf_idf_arr = np.array(tf_idf)
tf_idf_arr = np.reshape(tf_idf_arr, (4,6))
print(tf_idf_arr)

[[0.   0.   0.   0.12 0.05 0.23]
 [0.   0.1  0.   0.   0.23 0.1 ]
 [0.23 0.   0.   0.   0.23 0.23]
 [0.   0.   0.   0.12 0.05 0.23]]
