# Introduction To TF-IDF 
It is a technic called Term Frequency - Inversed Document Frequency 

## TF

### Definition
Simply calculates the term frequency to check how frequently a terms occurs in a document. 
If a term occurs more times in a document, then the term has more relevance than other terms.

### Formula 
    t : term , d : document
    tf(t,d) = count of t in d / number of words in d
### Example 
I love naturla language processing:

| Sentence | TF Score |
| --- | --- |
| I | 1/5 |  
| love | 1/5 |
| natural | 1/5 |
| language | 1/5 |
| processing | 1/5 |

Another Example : 
I have a job, I love my job. My job is cool:

| Sentence | TF Score |
| --- | --- |
| I | 2/12 |  
| have | 1/12 |
| a | 1/12 |
| job | 3/12 |
| love | 1/12 |
| my | 2/12 |
| is | 1/12 |
| cool | 1/12 |


## IDF 
While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing IDF, an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.

formula : idf(t) = N/df
Now there are few other problems with the IDF , in case of a large corpus,say 100,000,000 , the IDF value explodes , to avoid the effect we take the log of idf .

During the query time, when a word which is not in vocab occurs, the df will be 0. As we cannot divide by 0, we smoothen the value by adding 1 to the denominator.

that’s the final formula:
### Formula : 
    idf(t) = log(N/(df + 1))

In [8]:
import pandas as pd
import sklearn as sk
import math 

In [27]:
first_sentence = "I love natural language processing"
second_sentence = "I have a job in language modeling that I love"

#split so each word have their own string
first_sentence = first_sentence.split(" ")
second_sentence = second_sentence.split(" ")#join them to remove common duplicate words
total= set(first_sentence).union(set(second_sentence))

print(total)

{'natural', 'in', 'language', 'I', 'have', 'job', 'modeling', 'a', 'processing', 'that', 'love'}


In [28]:
wordDictA = dict.fromkeys(total, 0) 
wordDictB = dict.fromkeys(total, 0)
for word in first_sentence:
    wordDictA[word]+=1
    
for word in second_sentence:
    wordDictB[word]+=1

In [29]:
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,I,a,have,in,job,language,love,modeling,natural,processing,that
0,1,0,0,0,0,1,1,0,1,1,0
1,2,1,1,1,1,1,1,1,0,0,1


That’s all for TF formula , just i wanna talk about stop words that we should eliminate them because they are the most commonly occurring words which don’t give any additional value to the document vector .in-fact removing these will increase computation and space efficiency.

nltk library has a method to download the stopwords, so instead of explicitly mentioning all the stopwords ourselves we can just use the nltk library and iterate over all the words and remove the stop words. There are many efficient ways to do this, but ill just give a simple method.

In [30]:
def computeTF(wordDict, doc):
    tfDict = {}
    corpusCount = len(doc)
    for word, count in wordDict.items():
        tfDict[word] = count/float(corpusCount)
    return(tfDict)
#running our sentences through the tf function:
tfFirst = computeTF(wordDictA, first_sentence)
tfSecond = computeTF(wordDictB, second_sentence)
#Converting to dataframe for visualization
tf = pd.DataFrame([tfFirst, tfSecond])

In [31]:
computeTF(wordDictA,wordDictB)

{'natural': 0.09090909090909091,
 'in': 0.0,
 'language': 0.09090909090909091,
 'I': 0.09090909090909091,
 'have': 0.0,
 'job': 0.0,
 'modeling': 0.0,
 'a': 0.0,
 'processing': 0.09090909090909091,
 'that': 0.0,
 'love': 0.09090909090909091}

### StopWords with NLTK 
That’s all for TF formula , just i wanna talk about stop words that we should eliminate them because they are the most commonly occurring words which don’t give any additional value to the document vector .in-fact removing these will increase computation and space efficiency.

nltk library has a method to download the stopwords, so instead of explicitly mentioning all the stopwords ourselves we can just use the nltk library and iterate over all the words and remove the stop words. There are many efficient ways to do this, but ill just give a simple method.

In [32]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

filtered_sentence = [w for w in wordDictA if not w in stop_words]
print (filtered_sentence)

[nltk_data] Downloading package stopwords to /home/wael/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['natural', 'language', 'I', 'job', 'modeling', 'processing', 'love']


In [33]:
def computeIDF(docList):
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / (float(val) + 1))
        
    return(idfDict)
#inputing our sentences in the log file

idfs = computeIDF([wordDictA, wordDictB])



and now we implement the idf formula , let’s finish with calculating the TFIDF

In [34]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return(tfidf)
#running our two sentences through the IDF:
idfFirst = computeTFIDF(tfFirst, idfs)
idfSecond = computeTFIDF(tfSecond, idfs)
#putting it in a dataframe
idf= pd.DataFrame([idfFirst, idfSecond])

That was a lot of work. But it is handy to know, if you are asked to code TF-IDF from scratch in the future. However, this can be done a lot simpler thanks to sklearn library. Let’s look at the example from them below:

In [35]:
#first step is to import the library
from sklearn.feature_extraction.text import TfidfVectorizer
#for the sentence, make sure all words are lowercase or you will run #into error. for simplicity, I just made the same sentence all #lowercase
firstV= "Data Science is the sexiest job of the 21st century"
secondV= "machine learning is the key for data science"
#calling the TfidfVectorizer
vectorize= TfidfVectorizer()
#fitting the model and passing our sentences right away:
response= vectorize.fit_transform([firstV, secondV])

In [36]:
print(response)

  (0, 2)	0.24342026926924518
  (0, 10)	0.24342026926924518
  (0, 4)	0.24342026926924518
  (0, 12)	0.48684053853849035
  (0, 11)	0.34211869506421816
  (0, 5)	0.34211869506421816
  (0, 9)	0.34211869506421816
  (0, 0)	0.34211869506421816
  (0, 1)	0.34211869506421816
  (1, 2)	0.28986933576883284
  (1, 10)	0.28986933576883284
  (1, 4)	0.28986933576883284
  (1, 12)	0.28986933576883284
  (1, 8)	0.40740123733358447
  (1, 7)	0.40740123733358447
  (1, 6)	0.40740123733358447
  (1, 3)	0.40740123733358447


In [37]:
tfA = computeTF(wordDictA, first_sentence)
tfB = computeTF(wordDictB, second_sentence)

idfs = computeIDF([wordDictA, wordDictB])
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)

#res = computeTFIDF([firstV, secondV])
df = pd.DataFrame([tfidfA, tfidfB])

In [38]:
df

Unnamed: 0,I,a,have,in,job,language,love,modeling,natural,processing,that
0,0.060206,0.0,0.0,0.0,0.0,0.060206,0.060206,0.0,0.060206,0.060206,0.0
1,0.060206,0.030103,0.030103,0.030103,0.030103,0.030103,0.030103,0.030103,0.0,0.0,0.030103
