# TF-IDF (Term Frequency, Inverse Document Frequency)

## 1. Terminology

* t -- term (word)
* d -- document (set of words)
* N -- count of corpus
* corpus -- the total document set

## 2. Term Frequency (TF)

The number of times a term occurs in a document is called <b>term frequency</b>.

The weight of a term that occurs in a document is simply proportional to the term frequency.

```text
tf(t, d) = count of t in d / number of words in d
```

## 3. Document Frequency (DF)

DF is the count of occurrences of term t in the document set N. In other words, DF is the number of documents in which the word is present. 

```text
df(t) = occurrence of t in documents
```

## 4. Inverse Document Frequency (IDF)

IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.

```text
idf(t) = N / df
```

Now there are few other problems with the IDF , in case of a large corpus,say 100,000,000 , the IDF value explodes , to avoid the effect we take the log of idf.

During the query time, when a word which is not in vocab occurs, the df will be 0. As we cannot divide by 0, we smoothen the value by adding 1 to the denominator.

```text
idf(t) = log(N / (df + 1))
```

tf-idf now is a the right measure to evaluate how important a word is to a document in a collection or corpus. There are many different variations of TF-IDF but for now let us concentrate on the this basic version.

```text
tf-idf(t, d) = tf(t, d) * log(N/(df + 1))
```


In [1]:
import pandas as pd
import math

first_sentence = "Data science is the sexiest job of the 21st century"
second_sentence = "machine learning is the key for data science"

# split so each word have their own string
# after split, it will become an array
first_sentence = first_sentence.split(" ")
second_sentence = second_sentence.split(" ")
total = set(first_sentence).union(second_sentence)
print(total)

{'is', 'job', 'data', 'century', 'sexiest', 'key', 'the', 'for', 'Data', 'machine', 'learning', 'science', 'of', '21st'}


Now lets add a way to count the words using a dictionary key-value pairing for both sentences:

In [2]:
wordDictA = dict.fromkeys(total, 0)
wordDictB = dict.fromkeys(total, 0)

for word in first_sentence:
    wordDictA[word] += 1

for word in second_sentence:
    wordDictB[word] += 1

Now we put them in a dataframe and then view the result:

In [3]:
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,is,job,data,century,sexiest,key,the,for,Data,machine,learning,science,of,21st
0,1,1,0,1,1,0,2,0,1,0,0,1,1,1
1,1,0,1,0,0,1,1,1,0,1,1,1,0,0


Now let's write the TF fuction:

In [4]:
def computeTF(wordDict, doc):
    tfDict = {}
    corpusCount = len(doc)
    for word, count in wordDict.items():
        tfDict[word] = count / float(corpusCount)
    return tfDict

# running our sentence through the tf function:
tfFirst = computeTF(wordDictA, first_sentence)
tfSecond = computeTF(wordDictB, second_sentence)

# Converting to data frame for visualization
tf = pd.DataFrame([tfFirst, tfSecond])
print(tf)

      is  job   data  century  sexiest    key    the    for  Data  machine  \
0  0.100  0.1  0.000      0.1      0.1  0.000  0.200  0.000   0.1    0.000   
1  0.125  0.0  0.125      0.0      0.0  0.125  0.125  0.125   0.0    0.125   

   learning  science   of  21st  
0     0.000    0.100  0.1   0.1  
1     0.125    0.125  0.0   0.0  


We should eliminate stop words because they are the most commonly occurring words which don’t give any additional value to the document vector. In-fact removing these will increase computation and space efficiency.

`nltk` library has a method to download the stopwords, so instead of explicitly mentioning all the stopwords ourselves we can just use the nltk library and iterate over all the words and remove the stop words. There are many efficient ways to do this.

In [5]:
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

filtered_sentence = [w for w in wordDictA if not w in stop_words]
print(filtered_sentence)


['job', 'data', 'century', 'sexiest', 'key', 'Data', 'machine', 'learning', 'science', '21st']


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hongbing/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
def computeIDF(docList):
    idfDict = {}
    N = len(docList)
    
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / (float(val) + 1))
        
    return(idfDict)
#inputing our sentences in the log file
idfs = computeIDF([wordDictA, wordDictB])

In [7]:
print(idfs)

{'is': 0.3010299956639812, 'job': 0.3010299956639812, 'data': 0.3010299956639812, 'century': 0.3010299956639812, 'sexiest': 0.3010299956639812, 'key': 0.3010299956639812, 'the': 0.3010299956639812, 'for': 0.3010299956639812, 'Data': 0.3010299956639812, 'machine': 0.3010299956639812, 'learning': 0.3010299956639812, 'science': 0.3010299956639812, 'of': 0.3010299956639812, '21st': 0.3010299956639812}


In [8]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return(tfidf)
#running our two sentences through the IDF:
idfFirst = computeTFIDF(tfFirst, idfs)
idfSecond = computeTFIDF(tfSecond, idfs)
#putting it in a dataframe
idf= pd.DataFrame([idfFirst, idfSecond])
print(idf)

         is       job      data   century   sexiest       key       the  \
0  0.030103  0.030103  0.000000  0.030103  0.030103  0.000000  0.060206   
1  0.037629  0.000000  0.037629  0.000000  0.000000  0.037629  0.037629   

        for      Data   machine  learning   science        of      21st  
0  0.000000  0.030103  0.000000  0.000000  0.030103  0.030103  0.030103  
1  0.037629  0.000000  0.037629  0.037629  0.037629  0.000000  0.000000  


In [9]:
#first step is to import the library
from sklearn.feature_extraction.text import TfidfVectorizer

#for the sentence, make sure all words are lowercase or you will run #into error. for simplicity, I just made the same sentence all #lowercase
firstV= "Data Science is the sexiest job of the 21st century"
secondV= "machine learning is the key for data science"

#calling the TfidfVectorizer
vectorize= TfidfVectorizer()
#fitting the model and passing our sentences right away:
response= vectorize.fit_transform([firstV, secondV])

In [10]:
print(response)

  (0, 1)	0.34211869506421816
  (0, 0)	0.34211869506421816
  (0, 9)	0.34211869506421816
  (0, 5)	0.34211869506421816
  (0, 11)	0.34211869506421816
  (0, 12)	0.48684053853849035
  (0, 4)	0.24342026926924518
  (0, 10)	0.24342026926924518
  (0, 2)	0.24342026926924518
  (1, 3)	0.40740123733358447
  (1, 6)	0.40740123733358447
  (1, 7)	0.40740123733358447
  (1, 8)	0.40740123733358447
  (1, 12)	0.28986933576883284
  (1, 4)	0.28986933576883284
  (1, 10)	0.28986933576883284
  (1, 2)	0.28986933576883284
