# TF-IDF (Term Frequency and Inverse Docuement Frequency)

Definition :
    
how important a word is to a document in a collection or corpus.

Or 

How many word is to a document in the corpus.

# Term Frequency (TF)

Let’s first understand Term Frequent (TF). It is a measure of how frequently a term, t, appears in a document, d:
    
<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/tf-300x41.jpg" width="550px" />

Here, in the numerator, n is the number of times the term “t” appears in the document “d”. Thus, each document and term would have its own TF value.

# how to calculate the TF for Review:

Review 1: This movie is very scary and long
    
Review 2: This movie is not scary and is slow
    
Review 3: This movie is spooky and good


Vocabulary: ‘This’, ‘movie’, ‘is’, ‘very’, ‘scary’, ‘and’, ‘long’, ‘not’,  ‘slow’, ‘spooky’,  ‘good’

Review2:
TF(‘movie’) = 1/8
TF(‘is’) = 2/8 = 1/4
TF(‘very’) = 0/8 = 0
TF(‘scary’) = 1/8
TF(‘and’) = 1/8
TF(‘long’) = 0/8 = 0
TF(‘not’) = 1/8
TF(‘slow’) = 1/8
TF( ‘spooky’) = 0/8 = 0
TF(‘good’) = 0/8 = 0


<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/TF-matrix-1.png" width="550px" />

# Inverse Document Frequency (IDF)

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/idf-300x44.jpg" width="550px" />

We can calculate the IDF values for the all the words in Review 2:
    
IDF(‘this’) =  log(number of documents/number of documents containing the word ‘this’) = log(3/3) = log(1) = 0

Similarly,

IDF(‘movie’, ) = log(3/3) = 0
IDF(‘is’) = log(3/3) = 0
IDF(‘not’) = log(3/1) = log(3) = 0.48
IDF(‘scary’) = log(3/2) = 0.18
IDF(‘and’) = log(3/3) = 0
IDF(‘slow’) = log(3/1) = 0.48

We can calculate the IDF values for each word like this. Thus, the IDF values for the entire vocabulary would be:
<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/IDF-matrix.png" width="550px" />

Hence, we see that words like “is”, “this”, “and”, etc., are reduced to 0 and have little importance; while words like “scary”, “long”, “good”, etc. are words with more importance and thus have a higher value.

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/tf_idf.jpg" width="550px" />

We can now calculate the TF-IDF score for every word in Review 2:

TF-IDF(‘this’, Review 2) = TF(‘this’, Review 2) * IDF(‘this’) = 1/8 * 0 = 0

Similarly,

TF-IDF(‘movie’, Review 2) = 1/8 * 0 = 0
TF-IDF(‘is’, Review 2) = 1/4 * 0 = 0
TF-IDF(‘not’, Review 2) = 1/8 * 0.48 = 0.06
TF-IDF(‘scary’, Review 2) = 1/8 * 0.18 = 0.023
TF-IDF(‘and’, Review 2) = 1/8 * 0 = 0
TF-IDF(‘slow’, Review 2) = 1/8 * 0.48 = 0.06

Similarly, we can calculate the TF-IDF scores for all the words with respect to all the reviews

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/TF_IDF-matrix.png" width="550px" />

TF-IDF also gives larger values for less frequent words and is high when both IDF and TF values are high i.e the word is rare in all the documents combined but frequent in a single document.

## reference

https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76

TF(w) = (Number of times term w appears in a document) / (Total number of terms in the document)

IDF(w) = log_e(Total number of documents / Number of documents with term w in it)

Consider a document containing 100 words wherein the word 'Cauvery' appears 3 times.

The term frequency (tf) for 'Cauvery' is then TF = (3 / 100) = 0.03.

Now, assume we have 10 million documents and the word 'Cauvery' appears in 1000 of these. Then, the inverse document frequency (idf) is calculated as IDF = log(10,000,000 / 1,000) = 4.

Thus, the Tf-idf weight is the product of these quantities TF-IDF = 0.03 * 4 = 0.12.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer 
import pandas as pd 
texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions","Chennai super kings returns"] 
tfidf = TfidfVectorizer() 
features = tfidf.fit_transform(texts) 
pd.DataFrame(features.todense(),columns=tfidf.get_feature_names()) 

Unnamed: 0,2018,champions,chennai,crowned,final,ipl,kings,returns,super,the,won
0,0.333407,0.0,0.258921,0.0,0.438391,0.333407,0.258921,0.0,0.258921,0.438391,0.438391
1,0.370954,0.48776,0.288079,0.48776,0.0,0.370954,0.288079,0.0,0.288079,0.0,0.0
2,0.0,0.0,0.412859,0.0,0.0,0.0,0.412859,0.69903,0.412859,0.0,0.0


In [5]:
TfidfVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) 

SyntaxError: invalid syntax (<ipython-input-5-65ff36abb8cd>, line 1)

https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/