## Day24 - NLP

**Term frequency inverse document frequency** (tf-idf) is a statistic computed for each token over a collection of documents. The idea is to stand out tokens that occur frequently in a single document, but rarely on the entire collections of documents. So, it is defined as follows:

$$\begin{equation}
TF-IDF(t) = freq(t,d) \times \log{\frac{m}{freq(t,D)}}
\end{equation}$$

where $t$ is the token, $d$ is the current document, and $D$ is the collection of $m$ documents. 

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = ["I love this wine", "I hate this wine","This wine is bad"]

vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(corpus)
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,bad,hate,love,wine
0,0.0,0.0,0.861037,0.508542
1,0.0,0.861037,0.0,0.508542
2,0.861037,0.0,0.0,0.508542


In this case, the most significat words will be "hate", "love" and "bad"!

Another statistic is the **term frequency document frequency** (tf-df) where it gives an higher importance to the terms that are more frequent over all the documents. So, it is defined as follows: 
$$\begin{equation}
TF-IDF(t) = freq(t,d) \times \log{{freq(t,D)}}
\end{equation}$$
where $t$ is the token, $d$ is the current document, and $D$ is the collection of $m$ documents.

As a consequence, the tf-idf is suitable for heterogeneous documents, whereas the tf-df is suitable for homogeneous documents.