# TF-IDF

The TF-IDF formula makes it possible to determine in what proportions certain words in a text document, a document body or a website can be evaluated in relation to the rest of the text.

## TF

TF is the abbreviation for **term frenquency**. It determines the relative frequency of a word or combination of words in a document. This term frequency will be compared to the occurrence of all other remaining words in the text, document or website being analyzed. 

This formula uses a logarithm that reads as follows:

$$ tf(i,j) = \dfrac {n_{(i,j)}}{\sum_k n_{(i,j)}}  $$

The number of times a word appears in a document divded by the total number of words in the document. ***Every document has its own term frequency.***

This formula attests that a visible increase of the keyword in the text does not lead to an improvement of its value in the calculation. While the keyword density mainly calculates the percentage distribution of a single word in the text (in relation to the total number of words remaining), the term frequency also takes into account the proportion of all words used in a text.

## IDF

The IDF calculates the **inverse document frequency** and completes the word evaluation analysis. It acts as a correction of the TF. The IDF includes in the calculation the document frequency for a specific word, i.e. the IDF compares the figure corresponding to all known documents with the number of texts containing the word in question. 

The following logarithm is used to "condense" the results:

$${\displaystyle \mathrm {idf} (w)=\log ({\frac {N}{df_{(i)}}}})$$

The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

Consequently, the IDF determines the relevance of a text by considering a specific keyword.

Multiplied formulas show the relative evaluation of the word of a text compared to all potential documents that contain the same keyword. In order to obtain useful results, the formula needs to be applied to any significant keyword in a text document.

The larger the database used to calculate the TF-IDF, the more accurate the results


Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus. It is the product of tf and idf:



$$ w_{(i,j)} = tf_{(i,j)}  .  \log ({\frac {N}{df_{(i)}}})$$

<center>$ tf_{(i,j)}$ = <i>number of occurences of $i$ in $j$</i></center>
<center>$ df_{(i)}$ = <i>number of documents containing $i$</i></center>
<center>$ N $ = <i>total number of documents</i></center>

## Applications of TF-IDF
Determining how relevant a word is to a document, or TD-IDF, is useful in many ways, for example:

* **Information retrieval**  
TF-IDF was invented for document search and can be used to deliver results that are most relevant to what you’re searching for. Imagine you have a search engine and somebody looks for LeBron. The results will be displayed in order of relevance. That’s to say the most relevant sports articles will be ranked higher because TF-IDF gives the word LeBron a higher score. It’s likely that every search engine you have ever encountered uses TF-IDF scores in its algorithm.  


* **Keyword Extraction**  
TF-IDF is also useful for extracting keywords from text. How? The highest scoring words of a document are the most relevant to that document, and therefore they can be considered keywords for that document. Pretty straightforward.

## Pratice time !

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
import numpy as np

In [2]:
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [3]:
vectorizer = TfidfVectorizer()

In [4]:
X = vectorizer.fit_transform(corpus)

In [5]:
vocabulary = vectorizer.get_feature_names()

In [6]:
print(vocabulary)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [7]:
X.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

In [8]:
pipe = Pipeline([('count', CountVectorizer(vocabulary=vocabulary)),
                 ('tfid', TfidfTransformer())]).fit(corpus)

In [9]:
pipe['count'].transform(corpus).toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)