# Text Representation using TF-IDF

In all the other approaches we saw so far, all the words in the text are treated equally important. There is no notion of some words in the document being more important than others. TF-IDF addresses this issue. It aims to quantify the importance of a given word relative to other words in the document and in the corpus. It was commonly used representation scheme for information retrieval systems, for extracting relevant documents from a corpus for given text query.

The intuition behind **TF-IDF** is as follows:  
if a word _w_ appears many times in a document _dᵢ_  
but does not occur much in the rest of the documents _dⱼ_ in the corpus,  
then the word _w_ must be of great importance to the document _dᵢ_.  

The importance of _w_ should **increase** in proportion to its frequency in _dᵢ_,  
but at the same time, its importance should **decrease** in proportion to the word’s frequency in other documents _dⱼ_ in the corpus.  

Mathematically, this is captured using two quantities: **TF** and **IDF**.  
The two are then combined to arrive at the **TF-IDF score**.


In [2]:
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer() 

bow_rep_tfidf = tfidf.fit_transform(processed_docs)
bow_rep_tfidf


<4x6 sparse matrix of type '<class 'numpy.float64'>'
	with 12 stored elements in Compressed Sparse Row format>

In [6]:
# All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names_out())
print("-"*10)

#IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)
print("-"*10)


All words in the vocabulary ['bites' 'dog' 'eats' 'food' 'man' 'meat']
----------
IDF for all words in the vocabulary [1.51082562 1.22314355 1.51082562 1.91629073 1.22314355 1.91629073]
----------


In [10]:
#TFIDF representation for all documents in our corpus 
print("All documents: ")
print(processed_docs)
print("-"*10);
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray()) 
print("-"*10)

All documents: 
['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']
----------
TFIDF representation for all documents in our corpus
 [[0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.         0.44809973 0.55349232 0.         0.         0.70203482]
 [0.         0.         0.55349232 0.70203482 0.44809973 0.        ]]
----------


In [11]:
# Random text
temp = tfidf.transform(["dog and man are friends"])
print("Tfidf representation for 'dog and man are friends':\n", temp.toarray())

Tfidf representation for 'dog and man are friends':
 [[0.         0.70710678 0.         0.         0.70710678 0.        ]]
