* Term Frequency – Inverse Document Frequency(tf-idf)is a method to evaluate the importance of a specific word in a document
* It converts the textual representation of information into a Vector Space Model (VSM), or into sparse features,
* VSM is an algebraic model representing textual information as a vector, the components of this vector could represent the importance of a term (tf–idf) or even the absence or presence (Bag of Words) of it in a document

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.", "We can see the shining sun, the bright sun.")

In [38]:
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer.fit_transform(train_set)
print ("Vocabulary:", count_vectorizer.vocabulary_)

Vocabulary: {'sky': 2, 'blue': 0, 'sun': 3, 'bright': 1}


In [39]:
freq_term_matrix = count_vectorizer.transform(test_set)
print(freq_term_matrix.todense())

[[0 1 1 1]
 [0 1 0 2]]


In [40]:
from sklearn.feature_extraction.text import TfidfTransformer

In [41]:
tfidf = TfidfTransformer(norm="l2") #refers to Euclidean norm: https://en.wikipedia.org/wiki/Norm_%28mathematics%29#Euclidean_norm
tfidf.fit(freq_term_matrix)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [45]:
#print("IDF:", tfidf.idf_)

In [46]:
#print("IDF:", tfidf.get_params)

In [44]:
tf_idf_matrix = tfidf.transform(freq_term_matrix)
print(tf_idf_matrix.toarray())

[[ 0.          0.50154891  0.70490949  0.50154891]
 [ 0.          0.4472136   0.          0.89442719]]


#### Notes:
* Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval.
* The  goal is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.
* It scales up frequent terms and scales down rare terms which are empirically more informative than the high frequency terms.
* A term that occurs 10 times more than another isn’t 10 times more important than it, that’s why tf-idf uses the logarithmic scale to do that.