<h1>How to Use Tfidftransformer & Tfidfvectorizer?</h1>

<h2> Using Tfidftransformer </h2>

In [37]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

In [78]:
# this is a very toy example, do not try this at home unless you want to understand the usage differences
docs=["the House: had a tiny little mouse",
      "the cat saw the' mouse",
      "the mouse. ran away from the house",
      "the cat finally ate the mouse",
      "the end of the mouse story"
     ]

In [11]:
# instantiate CountVectorizer
cv = CountVectorizer()

# this step generates word count for the words in your doc
word_count_vector = cv.fit_transform(docs)

In [29]:
# should have 5 rows for 5 docs and 16 columns for 16 unique words
word_count_vector.shape

(5, 16)

In [16]:
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [31]:
# getting the idf scores
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(), columns = ['idf_weight'])

# sort ascending
df_idf.sort_values(by=['idf_weight'])

Unnamed: 0,idf_weight
mouse,1.0
the,1.0
cat,1.693147
house,1.693147
ate,2.098612
away,2.098612
end,2.098612
finally,2.098612
from,2.098612
had,2.098612


<h3> Let's compute the tf-idf scores now </h3>

In [34]:
# we already have the count matrix so let's get the tf-idf scores
tf_idf_vector = tfidf_transformer.transform(word_count_vector)

In [51]:
feature_names = cv.get_feature_names()
 
# get tfidf vector for first document
first_document_vector = tf_idf_vector[0]
 
# print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

# Note that 'a' has been removed possibly due to internal pre-processing of count_vectorizer

Unnamed: 0,tfidf
had,0.493562
little,0.493562
tiny,0.493562
house,0.398203
mouse,0.235185
the,0.235185
ate,0.0
away,0.0
cat,0.0
end,0.0


<h2> Using TfidfVectorizer now! </h2>
<p> With Tfidfvectorizer you compute the word counts, idf and tf-idf values all at once. It’s really simple. </p>

In [79]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [80]:
tfidf_vectorizer = TfidfVectorizer(use_idf=True)
tfidf_vectorizer_vectors = tfidf_vectorizer.fit_transform(docs)

In [81]:
# getting the first document printed
first_document_tfidfvectorizer = tfidf_vectorizer_vectors[0]

# putting it in a dataframe and printing it
df = pd.DataFrame(tfidf_vectorizer_vectors.T.todense(), index = tfidf_vectorizer.get_feature_names())
#df.sort_values(by = ['tf*idf'], ascending=False)
df

Unnamed: 0,0,1,2,3,4
ate,0.0,0.0,0.0,0.513923,0.0
away,0.0,0.0,0.457093,0.0,0.0
cat,0.0,0.483344,0.0,0.41463,0.0
end,0.0,0.0,0.0,0.0,0.491753
finally,0.0,0.0,0.0,0.513923,0.0
from,0.0,0.0,0.457093,0.0,0.0
had,0.493562,0.0,0.0,0.0,0.0
house,0.398203,0.0,0.36878,0.0,0.0
little,0.493562,0.0,0.0,0.0,0.0
mouse,0.235185,0.285471,0.217807,0.244887,0.234323


<h4>For more references click <a href="https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.Xigm71MzZQJ"> here. </a></h4>