## TF-IDF Vectorizer, a scikit-learn feature

#### Importing the libraries

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd

#### Training and Test Data

In [16]:
train = ['The sky is blue.','The sun is bright.']
test = ['The sun in the sky is bright', 'We can see the shining sun, the bright sun.']

#### Vectorizer(s) Init

In [17]:
countvectorizer = CountVectorizer(analyzer='word', stop_words='english')
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english')

#### Convert the docs into a matrix
Using `.fit_transform()`, create a *Document Term Matrix*.

In [18]:
count_wm = countvectorizer.fit_transform(train)
tfidf_wm = tfidfvectorizer.fit_transform(train)

#### Retrieve the terms found in the corpora
If we take the same parameters on both Classes (CountVectorizer and TfidfVectorizer), it will give the same output of `get_feature_names()` methods.

In [19]:
count_tokens = countvectorizer.get_feature_names_out()
tfidf_tokens = tfidfvectorizer.get_feature_names_out()

In [20]:
# Let's see how these look.
print("count_tokens: ", count_tokens)
print("tfidf_tokens: ", tfidf_tokens)

count_tokens:  ['blue' 'bright' 'sky' 'sun']
tfidf_tokens:  ['blue' 'bright' 'sky' 'sun']


#### Creating the Dataframe

In [21]:
df_countvect = pd.DataFrame(data = count_wm.toarray(),index = ['Doc1','Doc2'], columns = count_tokens)
df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(),index = ['Doc1','Doc2'], columns = tfidf_tokens)

In [22]:
# Let's see the result.
print("Count Vectorizer\n")
print(df_countvect)
print("\nTD-IDF Vectorizer\n")
print(df_tfidfvect)

Count Vectorizer

      blue  bright  sky  sun
Doc1     1       0    1    0
Doc2     0       1    0    1

TD-IDF Vectorizer

          blue    bright       sky       sun
Doc1  0.707107  0.000000  0.707107  0.000000
Doc2  0.000000  0.707107  0.000000  0.707107


#### Conclusion
Considering the above spicy sparse matrix of count and tf-idf vectorizer, we can conclude that:
1. Count Vectorizer gives the number of frequency with respect to index of vocabulary.
2. Tf-idf considers the overall documents of weight of words.

In [25]:
term_vectors = countvectorizer.transform(test)

print("Sparse Matrix form of test data:\n")
print(term_vectors.todense())

Sparse Matrix form of test data:

[[0 1 1 1]
 [0 1 0 2]]
