### Let’s define the dataset in a way that scikit.learn can use:

In [44]:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")

In scikit.learn, what we have presented as the term-frequency, is called CountVectorizer, so we need to import it and create a news instance:

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

In [46]:
vectorizer=CountVectorizer(stop_words='english')

The CountVectorizer already uses as default “analyzer” called WordNGramAnalyzer, which is responsible to convert the text to lowercase, accents removal, token extraction, filter stop words, etc… you can see more information by printing the class information:

In [47]:
print(vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


Let’s create now the vocabulary index:

In [48]:
vectorizer.fit_transform(train_set)
print(vectorizer.vocabulary_)

{'sky': 2, 'sun': 3, 'blue': 0, 'bright': 1}


See that the vocabulary created.
Let’s use the same vectorizer now to create the sparse matrix of our test_set documents:

In [49]:
freq_term_matrix=vectorizer.transform(test_set)
print(freq_term_matrix)

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (1, 1)	1
  (1, 3)	2


you can convert it into a dense format:

In [50]:
freq_term_matrix.todense()

matrix([[0, 1, 1, 1],
        [0, 1, 0, 2]], dtype=int64)

### The term frequency – inverse document frequency (tf-idf) weight

Now that we have the frequency term matrix (called freq_term_matrix), we can instantiate the TfidfTransformer, which is going to be responsible to calculate the tf-idf weights for our term frequency matrix:

In [51]:
from sklearn.feature_extraction.text import TfidfTransformer

In [52]:
tfidf=TfidfTransformer(norm='l2')
tfidf.fit(freq_term_matrix)
print(tfidf.idf_)

[ 2.09861229  1.          1.40546511  1.        ]


Note that I’ve specified the norm as L2, this is optional (actually the default is L2-norm), but I’ve added the parameter to make it explicit to you that it it’s going to use the L2-norm. Also note that you can see the calculated idf weight by accessing the internal attribute called idf_. Now that fit() method has calculated the idf for the matrix, let’s transform the freq_term_matrix to the tf-idf weight matrix:

In [53]:
tf_idf_matrix=tfidf.transform(freq_term_matrix)
print(tf_idf_matrix)

  (0, 3)	0.501548907094
  (0, 2)	0.704909488931
  (0, 1)	0.501548907094
  (1, 3)	0.894427191
  (1, 1)	0.4472135955


I really hope you liked the post

### Cosine Similarity for Vector Space Models

In [54]:
documents = (
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
)

then we instantiate the Sklearn TF-IDF Vectorizer and transform our documents into the TF-IDF matrix:

In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
print(tfidf_vectorizer)
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print(tfidf_matrix.shape)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
(4, 5)


Now we have the TF-IDF matrix (tfidf_matrix) for each document (the number of rows of the matrix) with 11 tf-idf terms (the number of columns from the matrix), we can calculate the Cosine Similarity between the first document (“The sky is blue”) with each of the other documents of the set:

In [58]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

array([[ 1.        ,  0.        ,  0.40728206,  0.        ]])

### Find the top features from the corpus

In [59]:
indices = np.argsort(tfidf_vectorizer.idf_)[::-1]
features = tfidf_vectorizer.get_feature_names()

In [61]:
top_n = 5
top_features = [features[i] for i in indices[:top_n]]
print (top_features)

['shining', 'blue', 'sky', 'sun', 'bright']
