# Exploring sklearn's TfIdfVectorizer

`TfIdfVectorizer` is included in scikit-learn's `sklearn.feature_extraction.text` module. It's commonly used for creating a feature matrix from text data, and it's surprisingly easy to work with. `TfIdfVectorizer` makes it easy to fit and transform data at the same, and one can easily change the regex pattern, number of n-grams, minimum and maximum token frequency, and stopwords simply by tweaking the parameters at initialization.

For the curious, `TfIdfVectorizer`'s documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer).

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import *

In [5]:
# Create a test string from the first few paragraphs of a Wikipedia article on collaborative filtering
test_string = 'Collaborative filtering (CF) is a technique used by recommender systems. Collaborative filtering has two senses, a narrow one and a more general one. In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B\'s opinion on a different issue than that of a randomly chosen person. For example, a collaborative filtering recommendation system for television tastes could make predictions about which television show a user should like given a partial list of that user\'s tastes (likes or dislikes). Note that these predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an average (non-specific) score for each item of interest, for example based on its number of votes. In the more general sense, collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc. Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including: sensing and monitoring data, such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data, such as financial service institutions that integrate many financial sources; or in electronic commerce and web applications where the focus is on user data, etc. The remainder of this discussion focuses on collaborative filtering for user data, although some of the methods and approaches may apply to the other major applications as well.'

In [10]:
# Split the string into sentences, which the vectorizer will consider as individual documents
string_data = test_string.split('. ')

In [72]:
# Initialize the vectorizer - the default is a standard 1-word vectorizer
word_vectorizer = TfidfVectorizer()

# Glimpse the vectorizer's default parameters
word_vectorizer.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}

In [74]:
X_word = word_vectorizer.fit_transform(string_data)

In [14]:
X_word

<11x149 sparse matrix of type '<class 'numpy.float64'>'
	with 240 stored elements in Compressed Sparse Row format>

Using `fit_transform` rather than just `fit` tells the vectorizer to learn the vocabulary and frequency from the text data AND to return the feature matrix. Note that X_word is an 11 x 149 matrix, so there are 11 entries (sentences) and 149 features (words).

We can make a nice scrollable list of all the words that were extracted from our text sample:

In [25]:
for word in word_vectorizer.get_feature_names():
    print(word)

about
agents
although
among
an
and
applications
applied
apply
approach
approaches
are
areas
as
assumption
automatic
average
based
been
but
by
cf
chosen
collaborating
collaboration
collaborative
collecting
commerce
could
data
different
differs
discussion
dislikes
each
electronic
environmental
etc
example
exploration
filtering
financial
focus
focuses
for
from
general
given
giving
gleaned
has
have
if
in
including
information
institutions
integrate
interest
interests
involve
involving
is
issue
item
its
kinds
large
like
likely
likes
list
major
make
making
many
may
method
methods
mineral
monitoring
more
multiple
narrow
narrower
newer
non
note
number
of
on
one
opinion
or
other
over
partial
patterns
person
predictions
preferences
process
randomly
recommendation
recommender
remainder
same
score
sense
senses
sensing
sensors
service
sets
should
show
simpler
some
sources
specific
such
system
systems
taste
tastes
technique
techniques
television
than
that
the
these
this
to
two
typically
underlying
u

We can also use the `.vocabulary_` attribute to look at the indices of each feature. Each feature (word) lives at an index in our feature matrix. For example, calling `X_word[0, 25]` will tell us the weighting of 'collaborative' within the first sentence.

In [29]:
word_vectorizer.vocabulary_

{'collaborative': 25,
 'filtering': 40,
 'cf': 21,
 'is': 62,
 'technique': 125,
 'used': 138,
 'by': 20,
 'recommender': 104,
 'systems': 122,
 'has': 50,
 'two': 134,
 'senses': 109,
 'narrow': 83,
 'one': 91,
 'and': 5,
 'more': 81,
 'general': 46,
 'in': 53,
 'the': 130,
 'newer': 85,
 'narrower': 84,
 'sense': 108,
 'method': 77,
 'of': 89,
 'making': 74,
 'automatic': 15,
 'predictions': 99,
 'about': 0,
 'interests': 59,
 'user': 139,
 'collecting': 26,
 'preferences': 100,
 'or': 93,
 'taste': 123,
 'information': 55,
 'from': 45,
 'many': 75,
 'users': 140,
 'collaborating': 23,
 'underlying': 136,
 'assumption': 14,
 'approach': 9,
 'that': 129,
 'if': 52,
 'person': 98,
 'same': 106,
 'opinion': 92,
 'as': 13,
 'on': 90,
 'an': 4,
 'issue': 63,
 'likely': 69,
 'to': 133,
 'have': 51,
 'different': 30,
 'than': 128,
 'randomly': 102,
 'chosen': 22,
 'for': 44,
 'example': 38,
 'recommendation': 103,
 'system': 121,
 'television': 127,
 'tastes': 124,
 'could': 28,
 'make': 73

In [92]:
for entry in range(11):
    print(entry, ":", X_word[entry, 25])

0 : 0.1666622159077999
1 : 0.13425908668519743
2 : 0.09542555931338716
3 : 0.06798342182634644
4 : 0.08035560781786677
5 : 0.0
6 : 0.0
7 : 0.09928901260456367
8 : 0.16848148408815192
9 : 0.05350060595361461
10 : 0.09512697975165428


It's a little easier to make sense of these weightings if we can look at the sentences. Let's print 0, 3, and 5 below:

In [95]:
for num in [0, 3, 5]:
    print(num, ':', string_data[num])

0 : Collaborative filtering (CF) is a technique used by recommender systems
3 : The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person
5 : Note that these predictions are specific to the user, but use information gleaned from many users


We can see that 0 has an especially high weighting (0.1666) for 'collaborative' because the sentence is short - only 9 words. This results in 'collaborative' being considered an important element of the sentence.

3 is quite a bit longer than 0, and 'collaborative' only occurs one time. 'Collaborative' thus gets a lower weighting of 0.0679.

5 has no occurrence of 'collaborative', and thus the feature weighting for 'collaborative' is 0.

Let's remove some stopwords. We can fit the input data again and specify the parameter `max_df = 8` to ignore any words that occur in 8 out of 11 of our sentences. ("Df" stands for "document frequency".)

In [56]:
word_vectorizer_stop = TfidfVectorizer(max_df = 8)
X_word_stop = word_vectorizer_stop.fit_transform(string_data)
word_vectorizer_stop.stop_words_

{'collaborative', 'filtering'}

Since the data is pulled from a Wikipedia article on collaborative filtering, these were the only two words to occur in at least 8 out of 11 sentences.

An alternative to setting `max_df` (or `min_df`) is that we can pass a list of stopwords to make sure the vectorizer doesn't include them in its vocabulary.

In [65]:
# Reinitialize word_vectorizer_stop
word_vectorizer_stop = TfidfVectorizer(stop_words = ['is', 'the', 'of', 'by', 'in', 'or', 'on', 'an', 'to'])
X_word_stop = word_vectorizer_stop.fit_transform(string_data)

word_vectorizer_stop.stop_words_

set()

In [66]:
word_vectorizer_stop.get_stop_words()

frozenset({'an', 'by', 'in', 'is', 'of', 'on', 'or', 'the', 'to'})

Notice that `word_vectorizer_stop.stop_words_` returns an empty set, while `get_stop_words()` returns a frozenset of our list of stopwords. According to the documentation, `.stop_words_` is built from words that were excluded based on setting the parameters `max_df`, `min_df`, and `max_features`, not from setting the `stop_words` parameter.

If you passed stopwords as a list rather than specifying any of those three parameters, then keep this in mind when trying to check which words were used as stopwords. Make sure to call `get_stop_words()` to check your stopwords, not `.stop_words_`.

We can now see that the new vectorizer has excluded adding the stopwords to its vocabulary, while the old one still includes them:

In [70]:
'of' in word_vectorizer.vocabulary_

True

In [71]:
'of' in word_vectorizer_stop.vocabulary_

False

Now let's make a vectorizer that will build 2-grams of words.

In [51]:
word2_vectorizer = TfidfVectorizer(ngram_range = (2, 2))
X_word2 = word2_vectorizer.fit_transform(string_data)
word2_vectorizer.vocabulary_

{'collaborative filtering': 36,
 'filtering cf': 58,
 'cf is': 33,
 'is technique': 103,
 'technique used': 200,
 'used by': 233,
 'by recommender': 32,
 'recommender systems': 177,
 'filtering has': 60,
 'has two': 83,
 'two senses': 229,
 'senses narrow': 182,
 'narrow one': 133,
 'one and': 156,
 'and more': 9,
 'more general': 129,
 'general one': 77,
 'in the': 89,
 'the newer': 215,
 'newer narrower': 135,
 'narrower sense': 134,
 'sense collaborative': 181,
 'filtering is': 61,
 'is method': 100,
 'method of': 124,
 'of making': 144,
 'making automatic': 119,
 'automatic predictions': 26,
 'predictions filtering': 172,
 'filtering about': 56,
 'about the': 0,
 'the interests': 212,
 'interests of': 97,
 'of user': 149,
 'user by': 235,
 'by collecting': 31,
 'collecting preferences': 37,
 'preferences or': 173,
 'or taste': 163,
 'taste information': 197,
 'information from': 91,
 'from many': 75,
 'many users': 122,
 'users collaborating': 239,
 'the underlying': 221,
 'underly

A glimpse inside our feature matrix shows that sentence 0 has occurrences of terms 36, 58, and 33. The original sentence was "Collaborative filtering (CF) is a technique used by recommender systems."

Taking a look at the top of our vocabulary list above shows that these terms are:

'collaborative filtering': 36
'filtering cf': 58
'cf is': 33

So it's clear that the vectorizer is taking words that occur next to each other and pairing them. We can also see that 'collaborative filtering' (36) is downweighted compared to other word pairs. This is because the other word pairs are unique to this sentence, while 'collaborative filtering' is not. 

In [102]:
print(X_word2[0, :])

  (0, 36)	0.15805741475500926
  (0, 58)	0.37321343063433804
  (0, 33)	0.37321343063433804
  (0, 103)	0.37321343063433804
  (0, 200)	0.37321343063433804
  (0, 233)	0.37321343063433804
  (0, 32)	0.37321343063433804
  (0, 177)	0.37321343063433804


In fact, as we know from our stopwords experiments above, 'collaborative filtering' appears in 9 out of 11 sentences, and is weighted with varying significance in each sentence. (Generally speaking: shorter sentence, higher weighting.)

In [103]:
print(X_word2[:, 36])

  (0, 0)	0.15805741475500926
  (1, 0)	0.14187939579983094
  (2, 0)	0.08332488626530703
  (3, 0)	0.07272669729102695
  (4, 0)	0.08163961550995026
  (7, 0)	0.08675073761032723
  (8, 0)	0.14808056479624188
  (9, 0)	0.05517764812255034
  (10, 0)	0.08452882670491463


And now, a character vectorizer:

In [39]:
char_vectorizer = TfidfVectorizer(analyzer = 'char')
X_char = char_vectorizer.fit_transform(string_data)
char_vectorizer.vocabulary_

{'c': 11,
 'o': 23,
 'l': 20,
 'a': 9,
 'b': 10,
 'r': 26,
 't': 28,
 'i': 17,
 'v': 30,
 'e': 13,
 ' ': 0,
 'f': 14,
 'n': 22,
 'g': 15,
 '(': 2,
 ')': 3,
 's': 27,
 'h': 16,
 'q': 25,
 'u': 29,
 'd': 12,
 'y': 33,
 'm': 21,
 'w': 31,
 ',': 4,
 'k': 19,
 'p': 24,
 "'": 1,
 'x': 32,
 '-': 5,
 ':': 7,
 ';': 8,
 'j': 18,
 '.': 6}

It's evident looking at this vocabulary that it would probably be appropriate to remove any symbol and space characters as stopwords if using a character vectorizer. Finally, let's see what a 2-character vectorizer looks like:

In [108]:
# Now a 2-character vectorizer!

char2_vectorizer = TfidfVectorizer(analyzer = 'char', ngram_range = (2, 2))
X_char2 = char2_vectorizer.fit_transform(string_data)
char2_vectorizer.vocabulary_

{'co': 65,
 'ol': 177,
 'll': 137,
 'la': 133,
 'ab': 34,
 'bo': 54,
 'or': 181,
 'ra': 199,
 'at': 46,
 'ti': 235,
 'iv': 127,
 've': 253,
 'e ': 74,
 ' f': 6,
 'fi': 96,
 'il': 118,
 'lt': 139,
 'te': 233,
 'er': 86,
 'ri': 203,
 'in': 120,
 'ng': 162,
 'g ': 99,
 ' (': 0,
 '(c': 23,
 'cf': 61,
 'f)': 93,
 ') ': 27,
 ' i': 9,
 'is': 125,
 's ': 213,
 ' a': 1,
 'a ': 32,
 ' t': 18,
 'ec': 78,
 'ch': 62,
 'hn': 111,
 'ni': 163,
 'iq': 123,
 'qu': 195,
 'ue': 245,
 ' u': 19,
 'us': 251,
 'se': 219,
 'ed': 79,
 'd ': 68,
 ' b': 2,
 'by': 56,
 'y ': 263,
 ' r': 16,
 're': 201,
 'om': 178,
 'mm': 149,
 'me': 146,
 'en': 85,
 'nd': 159,
 'de': 70,
 'r ': 196,
 ' s': 17,
 'sy': 228,
 'ys': 266,
 'st': 226,
 'em': 84,
 'ms': 152,
 ' h': 8,
 'ha': 108,
 'as': 45,
 'tw': 241,
 'wo': 259,
 'o ': 171,
 'ns': 166,
 'es': 87,
 's,': 215,
 ', ': 28,
 ' n': 13,
 'na': 157,
 'ar': 44,
 'rr': 208,
 'ro': 207,
 'ow': 186,
 'w ': 256,
 ' o': 14,
 'on': 179,
 'ne': 160,
 'an': 42,
 ' m': 12,
 'mo': 150,
 

In [109]:
X_char2

<11x267 sparse matrix of type '<class 'numpy.float64'>'
	with 1054 stored elements in Compressed Sparse Row format>