### This chapter covers
 Counting words and term frequencies to analyze meaning

 Predicting word occurrence probabilities with Zipf’s Law

 Vector representation of words and how to start using them

 Finding relevant documents from a corpus using inverse document frequencies

 Estimating the similarity of pairs of documents with cosine similarity and Okapi BM25

Detecting words is useful for simple tasks, like getting statistics about word usage or doing keyword search. Then we can use that “importance” value to find relevant documents in a corpus based on keyword importance within each document.

##### In this chapter, we look at three increasingly powerful ways to represent words and their importance in a document:
 Bags of words—Vectors of word counts or frequencies

 Bags of n-grams—Counts of word pairs (bigrams), triplets (trigrams), and so on

 TF-IDF vectors—Word scores that better represent their importance

Note : A document that refers to “wings” and “rudder” frequently may be more relevant to a problem involving jet airplanes or air travel, than say a document that refers frequently to “cats” and “gravity.” Or if we have classified some words as expressing positive emotions—words like “good,” “best,” “joy,” and “fantastic”—the more a document that contains those words is likely to have positive “sentiment.”
##### Let’s look at an example where counting occurrences of words is useful:

In [1]:
import nltk

In [2]:
from nltk.tokenize import TreebankWordTokenizer

In [3]:
sentence = """The faster Harry got to the store, the faster Harry, the faster, would get home."""

In [4]:
tokenizer = TreebankWordTokenizer()

In [5]:
tokens = tokenizer.tokenize(sentence.lower())

In [6]:
tokens

['the',
 'faster',
 'harry',
 'got',
 'to',
 'the',
 'store',
 ',',
 'the',
 'faster',
 'harry',
 ',',
 'the',
 'faster',
 ',',
 'would',
 'get',
 'home',
 '.']

In [7]:
from collections import Counter

In [8]:
bag_of_words = Counter(tokens)

In [9]:
bag_of_words

Counter({'the': 4,
         'faster': 3,
         'harry': 2,
         'got': 1,
         'to': 1,
         'store': 1,
         ',': 3,
         'would': 1,
         'get': 1,
         'home': 1,
         '.': 1})

###### Counter object has a handy method, most_common, for just this purpose:

In [10]:
bag_of_words.most_common(4)

[('the', 4), ('faster', 3), (',', 3), ('harry', 2)]

Specifically, the number of times a word occurs in a given document is called the term frequency, commonly abbreviated TF. In some examples we may see the count of word occurrences normalized (divided) by the number of terms in the document. However, normalized frequency is really a probability, so it should probably not be called frequency.

Let’s calculate the term frequency of “harry” from the Counter object (bag_of_words) you defined above:

In [11]:
times_harry_appears = bag_of_words['harry']

In [12]:
num_unique_words = len(bag_of_words) #The number of unique tokens from your original source

In [13]:
tf = times_harry_appears / num_unique_words

In [14]:
round(tf, 4)

0.1818

Take these first few paragraphs from the Wikipedia article on kites:

In [19]:

kite_text1 = """A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react
against the air to create lift and drag. A kite consists of wings, tethers, and anchors.
Kites often have a bridle to guide the face of the kite at the correct angle so the wind
can lift it. A kite’s wing also may be so designed so a bridle is not needed; when
kiting a sailplane for launch, the tether meets the wing at a single point. A kite may
have fixed or moving anchors. Untraditionally in technical kiting, a kite consists of
tether-set-coupled wing sets; even in technical kiting, though, a wing in the system is
still often called the kite.
The lift that sustains the kite in flight is generated when air flows around the kite’s
surface, producing low pressure above and high pressure below the wings. The
interaction with the wind also generates horizontal drag along the direction of the
wind. The resultant force vector from the lift and drag force components is opposed
by the tension of one or more of the lines or tethers to which the kite is attached. The
anchor point of the kite line may be static or moving (such as the towing of a kite by
a running person, boat, free-falling anchors as in paragliders and fugitive parakites
or vehicle).
The same principles of fluid flow apply in liquids and kites are also used under water.
A hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite
lifting surface is called a kytoon.
Kites have a long and varied history and many different types are flown
individually and at festivals worldwide. Kites may be flown for recreation, art or
other practical uses. Sport kites can be flown in aerial ballet, sometimes as part of a
competition. Power kites are multi-line steerable kites designed to generate large forces
which can be used to power activities such as kite surfing, kite landboarding, kite
fishing, kite buggying and a new trend snow kiting. Even Man-lifting kites have
been made."""

In [16]:
from collections import Counter

In [17]:
from nltk.tokenize import TreebankWordTokenizer

In [18]:
tokenizer = TreebankWordTokenizer()

In [20]:
from nlpia.data.loaders import kite_text

ModuleNotFoundError: No module named 'nlpia'

In [22]:
tokens = tokenizer.tokenize(kite_text1.lower())

In [23]:
token_counts = Counter(tokens)

In [24]:
token_counts

Counter({'a': 20,
         'kite': 14,
         'is': 7,
         'traditionally': 1,
         'tethered': 2,
         'heavier-than-air': 1,
         'craft': 2,
         'with': 2,
         'wing': 5,
         'surfaces': 1,
         'that': 2,
         'react': 1,
         'against': 1,
         'the': 26,
         'air': 2,
         'to': 5,
         'create': 1,
         'lift': 4,
         'and': 10,
         'drag.': 1,
         'consists': 2,
         'of': 10,
         'wings': 1,
         ',': 14,
         'tethers': 2,
         'anchors.': 2,
         'kites': 8,
         'often': 2,
         'have': 4,
         'bridle': 2,
         'guide': 1,
         'face': 1,
         'at': 3,
         'correct': 1,
         'angle': 1,
         'so': 3,
         'wind': 2,
         'can': 3,
         'it.': 1,
         'kite’s': 2,
         'also': 3,
         'may': 4,
         'be': 5,
         'designed': 2,
         'not': 1,
         'needed': 1,
         ';': 2,
         'when':

In [25]:
import nltk

In [26]:
nltk.download('stopwords', quiet=True)

True

In [27]:
stopwords = nltk.corpus.stopwords.words('english')

In [28]:
tokens = [x for x in tokens if x not in stopwords]

In [29]:
kite_counts = Counter(tokens)

In [30]:
kite_counts

Counter({'kite': 14,
         'traditionally': 1,
         'tethered': 2,
         'heavier-than-air': 1,
         'craft': 2,
         'wing': 5,
         'surfaces': 1,
         'react': 1,
         'air': 2,
         'create': 1,
         'lift': 4,
         'drag.': 1,
         'consists': 2,
         'wings': 1,
         ',': 14,
         'tethers': 2,
         'anchors.': 2,
         'kites': 8,
         'often': 2,
         'bridle': 2,
         'guide': 1,
         'face': 1,
         'correct': 1,
         'angle': 1,
         'wind': 2,
         'it.': 1,
         'kite’s': 2,
         'also': 3,
         'may': 4,
         'designed': 2,
         'needed': 1,
         ';': 2,
         'kiting': 3,
         'sailplane': 1,
         'launch': 1,
         'tether': 1,
         'meets': 1,
         'single': 1,
         'point.': 1,
         'fixed': 1,
         'moving': 2,
         'untraditionally': 1,
         'technical': 2,
         'tether-set-coupled': 1,
         'sets'

#### Vectorizing

In [31]:
document_vector = []

In [32]:
doc_length = len(tokens)

In [33]:
for key, value in kite_counts.most_common(): document_vector.append(value / doc_length)

In [34]:
document_vector

[0.06422018348623854,
 0.06422018348623854,
 0.03669724770642202,
 0.022935779816513763,
 0.01834862385321101,
 0.01834862385321101,
 0.013761467889908258,
 0.013761467889908258,
 0.013761467889908258,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.009174311926605505,
 0.0045871559633027525,
 0.0045871559633027525,
 0.0045871559633027525,
 0.0045871559633027525,
 0.0045871559633027525,
 0.0045871559633027525,
 0.0045871559633027525,
 0.0045871559633027525,
 0.0045871559633027525,
 0.0045871559633027525,
 0.0045871559633027525,
 0.00

we can grab a couple more documents and make vectors for each of them as well. But the values within each vector need to be relative to something consistent across all the vectors. If we’re going to do math on them, they need to represent
a position in a common space, relative to something consistent. 
###### The first step in this process is to normalize the counts by calculating normalized term frequency instead of raw count in the document (as you did in the last section); the second step is to make all the vectors of standard length or dimension.

You’ll find every unique word in each document and then find every unique ord in the union of those two sets. This collections of words in your vocabulary is ften called a lexicon

Let’s check in on Harry. You had one “document” already— let’s round out the corpus with a couple more

docs = ["The faster Harry got to the store, the faster and faster Harry  would get home."] # we use append to add to the list

In [36]:
docs.append("Harry is hairy and faster than Jill.")

In [37]:
docs.append("Jill is not as hairy as Harry.")

In [38]:
doc_tokens = []

In [39]:
for doc in docs:
    doc_tokens += [sorted(tokenizer.tokenize(doc.lower()))]

In [43]:
len(doc_tokens[0])

17

In [44]:
all_doc_tokens = sum(doc_tokens, [])

In [45]:
len(all_doc_tokens)

33

In [46]:
lexicon = sorted(set(all_doc_tokens))

In [47]:
len(lexicon)

18

In [48]:
lexicon

[',',
 '.',
 'and',
 'as',
 'faster',
 'get',
 'got',
 'hairy',
 'harry',
 'home',
 'is',
 'jill',
 'not',
 'store',
 'than',
 'the',
 'to',
 'would']

Each of your three document vectors will need to have 18 values, even if the document for that vector doesn’t contain all 18 words in your lexicon. Each token is assigned a “slot” in your vectors corresponding to its position in your lexicon. Some of those token counts in the vector will be zeros, which is what you want:

In [49]:
from collections import OrderedDict

In [50]:
zero_vector = OrderedDict((token, 0) for token in lexicon)

In [51]:
zero_vector

OrderedDict([(',', 0),
             ('.', 0),
             ('and', 0),
             ('as', 0),
             ('faster', 0),
             ('get', 0),
             ('got', 0),
             ('hairy', 0),
             ('harry', 0),
             ('home', 0),
             ('is', 0),
             ('jill', 0),
             ('not', 0),
             ('store', 0),
             ('than', 0),
             ('the', 0),
             ('to', 0),
             ('would', 0)])

In [52]:
# Now we’ll make copies of that base vector, update the values of the vector for each document, and store them in an array:

In [53]:
import copy

In [54]:
doc_vectors = []

In [57]:
for doc in docs:
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)
    for key, value in token_counts.items():
        vec[key] = value / len(lexicon)
    doc_vectors.append(vec)

copy.copy() creates an independent copy, a separate instance of your zero vector, rather than reusing a reference (pointer) to
the original object’s memory location. Otherwise you’d just be overwriting the same zero_vector with new values in each loop, and you wouldn’t have a fresh zero on each pass of the loop.

###### sometimes we call this dimensionality capital letter “K.” This number of distinct words is also the vocabulary size of your corpus

Cosine similarity is merely the cosine of the angle between two vectors (theta), shown in figure 3.3, which can be calculated from the Euclidian dot product using Cosine similarity is efficient to calculate because the dot product doesn’t require evaluation of any trigonometric functions. In addition, cosine similarity has a convenient range for most machine learning problems.

In Python this would be

a.dot(b) == np.linalg.norm(a) * np.linalg.norm(b) / np.cos(theta)

Solving this relationship for cos(theta), you can derive the cosine similarity using
###### Or you can do it in pure Python without numpy, as in the following listing.

###### Compute cosine similarity in python

In [58]:
import math

In [60]:
def cosine_sim(vec1, vec2):
    """ Let's convert our dictionaries to lists for easier matching."""
    vec1 = [val for val in vec1.values()]
    vec2 = [val for val in vec2.values()]
    dot_prod = 0
    for i, v in enumerate(vec1):
         dot_prod += v * vec2[i]
    mag_1 = math.sqrt(sum([x**2 for x in vec1]))
    mag_2 = math.sqrt(sum([x**2 for x in vec2]))
    return dot_prod / (mag_1 * mag_2)

#### Zipf’s Law

In [61]:
nltk.download('brown')  #The Brown corpus is about 3MB.

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\EMZ\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


True

In [62]:
from nltk.corpus import brown

In [63]:
brown.words()[:10]  # words() is a built-in method of the NTLK corpus object that returns the tokenized corpus as a sequence of strs.

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [64]:
brown.tagged_words()[:5]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL')]

In [65]:
len(brown.words())

1161192

In [66]:
from collections import Counter

In [67]:
puncs = set((',', '.', '--', '-', '!', '?', ':', ';', '``', "''", '(', ')', '[', ']'))

In [68]:
word_list = (x.lower() for x in brown.words() if x not in puncs)

In [69]:
token_counts = Counter(word_list)

In [70]:
token_counts.most_common(20)

[('the', 69971),
 ('of', 36412),
 ('and', 28853),
 ('to', 26158),
 ('a', 23195),
 ('in', 21337),
 ('that', 10594),
 ('is', 10109),
 ('was', 9815),
 ('he', 9548),
 ('for', 9489),
 ('it', 8760),
 ('with', 7289),
 ('as', 7253),
 ('his', 6996),
 ('on', 6741),
 ('be', 6377),
 ('at', 5372),
 ('by', 5306),
 ('i', 5164)]

In [71]:
# check https://github.com/totalgood/nlpia/blob/master/src/nlpia/book/examples/ch03_zipf.py

##### Topic Modeling

Let’s return to the Kite example from Wikipedia and grab another section (the History section); say it’s the second document in the Kite corpus:

In [75]:
kite_history = """Kites were invented in China, where materials ideal for kite building were readily available: 
silk fabric for sail material; fine, high-tensile-strength silk for flying line; and resilient bamboo for a strong, 
lightweight framework.
The kite has been claimed as the invention of the 5th-century BC Chinese philosophers Mozi (also Mo Di) and Lu Ban 
(also Gongshu Ban). By 549 AD paper kites were certainly being flown, as it was recorded that in that year a paper kite
was used as a message for a rescue mission. Ancient and medieval Chinese sources describe kites being used for measuring 
distances, testing the wind, lifting men, signaling, and communication for military operations. The earliest known Chinese 
kites were flat (not bowed) and often rectangular. Later, tailless kites incorporated a stabilizing bowline. Kites were 
decorated with mythological motifs and legendary figures; some were fitted with strings and whistles to make musical sounds 
while flying. From China, kites were introduced to Cambodia, Thailand, India, Japan, Korea and the western world.
After its introduction into India, the kite further evolved into the fighter kite, known as the patang in India, where
thousands are flown every year on festivals such as Makar Sankranti. Kites were known throughout Polynesia, as far as 
New Zealand, with the assumption being that the knowledge diffused from China along with the people. Anthropomorphic kites 
made from cloth and wood were used in religious ceremonies to send prayers to the gods. Polynesian kite traditions are used
by anthropologists get an idea of early “primitive” Asian traditions that are believed to have at one time existed in Asia."""

In [73]:
kite_intro = kite_text1.lower()

In [74]:
intro_tokens = tokenizer.tokenize(kite_intro)

In [76]:
kite_history = kite_history.lower()

In [77]:
history_tokens = tokenizer.tokenize(kite_history)

In [78]:
intro_total = len(intro_tokens)

In [79]:
intro_total

361

In [80]:
history_total = len(history_tokens)

In [81]:
history_total

295

Now with a couple tokenized kite documents in hand, let’s look at the term frequency of “kite” in each document. we ’ll store the TFs you find in two dictionaries, one for each document:

In [83]:
intro_tf = {}

In [84]:
history_tf = {}

In [85]:
intro_counts = Counter(intro_tokens)

In [86]:
intro_tf['kite'] = intro_counts['kite'] / intro_total

In [87]:
history_counts = Counter(history_tokens)

In [88]:
history_tf['kite'] = history_counts['kite'] / history_total

In [89]:
'Term Frequency of "kite" in intro is: {:.4f}'.format(intro_tf['kite']) ############

'Term Frequency of "kite" in intro is: 0.0388'

In [90]:
'Term Frequency of "kite" in history is: {:.4f}'.format(history_tf['kite'])

'Term Frequency of "kite" in history is: 0.0203'

In [91]:
intro_tf['and'] = intro_counts['and'] / intro_total

In [92]:
history_tf['and'] = history_counts['and'] / history_total

In [93]:
print('Term Frequency of "and" in intro is: {:.4f}'.format(intro_tf['and']))

Term Frequency of "and" in intro is: 0.0277


In [94]:
print('Term Frequency of "and" in history is: {:.4f}'.format(history_tf['and']))

Term Frequency of "and" in history is: 0.0305


A term’s IDF is merely the ratio of the total number of documents to the number of documents the term appears in. In the case of “and” and “kite” in this current example, the answer is the same for both:

 2 total documents / 2 documents contain “and” = 2/2 = 1

 2 total documents / 2 documents contain “kite” = 2/2 = 1

 Not very interesting. So let’s look at another word “China.”

 2 total documents / 1 document contains “China” = 2/1 = 2

##### Okay, that’s something different. Let’s use this “rarity” measure to weight the term frequencies:

In [109]:
num_docs_containing_and = 0

In [110]:
for doc in [intro_tokens, history_tokens]:
    if 'and' in doc:
        num_docs_containing_and += 1

In [111]:
num_docs_containing_kite = 0
for doc in [intro_tokens, history_tokens]:
    if 'kite' in doc:
        num_docs_containing_kite += 1

In [112]:
#the TF of “China” in the two documents
intro_tf['china'] = intro_counts['china'] / intro_total 
history_tf['china'] = history_counts['china'] / history_total

In [113]:
num_docs_containing_china = 0
for doc in [intro_tokens, history_tokens]:
    if 'china' in doc:
        num_docs_containing_china += 1

the IDF for all three. You’ll store the IDFs in dictionaries per document like you did with TF:

In [114]:
num_docs = 2
intro_idf = {}
history_idf = {}

In [116]:
intro_idf['and'] = num_docs / num_docs_containing_and
history_idf['and'] = num_docs / num_docs_containing_and

intro_idf['kite'] = num_docs / num_docs_containing_kite
history_idf['kite'] = num_docs / num_docs_containing_kite

intro_idf['china'] = num_docs / num_docs_containing_china
history_idf['china'] = num_docs / num_docs_containing_china

###### And then for the intro document we find:

In [117]:
intro_tfidf = {}

intro_tfidf['and'] = intro_tf['and'] * intro_idf['and']
intro_tfidf['kite'] = intro_tf['kite'] * intro_idf['kite']
intro_tfidf['china'] = intro_tf['china'] * intro_idf['china']

###### And then for the history document:

In [118]:
history_tfidf = {}

history_tfidf['and'] = history_tf['and'] * history_idf['and']
history_tfidf['kite'] = history_tf['kite'] * history_idf['kite']
history_tfidf['china'] = history_tf['china'] * history_idf['china']

##### Return of Zipf

Let’s say, though, you have a corpus of 1 million documents (maybe you’re baby-Google), someone searches for the word “cat,” and in your 1 million documents you have exactly 1 document that contains the word “cat.” The raw
IDF of this is
1,000,000 / 1 = 1,000,000
Let’s imagine you have 10 documents with the word “dog” in them. Your IDF for “dog” is
1,000,000 / 10 = 100,000

That’s a big difference. So Zipf’s Law suggests that you scale all your word frequencies (and document frequencies)
with the log() function, the inverse of exp(). This ensures that words such as “cat” and “dog,” which have similar counts, aren’t exponentially different in frequency. And this distribution of word frequencies will ensure that your TF-IDF scores are more uniformly distributed. So you should redefine IDF to be the log of the original probability of that word occurring in one of your documents. You’ll want to take the log of the term frequency as well.

And then finally, for a given term, t, in a given document, d, in a corpus, D, you get:
    
tf(t, d) = count(t)/count(d)
idf(t, D) = log (number of documents/number of documents containing t)

tfidf(t, d, D) = tf(t, d) * idf(t, D)

In [None]:
log_tf = log(term_occurences_in_doc) - log(num_terms_in_doc) # Log probability of a particular term in a particular document

log_log_idf = log(log(total_num_docs) - log(num_docs_containing_term)) # Log of the log probability of a particular term 
#occurring at least once in a document—the first log is to linearize the IDF (compensate for Zipf’s Law)

log_tf_idf = log_tf + log_idf # Log TF-IDF is the log of the product of TF and IDF or the sum of the logs of TF and IDF.


##### Relevance ranking

As you saw earlier, we can easily compare two vectors and get their similarity, but wd have since learned that merely counting words isn’t as descriptive as using their TFIDF.
Therefore, in each document vector let’s replace each word’s word_count with the word’s TF-IDF. Now our vectors will more thoroughly reflect the meaning, or topic, of the document, as shown in this Harry example:

In [124]:
document_tfidf_vectors = []

we need to copy the zero_vector to create a new, separate object. Otherwise we’d end up overwriting the same object/vector each time through the loop.

In [125]:
for doc in docs:
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)
    for key, value in token_counts.items():
        docs_containing_key = 0
        for _doc in docs:
            if key in _doc:
                docs_containing_key += 1
            tf = value / len(lexicon)
            if docs_containing_key:
                idf = len(docs) / docs_containing_key
        else:
            idf = 0
        vec[key] = tf * idf
        document_tfidf_vectors.append(vec)

Two vectors are considered similar if their cosine similarity is high, so you can find two similar vectors near each other if they minimize:

cos Θ = A · B/ |A| |B|

##### The last step is then to find the documents whose vectors have the highest cosine similarities to the query and return those as the search results. If you take your three documents about Harry, and make the query “How long does it take to get to the store?” as shown here

In [126]:
query = "How long does it take to get to the store?"

In [127]:
query_vec = copy.copy(zero_vector)

In [128]:
query_vec = copy.copy(zero_vector) # copy.copy() ensures you’re dealing with separate objects, not multiple references to the same object.

In [129]:
tokens = tokenizer.tokenize(query.lower())
token_counts = Counter(tokens)

In [133]:
for key, value in token_counts.items():
    docs_containing_key = 0
    for _doc in docs:
        if key in _doc.lower():
            docs_containing_key += 1
    if docs_containing_key == 0:  #You didn’t find that token in the lexicon, so go to the next key.
            continue
            tf = value / len(tokens)
            idf = len(documents) / docs_containing_key
            query_vec[key] = tf * idf
                

In [134]:
cosine_sim(query_vec, document_tfidf_vectors[0])

ZeroDivisionError: float division by zero

##### Tools

In [136]:
#!pip install scipy
#!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py): started
  Building wheel for sklearn (setup.py): finished with status 'done'
  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1310 sha256=1e92b2f53c1332c39616060d378d24124bb9d2e78a28da09ce914b5a4a9fe268
  Stored in directory: c:\users\emz\appdata\local\pip\cache\wheels\e4\7b\98\b6466d71b8d738a0c547008b9eb39bf8676d1ff6ca4b22af1c
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0




##### The sklearn TF-IDF class is a model with .fit() and .transform() methods that comply with the sklearn API for all machine learning models:

In [137]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [138]:
corpus = docs

In [139]:
vectorizer = TfidfVectorizer(min_df=1)

In [140]:
model = vectorizer.fit_transform(corpus) #

The TFIDFVectorizer model produces a sparse numpy matrix, because a TF-IDF matrix usually contains mostly zeros, since most documents use a small portion of the total words in the vocabulary.

In [141]:
print(model.todense().round(2)) ##

[[0.16 0.   0.48 0.21 0.21 0.   0.25 0.21 0.   0.   0.   0.21 0.   0.64
  0.21 0.21]
 [0.37 0.   0.37 0.   0.   0.37 0.29 0.   0.37 0.37 0.   0.   0.49 0.
  0.   0.  ]
 [0.   0.75 0.   0.   0.   0.29 0.22 0.   0.29 0.29 0.38 0.   0.   0.
  0.   0.  ]]


###### ## The .todense() method converts a sparse matrix back into a regular numpy matrix (filling in the gaps with zeros) for your viewing pleasure.