In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

In [2]:
np.set_printoptions(edgeitems=30, linewidth=100000, precision=3)

In [3]:
vectorizer = CountVectorizer()

This vectorizer will tokenize a string into individual words. A word is defined as having at least 2 letters. Single letter words are dropped. As can be seen in the example below, the word *a* is dropped.

In [4]:
tokenize = vectorizer.build_analyzer()
tokenize("This is a text document to analyze")

['this', 'is', 'text', 'document', 'to', 'analyze']

In [5]:
corpus = [
    "This is the first document",
    "This is the second second document",
    "And the third one",
    "Is this the first document"
]

In [6]:
D = vectorizer.fit_transform(corpus)
D

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

Each row of $D$ is a document to token mapping and each column is a token to document mapping, i.e, row $i$ has all the tokens that appeared in document $i$ and column $j$ has all the documents that contain the word $j$. In other words, each cell $d_{i,j}$ is the frequency of word $j$ in document $i$.

#### Note
It is tempting to draw parallels betweeen each row of $D$ and one-hot vectors. But remember that one-hot vectors represent a **single** word, here each row of $D$ represents the entire document.

In [15]:
D = D.toarray()
D

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

Given that the indices represent individual tokens, the `CountVectorizer` object has methods to get the tokens <--> idx mapping as shown below. 

`get_feature_names` method gives the list of tokens in their index order, i.e, the token with idx 0 will be the first element of this output and so on. It is really an `idx_to_token` mapping.

`vocabulary_` method outputs an actual map this time, with the token as the key and its idx as the value. This is the `token_to_idx` mapping.

In [8]:
vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [9]:
idx_to_token = vectorizer.get_feature_names()
for idx, token in enumerate(idx_to_token):
    print(f"[{idx}] = {token}")

[0] = and
[1] = document
[2] = first
[3] = is
[4] = one
[5] = second
[6] = the
[7] = third
[8] = this


In [10]:
vectorizer.vocabulary_

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

In [11]:
token_to_idx = vectorizer.vocabulary_
for token, idx in token_to_idx.items():
    print(f"[{token}] = {idx}")

[this] = 8
[is] = 3
[the] = 6
[first] = 2
[document] = 1
[second] = 5
[and] = 0
[third] = 7
[one] = 4


In [12]:
def bag_of_words(docvec):
    """
    Takes a vectorized document, i.e., a row in X, and outputs its contents in the
    form of a bag of words. A bag of words is a list with the tokens that appear in
    the document in index order. Words that appear multiple times in the doc are
    repeated in the output.
    """
    tokens = []
    for idx, freq in enumerate(docvec):
        if freq > 0:
            tokens += [idx_to_token[idx]] * freq
    return tokens

In [16]:
print(bag_of_words(D[0]))

['document', 'first', 'is', 'the', 'this']


In [17]:
print(bag_of_words(D[1]))

['document', 'is', 'second', 'second', 'the', 'this']


The vectorizer's vocab and bag of words was built with the `fit_transform`. The `transform` method will take in **new** documents and create a vectorized document just like rows of $D$. It will drop any words that were not present in the original corpus. Think of these docs as queries querying the corpus.

In [18]:
newdocs = vectorizer.transform([
   "This is a new document and a new example",
    "And another document here",
    "Something completely new"
]).toarray()
newdocs

array([[1, 1, 0, 1, 0, 0, 0, 0, 1],
       [1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [19]:
for docvec in newdocs:
    print(docvec, bag_of_words(docvec))

[1 1 0 1 0 0 0 0 1] ['and', 'document', 'is', 'this']
[1 1 0 0 0 0 0 0 0] ['and', 'document']
[0 0 0 0 0 0 0 0 0] []


The default `CountVectorizer` uses space to tokenize. But we an give it regex pattern to build ourselves a bi-gram vectorizer.

In [20]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r"\b\w+\b", min_df=1)

In [21]:
tokenize = bigram_vectorizer.build_analyzer()
tokenize("Bi-grams are cool!")

['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']

In [23]:
bigram_vectorizer.fit_transform(corpus).toarray()

array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]])

# TF-IDF

In [24]:
D = np.array([[3, 0, 1],
              [2, 0, 0],
              [3, 0, 0],
              [4, 0, 0],
              [3, 2, 0],
              [3, 0, 2]])

## TF-IDF

### Token Frequency
Token (or term) frequency is the number of times token $j$ has appeared in document. It is just the $j_{th}$ elmenet of the $i_{th}$ document so we can just the read this value of the main corpus $D$ matrix: $d_{i, j}$.

### Inverse Document Frequenecy
Document frequency of the token is the number of documents that have the token $j$: $f_j = \sum_i \mathbb{1}_{d_{i, j} > 0}$. This is just the mathematical way of saying count the number of non-zero cells in the $j_{th}$ column. The inverse document frequency is just the ratio of the total number of documents to the document frequency, $\frac{n}{f_j}$. Here $n$ is the total number of documents. The range of IDF is $[0, 1]$. Generally speaking a more popular token should have a higher IDF. On the other hand, very frequent tokens are less important for information retrieval because they don't provide any additional information about the document it appears in. For this reason, we usually take the log of IDF.

### Calculating tf-idf for document $i$ and token $j$
The first step in calculating the tf-idf of token $j$ in document $i$ is -
$$
x'_{i, j} = d_{i, j}  \left( log \frac{n}{f_j} + 1 \right)
$$

After doing this for each token in the document, we then normalize the document's tf-idf vector.

$$
x_{i,j} = \frac{x'_{i,j}}{\left \| \mathbf x'_i \right \|} 
$$

Lets calculate the tf-idf for all the tokens in the first document.
$$
x'_{0,0} = d_{0,0} \left( log \frac{n}{f_0} + 1 \right) \\
x'_{0,1} = d_{0,1} \left( log \frac{n}{f_1} + 1 \right) \\
x'_{0,2} = d_{0,2} \left( log \frac{n}{f_2} + 1 \right) \\
$$

In [25]:
d = D[0,:]
d

array([3, 0, 1])

In [26]:
f0 = np.sum(D[:,0] > 0)
f1 = np.sum(D[:,1] > 0)
f2 = np.sum(D[:,2] > 0)
f = np.array([f0, f1, f2])
f

array([6, 1, 2])

In [27]:
n = D.shape[0]
n

6

In [28]:
x0_ = d * (np.log(n/f)  + 1)
x0_

array([3.   , 0.   , 2.099])

The norm of this vector is $\left \| \mathbf x'_0 \right \| = \sqrt{{x'}_{0,0}^2 + x_{0,1}^2 + x_{0,2}^2}$.

In [29]:
x0_norm = np.linalg.norm(x0_)
x0_norm

3.6611710610334507

In [30]:
# The above function calculates the L2 norm
np.sqrt(x0_[0]**2 + x0_[1]**2 + x0_[2]**2)

3.6611710610334507

The normalized tf-idf vector for all the tokens in the first document are -
$$
x_0 = \frac{x'_0}{\left \| \mathbf x'_0 \right \|}
$$

In [31]:
x0 = x0_ / x0_norm
x0

array([0.819, 0.   , 0.573])

`sklearn` has a class that can do all of this in one shot.

In [32]:
from sklearn.feature_extraction.text import TfidfTransformer

In [33]:
xformer = TfidfTransformer(smooth_idf=False)
X = xformer.fit_transform(D).toarray()
X

array([[0.819, 0.   , 0.573],
       [1.   , 0.   , 0.   ],
       [1.   , 0.   , 0.   ],
       [1.   , 0.   , 0.   ],
       [0.473, 0.881, 0.   ],
       [0.581, 0.   , 0.814]])

So far we have seen the following pipeline - 
  1. Start with text corpus, typically a list of strings.
  2. Convert this to a list of document vectors by passing them throug the `CountVectorizer`. We now have $D$.
  3. Convert this to a matrix of tf-idfs by pasing $D$ through a `TfidfTransformer`.

Lets run this pipeline on our original corpus.

In [40]:
%reset -f

In [43]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [44]:
corpus = [
    "This is the first document",
    "This is the second second document",
    "And the third one",
    "Is this the first document"
]

In [52]:
count_vectorizer = CountVectorizer()
D = count_vectorizer.fit_transform(corpus)
tfidf_transformer = TfidfTransformer(smooth_idf=False)
X = tfidf_transformer.fit_transform(D).toarray()
print("Documents\n", D.toarray())
print("\nTF-IDF\n", X)

Documents
 [[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]

TF-IDF
 [[0.    0.433 0.569 0.433 0.    0.    0.336 0.    0.433]
 [0.    0.24  0.    0.24  0.    0.89  0.186 0.    0.24 ]
 [0.561 0.    0.    0.    0.561 0.    0.235 0.561 0.   ]
 [0.    0.433 0.569 0.433 0.    0.    0.336 0.    0.433]]


`sklearn` has a convenience class `TfidfVectorizer` to do all this, we just pass it the text corpus and get the tf-idf matrix out. This has the usual methods that we find on `CountVectorizer`.

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [54]:
tfidf_vectorizer = TfidfVectorizer(smooth_idf=False)
X = tfidf_vectorizer.fit_transform(corpus).toarray()
X

array([[0.   , 0.433, 0.569, 0.433, 0.   , 0.   , 0.336, 0.   , 0.433],
       [0.   , 0.24 , 0.   , 0.24 , 0.   , 0.89 , 0.186, 0.   , 0.24 ],
       [0.561, 0.   , 0.   , 0.   , 0.561, 0.   , 0.235, 0.561, 0.   ],
       [0.   , 0.433, 0.569, 0.433, 0.   , 0.   , 0.336, 0.   , 0.433]])

In [38]:
tfidf_vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [39]:
tfidf_vectorizer.vocabulary_

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

In [56]:
count_vectorizer.transform(["This is a new document and a new example"]).toarray()[0]

array([1, 1, 0, 1, 0, 0, 0, 0, 1])

In [57]:
tfidf_vectorizer.transform(["This is a new document and a new example"]).toarray()[0]

array([0.731, 0.394, 0.   , 0.394, 0.   , 0.   , 0.   , 0.   , 0.394])