# Bag of Words and TF-IDF
Below, we'll look at three useful methods of vectorizing text.
- `CountVectorizer` - Bag of Words
- `TfidfTransformer` - TF-IDF values
- `TfidfVectorizer` - Bag of Words AND TF-IDF values

Let's first use an example from earlier and apply the text processing steps we saw in this lesson.

In [1]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/zacks/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/zacks/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/zacks/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
corpus = ["Little House on the Prairie", 
           "Mary had a Little Lamb", 
           "The Silence of the Lambs", 
           "Twinkle Twinkle Little Star"]

In [3]:
stop_words = stopwords.words("english")
stemmer = PorterStemmer()

Use the skills you learned so far to create a function `tokenize` that takes in a string of text and applies the following:
- case normalization (convert to all lowercase)
- punctuation removal
- tokenization, lemmatization, and stop word removal using `nltk`

Feel free to refer back to previous sections to complete these steps!

In [4]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize and remove stop words
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]

    return tokens

# `CountVectorizer` (Bag of Words)

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize count vectorizer object
countVectorizer = CountVectorizer(tokenizer=tokenize)

In [6]:
# get counts of each token (word) in text data
X = countVectorizer.fit_transform(corpus)

In [7]:
# convert Compressed Sparse Row matrix to numpy array to view
X.toarray()

array([[1, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 1, 2]])

In [8]:
# Get output feature names for transformation
countVectorizer.get_feature_names_out()

array(['hous', 'lamb', 'littl', 'mari', 'prairi', 'silenc', 'star',
       'twinkl'], dtype=object)

In [9]:
# View inverse transform
countVectorizer.inverse_transform(X)

[array(['littl', 'hous', 'prairi'], dtype='<U6'),
 array(['littl', 'mari', 'lamb'], dtype='<U6'),
 array(['lamb', 'silenc'], dtype='<U6'),
 array(['littl', 'twinkl', 'star'], dtype='<U6')]

In [10]:
# A mapping of terms to feature indices.
countVectorizer.vocabulary_

{'littl': 2,
 'hous': 0,
 'prairi': 4,
 'mari': 3,
 'lamb': 1,
 'silenc': 5,
 'twinkl': 7,
 'star': 6}

In [11]:
import pandas as pd

# Document-Term Matrix
df = pd.DataFrame(index=corpus, columns=countVectorizer.get_feature_names_out(), data=X.toarray())
df

Unnamed: 0,hous,lamb,littl,mari,prairi,silenc,star,twinkl
Little House on the Prairie,1,0,1,0,1,0,0,0
Mary had a Little Lamb,0,1,1,1,0,0,0,0
The Silence of the Lambs,0,1,0,0,0,1,0,0
Twinkle Twinkle Little Star,0,0,1,0,0,0,1,2


In [12]:
# Document Frequency
document_frequency = pd.DataFrame(index=["Document Frequency"], data=df.sum().to_dict())
document_frequency

Unnamed: 0,hous,lamb,littl,mari,prairi,silenc,star,twinkl
Document Frequency,1,2,3,1,1,1,1,2


In [13]:
df.append(document_frequency)

Unnamed: 0,hous,lamb,littl,mari,prairi,silenc,star,twinkl
Little House on the Prairie,1,0,1,0,1,0,0,0
Mary had a Little Lamb,0,1,1,1,0,0,0,0
The Silence of the Lambs,0,1,0,0,0,1,0,0
Twinkle Twinkle Little Star,0,0,1,0,0,0,1,2
Document Frequency,1,2,3,1,1,1,1,2


# Calculate tf-idf manually

> [tf–idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)<br>
> [sklearn.feature_extraction.text.TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).

Term Frequency, $tf(t, d)$, is the frequency of term $t$,

$$\displaystyle tf(t, d) = \frac{f_{t, d}}{\sum_{t' \in d}f_{t', d}}$$

where $f_{t, d}$ is the *raw count* of a term in a document, i.e., the number of times that term $t$ occurs in document $d$. There are various other ways to define term frequency: 
- the raw count itself: $\displaystyle tf(t, d) = f_{t, d}$
- Boolean frequencies: $\displaystyle tf(t, d) = 1$
- term frequency adjusted for document length: $\displaystyle tf(t, d) = f_{t, d} \div \textrm{(number of words in d)}$
- logarithmmically scaled frequency: $\displaystyle tf(t, d) = log(1 + f_{t, d})$
- augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most occurring term in the document: $\displaystyle tf(t, d) = 0.5 + 0.5 \cdot \frac{f_{t, d}}{max\{f_{t', d}:t' \in d\}}$

In [14]:
# tf
df / df.sum()

Unnamed: 0,hous,lamb,littl,mari,prairi,silenc,star,twinkl
Little House on the Prairie,1.0,0.0,0.333333,0.0,1.0,0.0,0.0,0.0
Mary had a Little Lamb,0.0,0.5,0.333333,1.0,0.0,0.0,0.0,0.0
The Silence of the Lambs,0.0,0.5,0.0,0.0,0.0,1.0,0.0,0.0
Twinkle Twinkle Little Star,0.0,0.0,0.333333,0.0,0.0,0.0,1.0,1.0


The **inverse document frequency** is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the [logarithmically scaled](https://en.wikipedia.org/wiki/Logarithmic_scale "Logarithmic scale") inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

$$idf(t, D) = log \frac{N}{|d \in D:t \in d|}$$

with
- $N$:  total number of documents in the corpus $N = |D|$
- $|{d \in D:t \in d}|$: number of documents where the term $t$ appears (i.e., $tf(t, d) \neq 0$. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to $1 + |{d \in D:t \in d}|$.

In [15]:
# idf
from math import log10

In [16]:
# For term hous
# Document Frequency = 1
# Term Frequency = 1
# For smooth_idf=False: idf(t) = log [ n / df(t) ] + 1 
log10(4 / 1) + 1

1.6020599913279625

# `TfidfTransformer`

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

# initialize tf-idf transformer object
tfidfTransformer = TfidfTransformer(smooth_idf=False)

In [18]:
# use counts from count vectorizer results to compute tf-idf values
tfidf = tfidfTransformer.fit_transform(X)

In [19]:
# convert sparse matrix to numpy array to view
tfidf.toarray()

array([[0.66064766, 0.        , 0.3564959 , 0.        , 0.66064766,
        0.        , 0.        , 0.        ],
       [0.        , 0.52964479, 0.40280852, 0.74647284, 0.        ,
        0.        , 0.        , 0.        ],
       [0.        , 0.57866699, 0.        , 0.        , 0.        ,
        0.81556393, 0.        , 0.        ],
       [0.        , 0.        , 0.23458928, 0.        , 0.        ,
        0.        , 0.43473391, 0.86946782]])

In [20]:
pd.DataFrame(index=corpus, columns=countVectorizer.get_feature_names_out(), data=tfidf.toarray())

Unnamed: 0,hous,lamb,littl,mari,prairi,silenc,star,twinkl
Little House on the Prairie,0.660648,0.0,0.356496,0.0,0.660648,0.0,0.0,0.0
Mary had a Little Lamb,0.0,0.529645,0.402809,0.746473,0.0,0.0,0.0,0.0
The Silence of the Lambs,0.0,0.578667,0.0,0.0,0.0,0.815564,0.0,0.0
Twinkle Twinkle Little Star,0.0,0.0,0.234589,0.0,0.0,0.0,0.434734,0.869468


# `TfidfVectorizer`
`TfidfVectorizer` = `CountVectorizer` + `TfidfTransformer`

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# initialize tf-idf vectorizer object
tfidfVectorizer = TfidfVectorizer(tokenizer=tokenize, smooth_idf=False)

In [22]:
# compute bag of word counts and tf-idf values
X = tfidfVectorizer.fit_transform(corpus)

In [23]:
# convert sparse matrix to numpy array to view
X.toarray()

array([[0.66064766, 0.        , 0.3564959 , 0.        , 0.66064766,
        0.        , 0.        , 0.        ],
       [0.        , 0.52964479, 0.40280852, 0.74647284, 0.        ,
        0.        , 0.        , 0.        ],
       [0.        , 0.57866699, 0.        , 0.        , 0.        ,
        0.81556393, 0.        , 0.        ],
       [0.        , 0.        , 0.23458928, 0.        , 0.        ,
        0.        , 0.43473391, 0.86946782]])

In [24]:
pd.DataFrame(index=corpus, columns=tfidfVectorizer.get_feature_names_out(), data=X.toarray())

Unnamed: 0,hous,lamb,littl,mari,prairi,silenc,star,twinkl
Little House on the Prairie,0.660648,0.0,0.356496,0.0,0.660648,0.0,0.0,0.0
Mary had a Little Lamb,0.0,0.529645,0.402809,0.746473,0.0,0.0,0.0,0.0
The Silence of the Lambs,0.0,0.578667,0.0,0.0,0.0,0.815564,0.0,0.0
Twinkle Twinkle Little Star,0.0,0.0,0.234589,0.0,0.0,0.0,0.434734,0.869468
