# Varieties of TF-IDF Vectorization

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import string
import numpy as np
import pandas as pd

TF-IDF Vectorization of documents for an NLP problem is an extremely popular pre-processing step. The heart of the idea is:

- to *reward* a token-document pair when the token appears in many places *in the document* (this is the **term frequency** of the term in the document), but

- to *punish* a pair when the token appears in many places *across the documents* in the corpus as a whole (this is the **document frequency** of the term in the corpus).

Notice that the document frequency, though defined and used for every term-document pair $<t, d>$, is in fact independent of $d$.

This idea is in fact a theme with [multiple variations](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition). Let's explore a few definitions, starting with the default calculation in `sklearn`'s tool.

## `TfidfVectorizer`

In [4]:
# Sample corpus of three two-token documents

toy_corpus = ['love monster', 'hate dwarf', 'love hate']

### No Smoothing

Let's first turn off the IDF smoothing that `skearn` applies by default.

In [39]:
tfidf_toy = TfidfVectorizer(smooth_idf=False).fit(toy_corpus)

The fitted object stores the tokens in alphabetical order.

In [5]:
tfidf_toy.get_feature_names()

['dwarf', 'hate', 'love', 'monster']

It also stores the IDF values.

In [41]:
tfidf_toy.idf_

array([2.09861229, 1.40546511, 1.40546511, 2.09861229])

### Calculation

The calculation here is as follows:

For the inverse document frequency, $\large IDF_{w} = \log\left(\frac{|c|}{|d\in c: w\in d|}\right)+1$, i.e. the ratio of documents in the corpus that contain the term in question.

Why do we take the logarithm?

<details>
    <summary>Expand here for answer!</summary>
    If we just went with the bare ratio, tokens that appeared in only one document would have *twice* the score of tokens that appeared in two documents. But generally speaking that would be laying too much importance on the difference between moderately rare tokens and extremely rare tokens.
    </details>

Why do we add $1$?

<details>
    <summary>Expand here for answer!</summary>
    If we had a word that appeared in *every* document, we'd calculate $log(1) = 0$. Zeroes can cause problems in calculations, and so it's useful to add $1$ to our calculation of the IDF.

For 'dwarf' and 'monster', for example, we'd calculate:

In [47]:
idf_dw_mon = np.log(3 / 1) + 1

In [48]:
idf_dw_mon

2.09861228866811

For 'hate' and 'love', we'd calculate:

In [49]:
tf_ha_lo = np.log(3 / 2) + 1

In [50]:
np.log(3/2) + 1

1.4054651081081644

### Smoothing

The smoothing in effect adds a fourth document with every word to the corpus. In our case, this extra document would look like 'dwarf love hate monster'.

Thus the adjustment to our IDF calculation would be:

$\large IDF^*_{w} = \log\left(\frac{|c|+1}{|d\in c: w\in d|+1}\right)+1$

In [51]:
tfidf_toy_smooth = TfidfVectorizer().fit(toy_corpus)

In [52]:
tfidf_toy_smooth.idf_

array([1.69314718, 1.28768207, 1.28768207, 1.69314718])

For 'dwarf' and 'monster' the new calculation would be:

In [53]:
np.log((3+1) / (1+1)) + 1

1.6931471805599454

For 'hate' and 'love' the new calculation would be:

In [54]:
np.log((3+1) / (2+1)) + 1

1.2876820724517808

## A New Measure of IDF

There are other variants of IDF calculations, but every one I've seen proceeds by encoding the *binary* question of whether each document contains the term or not. We might see what happens if instead we consider *how many* times the term appears across the corpus.

If we insisted on a new name for this, we might call it "Inverse-Corpus-Term-Frequency", or ICTF.

Remarks:

- ICTF may be significantly larger than the traditional IDF score for a token that, though present in many documents, occurs rarely within documents, especially for a small corpus of long documents.
- ICTF may be significantly smaller than the traditional IDF score for a token that, though present in few documents, occurs frequently within documents, especially for a large corpus of short documents.

In [46]:
list(enumerate(toy_corpus))

[(0, 'love monster'), (1, 'hate dwarf'), (2, 'love hate')]

In [67]:
class TfictfVectorizer(TfidfVectorizer):
    
    def __init__(self):
        import numpy as np
        super().__init__()
    
    def fit(self, corpus, ictf=True):
        super().fit(raw_documents=corpus)
        if ictf:
            vocab = set()
            for doc in corpus:
                for word in doc.split():
                    vocab.add(word)
            vocab = sorted(list(vocab))
            tfs = np.zeros((len(corpus), len(vocab)))
            for x, doc in enumerate(corpus):
                for y, token in enumerate(vocab):
                    tfs[x, y] = doc.split().count(token)
            self.tfs = tfs
            
            ictfs = np.zeros(len(vocab))
            corpus_length = sum([len(doc.split()) for doc in corpus])
            for idx, token in enumerate(vocab):
                ictfs[idx] = sum([doc.count(token) for doc in corpus]) / corpus_length
        self.ictfs = ictfs
        return self

    def transform(self, corpus):
        import pandas as pd
        columns = super().get_feature_names()
        return pd.Data

In [68]:
tfictf = TfictfVectorizer().fit(toy_corpus)

In [69]:
tfictf.tfs

array([[0., 0., 1., 1.],
       [1., 1., 0., 0.],
       [0., 1., 1., 0.]])

In [70]:
tfictf.ictfs

array([0.16666667, 0.33333333, 0.33333333, 0.16666667])

In [73]:
tfictf.tfs * tfictf.ictfs

array([[0.        , 0.        , 0.33333333, 0.16666667],
       [0.16666667, 0.33333333, 0.        , 0.        ],
       [0.        , 0.33333333, 0.33333333, 0.        ]])

In [62]:
tfictf.transform(toy_corpus, ictf)

NameError: name 'raw_documents' is not defined

In [13]:
data = fetch_20newsgroups()
X, y = data['data'], data['target']

In [14]:
snow = SnowballStemmer('english')

In [15]:
sw = stopwords.words('english')

In [16]:
sw.append('\n')

In [17]:
X_no_punc = [doc.translate(str.maketrans('', '', string.punctuation)) for doc in X]

In [18]:
X_no_punc[0]

'From lerxstwamumdedu wheres my thing\nSubject WHAT car is this\nNntpPostingHost rac3wamumdedu\nOrganization University of Maryland College Park\nLines 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day It was a 2door sports car looked to be from the late 60s\nearly 70s It was called a Bricklin The doors were really small In addition\nthe front bumper was separate from the rest of the body This is \nall I know If anyone can tellme a model name engine specs years\nof production where this car is made history or whatever info you\nhave on this funky looking car please email\n\nThanks\n IL\n    brought to you by your neighborhood Lerxst \n\n\n\n\n'

In [19]:
corpus_stemmed = []
for doc in X_no_punc:
    doc_stemmed = []
    for word in doc.split():
        doc_stemmed.append(snow.stem(word))
    corpus_stemmed.append(doc_stemmed)

corpus_stemmed

[['from',
  'lerxstwamumdedu',
  'where',
  'my',
  'thing',
  'subject',
  'what',
  'car',
  'is',
  'this',
  'nntppostinghost',
  'rac3wamumdedu',
  'organ',
  'univers',
  'of',
  'maryland',
  'colleg',
  'park',
  'line',
  '15',
  'i',
  'was',
  'wonder',
  'if',
  'anyon',
  'out',
  'there',
  'could',
  'enlighten',
  'me',
  'on',
  'this',
  'car',
  'i',
  'saw',
  'the',
  'other',
  'day',
  'it',
  'was',
  'a',
  '2door',
  'sport',
  'car',
  'look',
  'to',
  'be',
  'from',
  'the',
  'late',
  '60s',
  'earli',
  '70s',
  'it',
  'was',
  'call',
  'a',
  'bricklin',
  'the',
  'door',
  'were',
  'realli',
  'small',
  'in',
  'addit',
  'the',
  'front',
  'bumper',
  'was',
  'separ',
  'from',
  'the',
  'rest',
  'of',
  'the',
  'bodi',
  'this',
  'is',
  'all',
  'i',
  'know',
  'if',
  'anyon',
  'can',
  'tellm',
  'a',
  'model',
  'name',
  'engin',
  'spec',
  'year',
  'of',
  'product',
  'where',
  'this',
  'car',
  'is',
  'made',
  'histori',


In [20]:
to_vec = [' '.join(doc) for doc in corpus_stemmed]

In [21]:
tfidf = TfidfVectorizer(stop_words=sw)

In [22]:
tfidf.fit(to_vec)

TfidfVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...])

In [25]:
df = pd.DataFrame(tfidf.transform(to_vec).todense(), columns=tfidf.get_feature_names())

In [26]:
df.loc[:, 'park']

0        0.112647
1        0.000000
2        0.000000
3        0.000000
4        0.000000
           ...   
11309    0.000000
11310    0.000000
11311    0.000000
11312    0.000000
11313    0.000000
Name: park, Length: 11314, dtype: float64

In [27]:
tfidf.idf_

array([6.46268355, 7.19839034, 8.54212509, ..., 9.64073738, 9.23527227,
       9.64073738])

Let's calculate by hand the TF-IDF score for 'park' in the first document:

In [28]:
to_vec[0].count('park')

1

In [29]:
len(to_vec[0].split())

120

In [30]:
TF = 1/120

In [31]:
ctr = 0
for doc in to_vec:
    if ' park ' in doc:
        ctr += 1
ctr

259

In [32]:
len(to_vec)

11314

In [33]:
IDF = 11314/259 + 1

In [34]:
TF*np.log(IDF)

0.03166335013457117

In [35]:
np.log(1001/2)

6.215607598755275

In [36]:
np.log(1001/5)

5.29931686688112

In [37]:
0.0013 * (np.log(1001/2) + 1)

0.009380289878381857