## Term frequency - Inverse document frequency

## Links
 * https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3
 * http://scikit-learn.org/stable/modules/feature_extraction.html
 * https://zablo.net/blog/post/twitter-sentiment-analysis-python-scikit-word2vec-nltk-xgboost
 * http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

In [15]:
from collections import Counter
from collections import defaultdict
import math

import warnings
warnings.filterwarnings('ignore')
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

TF - IDF stands for *Term frequency* - *Inverse document frequency*

Term frequency (tf) is the number of times a word appears in a document divided by the number of words in that document

Inverse document frequency (idf) is the log of the ratio of the total number of documents in a corpus to the number of documents containing a term $i$

$$ idf_i = log \frac{N}{d_i} $$

The TF-IDF score for a word $w$ is the product of the tf and idf values

$$ w_i = tf_i \times idf_i $$

Imagine two sentences $s1$ and $s2$

In [1]:
s1 = "the car is driven on the road"
s2 = "the truck is driven on the highway"

Split the sentences into word counts

In [4]:
w1 = Counter(s1.split())
w2 = Counter(s2.split())
[w1, w2]

[Counter({'the': 2, 'car': 1, 'is': 1, 'driven': 1, 'on': 1, 'road': 1}),
 Counter({'the': 2, 'truck': 1, 'is': 1, 'driven': 1, 'on': 1, 'highway': 1})]

The list of unique words is given by

In [5]:
words = set(w1.keys()) | set(w2.keys())
words

{'car', 'driven', 'highway', 'is', 'on', 'road', 'the', 'truck'}

In [6]:
def compute_tf(word_counts, words):
    tf_dict = {}
    total_words = sum(word_counts.values())
    for word, count in word_counts.items():
        tf_dict[word] = count
    # add words not found in the dict
    for word in words:
        if word not in tf_dict.keys():
            tf_dict[word] = 0
    return [(word, tf_dict[word]) for word in sorted(words)]

def compute_idf(corpus, words):
    idf_dict = {}
    doc_len = len(corpus)
    for word in words:
        doc_count = 0
        for doc in corpus:
            if word in doc.keys():
                doc_count += 1
        idf_dict[word] = math.log(doc_len/doc_count) + 1
    return [(word, idf_dict[word]) for word in sorted(words)]

In [7]:
compute_tf(w1, words)

[('car', 1),
 ('driven', 1),
 ('highway', 0),
 ('is', 1),
 ('on', 1),
 ('road', 1),
 ('the', 2),
 ('truck', 0)]

In [8]:
compute_idf([w1, w2], words)

[('car', 1.6931471805599454),
 ('driven', 1.0),
 ('highway', 1.6931471805599454),
 ('is', 1.0),
 ('on', 1.0),
 ('road', 1.6931471805599454),
 ('the', 1.0),
 ('truck', 1.6931471805599454)]

In [9]:
def compute_tf_idf(corpus, words):
    idf_tuple = compute_idf(corpus, words)
    tf_idf = defaultdict(list)
    for idx, word_counts in enumerate(corpus):
        tf_tuple = compute_tf(word_counts, words)
        for idx, (word, count) in enumerate(tf_tuple):
            tf_idf[word] += [count * idf_tuple[idx][1]]
    return tf_idf

Convert tf_idf to pandas table

In [14]:
tf_idf_dict = compute_tf_idf([w1, w2], words)
data = []
for key, values in tf_idf_dict.items():
    items = [key]
    items.extend(values)
    data.append(items)
    
pd.DataFrame(data, columns=['term', 'doc1', 'doc2'])

Unnamed: 0,term,doc1,doc2
0,car,1.693147,0.0
1,driven,1.0,1.0
2,highway,0.0,1.693147
3,is,1.0,1.0
4,on,1.0,1.0
5,road,1.693147,0.0
6,the,2.0,2.0
7,truck,0.0,1.693147


## Using a CountVectorizer

A CountVectorizer counts the number of words in a list of documents

In [17]:
cv = CountVectorizer()
counts = cv.fit_transform([s1, s2])

Return the number of words in each sentence

In [18]:
counts.toarray()

array([[1, 1, 0, 1, 1, 1, 2, 0],
       [0, 1, 1, 1, 1, 0, 2, 1]], dtype=int64)

Display the list of words

In [33]:
cv.get_feature_names()

['car', 'driven', 'highway', 'is', 'on', 'road', 'the', 'truck']

Display the count of each word in each sentence

In [42]:
pd.DataFrame.from_dict({'words':cv.get_feature_names(), 's1': counts.toarray()[0], 's2': counts.toarray()[1]})

Unnamed: 0,words,s1,s2
0,car,1,0
1,driven,1,1
2,highway,0,1
3,is,1,1
4,on,1,1
5,road,1,0
6,the,2,2
7,truck,0,1


## Use a TF-IDF transformer

Scikit-learn has the TfidfTransformer which converts the output of the CountVectorizer to TF-IDF values 

In [19]:
transformer = TfidfTransformer(norm=None, smooth_idf=False)
tfidf = transformer.fit_transform(counts)

Display the data as a table

In [21]:
pd.DataFrame.from_dict(
    {'words':cv.get_feature_names(),
     's1': tfidf.toarray()[0], 's2': tfidf.toarray()[1]})

Unnamed: 0,words,s1,s2
0,car,1.693147,0.0
1,driven,1.0,1.0
2,highway,0.0,1.693147
3,is,1.0,1.0
4,on,1.0,1.0
5,road,1.693147,0.0
6,the,2.0,2.0
7,truck,0.0,1.693147


## Use a TF-IDF vectorizer

The TfidfVectorizer can compute the tf-idf values directly without first calculating the counts

In [23]:
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False)
tfidf = vectorizer.fit_transform([s1, s2])

Display the data as a table

In [25]:
pd.DataFrame.from_dict(
    {'words':vectorizer.get_feature_names(),
     's1': tfidf.toarray()[0], 's2': tfidf.toarray()[1]})

Unnamed: 0,words,s1,s2
0,car,1.693147,0.0
1,driven,1.0,1.0
2,highway,0.0,1.693147
3,is,1.0,1.0
4,on,1.0,1.0
5,road,1.693147,0.0
6,the,2.0,2.0
7,truck,0.0,1.693147
