**Term Frequency-Inverse Document Frequency** (**TF-IDF**) is a numerical statistic that demonstrates how important a word is to a corpus.

**Term Frequency** (**tf**): gives us the frequency of the word in each document in the corpus. it's the ratio of number of times the word appears in a document compared to the total number of words in that document and increases as the number of occurrences of that word within the document increases

**Inverse Data Frequency** (**idf**): is used to calculate the weight of rare words across all data in the corpus

We find a TF-IDF score for each word in our data from the corpus by finding the product of tf and idf.

tfidf(𝑡,𝑑,𝐷)=tf(𝑡,𝑑)⋅idf(𝑡,𝐷)


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from functools import reduce
from math import log
import re
import string

In [2]:
corpus1 = """
I took my dog for a walk
My dog ate some grass
My dog's name is Scoot
""".split("\n")[1:-1]

In [None]:
# Clearing the data and tokenizing
l_A = corpus1[0].lower().split()
l_B = corpus1[1].lower().split()
l_C = corpus1[2].lower().split()

print(l_A)
print(l_B)
print(l_C)

In [None]:
## Bag of Words calculation
word_set = set(l_A).union(set(l_B)).union(set(l_C))
print(word_set)

In [None]:
## Continuing Bag of Words calculation

word_dict_A = dict.fromkeys(word_set, 0)
word_dict_B = dict.fromkeys(word_set, 0)
word_dict_C = dict.fromkeys(word_set, 0)

for word in l_A:
    word_dict_A[word] += 1

for word in l_B:
    word_dict_B[word] += 1

for word in l_C:
    word_dict_C[word] += 1

In [None]:
pd.DataFrame([word_dict_A, word_dict_B, word_dict_C])

In the case of the term frequency $tf(t,d)$, the simplest choice is to use the raw count of a term in a string. $${\displaystyle \mathrm {tf} (t,d)={\frac {n_{t}}{\sum _{k}n_{k}}}} $$ where $n_t$ is the number of occurrences of the word $t$ in the string, and in the denominator - the total number of words in this string.
In

In [None]:
def compute_tf(word_dict, l):
    tf = {}
    sum_nk = len(l)
    for word, count in word_dict.items():
        tf[word] = count/sum_nk
    return tf

In [None]:
tf_A = compute_tf(word_dict_A, l_A)
tf_B = compute_tf(word_dict_B, l_B)
tf_C = compute_tf(word_dict_C, l_C)

In [None]:
idf is a measure of how much information the word provides $$ \mathrm{idf}(t, D) =  \log \frac{N}{|\{d \in D: t \in d\}|} $$
$N$: total number of strings in the corpus ${\displaystyle N={|D|}}$
${\displaystyle |\{d\in D:t\in d\}|}$ : number of strings where the term ${\displaystyle t}$ appears (i.e., ${\displaystyle \mathrm {tf} (t,d)\neq 0})$. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to ${\displaystyle 1+|\{d\in D:t\in d\}|}$.

In [None]:
def compute_idf(strings_list):
    n = len(strings_list)
    idf = dict.fromkeys(strings_list[0].keys(), 0)
    for l in strings_list:
        for word, count in l.items():
            if count > 0:
                idf[word] += 1
    
    for word, v in idf.items():
        idf[word] = log(n / float(v))
    return idf

In [None]:
idf = compute_idf([word_dict_A, word_dict_B, word_dict_C])

In [None]:
def compute_tf_idf(tf, idf):
    tf_idf = dict.fromkeys(tf.keys(), 0)
    for word, v in tf.items():
        tf_idf[word] = v * idf[word]
    return tf_idf

In [None]:
tf_idf_A = compute_tf_idf(tf_A, idf)
tf_idf_B = compute_tf_idf(tf_B, idf)
tf_idf_C = compute_tf_idf(tf_C, idf)

pd.DataFrame([tf_idf_A, tf_idf_B, tf_idf_C])

## Alternative using sklearn library TfidfVectorizer()

In [3]:
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(corpus1)

In [4]:
pd.DataFrame(features.todense(), columns = tfidf.get_feature_names())

Unnamed: 0,ate,dog,for,grass,is,my,name,scoot,some,took,walk
0,0.0,0.307144,0.52004,0.0,0.0,0.307144,0.0,0.0,0.0,0.52004,0.52004
1,0.52004,0.307144,0.0,0.52004,0.0,0.307144,0.0,0.0,0.52004,0.0,0.0
2,0.0,0.307144,0.0,0.0,0.52004,0.307144,0.52004,0.52004,0.0,0.0,0.0
