### TF-IDF 
It stands for Term Frequency - Inverse Document Frequency. It is a way to quantify words in a document. It's structure seems somewhat similar to that of Count Vectorizer's vectors, but there are certain differences. Most important one is that Count Vectorizer basically gives us a count of a certain word in the document, while Tf-Idf will give us weights which show how important a word is. The TF part will tell us how frequent the word is in a document, while the IDF part will give us the invers of how frequent a word is across the documents. It is based on the idea that certain words - like 'the', 'is', 'a' and so on, while frequent, don't add much to the meaning of the document. They are prevalent across all documents. Whereas certain terms like  - "spacecraft", technical terms and so on, while they're popular across one document, they are not going to be there in all the documents.

TFIDF is most useful for retrieving information. 

So let's get started with this, and get our document corpus first.

#### Importing the Libraries

In [1]:
import re
import numpy as np
import nltk
from nltk.tokenize import RegexpTokenizer
import spacy
from random import shuffle
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
import math

nlp = spacy.load("en_core_web_sm")

#### Get the corpus

In [2]:
#get the corpus
with open("sample text.txt","r") as f:
    content = f.read()

# get the sentences
corpus = content.split("\n")
corpus = [c for c in corpus if len(c.strip()) > 0]
shuffle(corpus)
shuffle(corpus)
shuffle(corpus)
shuffle(corpus)
shuffle(corpus)
corpus = corpus[:10]

#### Clean the sentences

In [3]:
# clean sents and tokenize
def bag_of_words(s):
    tokenizer = RegexpTokenizer("\w+")
    tokens = tokenizer.tokenize(s)
    tokens = [token for token in tokens if token not in string.punctuation]
    tokens = [token.lower() for token in tokens if re.findall(r"\w", token)]
    tokens = [token.strip().strip('.').strip("—").strip("'") for token in tokens]
    return tokens
    

#### Get the unique words

In [4]:
# get the unique words
def get_vocab(all_text):
    tokenized_text = [bag_of_words(s) for s in all_text]
    tokens = [token for tokens in tokenized_text for token in tokens]
    tokens = list(set(tokens))
    tokens.sort()
    return tokens

#### Compute the term frequency

In [5]:
def compute_tf(doc, unique_words):
    tf_dict = dict()
    tokens = bag_of_words(doc)
    tf = [0]*len(unique_words)
    N = len(tokens)
    for token in tokens:
        tf_dict[token] = tf_dict.setdefault(token, 0)+1
    
    for word in tf_dict:
        tf_dict[word] = tf_dict[word] / N
    for i, w in enumerate(unique_words):
        if w in tf_dict:
            tf[i] = tf_dict[w]
        
    return tf

#### Get the Term Frequencies for all the terms for all the documents in the corpus

In [6]:
# computing TF
vocabulary = get_vocab(corpus)
TF_vals = [compute_tf(para, vocabulary) for para in corpus]
print(*corpus, sep="\n")
print(TF_vals[0])

"I will be there, you may be sure.— MAUDIE."
"You are funny, Masser Holmes, ain't you?"
Story 54, Case 53, Collier's Weekly, November 8, 1924;
"It was just a week ago to-day. The creature was howling outside the old well-house, and Sir Robert was in one of his tantrums that morning. He caught it up, and I thought he would have killed it. Then he gave it to Sandy Bain, the jockey, and told him to take the dog to old Barnes at the Green Dragon, for he never wished to see it again."
"Yet you say he is affectionate?"
"'I have his letters to me in my pocket.'
"Dear me, Holmes!" I cried, "that seemed to me to be the most damning incident of all."
"No, I heard nothing. But, indeed, Mr. Holmes, I was so agitated and horrified by this terrible outbreak that I rushed to get back to the peace of my own room, and I was incapable of noticing anything which happened."
"But surely," said I, "the vampire was not necessarily a dead man? A living person might have the habit. I have read, for example, of

#### Get the document Frequency

In [7]:
def compute_df(all_text):
    tokenized_text = [bag_of_words(doc) for doc in all_text]
    unique_words = get_vocab(all_text)
    df = dict.fromkeys(unique_words, 0)
    for i,tokens in enumerate(tokenized_text):
        for token in tokens:
            for w in unique_words:
                if token == w:
                    df[token] += 1
                    break
    return df

In [8]:
# computing document frequency
df_val = compute_df(corpus)
from pprint import pprint
pprint(df_val["a"])

3


In [9]:
def compute_idf(all_text):
    word_freq = compute_df(all_text)
    idf = dict()
    N = len(all_text)
    for word in word_freq:
        idf[word] = np.log(N/(word_freq[word] + 1))
    return idf
        

In [10]:
idf_vals = compute_idf(corpus)
idf_vals['a']

0.9162907318741551

#### compute tf_idf

In [11]:
def compute_tfidf(all_text, TF_VAL):
    unique_words = get_vocab(all_text)
    tf_vals = [compute_tf(text,unique_words) for text in all_text]
    idf_vals = compute_idf(all_text)
#     print(tf_vals == TF_VAL)

    tf_idf = list()
    
    for i in range(len(all_text)):
        temp = list()
        for j, w in enumerate(unique_words):
            
            temp.append(tf_vals[i][j] * idf_vals[w])
#         print(temp)
        tf_idf.append(temp)
#     print()
#     print(*tf_idf,sep ="\n")
    return tf_idf

In [12]:
ti_vals = compute_tfidf(corpus, TF_vals)

In [13]:
print(ti_vals[0])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2036201626387011, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.01059001997825832, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.17882643471490003, 0.17882643471490003, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.17882643471490003, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.0, 0.0, 0.0, 0.17882643471490003, 0.0, 0.0, -0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.17882643471490003, 0.0, 0.0, 0.0, 0.07701635339554948, 0.0, 0.0, 0.0]


In [14]:
print(TF_vals[0])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2222222222222222, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1111111111111111, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1111111111111111, 0.1111111111111111, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1111111111111111, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1111111111111111, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1111111111111111, 0, 0, 0, 0.1111111111111111, 0, 0, 0]
