This is a python notebook containing the code for implementing TFIDF Vectorizer from Scratch. It uses 'BBC-Document-Classification' csv file as dataset. 

Steps:
* Form a dictionary containing keys as words and indices as values.
* Convert all the rows (documents) in the dataset as integer values using the word to index dictionary.
* Form a term frequency matrix of size N X w (documents X words) and store the frequency of each integer appearing in each document.
* Form an inverse document frequency matrix by finding the number of times each integer appears in all the documents.
* Find the tfidf matrix by tf X idf (broadcasting)


In [17]:
import numpy as np 
import pandas as pd 
import nltk

In [18]:
from nltk import word_tokenize

In [19]:
# Download necessary data for nltk's word tokenizer
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\ishid/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [20]:
df = pd.read_csv('bbc_doc_classification.csv')

In [21]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\r\n\r\nQuart...,business
1,Dollar gains on Greenspan speech\r\n\r\nThe do...,business
2,Yukos unit buyer faces loan claim\r\n\r\nThe o...,business
3,High fuel prices hit BA's profits\r\n\r\nBriti...,business
4,Pernod takeover talk lifts Domecq\r\n\r\nShare...,business


In [22]:
word_idx ={}
idx =0
tokenized_docs = []
for doc in df['text']:
    words = word_tokenize(doc.lower())
    doc_as_int = []
    for word in words:
        if word not in word_idx:
            word_idx[word] = idx
            idx +=1

        doc_as_int.append(word_idx[word])
    tokenized_docs.append(doc_as_int)

In [23]:
# Reverse Mapping
idx_to_word = {v:k for k,v in word_idx.items()}

In [24]:
# Number of Documents 
N = len(df['text'])

In [25]:
# Number of Words
w = len(word_idx)

In [26]:
# Instantiate a term Frequency matrix
# Would have the same result as countvectorizer
#It would be more efficient to use a sparse matrix 

tf = np.zeros((N,w))

In [27]:
# Populate term frequency matrix
for i, doc_as_int in enumerate(tokenized_docs):
    for j in doc_as_int:
        tf[i,j] +=1

In [28]:
# Compute IDF
# Document Frequency - size is (w,)
document_frequency = np.sum(tf>0, axis=0)
idf = np.log(N/ document_frequency)

In [29]:
# Compute TF-IDF (numpy automatically broadcasts when trying to perform operations on objects of different dimensions)
tf_idf = tf * idf

In [30]:
tf_idf.shape

(2225, 34762)

In [31]:
np.random.seed(123)  # To get consistent result for the next part of the notebook

In [32]:
# Lets take a random document and show the top 5 words in terms of tfidf score
random_doc = np.random.randint(N)
row = df.iloc[random_doc]
print("Label : ", row['labels'])
print("Text : ", row['text'].split("\n",1)[0])
print("Top 5 words")
scores = tf_idf[random_doc]
indices = (-scores).argsort()

for j in indices[:5] :
    print(idx_to_word[j])

Label :  sport
Text :  Athens memories soar above lows
Top 5 words
paula
athens
1500m
her
kelly
