### Implementing Bag of Words without using sklearn library

## Q. What is Bag of Word?

It is a technique of conversion of text into numerical features. 

It describes the occurence of words within a document. 

It tells two things:

    1. Whether a word is present or not
    2. Frequency of word


In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from tqdm import tqdm
import os

In [2]:
# fit method of sklearn
from collections import Counter
from scipy.sparse import csr_matrix
from tqdm import tqdm

def fit(dataset):
    unique_words=set()
    if isinstance(dataset,(list,)):
        for row in dataset:
            for word in row.split(" "):
                if len(word)<2:
                    continue
                unique_words.add(word)
        unique_words=sorted(list(unique_words))
        vocab={j:i for i,j in enumerate(unique_words)}

        return vocab
    else:
        print("enter list of sentence")


In [3]:
# Transform method in BOW
def transform(dataset,vocab):
    rows = []
    columns = []
    values = []
    if isinstance(dataset,(list,)):
        for idx,row in enumerate(tqdm(dataset)):
            word_freq=dict(Counter(row.split()))
            for word,freq in word_freq.items():
                if len(word)<2:
                    continue
                col_index=vocab.get(word,-1)
                if col_index !=1:
                    rows.append(idx)
                    columns.append(col_index)
                    values.append(freq)
        return csr_matrix((values, (rows,columns)), shape=(len(dataset),len(vocab)))

    else:
        print("Enter list of sentence")
        
                



In [4]:
strings = ["the method of lagrange multipliers is the economists workhorse for solving optimization problems",
           "the technique is a centerpiece of economic theory but unfortunately its usually taught poorly"]
vocab = fit(strings)

In [5]:
trns=transform(strings,vocab)
print(list(vocab.keys()))
print(trns.toarray())

100%|██████████| 2/2 [00:00<00:00, 508.96it/s]

['but', 'centerpiece', 'economic', 'economists', 'for', 'is', 'its', 'lagrange', 'method', 'multipliers', 'of', 'optimization', 'poorly', 'problems', 'solving', 'taught', 'technique', 'the', 'theory', 'unfortunately', 'usually', 'workhorse']
[[0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 0 0 2 0 0 0 1]
 [1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0]]





In [6]:
dictt={}
for row in list(vocab.keys()):
    dictt[row]=0


dictt

{'but': 0,
 'centerpiece': 0,
 'economic': 0,
 'economists': 0,
 'for': 0,
 'is': 0,
 'its': 0,
 'lagrange': 0,
 'method': 0,
 'multipliers': 0,
 'of': 0,
 'optimization': 0,
 'poorly': 0,
 'problems': 0,
 'solving': 0,
 'taught': 0,
 'technique': 0,
 'the': 0,
 'theory': 0,
 'unfortunately': 0,
 'usually': 0,
 'workhorse': 0}

### Implementing TFIDF without using sklearn library


<strong>What is TF-IDF? What is the mathematics behind TF-IDF?</strong>


<strong>Term Frequency(TF)</strong>: It measures , how frequently a word occur in the document. If the word is appear more number of times in a document, then it give more weightage.

$TF(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}.$
</li>
<li>
<strong>Inverse Document Frequency(IDF)</strong>= It measures how important a term/word appear in the corpus. If thr word appears in many number of documents, then it holds less importance.
For example: Word like "is","the" ,"and" etc are present in almost all the documents in the corpus, so it holds less importance than the word which appear less number of times in document.

$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}}.$
for numerical stabiltiy we will be changing this formula little bit
$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}+1}.$
</li>



In [23]:
import math
def transform_tfidf(dataset,vocab):
    rows = []
    columns = []
    values = []
    Number_of_documents_with_term_t={}
    total_number_of_documents=len(dataset)
    
    if isinstance(dataset,(list,)):
        
        
        #IDF calculation
        for key in list(vocab.keys()):
            Number_of_documents_with_term_t[key]=0
           
        for idx,row in enumerate(tqdm(dataset)):
            #TF calculation
            word_freq=dict(Counter(row.split()))
            for word, freq in word_freq.items():
                 
                    Number_of_documents_with_term_t[word]= Number_of_documents_with_term_t[word]+1
           
              


        #TF-IDF calculation
        for idx,row in enumerate(tqdm(dataset)):
            #TF calculation
            word_freq=dict(Counter(row.split()))
            
           
            for word, freq in word_freq.items():
               
                col_index=vocab.get(word,-1)

                if col_index !=-1:
                    rows.append(idx)
                    columns.append(col_index)
                    TF=freq/len(row)
                    
                    IDF=math.log((1+total_number_of_documents)/(1+Number_of_documents_with_term_t[word]))+1
                    values.append(TF*IDF)
        return csr_matrix((values, (rows,columns)), shape=(len(dataset),len(vocab)))
    else:
        print("Enter list of document")           
                    










In [24]:
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
vocab=fit(corpus)
print(vocab)


{'and': 0, 'document': 1, 'first': 2, 'is': 3, 'one': 4, 'second': 5, 'the': 6, 'third': 7, 'this': 8}
