# Naive Bayes Classifier - Implementation from scratch 
The notbook contains the code to implement the Naive Bayes Classifier from Scratch.

### Importing required libraries

In [105]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from scipy.special import logsumexp
import re

### Importing the dataset
- This is a kaggle dataset which has news text articles along with the category they belong to like business, sports etc.
- The objective is to make a classifier which can predict the category of news article given it's text
- The sample records of the dataset are shown for your reference

In [2]:
inp_dataset = pd.read_csv("C:\\Ujjwal\\Analytics\\Datasets\\News Classification\\News_train.csv")
inp_dataset.head(2)

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business


### Cleaning the articles to remove the unwanted characters
- In this step we are cleaning the dataset and dividing it in to test and train set
- The data cleaning is done in the function defined as **text_clean**. Given the focus is on identifying popular words which come in specific category of articles, the function removes all the numbers, puncutations and special characters from the text. Moreover, any extra spaces are replaced with single space.
- Finally, we are splitting the data in to train and test set using *sklearn's function* **test_train_split**
- Notice the number of records in original, training and testing dataset

In [3]:
def text_clean(text_series):
    text_series = text_series.str.lower()
    clean_1 = text_series.str.replace(r"[^a-zA-Z\s]","")
    clean_2 = clean_1.str.replace(r"\s+", " ")
    return clean_2

In [4]:
inp_dataset["Text_Clean"] = text_clean(inp_dataset["Text"])

### Creating Bag of Words
- In this step we are creating a dataframe which has the count frequency of each word in each document
- This is done using *sklearn's* **CountVectorizer** function. We have **passed the argument stop_words as "english" so that the function can remove the english stop words on it's own**.
- Observe the sample records from the dataframe. The number 1 below abacus in record 4 indicates that the **word Abacus came once in this document**. 

**Note:**

- Since the number of unique words across all the documents are very high, all the words are not visible

In [5]:
Cnt_Vec = CountVectorizer(stop_words="english")
BOW = Cnt_Vec.fit_transform(inp_dataset["Text_Clean"]).toarray()
BOW_Df = pd.DataFrame(BOW, columns=Cnt_Vec.get_feature_names())
BOW_Df[4:8]

Unnamed: 0,aa,aaa,aaas,aac,aadc,aaliyah,aaltra,aamir,aaron,aashare,...,zonealarm,zones,zoom,zooms,zooropa,zorro,zuluaga,zurich,zutons,zvonareva
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Merging Dataframes & Creating training and test set
- To get the article category values along with the word frequency, we are merging the 2 datasets
- This is done because we will be implementing the Naive Bayes approach from scratch and not using sklearn's inbuilt functions

In [6]:
inp_dataset_final = pd.merge(inp_dataset, BOW_Df, left_index=True, right_index=True, how = "left")
inp_dataset_final.head(2)

Unnamed: 0,ArticleId,Text,Category,Text_Clean,aa,aaa,aaas,aac,aadc,aaliyah,...,zonealarm,zones,zoom,zooms,zooropa,zorro,zuluaga,zurich,zutons,zvonareva
0,1833,worldcom ex-boss launches defence lawyers defe...,business,worldcom exboss launches defence lawyers defen...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,154,german business confidence slides german busin...,business,german business confidence slides german busin...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Consolidating Bag of Words on the basis of News Categories
- In this section we are grouping all the word frequencies on the basis of different news categories they fall in
- An additional column is added to the dataframe which contains the count of all the words across the documents in specific category
- The ArticleId column is having the count of articles in each category
- Refer to the sample records as shown below

In [7]:
cons_dataset = inp_dataset_final.groupby("Category").agg({col:"count" if col == "ArticleId" else "sum" for col in inp_dataset_final.columns})
cols = [col for col in cons_dataset.columns if col not in ["Text", "Category", "Text_Clean"]]
cons_dataset_updated = cons_dataset[cols].copy()
cons_dataset_updated.reset_index(inplace=True)
cons_dataset_updated["sum_all_words"] = cons_dataset_updated.iloc[:,2:].apply(lambda x: np.sum(x), axis=1)
cons_dataset_updated.head()

Unnamed: 0,Category,ArticleId,aa,aaa,aaas,aac,aadc,aaliyah,aaltra,aamir,...,zones,zoom,zooms,zooropa,zorro,zuluaga,zurich,zutons,zvonareva,sum_all_words
0,business,336,0,0,0,0,0,0,0,0,...,3,0,0,0,0,0,2,0,0,58683
1,entertainment,273,0,0,0,0,0,4,1,1,...,0,0,0,1,2,0,0,1,0,46998
2,politics,274,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,61750
3,sport,346,0,4,1,3,0,0,0,0,...,0,0,0,0,0,1,5,0,4,56255
4,tech,261,0,0,1,0,3,0,0,0,...,1,3,2,0,0,0,0,0,0,65823


### Creating Probability Table
- Using the consolidated table we created in the previous section, we will now calculate the probabilities of each word occuring in each artcile category
- Laplace smoothing is also built in to ensure that words with 0 frequency are taken care of. This table will be used to calculate the probability of all the test documents
- See the sample dataframe output

In [9]:
alpha = 1
prob_table = pd.DataFrame()
prob_table["Category"] = cons_dataset_updated["Category"]
prob_table["p_C"] = cons_dataset_updated["ArticleId"]/inp_dataset_final.shape[0]
cols = [col for col in cons_dataset_updated.columns if col not in ["Category", "ArticleId", "sum_all_words"]]
no_of_cols = len(cols)
for col in cols:
    prob_table[col] = (cons_dataset_updated[col]+alpha)/(cons_dataset_updated["sum_all_words"] + (alpha*no_of_cols))

In [92]:
prob_table.head()

Unnamed: 0,Category,p_C,aa,aaa,aaas,aac,aadc,aaliyah,aaltra,aamir,...,zonealarm,zones,zoom,zooms,zooropa,zorro,zuluaga,zurich,zutons,zvonareva
0,business,0.225503,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,...,1.2e-05,4.8e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,3.6e-05,1.2e-05,1.2e-05
1,entertainment,0.183221,1.4e-05,1.4e-05,1.4e-05,1.4e-05,1.4e-05,6.9e-05,2.8e-05,2.8e-05,...,1.4e-05,1.4e-05,1.4e-05,1.4e-05,2.8e-05,4.2e-05,1.4e-05,1.4e-05,2.8e-05,1.4e-05
2,politics,0.183893,2.3e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,...,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05
3,sport,0.232215,1.2e-05,6.1e-05,2.5e-05,4.9e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,...,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,1.2e-05,2.5e-05,7.4e-05,1.2e-05,6.1e-05
4,tech,0.175168,1.1e-05,1.1e-05,2.2e-05,1.1e-05,4.4e-05,1.1e-05,1.1e-05,1.1e-05,...,2.2e-05,2.2e-05,4.4e-05,3.3e-05,1.1e-05,1.1e-05,1.1e-05,1.1e-05,1.1e-05,1.1e-05


### Creating a word tokenizer
- Here we are defining a word tokenizer function to split the test articles in to individual words
- We could have used our count vectorizer variable to transform the text but just to demonstrate the process of tokenizing, this funciton is defined

In [11]:
def wt(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = re.sub(r"\s+"," ", text)
    list_of_words = text.split(" ")
    return list_of_words

### Testing the classification
- Using the probability table creating in the previous section, the test dataset articles will be classified
- As can be seen fromt he results, close to 95% of the documents got correctly classified

In [67]:
text = "game went really good"

In [68]:
prob = pd.DataFrame()
prob["Category"] = prob_table["Category"]
prob["prob"] = prob_table['p_C']
for val in wt(text):
    if val in prob_table.columns:
        prob["prob"] = prob["prob"] * prob_table[val]
    else:
        prob["prob"] = prob["prob"] * 1
prob["probability"] = prob["prob"]/prob["prob"].sum()

In [69]:
prob

Unnamed: 0,Category,prob,probability
0,business,1.823821e-16,0.000104
1,entertainment,2.270805e-14,0.012974
2,politics,7.476354e-15,0.004271
3,sport,1.664447e-12,0.950936
4,tech,5.551025e-14,0.031714


### Creating probability table of training dataset using Sklearn

In [70]:
Multi_NB = MultinomialNB(alpha=1)
cols = [col for col in inp_dataset_final.columns if col not in ["Category","ArticleId", "Text", "Text_Clean"]]
Multi_NB.fit(inp_dataset_final[cols], inp_dataset_final["Category"])

MultinomialNB(alpha=1)

In [71]:
text_list = wt(text)
text_dict = {}
for word in text_list:
    if word in text_dict:
        text_dict[word] +=1
    else:
        text_dict[word] = 1
for word in cols:
    if word not in text_dict:
        text_dict[word] = 0
    else:
        pass
df = pd.DataFrame(text_dict, index=[0])

In [148]:
np.log(prob_table.iloc[0,2:].astype(np.float64))

aa          -11.335532
aaa         -11.335532
aaas        -11.335532
aac         -11.335532
aadc        -11.335532
               ...    
zorro       -11.335532
zuluaga     -11.335532
zurich      -10.236919
zutons      -11.335532
zvonareva   -11.335532
Name: 0, Length: 25062, dtype: float64

In [126]:
Multi_NB.feature_log_prob_

array([[-11.33553175, -11.33553175, -11.33553175, ..., -10.23691946,
        -11.33553175, -11.33553175],
       [-11.18525438, -11.18525438, -11.18525438, ..., -11.18525438,
        -10.4921072 , -11.18525438],
       [-10.67835296, -11.37150014, -11.37150014, ..., -11.37150014,
        -11.37150014, -11.37150014],
       [-11.30611038,  -9.69667246, -10.6129632 , ...,  -9.51435091,
        -11.30611038,  -9.69667246],
       [-11.41735025, -11.41735025, -10.72420307, ..., -11.41735025,
        -11.41735025, -11.41735025]])

In [117]:
np.exp(arr - logsumexp(arr))

array([[0.01947449, 0.02886343, 0.02750577, 0.90234589, 0.02181042]])

In [106]:
arr = np.dot(df, Multi_NB.feature_log_prob_.T) + Multi_NB.class_log_prior_