# Naive Bayes Classifier - Implementation from scratch 
The notbook contains the code to implement the Naive Bayes Classifier from Scratch.

### Importing required libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import re

### Importing the dataset
- This is a kaggle dataset which has news text articles along with the category they belong to like business, sports etc.
- The objective is to make a classifier which can predict the category of news article given it's text
- The sample records of the dataset are shown for your reference

In [3]:
inp_dataset = pd.read_csv("C:\\Ujjwal\\Analytics\\Datasets\\News Classification\\News_train.csv")
inp_dataset.head(2)

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business


### Cleaning the articles to remove the unwanted characters
- In this step we are cleaning the dataset and dividing it in to test and train set
- The data cleaning is done in the function defined as **text_clean**. Given the focus is on identifying popular words which come in specific category of articles, the function removes all the numbers, puncutations and special characters from the text. Moreover, any extra spaces are replaced with single space.
- Finally, we are splitting the data in to train and test set using *sklearn's function* **test_train_split**
- Notice the number of records in original, training and testing dataset

In [4]:
def text_clean(text_series):
    clean_1 = text_series.str.replace(r"[^a-zA-Z\s]","")
    clean_2 = clean_1.str.replace(r"\s+", " ")
    return clean_2

In [5]:
inp_dataset["Text_Clean"] = text_clean(inp_dataset["Text"])

In [6]:
Y = inp_dataset['Category']
train_x, test_x, train_y, test_y  = train_test_split(inp_dataset,Y,random_state = 8)
train_x.reset_index(inplace = True, drop = True)

In [8]:
print(f'Number of records in original dataset - {inp_dataset.shape[0]}\nNumber of records in training dataset - {train_x.shape[0]}\nNumber of records in testing dataset - {test_x.shape[0]}')

Number of records in original dataset - 1490
Number of records in training dataset - 1117
Number of records in testing dataset - 373


### Creating Bag of Words
- In this step we are creating a dataframe which has the count frequency of each word in each document
- This is done using *sklearn's* **CountVectorizer** function. We have **passed the argument stop_words as "english" so that the function can remove the english stop words on it's own**.
- Observe the sample records from the dataframe. The number 1 below abacus in record 4 indicates that the **word Abacus came once in this document**. 

**Note:**

- Since the number of unique words across all the documents are very high, all the words are not visible

In [14]:
Cnt_Vec = CountVectorizer(stop_words="english")
BOW = Cnt_Vec.fit_transform(train_x["Text_Clean"]).toarray()
BOW_Df = pd.DataFrame(BOW, columns=Cnt_Vec.get_feature_names())
BOW_Df[4:8]

Unnamed: 0,aa,aaa,aaas,aac,aadc,aaliyah,aamir,aaron,aashare,abacus,...,zombies,zone,zones,zoom,zooms,zooropa,zuluaga,zurich,zutons,zvonareva
4,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Merging Dataframes & Creating training and test set
- To get the article category values along with the word frequency, we are merging the 2 datasets
- This is done because we will be implementing the Naive Bayes approach from scratch and not using sklearn's inbuilt functions

In [16]:
inp_dataset_final = pd.merge(train_x, BOW_Df, left_index=True, right_index=True, how = "left")
inp_dataset_final.head(2)

Unnamed: 0,ArticleId,Text,Category,Text_Clean,aa,aaa,aaas,aac,aadc,aaliyah,...,zombies,zone,zones,zoom,zooms,zooropa,zuluaga,zurich,zutons,zvonareva
0,912,warning over tsunami aid website net users are...,tech,warning over tsunami aid website net users are...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,901,kenteris denies faking road crash greek sprint...,sport,kenteris denies faking road crash greek sprint...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Consolidating Bag of Words on the basis of News Categories
- In this section we are grouping all the word frequencies on the basis of different news categories they fall in
- An additional column is added to the dataframe which contains the count of all the words across the documents in specific category
- The ArticleId column is having the count of articles in each category
- Refer to the sample records as shown below

In [18]:
cons_dataset = inp_dataset_final.groupby("Category").agg({col:"count" if col == "ArticleId" else "sum" for col in inp_dataset_final.columns})
cols = [col for col in cons_dataset.columns if col not in ["Text", "Category", "Text_Clean"]]
cons_dataset_updated = cons_dataset[cols].copy()
cons_dataset_updated.reset_index(inplace=True)
cons_dataset_updated["sum_all_words"] = cons_dataset_updated.iloc[:,2:].apply(lambda x: np.sum(x), axis=1)
cons_dataset_updated.head()

Unnamed: 0,Category,ArticleId,aa,aaa,aaas,aac,aadc,aaliyah,aamir,aaron,...,zone,zones,zoom,zooms,zooropa,zuluaga,zurich,zutons,zvonareva,sum_all_words
0,business,245,0,0,0,0,0,0,0,0,...,1,3,0,0,0,0,1,0,0,42126
1,entertainment,211,0,0,0,0,0,4,1,0,...,0,0,0,0,1,0,0,1,0,35964
2,politics,208,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,48353
3,sport,261,0,2,1,3,0,0,0,2,...,4,0,0,0,0,1,4,0,1,42686
4,tech,192,0,0,0,0,3,0,0,0,...,1,1,3,1,0,0,0,0,0,48323


### Creating Probability Table
- Using the consolidated table we created in the previous section, we will now calculate the probabilities of each word occuring in each artcile category
- Laplace smoothing is also built in to ensure that words with 0 frequency are taken care of. This table will be used to calculate the probability of all the test documents
- See the sample dataframe output

In [19]:
alpha = 1
prob_table = pd.DataFrame()
prob_table["Category"] = cons_dataset_updated["Category"]
prob_table["p_C"] = cons_dataset_updated["ArticleId"]/cons_dataset_updated.shape[0]
cols = [col for col in cons_dataset_updated.columns if col not in ["Category", "ArticleId", "sum"]]
no_of_cols = len(cols)
for col in cols:
    prob_table[col] = (cons_dataset_updated[col]+alpha)/(cons_dataset_updated["sum_all_words"] + (alpha*no_of_cols))

In [23]:
prob_table.head(2)

Unnamed: 0,Category,p_C,aa,aaa,aaas,aac,aadc,aaliyah,aamir,aaron,...,zone,zones,zoom,zooms,zooropa,zuluaga,zurich,zutons,zvonareva,sum_all_words
0,business,49.0,1.6e-05,1.6e-05,1.6e-05,1.6e-05,1.6e-05,1.6e-05,1.6e-05,1.6e-05,...,3.1e-05,6.2e-05,1.6e-05,1.6e-05,1.6e-05,1.6e-05,3.1e-05,1.6e-05,1.6e-05,0.655265
1,entertainment,42.2,1.7e-05,1.7e-05,1.7e-05,1.7e-05,1.7e-05,8.6e-05,3.4e-05,1.7e-05,...,1.7e-05,1.7e-05,1.7e-05,1.7e-05,3.4e-05,1.7e-05,1.7e-05,3.4e-05,1.7e-05,0.618721


### Creating a word tokenizer
- Here we are defining a word tokenizer function to split the test articles in to individual words
- We could have used our count vectorizer variable to transform the text but just to demonstrate the process of tokenizing, this funciton is defined

In [24]:
def wt(text):
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    text = re.sub(r"\s+"," ", text)
    list_of_words = text.split(" ")
    return list_of_words

### Testing the classification
- Using the probability table creating in the previous section, the test dataset articles will be classified
- As can be seen fromt he results, close to 95% of the documents got correctly classified

In [25]:
test_x.reset_index(drop=True, inplace=True)

In [27]:
correct = 0
incorrect = 0
for i in range(0,test_x.shape[0]-200):
    text = test_x.loc[i,"Text_Clean"]
    prob = pd.DataFrame()
    prob["Category"] = prob_table["Category"]
    prob["prob"] = prob_table['p_C']
    for val in wt(text):
        if val in prob_table.columns:
            prob["prob"] = prob["prob"] * prob_table[val] * 1000
        else:
            prob["prob"] = prob["prob"] * 1
    prob["probability"] = prob["prob"]/prob["prob"].sum()
    prob.sort_values("probability",ascending = False, inplace=True)
    if test_x.loc[i,'Category'] == prob.iloc[0,0]:
        correct += 1
    else:
        incorrect +=1
print(f'Total number of text documents evaluated - {test_x.shape[0]-200}\nNumber of documents correctly classified - {correct}\nNumber of documents incorrectly classified - {incorrect}')

Total number of text documents evaluated - 173
Number of documents correctly classified - 164
Number of documents incorrectly classified - 9
