<a href="https://colab.research.google.com/github/datascientist-kenn/SGA07_DATASCI/blob/master/NaiveBayesSKlearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naïve Bayes from Scratch! 


# Outcome of this Tutorial - A Hands-On Scikit-learn Implementation of NB 
A complete walk-through of NB implementation of NB using Python's Holy Grail of Machine Learning - Scikit-learn


Let's begin with a few imports...

In [0]:
import pandas as pd 
import numpy as np 
from collections import defaultdict
import re 

Lets first write a handy text preprocessing function 

In [0]:
def preprocess_string(str_arg):
    
    """"
        Parameters:
        ----------
        str_arg: example string to be preprocessed
        
        What the function does?
        -----------------------
        Preprocess the string argument - str_arg - such that :
        1. everything apart from letters is excluded
        2. multiple spaces are replaced by single space
        3. str_arg is converted to lower case 
        
        Example:
        --------
        Input :  Menu is absolutely perfect,loved it!
        Output:  ['menu', 'is', 'absolutely', 'perfect', 'loved', 'it']
        

        Returns:
        ---------
        Preprocessed string 
        
    """
    
    cleaned_str=re.sub('[^a-z\s]+',' ',str_arg,flags=re.IGNORECASE) #every char except alphabets is replaced
    cleaned_str=re.sub('(\s+)',' ',cleaned_str) #multiple spaces are replaced by single space
    cleaned_str=cleaned_str.lower() #converting the cleaned string to lower case
    
    return cleaned_str # returning the preprocessed string in tokenized form

# Loading the 20 newsgroup Dataset  


In [0]:
from sklearn.datasets import fetch_20newsgroups

######################### Loading Training Dataset ############################

categories=['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med'] 
newsgroups_train=fetch_20newsgroups(subset='train',categories=categories)

train_data=newsgroups_train.data #getting all training examples
train_labels=newsgroups_train.target #getting training labels

print ("Total Number of Training Examples: ",len(train_data))
print ("Total Number of Training Labels: ",len(train_labels))


Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Total Number of Training Examples:  2257
Total Number of Training Labels:  2257


## Here is what the training dataset looks like in it's raw form .....  🤔
Training Examples : <br>
    The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics 
    
Training Labels : <br>
    Training Labels are ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian'] - where each training     label has its own unique integer id

In [0]:
pd.DataFrame(data=np.column_stack([train_data,train_labels]),columns=["Training Examples","Training Labels"]).head()

Unnamed: 0,Training Examples,Training Labels
0,From: sd345@city.ac.uk (Michael Collier)\nSubj...,1
1,From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\...,1
2,From: djohnson@cs.ucsd.edu (Darin Johnson)\nSu...,3
3,From: s0612596@let.rug.nl (M.M. Zwart)\nSubjec...,3
4,From: stanly@grok11.columbiasc.ncr.com (stanly...,3


## Let's begin with the preprocessing of the training dataset that includes
1. Text Cleaning
2. Creating the BoW representation of our training Dataset (would need the same for test dataset as well)

### 1. Text Cleaning

In [0]:
train_data=[preprocess_string(train_str) for train_str in train_data]
print ("Data Cleaning Done")
print ("Total Number of Training Examples: ",len(train_data))

Data Cleaning Done
Total Number of Training Examples:  2257


## Here's what the processed training dataset looks like

In [0]:
pd.DataFrame(data=np.column_stack([train_data,train_labels]),columns=["Training Examples","Training Labels"]).head()

Unnamed: 0,Training Examples,Training Labels
0,from sd city ac uk michael collier subject con...,1
1,from ani ms uky edu aniruddha b deglurkar subj...,1
2,from djohnson cs ucsd edu darin johnson subjec...,3
3,from s let rug nl m m zwart subject catholic c...,3
4,from stanly grok columbiasc ncr com stanly sub...,3


### 2. Creating the BoW representation of our training Dataset (would need the same for test dataset as well)

In [0]:
from sklearn.feature_extraction.text import CountVectorizer #simply import CountVectorizer
count_vect = CountVectorizer() #instantiate it's object
X_train_counts = count_vect.fit_transform(train_data) #builds a term-document matrix ands return it
print (X_train_counts.shape)


(2257, 31159)


## Regarding CountVectorizer - as explained on [Scikit_learn](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

What the Countvectorizer Does?
Takes in the text corpus, builds it's term document matrix (i.e BoW), and returns it

Every word is assigned a fixed unique integer id and vale of each cell of this matrix represents the word
count - BoW

So for example X_train_counts[ i , j ]- where i refers to a document which in our case each document specifies a training example and j refers to the index of a word w in it's respective document i- would return count of word j 

X_train_counts[0,12048] will retrieve the word count of word with the integer id = 12048 and domcent/example 
id 0

You can read more about Sklearn CountVectorizer at [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

In [0]:
print (X_train_counts[0,12048])

1


In [0]:
print (X_train_counts)

  (0, 12048)	1
  (0, 8371)	1
  (0, 9880)	1
  (0, 16147)	1
  (0, 27392)	1
  (0, 29019)	1
  (0, 5237)	1
  (0, 21703)	1
  (0, 484)	1
  (0, 13293)	1
  (0, 27608)	1
  (0, 11615)	1
  (0, 5831)	1
  (0, 27740)	1
  (0, 14221)	1
  (0, 23397)	1
  (0, 1283)	1
  (0, 8664)	2
  (0, 20880)	1
  (0, 20903)	1
  (0, 12679)	1
  (0, 24151)	1
  (0, 7862)	1
  (0, 15887)	1
  (0, 898)	1
  :	:
  (2256, 28544)	1
  (2256, 2499)	1
  (2256, 19646)	1
  (2256, 17923)	1
  (2256, 31008)	2
  (2256, 27703)	1
  (2256, 30555)	1
  (2256, 1027)	1
  (2256, 30326)	1
  (2256, 1780)	1
  (2256, 22605)	2
  (2256, 10403)	1
  (2256, 16988)	1
  (2256, 3703)	1
  (2256, 8455)	2
  (2256, 5237)	1
  (2256, 13293)	1
  (2256, 27740)	1
  (2256, 14221)	2
  (2256, 19212)	2
  (2256, 15943)	1
  (2256, 27615)	6
  (2256, 19515)	1
  (2256, 26575)	1
  (2256, 10690)	1


# That's it!!! Let's Move to Training! ⛸⛸⛸

In [0]:
from sklearn.naive_bayes import MultinomialNB #importing the Sklearn's NB Fucntionality
clf = MultinomialNB() #simply instantiate a Multinomial Naive Bayes object
clf.fit(X_train_counts, train_labels)  #calling the fit method trains it
print ("Training Completed")

Training Completed


# So Now That We Have Trained NB Model - Let's Move to Testing! 🏄🏽🏄🏽🏄🏽

In [0]:
newsgroups_test=fetch_20newsgroups(subset='test',categories=categories) #loading test data
test_data=newsgroups_test.data #get test set examples
test_labels=newsgroups_test.target #get test set labels

print ("Number of Test Examples: ",len(test_data))
print ("Number of Test Labels: ",len(test_labels))

test_data=[preprocess_string(test_str) for test_str in test_data] #need to preporcess the test set as well!!
print ("Number of Test Examples: ",len(test_data))



Number of Test Examples:  1502
Number of Test Labels:  1502
Number of Test Examples:  1502


The same count_vect object that was instantiated for training dataset will be used for test dataset.
But remeber that we are not calling fit_transform(since we only want to transform the test data into a term-document matrix whereas fit_transform fit_transform learns the vocabulary dictionary first and then returns a term-document matrix. We are supposed to learn the vocabulary on training dataset only

fit_transform : learns the vocabulary dictionary and returns term-document matrix
transform : transforms documents to document-term matrix


In [0]:
X_test_counts=count_vect.transform(test_data) #transforms test data to numerical form
print (X_test_counts.shape)

(1502, 31159)


# Now we can test on the transformed version of test data

In [0]:
predicted=clf.predict(X_test_counts)
print ("Test Set Accuracy : ",np.sum(predicted==test_labels)/float(len(predicted))) 

Test Set Accuracy :  0.936085219707


### The above code can be further reduced to literally 3 lines of code by using the pipeline functionality of sklearn!

# It's truly the ML Holy Toolkit!

In [0]:
from sklearn.pipeline import Pipeline #importing the pipeline functionality


"""
    We simply build a pipeline object by specifying the pipeline actions and once that pipeline object is
    used for the trainign purpose, it will automatically perform the pipeline steps int he specified order.
    In our case, as we first want to build a CountVectorizer for the purpose of BoW, and then fit/train a 
    NB model, so in exectly the same manner, we will speicify these actions in our pipeline. 
    
    Do note that, now when calling the fit method, we will pass the original textual data as now
    the count_vect in pipeline will itself convert it to numeric form. So it's important here that we
    pass the textual data or else nasty errros will pop out. Same is the case for test data as well. No need
    to count vectorize it separately :) But we do need to preprocess the test data from cleaning point of view

"""

clf=Pipeline([('count_vect', CountVectorizer()),('clf', MultinomialNB())])
clf.fit(train_data,train_labels)  
print ("Done")

Done


In [0]:

print (len(test_data))
predicted=clf.predict(test_data)
print ("Test Set Accuracy : ",np.sum(predicted==test_labels)/float(len(predicted))) 

1502
Test Set Accuracy :  0.936085219707
