## CSCI 316 Indivual Assignment 2 Task 1
### Dataset information
##### The wordsList file contains 72 pre-processed emails. Each line is a list of words extracted from each email.
##### The classList file contains the class labels that indicating whether the emails are ordinary or adverts (0 forordinary and 1 for adverts).
### Objective
##### Develop a Naïve Bayesian classifier as an email filter in Python. Namely, the classifier predicts whether emails are ordinary or adverts.
### Task requirements
##### (1) Use stratified sampling to select 66 out of 72 lines for training and the remaining 6 lines for test. Return the classification probabilities of these 6 records.
##### (2) The Naïve Bayesian classifier must account for multiple occurrences of words and implementstechniques to overcome the numerical underflows and zero counts.
##### (3) No ML library can be used in this task. The implementation must be developed from scratch.However, scientific computing libraries such as NumPy and SciPy are allowed.

In [1]:
# import libraries needed for this task 
import pandas as pd
import numpy as np
from collections import defaultdict
import re

### Reading and importing wordsList and classList file as a csv file, and adding classList into wordsList df. So wordsList has wordList column and classList column which contains whetther the emails are ordinary or adverts(0 for ordinary and 1 for adverts)

In [2]:
# Import wordsList and classLists file
wordsListcol=["wordList"]
wordsList=pd.read_csv('wordsList',sep=" ",header=None,names=wordsListcol)
classListcol=["classList"]
classList=pd.read_csv('classList',sep=" ",header=None,names=classListcol)
wordsList["classList"]=classList['classList']
wordsList

Unnamed: 0,wordList,classList
0,"codeine,15mg,for,203,visa,only,codeine,methylm...",1
1,"peter,with,jose,out,town,you,want,meet,once,wh...",0
2,"hydrocodone,vicodin,brand,watson,vicodin,750,1...",1
3,"yay,you,both,doing,fine,working,mba,design,str...",0
4,"you,have,everything,gain,incredib1e,gains,leng...",1
...,...,...
67,"scifinance,now,automatically,generates,gpu,ena...",0
68,"you,have,everything,gain,incredib1e,gains,leng...",1
69,"will,there,the,latest",0
70,"experience,with,biggerpenis,today,grow,inches,...",1


### This function it contains a parameters: str_arg that is an example string that the function preprocess it such as everything apart from letters is excluded, multiple spaces are replaced by single space and the the string is converted to lower case. Lastly it returns the preprocessed string.

In [3]:
def preprocess_string(str_arg):
    new_str=re.sub('[^a-z\s]+',' ',str_arg,flags=re.IGNORECASE) 
    new_str=re.sub('(\s+)',' ',new_str) 
    new_str=new_str.lower() 
    
    return new_str 

### In this implementation of NaiveBayes class, there contain four functions in it. 
#### The two major functions in this class are the train and test functions, while the other two functions are to supplement these two major functions, where addToBow supplements the training function and the getExampleProbability supplements the test functions.
##### addToBow function is called by the train function, it has two parameters which has example that is  the features variables and dict_index that implies to whioch Bow category the example belongs to that  it splits the given example using space as a tokenizer and adds every tokenized words to its corressponding Bow. 
##### Train function is the function that trains the NB model where it takes in the features and the label.  
#### getExampleProbability is called by the test function, it takes in one parameter, a single test example,  it estimates probability of the given test example so that it can be classified to a class label and returns the probability of test example in all classes.
##### Test function determines the probability of eac test example against all classes and predicts the label against which the class probability is maximum . It returns the predictions of test examples.

In [4]:
class NaiveBayes:
    
    def __init__(self,unique_classes):
        
        self.classes=unique_classes # Constructor passed with unique number of classes of the training set
        

    def addToBow(self,example,dict_index):

        
        if isinstance(example,np.ndarray): example=example[0]
     
        for token_word in example.split(): #for every word in preprocessed example
          
            self.bow_dicts[dict_index][token_word]+=1 #increment in its count
            
    def train(self,features,label):
        

    
        self.examples=features
        self.label=label
        self.bow_dicts=np.array([defaultdict(lambda:0) for index in range(self.classes.shape[0])])
        
        # Only convert to numpy arrays if initially not passed as numpy arrays - else its a useless recomputation
        
        if not isinstance(self.examples,np.ndarray): self.examples=np.array(self.examples)
        if not isinstance(self.label,np.ndarray): self.label=np.array(self.label)
            
        # Constructing Bow for each category
        for cat_index,cat in enumerate(self.classes):
          
            all_cat_examples=self.examples[self.label==cat] #filter all examples of category == cat
            
            # Get examples preprocessed
            
            cleaned_examples=[preprocess_string(cat_example) for cat_example in all_cat_examples]
            
            cleaned_examples=pd.DataFrame(data=cleaned_examples)
            
            # Now costruct Bow of this particular category
            np.apply_along_axis(self.addToBow,1,cleaned_examples,cat_index)
            
                
        ###################################################################################################
        
        '''
            
            We are done with constructing of Bow for each category. But there is a need to precompute a few 
            other calculations at training time too:
            1. prior probability of each class - p(c)
            2. vocabulary |V| 
            3. denominator value of each class - [ count(c) + |V| + 1 ] 
            
            ---------------------
            All these 3 calculations can be done at test time too however doing so means to re-compute these 
            again and again every time the test function will be called - this would significantly
            increase the computation time especially when there is a lot of test examples to classify.  
            Which does not make sense to repeatedly compute the same thing.
            So precompute all of them & use them during test time to speed up predictions.
            
        '''
        
        ###################################################################################################
      
        prob_classes=np.empty(self.classes.shape[0])
        all_words=[]
        cat_word_counts=np.empty(self.classes.shape[0])
        for cat_index,cat in enumerate(self.classes):
           
            # Calculating prior probability p(c) for each class
            prob_classes[cat_index]=np.sum(self.label==cat)/float(self.label.shape[0]) 
            
            # Calculating total counts of all the words of each class 
            count=list(self.bow_dicts[cat_index].values())
            cat_word_counts[cat_index]=np.sum(np.array(list(self.bow_dicts[cat_index].values())))+1
            
            # Get all words of this category                                
            all_words+=self.bow_dicts[cat_index].keys()
                                                     
        
        # Combine all words of every category & make them unique to get vocabulary -V- of entire training set
        
        self.vocab=np.unique(np.array(all_words))
        self.vocab_length=self.vocab.shape[0]
                                  
        # Computing denominator value                                      
        denoms=np.array([cat_word_counts[cat_index]+self.vocab_length+1 for cat_index,cat in enumerate(self.classes)])                                                                          
      
        '''
            Now that have everything precomputed as well, organize everything in a tuple 
            rather than to have a separate list for every thing.
            
            Every element of self.cats_info has a tuple of values
            Each tuple has a dict at index 0, prior probability at index 1, denominator value at index 2
        '''
        
        self.cats_info=[(self.bow_dicts[cat_index],prob_classes[cat_index],denoms[cat_index]) for cat_index,
                        cat in enumerate(self.classes)]                               
        self.cats_info=np.array(self.cats_info)                                 
                                              
                                              
    def getExampleProbability(self,test_example):                                
        

                                 
                                              
        likelihood_prob=np.zeros(self.classes.shape[0]) # Store probability w.r.t each class
        
 
        for cat_index,cat in enumerate(self.classes): 
                             
            for test_token in test_example.split(): #split the test example and get p of each test word
                           
                
                #get total count of this test token from it's respective training dict to get numerator value                           
                test_token_counts=self.cats_info[cat_index][0].get(test_token,0)+1
                
                   
                test_token_prob=test_token_counts/float(self.cats_info[cat_index][2])                              
                

                likelihood_prob[cat_index]+=np.log(test_token_prob)
                                              
        # Estimate of the given example against every class but we need posterior probility
        post_prob=np.empty(self.classes.shape[0])
        for cat_index,cat in enumerate(self.classes):
            post_prob[cat_index]=likelihood_prob[cat_index]+np.log(self.cats_info[cat_index][1])                                  
      
        return post_prob
    
   
    def test(self,test_set):
      
       
        predictions=[] # Store prediction of each test example
        for example in test_set: 
                                              
            # Preprocess the test example the same way we did for training set examples                                 
            cleaned_example=preprocess_string(example) 
             
            # Get the posterior probability of every example                                  
            post_prob=self.getExampleProbability(cleaned_example) #get prob of this example for both classes
            
            # Pick the max value and map against self.classes
            predictions.append(self.classes[np.argmax(post_prob)])
                
        return np.array(predictions) 

#### Use StratifiedShuffleSplit from sklearn MLlib and split it into 66 lines out of 72 for train and 6 lines for test which is about 93% for train.  X is features and y is the target label

In [5]:
X=wordsList.wordList
y=wordsList.classList
from sklearn.model_selection import StratifiedShuffleSplit 

splitter=StratifiedShuffleSplit(train_size=0.93,random_state=42)

for train,test in splitter.split(X,y):    
    X_train_SS = X.iloc[train]
    y_train_SS = y.iloc[train]
    X_test_SS = X.iloc[test]
    y_test_SS = y.iloc[test]


#### Shows that our training dataset and test dataset is stratified as can be seen that both train train and test have the same split 

In [6]:
print("X_train_SS no. of rows: "+ str(len(X_train_SS)))
print("X_test_SS no. of rows: "+ str(len(X_test_SS)))
print("y_train_SS no. of rows: "+ str(len(y_train_SS)))
print("y_test_SS no. of rows: "+ str(len(y_test_SS)))

X_train_SS no. of rows: 66
X_test_SS no. of rows: 6
y_train_SS no. of rows: 66
y_test_SS no. of rows: 6


#### Train model using Naive Bayes class 

In [7]:
nb=NaiveBayes(np.unique(y_train_SS))
nb.train(X_train_SS,y_train_SS)

In [8]:
prediction=nb.test(X_test_SS)
prediction

array([1, 0, 0, 1, 1, 0], dtype=int64)

In [9]:
# Accuracy of the model
test_acc=np.sum(prediction==y_test_SS)/float(y_test_SS.shape[0]) 
print ("Test Set Accuracy: ",test_acc*100,"%")

Test Set Accuracy:  100.0 %
