# Lab 06: Generative classifiers: Naive Bayes

As discussed in class, a Naive Bayes classifier works as follows:
$$\begin{eqnarray}
p(y \mid \mathbf{x} ; \theta) & = & \frac{p(\mathbf{x} \mid y ; \theta) p(y ; \theta)}{p(\mathbf{x} ; \theta)} \\
& \propto & p(\mathbf{x} \mid y ; \theta) p(y ; \theta) \\
& \approx & p(y ; \theta) \prod_j p(x_j \mid y ; \theta)
\end{eqnarray}$$
We will use Naive Bayes to perform diabetes diagnosis and text classification.

## Example 1: Diabetes classification

In this example we predict wheter a patient with specific diagnostic measurements has diabetes or not. As the features are
continuous, we will model the conditional probabilities
$p(x_j \mid y ; \theta)$ as univariate Gaussians with mean $\mu_{j,y}$ and standard deviation $\sigma_{j,y}$.

The data are originally from the U.S. National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and are available
from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database)


In [21]:
import csv
import math
import random
import pandas as pd
import numpy as np

### Data manipulation

First we have some functions to read the dataset, split it into train and test, and partition it according to target class ($y$).

In [22]:
# Load data from CSV file

def loadCsv(filename):
    data_raw = np.genfromtxt(filename,delimiter = ',', dtype=str);
    headers = data_raw[0,:];
    dataset = np.array(data_raw[1:,:], dtype=float);
    dataset = list(dataset)
    for i in range(len(dataset)):
        dataset[i] = [float(x) for x in dataset[i]]
    return dataset, headers

# Split dataset into test and train with given ratio

def splitDataset(dataset, splitRatio):
    trainSize = int(len(dataset) * splitRatio)
    trainSet = []
    copy = list(dataset)
    while len(trainSet) < trainSize:
        index = random.randrange(len(copy))
        trainSet.append(copy.pop(index))
    return [trainSet, copy]

# Separate training data according to target class
# Return key value pairs array in which keys are possible target variable values
# and values are the data records.

def data_split_byClass(dataset):
    Xy = {}
    for i in range(len(dataset)):
        datapair = dataset[i]
        # datapair[-1] (the last column) is the target class for this record.
        # Check if we already have this value as a key in the return array
        if (datapair[-1] not in Xy):
            # Add class as key
            Xy[datapair[-1]] = []
        # Append this record to array of records for this class key
        Xy[datapair[-1]].append(datapair)
    return Xy

### Model training

Next we have some functions used for training the model. Parameters include mean and standard deviation, used
to partition numerical variables into categorical variables, as well as 

In [23]:
# Parameters of a Gaussian are its mean and standard deviation

def mean(numbers):
    return sum(numbers)/float(len(numbers))

def stdev(numbers):
    avg = mean(numbers)
    variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
    return math.sqrt(variance)

# Calculate Gaussian parameters mu and sigma for each attribute over a dataset

def gaussian_parameters(dataset):
    parameters = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
    del parameters[-1]
    return parameters

# Find mean and standard deviation for each attribute for each possible value of the target variable.
    
def parametersByClass(dataset):
    class_key_value_data = data_split_byClass(dataset)
    parameters = {}
    priors = {}
    # For each possible target variable value, collect parameters
    for key, values in class_key_value_data.items():
        # Get mean and standard deviation for each variable for each target variable value
        parameters[key] = gaussian_parameters(values)
     
    return parameters

# Gaussian pdf

def calculateProbability(x, mean, stdev):
    exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
    return (1/(math.sqrt(2*math.pi)*stdev))*exponent

### Model testing

Next some functions for testing the model on a test set and computing its accuracy. Note that we assume
$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta), $$
which means we assume that the priors $p(y)$ are equal for each possible value of $y$.

In [24]:
# Calculate class conditional probabilities for given input data vector

def calculateClassProbabilities(parameters, inputVector):
    probabilities = {}
    for class_key, variable_parameters in parameters.items():
        probabilities[class_key] = 1
        for i in range(len(variable_parameters)):
            mean, stdev = variable_parameters[i]
            x = inputVector[i]
            probabilities[class_key] *= calculateProbability(x, mean, stdev)
    return probabilities

# Make prediction for given input data vector

def predict(summaries, inputVector):
    probabilities = calculateClassProbabilities(summaries, inputVector)
    bestLabel, bestProb = None, -1
    for classValue, probability in probabilities.items():
        if bestLabel is None or probability > bestProb:
            bestProb = probability
            bestLabel = classValue
    return bestLabel

# Get predictions for test set

def getPredictions(summaries, testSet):
    predictions = []
    for i in range(len(testSet)):
        result = predict(summaries, testSet[i])
        predictions.append(result)
    return predictions

# Get accuracy for test set

def getAccuracy(testSet, predictions):
    correct = 0
    for x in range(len(testSet)):
        if testSet[x][-1] == predictions[x]:
            correct += 1
    return (correct/float(len(testSet)))*100.0

### Experiment

Here we load the diabetes dataset, split it into training and test data, train a Gaussian NB model, and test the model on the test set.

In [25]:
# Load dataset

filename = 'diabetes.csv'
dataset, headers = loadCsv(filename)
print(headers)
print(np.array(dataset)[0:5,:])

# Split into training and test

splitRatio = 0.60
trainingSet, testSet = splitDataset(dataset, splitRatio)
print('Total {0} rows,  train = {1} rows, test = {2} rows'.format(len(dataset),len(trainingSet),len(testSet)))

# Train model

parameters = parametersByClass(trainingSet)

# Test model

predictions = getPredictions(parameters, testSet)
accuracy = getAccuracy(testSet, predictions)
print("Accuracy: %.2f" % accuracy)

['Pregnancies' 'Glucose' 'BloodPressure' 'SkinThickness' 'Insulin' 'BMI'
 'DiabetesPedigreeFunction' 'Age' 'Outcome']
[[6.000e+00 1.480e+02 7.200e+01 3.500e+01 0.000e+00 3.360e+01 6.270e-01
  5.000e+01 1.000e+00]
 [1.000e+00 8.500e+01 6.600e+01 2.900e+01 0.000e+00 2.660e+01 3.510e-01
  3.100e+01 0.000e+00]
 [8.000e+00 1.830e+02 6.400e+01 0.000e+00 0.000e+00 2.330e+01 6.720e-01
  3.200e+01 1.000e+00]
 [1.000e+00 8.900e+01 6.600e+01 2.300e+01 9.400e+01 2.810e+01 1.670e-01
  2.100e+01 0.000e+00]
 [0.000e+00 1.370e+02 4.000e+01 3.500e+01 1.680e+02 4.310e+01 2.288e+00
  3.300e+01 1.000e+00]]
Total 768 rows,  train = 460 rows, test = 308 rows
Accuracy: 68.51


### In lab / take home work

Find out the proportion of the records in your dataset are positive vs. negative.  Can we conclude that $p(y=1) = p(y=0)$? If not, add
the priors $p(y=1)$ and $p(y=0)$ to your NB model. Does it improve the result?



In [26]:
def split_data_set(df, training_ratio=0.6, shuffle=True):
    if shuffle :
        df = df.sample(frac=1)
    t_size = int(df.shape[0] * training_ratio)
    # train , test 
    return df.iloc[:t_size], df.iloc[t_size:]

class NaiveBay:
    def __init__(self, with_prior=True):
        self.x_mean = {}
        self.x_std = {}
        
        self.priors = {}
        self.y_class = []
        self.pred_with_prior = with_prior
    
    def fit(self, X, y):
        # X : 2d array 
        # y : 1d array
        
        self.priors = {} # reset priority 
        
        # reset saving parameters
        self.x_mean = {}
        self.x_std = {}
        
        # xy_df should be in format columns = [x0, x1, ..., xn, y]
        self.y_class = np.unique(y)
        
        for y_class in self.y_class : 
            y_index = np.where(y == y_class) # select X by class 
            
            # select X data of y_class answer
            y_X = X.copy()[y_index] # X.copy() => prevent update the original X variable.
            
            x_mean = y_X.mean(axis=0) # find mean of each column
            x_std = y_X.std(axis=0) # find the stdev of each column
            # save parameter values 
            self.x_mean[y_class] = x_mean
            self.x_std[y_class] = x_std
            
            self.priors[y_class] = 1
            if self.pred_with_prior : 
                self.priors[y_class] *=  (y_X.shape[0]/ X.shape[0])                 
                
    def calculate_probability(self, x, mean, stdev):
        '''
        Equation from 
        https://en.wikipedia.org/wiki/Naive_Bayes_classifier#:~:text=A%20class's%20prior%20may%20be,total%20number%20of%20samples%3E).
        '''
        top = np.exp(-1 * np.power((x-mean), 2) / (2*np.power(stdev,2)))
        bot = np.sqrt(2*np.pi*np.power(stdev,2))
        p = top/bot
        return p 
    
    def predict(self, X):
        
        # save prob for each class 
        pred_df = pd.DataFrame()
        
        for y_class in self.y_class : 
            
            prior_y = self.priors[y_class]
            
            # a vector of mean 
            y_params_mean = self.x_mean[y_class]
            # a vector of variance 
            y_params_std = self.x_std[y_class]
            
            # probability of y given X 
            yp_by_features = self.calculate_probability(X, y_params_mean, y_params_std)
            yp_by_features =  yp_by_features * (prior_y ** prior_y)         
            yp = np.prod(yp_by_features, axis=1)
            pred_df[y_class] = yp
    
        pred_df['y_pred'] = np.argmax(pred_df.values, axis=1)
        pred_df['y_pred'] = [pred_df.columns[v] for v in pred_df['y_pred']]
        self.pred_df = pred_df
        return pred_df['y_pred'].values

In [27]:
def get_accuracy(y_pred, y_true):
    return (y_pred == y_true).sum() / len(y_true)

df = pd.read_csv('diabetes.csv')
# df.Outcome.hist()
training_set, test_set = split_data_set(df, 0.8)

outcome_ratio = (training_set.Outcome.value_counts() / len(training_set)).to_dict()

print('Outcome Proportion')
print(outcome_ratio)

print('Can we conclude p(y=1) == p(y=0) ? ')
print('The answer is', outcome_ratio[0] == outcome_ratio[1])

X_train, y_train = training_set.drop(columns='Outcome').values, training_set['Outcome'].values
X_test, y_test = test_set.drop(columns='Outcome').values, test_set['Outcome'].values

nb= NaiveBay(with_prior=False)
print('='*30)
print('Model without prior')
nb.fit(X_train, y_train)
train_pred = nb.predict(X_train)
test_pred = nb.predict(X_test)
print('Train Accuracy :', get_accuracy(train_pred, y_train))
print('Test Accuracy : ', get_accuracy(test_pred, y_test))
print('='*30)

print('Model with prior')
nb = NaiveBay(with_prior=True)
nb.fit(X_train, y_train)
print(nb.priors)
my_pred = nb.predict(X_test)
train_pred = nb.predict(X_train)
test_pred = nb.predict(X_test)
print('Train Accuracy :', get_accuracy(train_pred, y_train))
print('Test Accuracy : ', get_accuracy(test_pred, y_test))

Outcome Proportion
{0: 0.6465798045602605, 1: 0.3534201954397394}
Can we conclude p(y=1) == p(y=0) ? 
The answer is False
Model without prior
Train Accuracy : 0.752442996742671
Test Accuracy :  0.7337662337662337
Model with prior
{0: 0.6465798045602605, 1: 0.3534201954397394}
Train Accuracy : 0.7719869706840391
Test Accuracy :  0.7532467532467533


##### Priors are not equal here but with addition of priors the result does imporove by a small margin

## Example 2: Text classification

This example has been adapted from a post by Jaya Aiyappan, available at
[Analytics Vidhya](https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data).

We will generate a small dataset of sentences that are classified as either "statements" or "questions."

We will assume that occurance and placement of words within a sentence is independent of each other
(i.e., the features are conditionally independent given $y$). So the sentence "this is my book" is the same as "is this my book."
We will treat words as case insensitive.

In [28]:
# Generate text data for two classes, "statement" and "question"

text_train = [['This is my novel book', 'statement'],
              ['this book has more than one author', 'statement'],
              ['is this my book', 'question'],
              ['They are novels', 'statement'],
              ['have you read this book', 'question'],
              ['who is the novels author', 'question'],
              ['what are the characters', 'question'],
              ['This is how I bought the book', 'statement'],
              ['I like fictional characters', 'statement'],
              ['what is your favorite book', 'question']]

text_test = [['this is the book', 'statement'], 
             ['who are the novels characters', 'question'], 
             ['is this the author', 'question']]

# Load training and test data into pandas data frames

training_data = pd.DataFrame(text_train, columns= ['sentence', 'class'])
print(training_data)
print('\n------------------------------------------\n')
testing_data = pd.DataFrame(text_test, columns= ['sentence', 'class'])
print(testing_data)


                             sentence      class
0               This is my novel book  statement
1  this book has more than one author  statement
2                     is this my book   question
3                     They are novels  statement
4             have you read this book   question
5            who is the novels author   question
6             what are the characters   question
7       This is how I bought the book  statement
8         I like fictional characters  statement
9          what is your favorite book   question

------------------------------------------

                        sentence      class
0               this is the book  statement
1  who are the novels characters   question
2             is this the author   question


In [29]:
# Partition training data by class

stmt_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'statement']
question_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'question']

# Get word frequencies for each sentence and class

def get_words(text):
    # Initialize word list
    words = [];
    # Loop through each sentence in input array
    for text_row in text:       
        # Check the number of words. Assume each word is separated by a blank space
        # so that the number of words is the number of blank spaces + 1
        number_of_spaces = text_row.count(' ')
        # loop through the sentence and get words between blank spaces.
        for i in range(number_of_spaces):
            # Check for for last word
            words.append([text_row[:text_row.index(' ')].lower()])
            text_row = text_row[text_row.index(' ')+1:]  
            i = i + 1        
        words.append([text_row])
    return np.unique(words)

# Get frequency of each word in each document

def get_doc_word_frequency(words, text):  
    word_freq_table = np.zeros((len(text),len(words)), dtype=int)
    i = 0
    for text_row in text:
        # Insert extra space between each pair of words to prevent
        # partial match of words
        text_row_temp = ''
        for idx, val in enumerate(text_row):
            if val == ' ':
                 text_row_temp = text_row_temp + '  '
            else:
                  text_row_temp = text_row_temp + val.lower()
        text_row = ' ' + text_row_temp + ' '
        j = 0
        for word in words: 
            word = ' ' + word + ' '
            freq = text_row.count(word)
            word_freq_table[i,j] = freq
            j = j + 1
        i = i + 1
    
    return word_freq_table

In [30]:
# Get word frequencies for statement documents

word_list_s = get_words(stmt_docs)
word_freq_table_s = get_doc_word_frequency(word_list_s, stmt_docs)
tdm_s = pd.DataFrame(word_freq_table_s, columns=word_list_s)
print(tdm_s)

   are  author  book  bought  characters  fictional  has  how  i  is  like  \
0    0       0     1       0           0          0    0    0  0   1     0   
1    0       1     1       0           0          0    1    0  0   0     0   
2    1       0     0       0           0          0    0    0  0   0     0   
3    0       0     1       1           0          0    0    1  1   1     0   
4    0       0     0       0           1          1    0    0  1   0     1   

   more  my  novel  novels  one  than  the  they  this  
0     0   1      1       0    0     0    0     0     1  
1     1   0      0       0    1     1    0     0     1  
2     0   0      0       1    0     0    0     1     0  
3     0   0      0       0    0     0    1     0     1  
4     0   0      0       0    0     0    0     0     0  


In [31]:
# Get word frequencies over all statement documents

freq_list_s = word_freq_table_s.sum(axis=0) 
freq_s = dict(zip(word_list_s,freq_list_s))
print(freq_s)

{'are': 1, 'author': 1, 'book': 3, 'bought': 1, 'characters': 1, 'fictional': 1, 'has': 1, 'how': 1, 'i': 2, 'is': 2, 'like': 1, 'more': 1, 'my': 1, 'novel': 1, 'novels': 1, 'one': 1, 'than': 1, 'the': 1, 'they': 1, 'this': 3}


In [32]:
# Get word frequencies for question documents

word_list_q = get_words(question_docs)
word_freq_table_q = get_doc_word_frequency(word_list_q, question_docs)
tdm_q = pd.DataFrame(word_freq_table_q, columns=word_list_q)
print(tdm_q)

   are  author  book  characters  favorite  have  is  my  novels  read  the  \
0    0       0     1           0         0     0   1   1       0     0    0   
1    0       0     1           0         0     1   0   0       0     1    0   
2    0       1     0           0         0     0   1   0       1     0    1   
3    1       0     0           1         0     0   0   0       0     0    1   
4    0       0     1           0         1     0   1   0       0     0    0   

   this  what  who  you  your  
0     1     0    0    0     0  
1     1     0    0    1     0  
2     0     0    1    0     0  
3     0     1    0    0     0  
4     0     1    0    0     1  


In [33]:
# Get word frequencies over all question documents

freq_list_q = word_freq_table_q.sum(axis=0) 
freq_q = dict(zip(word_list_q,freq_list_q))
print(freq_q)

{'are': 1, 'author': 1, 'book': 3, 'characters': 1, 'favorite': 1, 'have': 1, 'is': 3, 'my': 1, 'novels': 1, 'read': 1, 'the': 2, 'this': 2, 'what': 2, 'who': 1, 'you': 1, 'your': 1}


### In-lab exercise: Laplace smoothing

Run the code below and figure out why it fails.

When a word does not appear with a specific class in the training data, its class-conditional probability is 0, and we are unable to
get a reasonable probability for that class.

Research Laplace smoothing, and modify the code above to implement Laplace smoothing (setting the frequency of all words with frequency 0 to a frequency of 1).
Run the modified code on the test set.

In [34]:
# laplace smooth adds words to each class that appears in other class but not in its own with a count 1
# It also increments the counts of other words by 1 so as not to change the overall probabilities much

def laplace_smooth(word_list_s, freq_list_s, word_list_q, freq_list_q):
    s_extra = np.setdiff1d(word_list_q, word_list_s, assume_unique=True)
    q_extra = np.setdiff1d(word_list_s, word_list_q, assume_unique=True)
    
    word_list_s = np.concatenate((word_list_s, s_extra))
    word_list_q = np.concatenate((word_list_q, q_extra))  
    
    freq_list_s = np.concatenate((freq_list_s, np.zeros(s_extra.shape, dtype=int)))
    freq_list_q = np.concatenate((freq_list_q, np.zeros(q_extra.shape, dtype=int)))
    
    freq_list_s = freq_list_s + 1
    freq_list_q = freq_list_q + 1
    
    return word_list_s, freq_list_s, word_list_q, freq_list_q

word_list_s, freq_list_s, word_list_q, freq_list_q = laplace_smooth(word_list_s, freq_list_s, word_list_q, freq_list_q)

In [35]:
# Get word probabilities for statement class

prob_s = []
for word, count in zip(word_list_s, freq_list_s):
    #print(word, count)
    prob_s.append(count/len(word_list_s))
    
# Get word probabilities for question class

prob_q = []
for count in freq_list_q:
    prob_q.append(count/len(word_list_q))
    
print('Probability of words for "statement" class \n')
print(dict(zip(word_list_s, prob_s)))
print('------------------------------------------- \n')
print('Probability of words for "question" class \n')
print(dict(zip(word_list_q, prob_q)))

Probability of words for "statement" class 

{'are': 0.07407407407407407, 'author': 0.07407407407407407, 'book': 0.14814814814814814, 'bought': 0.07407407407407407, 'characters': 0.07407407407407407, 'fictional': 0.07407407407407407, 'has': 0.07407407407407407, 'how': 0.07407407407407407, 'i': 0.1111111111111111, 'is': 0.1111111111111111, 'like': 0.07407407407407407, 'more': 0.07407407407407407, 'my': 0.07407407407407407, 'novel': 0.07407407407407407, 'novels': 0.07407407407407407, 'one': 0.07407407407407407, 'than': 0.07407407407407407, 'the': 0.07407407407407407, 'they': 0.07407407407407407, 'this': 0.14814814814814814, 'favorite': 0.037037037037037035, 'have': 0.037037037037037035, 'read': 0.037037037037037035, 'what': 0.037037037037037035, 'who': 0.037037037037037035, 'you': 0.037037037037037035, 'your': 0.037037037037037035}
------------------------------------------- 

Probability of words for "question" class 

{'are': 0.07407407407407407, 'author': 0.07407407407407407, 'book': 

In [36]:
# Calculate prior for one class

def prior(className):    
    denominator = len(stmt_docs) + len(question_docs)
    
    if className == 'statement':
        numerator =  len(stmt_docs)
    else:
        numerator =  len(question_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
#     print("words: ",words)
    prob = 1
    for word in words:
        if className == 'statement':
#             print("word is: ",word)
            idx = np.where(word_list_s == word)
#             print("Shape of idx is: ",idx)
            prob = prob * prob_s[np.array(idx)[0,0]]
        else:
            idx = np.where(word_list_q == word)
            prob = prob * prob_q[np.array(idx)[0,0]]   
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    if  prob_statement > prob_question:
        return 'statement'
    else:
        return 'question'

def accuracy(test_docs, target):
    correct = 0
    for i in range(len(test_docs)):
        predicted_as = predict(test_docs[i])
        if predicted_as == target[i]:
            correct += 1
    return correct/len(test_docs) * 100

In [38]:
test_docs = list([[test['sentence']] for index,test in testing_data.iterrows()])
print('Getting prediction for "%s"' % test_docs[0][0])
predict(test_docs[0])

Getting prediction for "this is the book"


'question'

The classifier being word position insensitive classifies the statement "this is a book" as a question of form "is this a book"

In [39]:
acc = accuracy(test_docs, testing_data['class'])
print("Accuracy on test set is %.2f "%(acc))

Accuracy on test set is 66.67 




### Take home exercise

Find a more substantial text classification dataset, clean up the documents, and build your NB classifier. Write a brief report on your in-lab and take home exercises and results.

###### The dataset is from kaggle:
Link : https://www.kaggle.com/arbazkhan971/product-sentiment-analysis


In [41]:
data = pd.read_csv("product_review.csv")

In [42]:
data.head()

Unnamed: 0,Text_ID,Product_Description,Product_Type,Sentiment
0,3057,The Web DesignerÛªs Guide to iOS (and Android...,9,2
1,6254,RT @mention Line for iPad 2 is longer today th...,9,2
2,8212,Crazy that Apple is opening a temporary store ...,9,2
3,4422,The lesson from Google One Pass: In this digit...,9,2
4,5526,RT @mention At the panel: &quot;Your mom has a...,9,2


In [43]:
## remove some special characters and convert to lower case
data['Product_Description'] = data['Product_Description'].str.replace('[^a-zA-Z0-9 \n\.]|\.', '').str.lower()

In [44]:
data.head()

Unnamed: 0,Text_ID,Product_Description,Product_Type,Sentiment
0,3057,the web designers guide to ios and android app...,9,2
1,6254,rt mention line for ipad 2 is longer today tha...,9,2
2,8212,crazy that apple is opening a temporary store ...,9,2
3,4422,the lesson from google one pass in this digita...,9,2
4,5526,rt mention at the panel quotyour mom has an ip...,9,2


In [50]:
data['Sentiment'].value_counts()

2    3765
3    2089
1     399
0     111
Name: Sentiment, dtype: int64

In [45]:
class NaiveBayes:
    def __init__(self, with_prior):
        self.word_freqs = {} 
        self.n_words = {}
        self.y_class = {}
        self.with_prior = with_prior
        
    def get_word_frequency(self, sentence_array):
        # make text_array into 1d array 
        sentence_series = pd.Series(sentence_array)
        sentence_series += ' ' # add space for spliting words
        word_series = pd.Series(sentence_series.sum().split(' '))
        word_series = word_series[word_series != ' '] # drop unwanted elements
        return word_series.value_counts().to_dict(), len(word_series)
        
    def fit(self, X, y):
        # X => array of vector
        # y => array of class
        self.y_class = pd.Series(y).value_counts().to_dict()
        for y_class in self.y_class : 
            y_index = np.where(y == y_class)
            
            y_X = X.copy()[y_index]
            self.word_freqs[y_class], self.n_words[y_class] = self.get_word_frequency(y_X)
            
        all_X = X.copy()
        self.word_freqs['all'], self.n_words['all'] = self.get_word_frequency(all_X)
        self.y_class['all'] = X.shape[0] # count the number of samples
        
    def predict(self, X, alpha=1):
        d = len(self.word_freqs['all']) # the number of possible words in all categories
        pred_df = pd.DataFrame()        
        n = self.y_class['all'] # number of training size
        for y_class in self.y_class :
            if y_class != 'all' : 
                y_hat = []
                n_y = self.y_class[y_class] # the number of class y in training set 
                n_wy = self.n_words[y_class] # the number of words occur in y_class 
                for sentence in X : 
                    p_xy = 1
                    words = pd.Series(sentence.split(' '))                
                    for w in words : 
                        n_xy = 0
                        if w in self.word_freqs[y_class] : 
                            n_xy = self.word_freqs[y_class][w] # the number of word x happens in class y   
                        p_xy *= (n_xy + alpha) / (n_wy + alpha*d) 

                    if self.with_prior :
                        p_xy *= n_y / n
                    y_hat.append(p_xy)               
                pred_df[y_class] = y_hat
    
        pred_df['y_pred'] = np.argmax(pred_df.values, axis=1)
        pred_df['y_pred'] = [pred_df.columns[v] for v in pred_df['y_pred']]
        self.pred_df = pred_df
        return pred_df['y_pred'].values

In [47]:
# Split dataset into test and train set
def split_dataset(df, training_ratio=0.6, shuffle=True):
    if shuffle :
        df = df.sample(frac=1)
    t_size = int(df.shape[0] * training_ratio)
    # train , test 
    return df.iloc[:t_size], df.iloc[t_size:]
data_train , data_test = split_dataset(data, 0.8)


Unnamed: 0,Text_ID,Product_Description,Product_Type,Sentiment
99,8512,like mention download freak jet on your iphone...,7,3
756,8219,apple just brought is water in the ipad line sxsw,9,2
4171,1985,mention and mention saying hi at sxsw with the...,9,3
4275,7717,learned from sponsored sxsw software 1 austin ...,5,3
2108,611,i may not have invites to all the quotcoolquot...,9,3


In [48]:
m = NaiveBayes(True)
m.fit(data_train['Product_Description'].values, data_train['Sentiment'].values)
train_pred = m.predict(data_train['Product_Description'].values)
test_pred = m.predict(data_test['Product_Description'].values)

In [49]:
print('Accuracy')
print('Training set accuracy', get_accuracy(train_pred, data_train['Sentiment'].values))
print('Test set accuracy', get_accuracy(test_pred, data_test['Sentiment'].values))

Accuracy
Training set accuracy 0.8222353172264781
Test set accuracy 0.6614296936370778


The test accuract value is quite low, this may be because of the unbalanced dataset that we have in terms of number of data for each sentiment value. There Another point to note while working with text data is that, extensive data cleaning and preprocessing is required. 