Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "Bishal Khanal"
ID = "st122221"

---

# Lab 06: Generative Classifiers: Naive Bayes

As discussed in class, a naive Bayes classifier works as follows.

We are given a feature space $\mathcal{X}$ that could be discrete, continuous, or a mix of discrete and continuous features.

We are also given a discrete set $\mathcal{Y} = { y_1, \ldots,
y_K }$ of exhaustive, mutually exclusive classes thought to be the provenance of a dataset elements $\mathbf{x} \in \mathcal{X}$.

What does it mean to say that the features come from the classes? Specifically, we mean that the observation $\mathbf{x}^{(i)}$ is a random vector statistically dependent on a random variable $y^{(i)}$.

This means that $\mathbf{x}^{(i)} \sim p(\mathbf{x} \mid y^{(i)})$, where $y^{(i)} \in \mathcal{Y}$ and $y^{(i)} \sim p(y)$. $p(y)$, the *prior*, is assumed to be a multinomial distribution over the possible classes $\mathcal{Y}$, but the class conditional distribution $p(\mathbf{x} \mid y)$ can be an arbitrarily complicated joint distribution over the feature space that is different for each $y \in \mathcal{Y}$.

The random process just described, in which a $y$ is first sampled from a multinomial distribution over $\mathcal{Y}$ then an $\mathbf{x}$ is sampled from an arbitrary joint distribution over $\mathcal{X}$ that is conditioned on $y$, is a *generative model* for the provenance of our dataset. It may not be a fully accurate model for how nature gave us our dataset, but we nevertheless assume that it is.

With all those preliminaries, now, given a new sample $\mathbf{x}$ assumed to have been generated by the same generative process, we estimate, for each $y \in \mathcal{Y}$, the *posterior* $p(y \mid \mathbf{x})$ using the following strategy:
$$\begin{eqnarray}
p(y \mid \mathbf{x} ; \theta) & = & \frac{p(\mathbf{x} \mid y ; \theta) p(y ; \theta)}{p(\mathbf{x} ; \theta)} \\
& \propto & p(\mathbf{x} \mid y ; \theta) p(y ; \theta) \\
& = & p(y ; \theta) \prod_j p(x_j \mid y, x_1, \ldots, x_{j-1} ; \theta) \\
& \approx & p(y ; \theta) \prod_j p(x_j \mid y ; \theta).
\end{eqnarray}$$

The critical assumption here (besides the story of the generative random process assumed to be the origin of our dataset) is the *naive Bayes assumption* that the approximation

$$ p(x_j \mid y, x_1, \ldots, x_{j-1} ; \theta) \approx p(x_j \mid y ; \theta)$$

is close enough to reality to be useful. Note that if the features are truly *conditionally independent of each other given the class*, then the naive Bayes classifier is an exact probabilistic classifier.

So now we know that the parameters of a naive Bayes classifier will always include the parameters $\phi_1, \ldots, \phi_k$ of the multinomial distribution over $\mathcal{Y}$ plus the individual conditional feature distributions $p(x_j \mid y)$. If $x_j$ is discrete, we can represent this conditional distribution using a simple table of probabilities, and if $x_j$ is continuous, we represent the conditional distribution using the parameters of some continuous distribution such as a univariate Gaussian, univariate exponential, etc.

In today's lab, we will use naive Bayes to perform diabetes diagnosis and text classification.

## Example 1: Diabetes classification

In this example we predict wheter a patient with specific diagnostic measurements has diabetes or not. The target classes $\mathcal{Y} = { y_1, y_2 }$ correspond respectively to "no diabetes" and "diabetes." As the features are continuous, we will model their conditional probabilities $p(x_j \mid y ; \theta)$ as univariate Gaussians with means $\mu_{j,y}$ and standard deviations $\sigma_{j,y}$.

The data are originally from the U.S. National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and are available from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [2]:
import csv
import math
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Data manipulation

First we have some functions to read the dataset, split it into train and test, and partition it according to target class ($y$).

In [3]:
# Load data from CSV file
def loadCsv(filename):
    data_raw = pd.read_csv(filename)
    headers = data_raw.columns
    dataset = data_raw.values
    return dataset, headers

# Split dataset into test and train with given ratio
def splitDataset(test_size,*arrays,**kwargs):
    return train_test_split(*arrays,test_size=test_size,**kwargs)

# Separate training data according to target class
# Return key value pairs array in which keys are possible target variable values
# and values are the data records.

def data_split_byClass(dataset):
    Xy = {}
    for i in range(len(dataset)):
        datapair = dataset[i]
        # datapair[-1] (the last column) is the target class for this record.
        # Check if we already have this value as a key in the return array
        if (datapair[-1] not in Xy):
            # Add class as key
            Xy[datapair[-1]] = []
        # Append this record to array of records for this class key
        Xy[datapair[-1]].append(datapair)
    return Xy

### Model training

Next we have some functions used for training the model. Parameters include the conditional means and standard deviations for each feature as well as the parameters of the multinomial distribution (more specifically the Bernoulli distribution since this is a binary classification problem) over $\mathcal{Y}$.

In [4]:
# Calculate Gaussian parameters mu and sigma for each attribute over a dataset

def get_gaussian_parameters(X, y):
    parameters = {}
    unique_y = np.unique(y)
    for uy in unique_y:
        mean = np.mean(X[y==uy], axis=0)
        std = np.std(X[y==uy], axis=0)
        py = y[y==uy].size / y.size
        parameters[uy] = { 'prior': py, 'mean': mean, 'std': std }
    return parameters, unique_y

def calculateProbability(x, mu, sigma):
    sigma = np.diag(sigma**2)
    x = x.reshape(-1,1)
    mu = mu.reshape(-1,1)
    exponent = np.exp(-1/2*(x-mu).T@np.linalg.inv(sigma)@(x-mu))
    return ((1/(np.sqrt(((2*np.pi)**x.size)*np.linalg.det(sigma))))*exponent)[0,0]

### Model testing

Next are some functions for testing the model on a test set and computing its accuracy. Note that `predict_one()` allows us to calculate $p(y \mid \mathbf{x} ; \theta)$ with or without the prior, i.e., as either

$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta),$$

which corresponds to the assumption that the priors $p(y)$ are equal, i.e., $p(y) = \frac{1}{K}$ for all $y$, or

$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta) p(y ; \theta),$$

which correctly includes the prior.

In [5]:
# Calculate class conditional probabilities for given input data vector

def predict_one(x, parameters, unique_y, prior=True):
    probabilities = []
    for key in parameters.keys():
        probabilities.append(calculateProbability(x, parameters[key]['mean'], parameters[key]['std']) * (parameters[key]['prior']**(float(prior))))
    probabilities = np.array(probabilities)
    return unique_y[np.argmax(probabilities)]

def getPredictions(X, parameters, unique_y,prior=True):
    predictions = []
    for i in range(X.shape[0]):
        predictions.append(predict_one(X[i],parameters,unique_y,prior))
    return np.array(predictions)

# Get accuracy for test set

def getAccuracy(y, y_pred):
    correct = len(y[y==y_pred])
    return correct/y.size

### Experiment

Here we load the diabetes dataset, split it into training and test data, train a Gaussian NB model, and test the model on the test set.

In [6]:
# Load dataset

filename = 'diabetes.csv'
dataset, headers = loadCsv(filename)
#print(headers)
#print(np.array(dataset)[0:5,:])

# Split into training and test

X_train,X_test,y_train,y_test = splitDataset(0.4,dataset[:,:-1],dataset[:,-1])
print("Total =",len(dataset),"Train =", len(X_train),"Test =",len(X_test))

# Train model

parameters, unique_y = get_gaussian_parameters(X_train,y_train)

# Test model
prediction = getPredictions(X_test,parameters,unique_y)
print("Accuracy with Prior =",getAccuracy(y_test,prediction))

prediction = getPredictions(X_test,parameters,unique_y,prior = False)
print("Accuracy without Prior =",getAccuracy(y_test,prediction))

Total = 768 Train = 460 Test = 308
Accuracy with Prior = 0.7142857142857143
Accuracy without Prior = 0.7045454545454546


###  Exercise In lab / take home work (20 points)

Find out the proportion of the records in your dataset are positive vs. negative.  Can we conclude that $p(y=1) = p(y=0)$? If not, we should use the version of the model in which we use the priors $p(y=1)$ and $p(y=0)$. Explain
whether/how it improves the result.


In [7]:
# YOUR CODE HERE
for key,value in parameters.items():
    print(f"The prior, p(y={key}): {parameters[key]['prior']}")
    

The prior, p(y=0.0): 0.6521739130434783
The prior, p(y=1.0): 0.34782608695652173


**Explain whether you can conclude that $p(y=1) = p(y=0)$? If not, add
the priors $p(y=1)$ and $p(y=0)$ to your NB model and explain how it improves the result.**


From the output of above code we can see that p(y=0) = 0.6347 and p(y=1) = 0.3652 which are not equal.
and if we compare the accuracy of our model with and without considering prior in calculating the probability
we can see that the accuracy with prior is higher than the accuracy without prior. This means that the 
result is improved if we consider the prior.

Suppose if p(y=0) and p(y=1) were to be equal, then including priors in the prediction won't have any effect
on the accuracy, since the probability p(y|x;theta) both the classes were multiplied by the same number. It's like scaling both probality of both classes by same constants. But, if the priors are unequal or we can also say if the training dataset is not balance (equal no. of training samples for each class), then the priors becomes biased more towards the class which occurs mostly in the training samples. So if we do not consider such priors to predict, then it will be like ignoring the probalility of the occurence of the test sample of such class in the test dataset, which we assume to be the same with the training dataset.

But if the probalility of occurence of the test sample of a class in the test dataset completely differed from that of training datasets then using the prior may result in poor performance. In such case you may not want to remove such prior from the model, meaning not using the prior to classify the test sets.

Suppose we have a datasets to classify into two class where the dataset is biased like this: p(y=1) is 0.9 and p(y=0) is 0.1. Suppose this is same for test and training dataset. So when we pick 10 test samples from such type of test dataset pool, then there is a high chance to get 9 test samples of class one and 1 test sample of class zero. So, if we just predict this without even considering other parameters then we can get as high as 90% accuracy. It means only considering the priors gives us such 90% accuracy, without even touching the features x, which we can call it as a base accuracy. So if we also consider features x then we can get accuracy higher than that. But in such case if we reject priors or do not use priors then the model will perform much more poorly than the base accuracy.

This above explanation with example is valid only if the p(y=1) and p(y=0) for both training and test datasets are equal or nearly equal.


## Example 2: Text classification

This example has been adapted from a post by Jaya Aiyappan, available at
[Analytics Vidhya](https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data).

We will generate a small dataset of sentences that are classified as either "statements" or "questions."

We will assume that occurance and placement of words within a sentence are independent of each other, so the sentence "this is my book" will have the same features as the sentence "is this my book." We will treat words without case sensitivity.

In [8]:
# Generate text data for two classes, "statement" and "question"

text_train = [['This is my novel book', 'statement'],
              ['this book has more than one author', 'statement'],
              ['is this my book', 'question'],
              ['They are novels', 'statement'],
              ['have you read this book', 'question'],
              ['who is the novels author', 'question'],
              ['what are the characters', 'question'],
              ['This is how I bought the book', 'statement'],
              ['I like fictional characters', 'statement'],
              ['what is your favorite book', 'question']]

text_test = [['this is the book', 'statement'], 
             ['who are the novels characters', 'question'], 
             ['is this the author', 'question'],
            ['I like apples']]

# Load training and test data into pandas data frames

training_data = pd.DataFrame(text_train, columns= ['sentence', 'class'])
print(training_data)
print('\n------------------------------------------\n')
testing_data = pd.DataFrame(text_test, columns= ['sentence', 'class'])
print(testing_data)


                             sentence      class
0               This is my novel book  statement
1  this book has more than one author  statement
2                     is this my book   question
3                     They are novels  statement
4             have you read this book   question
5            who is the novels author   question
6             what are the characters   question
7       This is how I bought the book  statement
8         I like fictional characters  statement
9          what is your favorite book   question

------------------------------------------

                        sentence      class
0               this is the book  statement
1  who are the novels characters   question
2             is this the author   question
3                  I like apples       None


In [9]:
# Partition training data by class

stmt_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'statement']
question_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'question']
all_docs = [train['sentence'] for index,train in training_data.iterrows()]

# Get word frequencies for each sentence and class

def get_words(text):
    # Initialize word list
    words = [];
    # Loop through each sentence in input array
    for text_row in text:       
        # Check the number of words. Assume each word is separated by a blank space
        # so that the number of words is the number of blank spaces + 1
        number_of_spaces = text_row.count(' ')
        # loop through the sentence and get words between blank spaces.
        for i in range(number_of_spaces):
            # Check for for last word
            words.append([text_row[:text_row.index(' ')].lower()])
            text_row = text_row[text_row.index(' ')+1:]  
            i = i + 1        
        words.append([text_row])
    return np.unique(words)

# Get frequency of each word in each document

def get_doc_word_frequency(words, text):  
    word_freq_table = np.zeros((len(text),len(words)), dtype=int)
    i = 0
    for text_row in text:
        # Insert extra space between each pair of words to prevent
        # partial match of words
        text_row_temp = ''
        for idx, val in enumerate(text_row):
            if val == ' ':
                 text_row_temp = text_row_temp + '  '
            else:
                  text_row_temp = text_row_temp + val.lower()
        text_row = ' ' + text_row_temp + ' '
        j = 0
        for word in words: 
            word = ' ' + word + ' '
            freq = text_row.count(word)
            word_freq_table[i,j] = freq
            j = j + 1
        i = i + 1
    
    return word_freq_table

In [10]:
# Get word frequencies for statement documents

word_list_s = get_words(stmt_docs)
word_freq_table_s = get_doc_word_frequency(word_list_s, stmt_docs)
tdm_s = pd.DataFrame(word_freq_table_s, columns=word_list_s)
print(tdm_s)

   are  author  book  bought  characters  fictional  has  how  i  is  like  \
0    0       0     1       0           0          0    0    0  0   1     0   
1    0       1     1       0           0          0    1    0  0   0     0   
2    1       0     0       0           0          0    0    0  0   0     0   
3    0       0     1       1           0          0    0    1  1   1     0   
4    0       0     0       0           1          1    0    0  1   0     1   

   more  my  novel  novels  one  than  the  they  this  
0     0   1      1       0    0     0    0     0     1  
1     1   0      0       0    1     1    0     0     1  
2     0   0      0       1    0     0    0     1     0  
3     0   0      0       0    0     0    1     0     1  
4     0   0      0       0    0     0    0     0     0  


In [11]:
# Get word frequencies over all statement documents

freq_list_s = word_freq_table_s.sum(axis=0) 
freq_s = dict(zip(word_list_s,freq_list_s))
print(freq_s)

{'are': 1, 'author': 1, 'book': 3, 'bought': 1, 'characters': 1, 'fictional': 1, 'has': 1, 'how': 1, 'i': 2, 'is': 2, 'like': 1, 'more': 1, 'my': 1, 'novel': 1, 'novels': 1, 'one': 1, 'than': 1, 'the': 1, 'they': 1, 'this': 3}


In [12]:
# Get word frequencies for question documents

word_list_q = get_words(question_docs)
word_freq_table_q = get_doc_word_frequency(word_list_q, question_docs)
tdm_q = pd.DataFrame(word_freq_table_q, columns=word_list_q)
print(tdm_q)

   are  author  book  characters  favorite  have  is  my  novels  read  the  \
0    0       0     1           0         0     0   1   1       0     0    0   
1    0       0     1           0         0     1   0   0       0     1    0   
2    0       1     0           0         0     0   1   0       1     0    1   
3    1       0     0           1         0     0   0   0       0     0    1   
4    0       0     1           0         1     0   1   0       0     0    0   

   this  what  who  you  your  
0     1     0    0    0     0  
1     1     0    0    1     0  
2     0     0    1    0     0  
3     0     1    0    0     0  
4     0     1    0    0     1  


In [13]:
# Get word frequencies over all question documents

freq_list_q = word_freq_table_q.sum(axis=0) 
freq_q = dict(zip(word_list_q,freq_list_q))
print(freq_q)
print(freq_list_s)
print(freq_list_q)

{'are': 1, 'author': 1, 'book': 3, 'characters': 1, 'favorite': 1, 'have': 1, 'is': 3, 'my': 1, 'novels': 1, 'read': 1, 'the': 2, 'this': 2, 'what': 2, 'who': 1, 'you': 1, 'your': 1}
[1 1 3 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 3]
[1 1 3 1 1 1 3 1 1 1 2 2 2 1 1 1]


In [14]:
# Get word probabilities for statement class
a = 1
prob_s = []
for count in freq_list_s:
    #print(word, count)
    prob_s.append((count+a)/(sum(freq_list_s)+len(freq_list_s)*a))
prob_s.append(a/(sum(freq_list_s)+len(freq_list_s)*a))
    
# Get word probabilities for question class

prob_q = []
for count in freq_list_q:
    prob_q.append((count+a)/(sum(freq_list_q)+len(freq_list_q)*a))
prob_q.append(a/(sum(freq_list_q)+len(freq_list_q)*a))   
    
    
print('Probability of words for "statement" class \n')
print(dict(zip(word_list_s, prob_s)))
print('------------------------------------------- \n')
print('Probability of words for "question" class \n')
print(dict(zip(word_list_q, prob_q)))

Probability of words for "statement" class 

{'are': 0.043478260869565216, 'author': 0.043478260869565216, 'book': 0.08695652173913043, 'bought': 0.043478260869565216, 'characters': 0.043478260869565216, 'fictional': 0.043478260869565216, 'has': 0.043478260869565216, 'how': 0.043478260869565216, 'i': 0.06521739130434782, 'is': 0.06521739130434782, 'like': 0.043478260869565216, 'more': 0.043478260869565216, 'my': 0.043478260869565216, 'novel': 0.043478260869565216, 'novels': 0.043478260869565216, 'one': 0.043478260869565216, 'than': 0.043478260869565216, 'the': 0.043478260869565216, 'they': 0.043478260869565216, 'this': 0.08695652173913043}
------------------------------------------- 

Probability of words for "question" class 

{'are': 0.05128205128205128, 'author': 0.05128205128205128, 'book': 0.10256410256410256, 'characters': 0.05128205128205128, 'favorite': 0.05128205128205128, 'have': 0.05128205128205128, 'is': 0.10256410256410256, 'my': 0.05128205128205128, 'novels': 0.0512820512

In [15]:
# Calculate prior for one class

def prior(className):    
    denominator = len(stmt_docs) + len(question_docs)
    
    if className == 'statement':
        numerator =  len(stmt_docs)
    else:
        numerator =  len(question_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
        if className == 'statement':
            idx = np.where(word_list_s == word)
            prob = prob * prob_s[np.array(idx)[0,0]]
        else:
            idx = np.where(word_list_q == word)
            prob = prob * prob_q[np.array(idx)[0,0]]   
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    if  prob_statement > prob_question:
        return 'statement'
    else:
        return 'question'

### In-lab exercise: Laplace smoothing

Run the code below and figure out why it fails.

When a word does not appear with a specific class in the training data, its class-conditional probability is 0, and we are unable to
get a reasonable probability for that class.

Research Laplace smoothing, and modify the code above to implement Laplace smoothing (setting the frequency of all words with frequency 0 to a frequency of 1).
Run the modified code on the test set.

In [16]:
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])
print('Getting prediction for %s"' % test_docs[0])
predict(test_docs[0])


Getting prediction for this is the book"


IndexError: index 0 is out of bounds for axis 1 with size 0

### Exercise 1.1 (10 points)

Explain Why it failed and explain how to solve the problem.

Explanation here! (Double click to explain)

The failing has occured because we do not have the input word in our dictionary, the dictionary that we made using the training datasets for both of the classes separately. So the function *idx = np.where(word_list_s == word)[0][0]* is generating **out of bound index error**, since np.where(word_list_s == word)[0] returned empty list indexing empty array generated that error.\
So this is due to the **absence of word** in the training dataset for that class.
What I did is just checked if the word is present in the word_list_s for statement and word_list_q for question condition. If the word is not present then I skipped that word by setting it's probability to 1 (Laplace smooting) and multiplied with the previous resulting probability, like this: **prob = prob * 1**.



### Exercise 1.2 (20 points)

Modify the code to make it work using Laplace smoothing. Include the functions `prior()`, `classCondProb()`, and `predict()`.

In [18]:
# YOUR CODE HERE
def prior(className):    
    denominator = len(stmt_docs) + len(question_docs)
    
    if className == 'statement':
        numerator =  len(stmt_docs)
    else:
        numerator =  len(question_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
        if className == 'statement':
            if word in word_list_s:
                idx = np.where(word_list_s == word)
                prob = prob * prob_s[np.array(idx)[0,0]]
            else:
                prob = prob * 1 #laplace smoothing
        else:
            if word in word_list_q:
                idx = np.where(word_list_q == word)
                prob = prob * prob_q[np.array(idx)[0,0]] 
            else:
                prob = prob * 1 #laplace smoothing
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    if  prob_statement > prob_question:
        return 'statement'
    else:
        return 'question'

In [19]:
# Test function: Do not remove
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])

for sentence in test_docs:
    print('Getting prediction for %s"' % sentence)
    print(predict(sentence))
    
print("success!")
# End Test function

Getting prediction for this is the book"
question
Getting prediction for who are the novels characters"
question
Getting prediction for is this the author"
question
Getting prediction for I like apples"
question
success!


**Expected result**:\
Getting prediction for this is the book"\
question\
Getting prediction for who are the novels characters"\
question\
Getting prediction for is this the author"\
question\
Getting prediction for I like apples"\
statement\
success!

### Take home exercise

Find a more substantial text classification dataset, clean up the documents, and build your NB classifier. Write a brief report on your in-lab and take home exercises and results here.

In [20]:
import pandas as pd

In [21]:
#test part

# dp_split = dp.str.split()
# dppd = pd.DataFrame(dp_split.to_list())
# dppd = dppd.fillna(value='bishal')

# sums = dppd[dppd[0].str.isalpha()][0].value_counts()
# for i in dppd.columns:
#     if i!=0:
#         sums = (dppd[dppd[i].str.isalpha()][i].value_counts()).radd(sums, fill_value=0)
        
# labels = []
# count = 0
# for word in sums.index:
#     if len(word)<3 and word!="no":
#         labels.append(word)
        
# sums.drop(labels = labels, inplace=True)
# sums = sums[sums>10]

In [22]:
data_pos = pd.read_fwf('postive.txt')
data_neg = pd.read_fwf('negative.txt')

pdata = data_pos['Positive_text']
ndata = data_neg['Negative_text']

print("Number of positive Reviews", pdata.shape[0])
print("Number of negative Reviews", ndata.shape[0])


Number of positive Reviews 5331
Number of negative Reviews 5331


In [23]:
import random

random.seed(32)
idx = [i for i in range(pdata.shape[0])]
random.shuffle(idx)
train_len = int(0.8*len(idx))
train_idx = idx[:train_len]
test_idx = idx[train_len:]

In [24]:
ptrain = pdata.iloc[train_idx]
ptest = pdata.iloc[test_idx]
ntrain = ndata.iloc[train_idx]
ntest = ndata.iloc[test_idx]

In [25]:
print(ptrain.shape, ptest.shape, ntrain.shape, ntest.shape)

(4264,) (1067,) (4264,) (1067,)


In [26]:
#outputs a series containing words as index with value as count
#input is a series of lines
def process_text(data):
    dp_split = data.str.split()
    dppd = pd.DataFrame(dp_split.to_list())
    dppd = dppd.fillna(value='bishal')

    sums = dppd[dppd[0].str.isalpha()][0].value_counts()
    for i in dppd.columns:
        if i!=0:
            sums = (dppd[dppd[i].str.isalpha()][i].value_counts()).radd(sums, fill_value=0)
            
    labels = []
    count = 0
    for word in sums.index:
        if len(word)<3 and word!="no":
            labels.append(word)
            
    labels.append('bishal')    
    sums.drop(labels = labels, inplace=True)
    return sums

In [27]:
#bag of words for pos and neg class with frequecy
pos = process_text(ptrain)
neg = process_text(ntrain)

print(f"Number of words in positive class bag {pos.shape} \nNumber of words in negative class bag: {neg.shape}")

#bag of words in both postive class bag and negative class bag
tot = pos.add(neg, fill_value=0, axis = 0)
print("total number of words mixing both classes bags:", tot.shape)

Number of words in positive class bag (10372,) 
Number of words in negative class bag: (10598,)
total number of words mixing both classes bags: (15297,)


In [28]:
# pos/tot[pos.index]
tot.head(10)

aaa          1.0
aaliyah      3.0
aan          1.0
abandon      8.0
abandone     1.0
abandoned    2.0
abandono     1.0
abbas        1.0
abbass       1.0
abbott       2.0
dtype: float64

### Model

In [29]:
#since there is equal number of positive and negative review in our training dataset the priors will be
#equal to 0.5
prob_y_pos = ptrain.shape[0]/(ntrain.shape[0]+ptrain.shape[0])
prob_y_neg = ntrain.shape[0]/(ntrain.shape[0]+ptrain.shape[0])

#probability of each word in its class
prob_word_pos = pos/tot[pos.index]
prob_word_neg = neg/tot[neg.index]

### Testing

In [30]:
#code to process the text sentence
def process_textsent(dppd, index): #dataframe of text, index to process the text at index
    return dppd.iloc[index][dppd.iloc[index].str.isalpha()]


In [31]:
def sentence_p_n_prob(sentence, prob_word_pos, prob_word_neg, prob_y_neg, prob_y_pos):
    pi_p = 1
    pi_n = 1
    for word in sentence.values:
        if word in prob_word_neg:
            pi_n = pi_n * prob_word_neg[word]
        if word in prob_word_pos:
            pi_p = pi_p * prob_word_pos[word]

    prob_s_n = prob_y_neg * pi_n
    prob_s_p = prob_y_pos * pi_p
    
    return prob_s_p, prob_s_n


In [32]:
def process_testdb(df):
    dp_split = df.str.split()
    dppd = pd.DataFrame(dp_split.to_list())
    dppd = dppd.fillna(value='@') #so that I could remove this while processing the text
    return dppd

In [33]:
processed_ptest = process_testdb(ptest)
processed_ntest = process_testdb(ntest)

#testing for positive reviews
sum_predict_p = 0
for index in range(processed_ptest.shape[0]):
    sentence = process_textsent(processed_ptest, index);
    prob_p, prob_n = sentence_p_n_prob(sentence, prob_word_pos, prob_word_neg, prob_y_neg, prob_y_pos)
    if prob_p >= prob_n:
        sum_predict_p += 1

#testing for negative reviews
sum_predict_n = 0
for index in range(processed_ntest.shape[0]):
    sentence = process_textsent(processed_ntest, index);
    prob_p, prob_n = sentence_p_n_prob(sentence, prob_word_pos, prob_word_neg, prob_y_neg, prob_y_pos)
    if prob_n >= prob_p:
        sum_predict_n += 1
        

In [34]:
print(f"Score for positive class: {sum_predict_p}/{ptest.shape[0]} \nScore for negative class: {sum_predict_n}/{ntest.shape[0]}")

Score for positive class: 803/1067 
Score for negative class: 788/1067


In [35]:
pos_per = sum_predict_p*100/ptest.shape[0]
neg_per = sum_predict_n*100/ntest.shape[0]
print("Percentage of correct postive reviews: ", pos_per)
print("Percentage of correct negative reviews: ", neg_per)
print("Total accuracy of model: ", (pos_per + neg_per) / 2)

Percentage of correct postive reviews:  75.25773195876289
Percentage of correct negative reviews:  73.8519212746017
Total accuracy of model:  74.55482661668229


## Discussion

### Datasets

Dataset source: http://www.cs.cornell.edu/people/pabo/movie-review-data 
Citation Info-
This data was first used in Bo Pang and Lillian Lee,
Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. Proceedings of the ACL, 2005.

This dataset contains two text files, one contains the positive movie reviews and other contains the negative movie reviews. The total number of reviews on both text files are same which is 5331, one review per line. I am going to make a Classifier using Naive Bayes Classifier that would detect whether the given review is positive review or negative review. Since the number of reviews on both classes are same, I am going to divide equally for training and test sets from both of these classes, 20% (almost 1000 reviews per class) for test purpose and 80% for training purpose.


### Preprocessing

The dataset is in the text form and it's well formatted, meaning each review in each separate line. So, it will be easier to divide dataset into train and test set. And since the positive and negative reviews are given in two separate file, I can directly load these separately with simple file loader.

Though the dataset is well formated, the reviews itself is not so clean, meaning the reviewer has given reviews in their own way. Some reviews are in different language with non-english or near english symbol (may be french character, I don't know :) ). In addition to this there are other numeric characters (0-9) and symbols like dollar sign, at sign, comma, single quote, double quote, dash, tilde etc and few more. It's easy to filter out this if they are separately written, but that was not the case. They are written inside of some text. And the another main problem is the use of word with aposthophe sign in between the characters in word.

The symbols that are on outside of the word can be removed by using simple strip method. But I couldn't figure out how to process the text that contains symbol inside characters.

Another task that we can do is stemming of the word, meaning replacing the variants of word with the base word. For example, the word eat can have various forms like- eaten, ate, eats, eating. This example may not be helpful to distinguish between positive and negative reviews, but I hope this serves the purpose. It would be better if we could just do that.

So what did I actually do? I just checked whether the word contains characters besides the alphabets, and filtered out every word that contains non-alphabetical characters. This is the most simplest form of text processing and the worst way to do text processing, since it may replace many important words (the words that serve the purpose to distinguish between positive and negative reviews). It turns out there were not much of these words. The total number of words in bag of each class is around 10,000, in words in collection of these bags is around 15,000. Hence, I was sure that this will help me to build pretty descent performing model.

I removed the word 'the' which has count around 5000, almost same as number of reviews, and some other two letter words (not-so-important word and some of them are not even a word) except some important one from the training datasets.


I didn't use any fancy text processing library. I just used the pandas to load the dataset and process the texts and store the bag of words and almost everything.


### Model

The model contains parameters like priors of each class, which in my case it's 0.5, since the number of data for each class is the same. And the other parameters are the probability of texts for each class. The calculations are made easy by the pandas dataframe and series data structures. Like wise during testing, the accessing of the probability of each words from the collection of bag of words is also made easy by the pandas series data structure.

### Results
The scores of my model for each class are listed below:\
**Score for positive class: 803/1067 
Score for negative class: 788/1067**

Out of 1067 reviews, my model is able to classify 803 of them properly for postive class and 788 of them for negative class.

The corresponding scores in percentage are:\
**Percentage of correct postive reviews:  75.25773195876289
Percentage of correct negative reviews:  73.8519212746017**

**So overall accuracy of my model on 2000 reviews (1000 for each class) is 74.44%**

The performance can be improved by using proper preprocessing of the text datasets and a lot of such training datasets. No words should have to be filtered out just based on its composition of the character. Proper stemming of words could also result in better model performance. 

Hence, in conclusion, if I improve my text processing then this Naive Bayes classifier will definitely perform better.

--End--


In [36]:
#test part

# labels = []
# count = 0
# for word in sums.index:
#     labels.append(word.strip("""''"$#!?/[]{&}()`;:|~<>-=+-_./@%^*","""))
# labels

# sums.filter(regex = r"""[,@#!/-;<|>_:(),$.*]""", axis=0)
# dp.str.split(r" ", expand=True)


#depending upon the smallest count words and largest count words we can filter out that
# pos.nsmallest()
# pos.nlargest()
# pos = pos[pos<1000]