#### Importing Libraries for: stemming the reviews, using Naive Baiyes Classifier, creating a confusion matrix and using Logistic Regression Classifier. 

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.stem import PorterStemmer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression

# Task 1 - Naive Bayes Classification

In [2]:
ford_reviews = pd.read_csv('car_reviews.csv')

### Step 1 - Splitting the Data. 

##### Note: Since the 'int' converter method in python rounds of decimal numbers to the largest integer smaller than it's original value, we need to add one to get the desired 1106:276 split of our dataset. 

In [3]:
#Creating a randomized vector whose length is the length of our dataset
total_count = ford_reviews.shape[0]
np.random.seed(0)
shuffle = np.random.permutation(total_count)

#Splitting the dataset into 'x' and 'y', so that it can be used in our model
#'x' represents the reviews and 'y' represents the sentiments 
x = ford_reviews.iloc[shuffle, 1]
y = ford_reviews.iloc[shuffle, 0]

#splitting the dataset in training and testing sets in a 80:20 ratio
split = int(total_count * 0.8) + 1  #the required 80% split
x_train = x[:split]
y_train = y[:split]

x_test = x[split:]
y_test = y[split:]


#### Splitting the test and train set at beginning ensures that two sets stay seperate till the end. From here onwards, we keep working on the test set and train set seperately. Both Step 2 and Step 3 are performed seperately on the two sets, thus ensuring that the two don't mix. 

### Step 2 - Data Clean-up

##### Creating an all-in-one function that performs all the data cleaning steps in one fell swoop.  

In [4]:
def review_cleaner(reviews):
    #ListOfbow represents the bag of words that we will get
    ListOfbow = []
    #Getting the list of all stopwords
    stopwords = nltk.corpus.stopwords.words("english")
    porter = PorterStemmer()
    for review in range(reviews.shape[0]):
        #bowdict represents each review that will be returned as dictionaries
        bowdict = {}
        #Tokenize our review
        original = nltk.word_tokenize(reviews.iloc[review])
        for words in original:
            #Turn all the words to lowercase
            words = words.lower()
            #Stems the words
            new_word = porter.stem(words)
            #If the words is not a stopword and if it is not already accounted for
            if words not in stopwords and new_word not in bowdict:
                bowdict[new_word] = 1  #account for it
            elif words not in stopwords and new_word in bowdict: #else
                bowdict[new_word] += 1 #add a count
        ListOfbow.append(bowdict)
    return ListOfbow

#Creating a seperate bag of words for our test and train set
BOW_train = review_cleaner(x_train)
BOW_test = review_cleaner(x_test)

In [5]:
print(x_train.iloc[0])
print(' ')
print(BOW_train[0])
print(' ')
print('Some interesting stemmed words and their count:')
print('mechan -', BOW_train[0]['mechan'])
print('nois -', BOW_train[0]['nois'])
print('turn -', BOW_train[0]['turn'])

 
{'admit': 1, 'much': 2, 'mechan': 2, 'detail': 1, 'technic': 1, 'get': 2, 'basic': 1, 'six': 1, 'year': 2, 'ago': 1, 'parent': 1, 'bought': 1, '1996': 2, 'ford': 3, 'windstar': 2, 'today': 1, 'constant': 1, 'pain': 1, 'problem': 4, 'sort': 1, 'nois': 5, 'time': 1, 'whenev': 1, 'car': 3, 'turn': 2, 'appli': 1, 'brake': 4, 'also': 2, 'warn': 1, 'light': 1, 'constantli': 1, 'come': 1, 'lead': 1, 'road': 1, 'trip': 2, 'worri': 1, 'make': 1, 'home': 1, 'safe': 1, 'usual': 1, 'use': 1, 'long': 3, 'unless': 1, 'lot': 1, 'stuff': 1, 'carri': 1, 'transmiss': 1, 'even': 2, 'stop': 1, 'work': 1, 'howev': 1, 'take': 5, 'van': 1, 'place': 1, 'fix': 3, 'claim': 1, 'hear': 1, 'end': 1, 'differ': 1, 'next': 1, 'month': 1, 'back': 1, 'week': 1, 'shop': 3, 'day': 1, 'event': 1, 'longer': 1, 'major': 1, 'incoveni': 1, 'recal': 1, 'although': 1, 'appear': 1, 'reason': 1, 'appar': 1, 'among': 1, 'truli': 1, 'bless': 2, 'sinc': 1, 'other': 1, 'accid': 1, 'caus': 1, 'head': 1, 'gasket': 1, 'epinion': 1, 's

#### As you can see, after clean-up, we are only left with the stems of our essential words (i.e., words that WILL affect our sentiments). These stems are then turned into a dictionary as part of the 'bag-of-word' technique. I have kept the training set as it is for the sake of purity.

#### In the above review from training set, there are two words with the stem 'mechan' viz., mechanics and mechanic. Both of these have been identified as we can see below. Similary the words 'noise' and 'noises' are identified by their stem word - 'nois', and the words 'turned' and 'turns' have been identified by their stem 'turn'. 

#### Also, notice that the word 'Six' with an upper case 'S' has been stemmed to 'six' with a lower case 's'. Same can be seen for other words like 'Ford' and 'Windstar'. 

##### Note: Use Ctrl + F to find all the instances of the above words/stems. 

### Step 3 - Creating vectors for our Algorithm

#### For the purposes of this Lab, I have decided to use vectors of 1s and 0s despite counting the number of elements in the previous step. The counts, nonetheless, helped in visualizing the efficiency of my code. 



In [6]:
#Creates a 'Megalist' - list of all the unique stemwords from our training set
megalist = []
for i in range(x_train.shape[0]):
    for j in BOW_train[i]:
        if j not in megalist:
            megalist.append(j)


#### In the above cell, we created a 'megalist' of all the unique stemwords from our training set. Then in the cell below, we used the megalist as a backdrop to create our vectors of 1s and 0s, where '1' represents whether a particular word from the megalist is present in a given review and '0' represents the absence thereof. 

#### We use the same process on test set to get vectors as well. However, there are some words in test set that don't appear in the training set. Thus, we put an 'if' condition inside the for loops of test case accounting for this issue. It makes sure that the the code only accounts for the words that are IN the megalist and puts '0' for the one's that it doesn't recognize. 

##### The reason for creating a megalist in the first place was to avoid dimension errors in our algorithm

In [7]:
#Starting with all zeros
bayes_train_array = np.zeros((x_train.shape[0], len(megalist)))
for i in range(x_train.shape[0]):
    for j in BOW_train[i]:
        bayes_train_array[i][megalist.index(j)] = 1

#prints the first 5 rows for display
print(bayes_train_array[0:5])

[[1. 1. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]


In [8]:
#Starting with all zeros
bayes_test_array = np.zeros((x_test.shape[0], len(megalist)))
for i in range(x_test.shape[0]):
    for j in BOW_test[i]:
        #This accounts for missing words in the megalist
        if j in megalist:
            bayes_test_array[i][megalist.index(j)] = 1

#prints the first 5 rows for display
print(bayes_test_array[0:5])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]


In [9]:
print(len(bayes_train_array[0]))
print(len(bayes_test_array[0]))
print(len(megalist))

10109
10109
10109


### Step 4 - Using the Naive Bayes Classifier!!

##### I opted to use the Multinomial NB tool from the scikit-learn library

In [10]:
#We fit (train) out model according to our training set
clf = MultinomialNB()
clf.fit(bayes_train_array, y_train)
#Then we predict sentiments for our test set
R = clf.predict(bayes_test_array)

In [11]:
#A confusion matrix to test the accuracy of our algorithm
conf_mat_NB = confusion_matrix(R, y_test)
conf_mat_NB = conf_mat_NB/276

#Creating a confusion matrix represented by actual number of reviews
values_NB = {'Actual Neg': conf_mat_NB[0]*276, 'Actual Pos': conf_mat_NB[1]*276}
finale_NB = pd.DataFrame.from_dict(values_NB, orient='index', columns=['Predicted Neg', 'Predicted Pos'])

#Creating a confusion matrix represented by percentage of total reviews
percentage_NB = {'Actual Neg': conf_mat_NB[0]*100.0, 'Actual Pos': conf_mat_NB[1]*100.0}
felina_NB = pd.DataFrame.from_dict(percentage_NB, orient='index', columns=['Predicted Neg', 'Predicted Pos'])

print('The confusion matrix in terms of the number of reviews is:')
print(finale_NB)
print(" ")
print('The confusion matrix in terms of the percentages is:')
print(felina_NB)

The confusion matrix in terms of the number of reviews is:
            Predicted Neg  Predicted Pos
Actual Neg          106.0           28.0
Actual Pos           24.0          118.0
 
The confusion matrix in terms of the percentages is:
            Predicted Neg  Predicted Pos
Actual Neg      38.405797      10.144928
Actual Pos       8.695652      42.753623


In [12]:
#The percentage of correct predictions is the sum of true positive and true negatives
Correct_predictions_NB = conf_mat_NB[0][0] + conf_mat_NB[1][1]
print('The percentage of Sentiments predicted correctly by the Naive Bayes method is: ', Correct_predictions_NB*100,'%')

The percentage of Sentiments predicted correctly by the Naive Bayes method is:  81.15942028985508 %


# Task 2 - Logistic Regression

#### For my Task 2, I used an alternate classification algorithm viz., the Logistic Regression Algorithm. Incidentally, the scikit-learn library had tools to use logistic regression as well, and the inputs for this tools were identical to the ones required for Naive Bayes Algorithm. 

#### As for my reason behind using the Logistic Regression method, I was concerned about the independence assumption of the Naive Bayes classifier. The Naive Bayes Classifier assumes that all the variables in our dataset are independent of each other. However, when working with reviews, were words are part of sentences, this might not be completely true. It is possible that 'phrases' rather than 'words' are deterministic variables in predicting sentiments. If that were true, then individual words would NOT be independent of each other. In that case, Naive Bayes would be a far worse algorithm than Logistic regression which takes correlation in consideration. 

#### For reference, here is one of the articles that I found online which highlighted this point - https://dataespresso.com/en/2017/10/24/comparison-between-naive-bayes-and-logistic-regression/

In [13]:
#We fit (train) out model according to our training set
model = LogisticRegression()
model.fit(bayes_train_array, y_train)
#Then we predict sentiments for our test set
Q = model.predict(bayes_test_array)

In [14]:
#A confusion matrix to test the accuracy of our algorithm
conf_mat_LR = confusion_matrix(Q, y_test)
conf_mat_LR = conf_mat_LR/276

#Creating a confusion matrix represented by actual number of reviews
values_LR = {'Actual Neg': conf_mat_LR[0]*276, 'Actual Pos': conf_mat_LR[1]*276}
finale_LR = pd.DataFrame.from_dict(values_LR, orient='index', columns=['Predicted Neg', 'Predicted Pos'])

#Creating a confusion matrix represented by percentage of total reviews
percentage_LR = {'Actual Neg': conf_mat_LR[0]*100.0, 'Actual Pos': conf_mat_LR[1]*100.0}
felina_LR = pd.DataFrame.from_dict(percentage_LR, orient='index', columns=['Predicted Neg', 'Predicted Pos'])

print('The confusion matrix in terms of the number of reviews is:')
print(finale_LR)
print(" ")
print('The confusion matrix in terms of the percentages is:')
print(felina_LR)

The confusion matrix in terms of the number of reviews is:
            Predicted Neg  Predicted Pos
Actual Neg          104.0           29.0
Actual Pos           26.0          117.0
 
The confusion matrix in terms of the percentages is:
            Predicted Neg  Predicted Pos
Actual Neg      37.681159      10.507246
Actual Pos       9.420290      42.391304


In [15]:
#The percentage of correct predictions is the sum of true positive and true negatives
Correct_predictions_LR = conf_mat_LR[0][0] + conf_mat_LR[1][1]
print('The percentage of Sentiments predicted correctly by the Logistic Regression method is: ', Correct_predictions_LR*100,'%')

The percentage of Sentiments predicted correctly by the Logistic Regression method is:  80.07246376811594 %


### Comparison

In [16]:
print('The percent-wise confusion matrix we got for Naive Bayes was:')
print(felina_NB)
print(" ")
print('The percent-wise confusion matrix we got for Logistic Regression was:')
print(felina_LR)

The percent-wise confusion matrix we got for Naive Bayes was:
            Predicted Neg  Predicted Pos
Actual Neg      38.405797      10.144928
Actual Pos       8.695652      42.753623
 
The percent-wise confusion matrix we got for Logistic Regression was:
            Predicted Neg  Predicted Pos
Actual Neg      37.681159      10.507246
Actual Pos       9.420290      42.391304


#### On comparing the values of the above two confusion matrices, it is apparent that Logistic Regression did slightly worse than Naive Bayes. There doesn't seem to be any apparent pattern between the two matrices. The True positives and negatives are fewer and false negatives and false positives are higher. There seems to be a general failure in estimating parameters. The Naive Bayes parameters (posteriors) were perhaps a bit better than the Logistic Regression ones.  

#### I think the reason that Logisitic Regression did worse than Naive Bayes in this case was because the Dataset was a bit small. Logisitic Regression does not perform well on smaller datasets. However, it is hard to know the threshold of these 'small' datasets. I assumed it would be big enough for it to work, but clearly it wasn't. From the looks of it, the parameters calculated by Logistic Regression were not good enough to give us a better value than the parameters calculated by Naive Bayes Classifier, since they were not trained enough. 