# In this assignment, you are going to do machine learning on a text. You will use ScikitLearn multinomial Naive Bayes classifier to build a classification model, which will classify a review either as positive or negative. A review is positive if a customer has given a good review and negative when a customer has given a bad review.

## To build and train the model there are two training viz. positive review file and negative review file. All the reviews in the positive review file are positive and are not labeled same is the case with negative review file.

## To test your model there is a testSet.txt file. In this file first 2989 reviews are positive and from 2990 to 4321 are negative. Since these data is not labeled you have to preprocess them to attach label to the reviews

# Also, you should apply data preprocessing techniques which you have learned from the study notes. At the end report the accuracy of your classifier. 

In [1]:
import os

import re

from sklearn.naive_bayes import MultinomialNB

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

import numpy as np

import random

In [2]:
Neg = ('TrainingDataNegative.txt')
Pos = ('TrainingDataPositive.txt')
Tes = ('testSet.txt')

In [3]:
Pr_Tes = ('Processed_Test.csv') 
Pr_Tr = ('Processed_Training.csv')

## Preprocessing
### Following function has been taken from study notes and modified to the problem

### For Training Data

In [4]:
file1 = open(Neg)

file2 = open(Pos)

writeFile1 = open(Pr_Tr,"w")

#list of bad characters to be removed from the data.
badChar = "[,!.?#@=$%^&*\n]" 
    
for line in file1:

    line = line.lower().replace("\t"," ")
    #First convert each word to lower case, then replace all tab space with single back space.

    line = re.sub(badChar,"",line) 
    #Using regular expression remove all bad character from the text.

    arr = line.split(" ")
    #Split the line using space and put all the words into a list.
        
    label = 'negative'

    words = " ".join(word for word in arr) 
    #All of the words in the list are joined back to form the original sentence.

    toWrite = label + "," + words 
    #Line to be written: Class label, Review

    writeFile1.write(toWrite)

    writeFile1.write("\n")
    #After writing every line put new line character.
    
for line in file2:

    line = line.lower().replace("\t"," ")
   
    line = re.sub(badChar,"",line) 

    arr = line.split(" ")
        
    label = 'positive'

    words = " ".join(word for word in arr) 

    toWrite = label + "," + words 

    writeFile1.write(toWrite)

    writeFile1.write("\n")

    
file1.close()
file2.close()
writeFile1.close()

### For Testing Data

In [5]:
file3 = open(Tes)

writeFile2 = open(Pr_Tes,"w")

for num, line in enumerate(file3, 1):

    line = line.lower().replace("\t"," ")

    line = re.sub(badChar,"",line) 

    arr = line.split(" ")
        
    if num < 2990:
        label = 'positive'
    else:
        label = 'negative'

    words = " ".join(word for word in arr) 
    
    toWrite = label + "," + words 

    writeFile2.write(toWrite)

    writeFile2.write("\n")
        
        
file3.close()
writeFile2.close()

## Defining functions for : Getting Data and Label, and Calculating Baseline
### Following codes have been taken from study notes

In [6]:
def getDataAndLabel(processedFilePath):
    #Reading the processed file.
    file = open(processedFilePath)

    label = []

    data = []

    for line in file:

        arr = line.replace("\n","").split(",") 
        #Split with comma.

        label.append(arr[0])
        #First element is class label.

        data.append(arr[1].replace("\n",""))
        #Second element is Review.

    return data,label

In [7]:
# calculate baseline : it is percentage of records belonging to majority class
def calBaseLine(data): 

    classValues = np.unique(data) 
    #from target values find out unique classes.

    highest = 0

    baseClass = ""
    
    #iterate over these classes to find number of records belonging to that class.
    for label in classValues: 
        
        #create a list containing only label either positive or negative.
        count = [i for i in data if i == label ] 

        count = len(count) 
        #find how many of them are positive or negative.

        if count > highest:

            highest = count

            baseClass = label

    print("Base Class :",baseClass)

    print("Base Line :",(float(highest)/len(data))*100)

## Working with the data

In [8]:
Tr_data, Tr_label = getDataAndLabel(Pr_Tr)
Tes_data, Tes_label = getDataAndLabel(Pr_Tes)

### Code from Study Notes

In [9]:
count_vect = CountVectorizer() 
#instance of count vectorize.

X_train_counts = count_vect.fit_transform(Tr_data) 
#create a numerical feature vector.

tfidf_transformer = TfidfTransformer() 
#calculate term frequency.

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) 
#calculate Term Frequency times Inverse Document Frequency.
#It is stored in a compress row format.

In [10]:
#Shuffling the data
random.shuffle(X_train_tfidf.toarray(), random.random)

In [11]:
model = MultinomialNB(fit_prior = True) 
#create an instance of multinomial Naive Bayes.

### Training and Testing!

In [12]:
model.fit(X_train_tfidf, Tr_label)

MultinomialNB()

In [13]:
#Testing!
X_new_counts = count_vect.transform(Tes_data)

X_new_tfidf = tfidf_transformer.transform(X_new_counts)
#create Term Frequency times Inverse Document Frequency for test data.

In [14]:
#Shuffling the data
random.shuffle(X_new_tfidf.toarray(), random.random)

In [15]:
#Predicting!
predLabel = model.predict(X_new_tfidf)

In [16]:
print("Accuracy of Multinomial Naive Bayes Classifier is: ", np.mean(predLabel == Tes_label) * 100)

Accuracy of Multinomial Naive Bayes Classifier is:  73.52464707243693


In [17]:
calBaseLine(Tes_label)

Base Class : positive
Base Line : 69.17380236056468


## Shuffling had no effect on accuracy!
### Since, base class has 69% base line, the accuracy performance is not that good. 