Text Classification (Sentiment Analysis) Using Bayes Rule 
==============


## Goal

Your goal in this part of assigment is to implement a Naive Bayes Multinomial classifier using  bag of words model for the classification of text (movie reviews) into different categories..

**Note** Please note that you are allowed to use only those libraries which we have discussed in the class, i.e. numpy, scipy, pandas.

Once you have build and test the model on the provided dataset. You will use the learned techniques to compete in a [Kaggle](https://www.kaggle.com/c/word2vec-nlp-tutorial) competition and report your final score and leaderboard ranking to get full credit.

For final submission attach the screen-shot of the leader-board with your score.

In [1]:
%pylab inline
import scipy.stats
from collections import defaultdict

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [2]:
import re
import bs4

def parse_string(string): 
    """"
        Parse the input string and tokenize it using regular expressisons:
        First clean the string such that it does not have any punctuation or number, it must only have a-z and A-Z.
        Please note that while doing this, the spaces much not get disturbed, but in case of multiple spaces convert 
        them to one space.
        Then convert the string to lower case and return its words as a list of strings.
        
        Example:
        --------
        Input :  computer scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????
        Output:  ['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
        
        Parameters:
        ----------
        string: string to be parsed...
        re: regular expression to be used for the tokenization.
        
        Returns:
        ---------
        list of tokens extracted from the string...
    """
    
#     newstring= string.split()
    
    # removing punctuation and numbers
    newstring=re.sub(r'[^a-zA-Z\s]', '', string).replace("_","")
    newstring=newstring.lower()
    
    newstring=newstring.split()
    newstring= list(filter(None, newstring))
    
    return newstring    
    
parse_string("computer scien_tist-s are,,,  the  rock__stars of tomorrow_ 123<cool>  ????")

['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow', 'cool']

In [3]:
def parse_file(filename): # Parse a given file
    """"
        Parameters:
        ----------
        filename: name of text file to be read
   
        
        Returns:
        ---------
        read file as raw string (with \n, \t, \r, etc included)
    """
    
    text=None

    with open(filename,'r',encoding='utf-8') as f:
        text = f.read()
        
    return text

In [4]:
def files_to_strings(X):
    
    """
        Read an array (or list) of files where each file content is read in a string...
        Input:
        -------
        X an array (or list) of file names
        
        Returns:
        --------
        X as a numpy array with each row containing a read string from the file...
    """
    
    file_content=[]
    for i in X:
        content= parse_file(i)
        file_content.append(content)
    
    file_content=np.array(file_content)
    return file_content
    

In [5]:
from nose.tools import assert_equal, assert_list_equal

assert_list_equal(parse_string("computer scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"),
        [u'computer', u'scientists', u'are', u'the', u'rockstars', u'of', u'tomorrow', u'cool'], "Incorrect cleanning")


strings = files_to_strings(np.array(["./data/imdb1/neg/cv000_29416.txt", "./data/imdb1/pos/cv000_29590.txt"]))
with open("./data/imdb1/neg/cv000_29416.txt") as f:
    text = f.read()

assert_equal(strings[0], text, "At first index should be text of first file")
assert_equal(strings.shape, (2,), "Shape must be (2,) for two files in list")

In [6]:
import nltk

from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
def removestopwords(listofwords):
    stopword=set(t.read_txt_file(r'./data/english.stop'))
    
    filtered_sentence = [word for word in listofwords if word.lower() not in stopword]
    
    return filtered_sentence

In [8]:
from itertools import chain
from collections import defaultdict
import numpy as np

class NaiveBayes:
    ''' Implements the Naive Bayes For Text Classification... '''
    def __init__(self, classes):
        self.classes = classes  # e.g., ['neg', 'pos']
        self.priors = {}  # Prior probabilities for each class
        self.word_to_id = {}  # Unique word to ID mapping
        self.word_count = {cls: defaultdict(int) for cls in classes}  # Word counts per class
        self.class_count = {cls: 0 for cls in classes}  # Document counts per class
        self.total_words = {cls: 0 for cls in classes}  # Total word counts per class
        self.vocab_size = 0  # Vocabulary size

    def train(self, X, Y):
        ''' Train the Naive Bayes classifier using the given data and labels. '''
        # Preprocess the training data
        new_X = []
        for i in X:
            filtered_str = parse_string(i[0])
            filtered_stopwords = removestopwords(filtered_str)
            new_X.append(filtered_stopwords)
        
        X = new_X
        
        # Build vocabulary
        words = list(chain(*X))
        unique_words = sorted(set(words))
        self.word_to_id = {word: idx for idx, word in enumerate(unique_words, start=1)}
        self.vocab_size = len(self.word_to_id)
        
        # Calculate priors and word counts
        for cls in self.classes:
            cls_indices = np.where(Y == cls)[0]
            self.class_count[cls] = len(cls_indices)
            self.priors[cls] = np.log(self.class_count[cls] / len(Y))  # Use log priors directly
            
            for idx in cls_indices:
                for word in X[idx]:
                    self.word_count[cls][word] += 1
                    self.total_words[cls] += 1

    def test(self, X):
        ''' Test the trained classifier on the given test data. '''
        # Preprocess the test data
        new_X = []
        for i in X:
            filtered_str = parse_string(i[0])
            filtered_stopwords = removestopwords(filtered_str)
            new_X.append(filtered_stopwords)
        
        X = new_X
        predictions = []
        
        for document in X:
            log_probs = {}
            for cls in self.classes:
                log_probs[cls] = self.priors[cls]  # Initialize with log prior
                for word in document:
                    # Get word count in class with Laplace smoothing
                    word_count = self.word_count[cls].get(word, 0)
                    # Calculate log probability with Laplace smoothing
                    log_prob = np.log((word_count + 1) / (self.total_words[cls] + self.vocab_size))
                    log_probs[cls] += log_prob
            # Assign the class with the highest log probability
            predicted_class = max(log_probs, key=log_probs.get)
            predictions.append(predicted_class)
        
        return np.array(predictions)
    
    def predict(self, x):
        ''' Predict the class of a single input example (list of words). '''
        return self.test([x])[0]

In [9]:
import pandas as pd
import tools as t

In [10]:
tdir= "./data/imdb1/" # training dir...
#load data, get list of files for each class...
posfiles=t.get_files(tdir+'/pos','*',withpath=True)
negfiles=t.get_files(tdir+'/neg','*',withpath=True)

In [11]:
#generate training and testing data...
plabels=['pos']*len(posfiles)
nlabels=['neg']*len(posfiles)
labels=np.concatenate((plabels,nlabels)) # concatenate the +ve and -ve labels
tX=np.concatenate((posfiles,negfiles))
print ("Training data Dimensions =", tX.shape," Training labels dimensions=", labels.shape)

Training data Dimensions = (2000,)  Training labels dimensions= (2000,)


In [12]:
tX

array(['./data/imdb1//pos\\cv000_29590.txt',
       './data/imdb1//pos\\cv001_18431.txt',
       './data/imdb1//pos\\cv002_15918.txt', ...,
       './data/imdb1//neg\\cv997_5152.txt',
       './data/imdb1//neg\\cv998_15691.txt',
       './data/imdb1//neg\\cv999_14636.txt'], dtype='<U33')

In [17]:
X=files_to_strings(tX) # read files and convert each file into set of strings and return an numpy array
X = X.reshape((X.shape[0], 1))
#Split the data into two halves training and test set...
traindata,trainlabels,testdata,testlabels=t.split_data(X,labels)
#Find the classes to train
classes=np.unique(labels)

In [18]:
#Now build a Naive Bayes classifier and test it...
print ('[Info] training a classifier for following classes {}, {}'.format(classes[0],classes[1]))
nb=NaiveBayes(classes)
nb.train(traindata,trainlabels)
pclasses=nb.test(testdata)
acc=np.sum(pclasses==testlabels)/float(testlabels.shape[0])
print ("[Info] Accuracy = {}".format(acc))

[Info] training a classifier for following classes neg, pos
[Info] Accuracy = 0.81


### Test Cells Start
#### Do not Modify

In [None]:
from nose.tools import assert_in

nb=NaiveBayes(classes)
nb.train(traindata,trainlabels)
assert_equal (nb.test(testdata).shape[0], testdata.shape[0])
assert_in( type(nb.predict(["ok"])) , [str, np.string_, np.str, np.str_] , "Predict should return a label \
                                                                                            not list or array")

In [48]:
from nose.tools import assert_greater

nb=NaiveBayes(classes)
nb.train(traindata,trainlabels)
pclasses=nb.test(testdata)
acc=np.sum(pclasses==testlabels)/float(testlabels.shape[0])
assert_greater(acc, 0.77, "Acc must be greater then 77% you are doing something wrong")    

In [None]:
from nose.tools import assert_equal

comment_pos = "A nice movie, the case was good. Overall a perfect play"
comment_neg = "A waste of time, cast was bad. a clear No!"

#generate training and testing data...
tX=np.concatenate((posfiles,negfiles))
X=files_to_strings(tX)
X = X.reshape((X.shape[0], 1))

plabels=['pos']*len(posfiles)
nlabels=['neg']*len(posfiles)
true_labels = np.concatenate((plabels,nlabels))
inverted_labels = np.concatenate((nlabels,plabels))

true_nb=NaiveBayes(classes)
true_nb.train(X,true_labels)

inverted_nb=NaiveBayes(classes)
inverted_nb.train(X,inverted_labels)

assert_equal( true_nb.predict(comment_pos.split()), "pos" )
assert_equal( true_nb.predict(comment_neg.split()), "neg" )

assert_equal( inverted_nb.predict(comment_pos.split()), "neg" )
assert_equal( inverted_nb.predict(comment_neg.split()), "pos" )

### Test Cells End

# Cross Validation

Now lets throw our methods to winds of different folds and measure their accuracy...

In [19]:
#Now lets generate n-fold training and testing data...
nfolds=10
folds=t.generate_folds(X,labels,nfolds) # generate folds for 
for k in arange(len(folds)):
    print (folds[k][0].shape, folds[k][2].shape)

Generating CV data for 2 classes
(1800, 1) (200, 1)
(1800, 1) (200, 1)
(1800, 1) (200, 1)
(1800, 1) (200, 1)
(1800, 1) (200, 1)
(1800, 1) (200, 1)
(1800, 1) (200, 1)
(1800, 1) (200, 1)
(1800, 1) (200, 1)
(1800, 1) (200, 1)


In [20]:
totacc=[]
#train a classifier for each fold...
classes=np.unique(labels)

for k in range(nfolds):
    nb=NaiveBayes(classes)
    
    traindata=folds[k][0]
    trainlabels=folds[k][1]
    
    #Lets first train the classifier
    nb.train(traindata,trainlabels)
    
    testdata=folds[k][2]
    testlabels=folds[k][3]
    
    #Lets test the classifier
    pclasses= nb.test(testdata)
    
    #print pclasses
    acc=np.sum(pclasses==testlabels)/float(testlabels.shape[0])
    print ("[Info] Fold {} Accuracy = {}".format(k+1, acc))
    
    totacc.append(acc)

print (totacc)

mean_acc = np.mean(totacc)
print ('[Info] Mean Accuracy =', mean_acc)

[Info] Fold 1 Accuracy = 0.825
[Info] Fold 2 Accuracy = 0.81
[Info] Fold 3 Accuracy = 0.81
[Info] Fold 4 Accuracy = 0.835
[Info] Fold 5 Accuracy = 0.82
[Info] Fold 6 Accuracy = 0.805
[Info] Fold 7 Accuracy = 0.78
[Info] Fold 8 Accuracy = 0.815
[Info] Fold 9 Accuracy = 0.805
[Info] Fold 10 Accuracy = 0.825
[0.825, 0.81, 0.81, 0.835, 0.82, 0.805, 0.78, 0.815, 0.805, 0.825]
[Info] Mean Accuracy = 0.813


# Excellent, now its time to go into real waters of Kaggle.


You will be needed to create an account on the Kaggle and download the data for the competition ["Bag of words meets bags of popcorn"](https://www.kaggle.com/c/word2vec-nlp-tutorial/data).  Note that you will be only downloading the "labeledTrainData.tsv" and "labeledTestData.tsv".


"labeledTrainData.tsv" will be used for training your model and thus have prespecified labels for each example review. "labeledTestData.tsv" will be used for testing your model and thus don't have prespecified labels for each example. You will predicting the label for each review and then uploading your result to Kaggle server which will be evaluating your model and will give score to your entry. You will report this score during your assignment submission.

**[Caution]** Please note that Kaggle limits maximum number of evaluations per 24 hours to 5 to reduce the overfitting on the test set, so be careful and throughly test your model before submitting your entry to Kaggle server. 

Read the instructions on the Competition Page. Note you are not allowed to use any of the library except what we have learned during class.

In [21]:
import pandas as pd

In [22]:
# read the data-set
train=pd.read_csv('labeledTrainData.tsv',sep='\t')

In [23]:
train.describe()

Unnamed: 0,sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [24]:
train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [25]:
Yt=train['sentiment']
Xt=train['review']
Xt=np.array(Xt)
Yt=np.array(Yt)

print (Xt.shape)

(25000,)


In [26]:
#read test set...
test=pd.read_csv('testData.tsv',sep='\t')

In [27]:
test.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


#### Training Time 

In [28]:
# Let's split the training data into two halves and test our accuracy...
traindata,trainlabels,testdata,testlabels=t.split_data(Xt.reshape((Xt.shape[0],1)),Yt)
classes=np.unique(trainlabels)

In [29]:
# Now lets go and train the model and see its performance...
print ('[Info] training a classifier for following classes {}, {}'.format(classes[0],classes[1]))
nb=NaiveBayes(classes)
nb.train(traindata,trainlabels)
pclasses=nb.test(testdata)
acc=np.sum(pclasses==testlabels)/float(testlabels.shape[0])
print ("[Info] Accuracy = {}".format(acc) )

[Info] training a classifier for following classes 0, 1
[Info] Accuracy = 0.8564


#### Cross-Validation Time...

In [30]:
#Split the training data into 10 folds and test classifiers performance...

nfolds=10
folds=t.generate_folds(Xt.reshape((Xt.shape[0],1)),Yt,nfolds) # generate folds for 
for k in arange(len(folds)):
    print (folds[k][0].shape, folds[k][2].shape)

Generating CV data for 2 classes
(22500, 1) (2500, 1)
(22500, 1) (2500, 1)
(22500, 1) (2500, 1)
(22500, 1) (2500, 1)
(22500, 1) (2500, 1)
(22500, 1) (2500, 1)
(22500, 1) (2500, 1)
(22500, 1) (2500, 1)
(22500, 1) (2500, 1)
(22500, 1) (2500, 1)


In [31]:
# As it takes time, so becareful it can cause your machine into red hot oven
totacc=[]
classes=np.unique(Yt)

for k in range(nfolds):
    nb=NaiveBayes(classes)
    
    traindata=folds[k][0]
    trainlabels=folds[k][1]
    
    #Lets first train the classifier
    nb.train(traindata,trainlabels)
    
    testdata=folds[k][2]
    testlabels=folds[k][3]
    
    #Lets test the classifier
    pclasses= nb.test(testdata)
    
    acc=np.sum(pclasses==testlabels)/float(testlabels.shape[0])
    print ("[Info] Fold {} Accuracy = {}".format(k+1, acc) ) 
    
    totacc.append(acc)

print (totacc)
print ('[Info] Mean Accuracy =', np.mean(totacc))

[Info] Fold 1 Accuracy = 0.8576
[Info] Fold 2 Accuracy = 0.8776
[Info] Fold 3 Accuracy = 0.866
[Info] Fold 4 Accuracy = 0.8552
[Info] Fold 5 Accuracy = 0.848
[Info] Fold 6 Accuracy = 0.8544
[Info] Fold 7 Accuracy = 0.8596
[Info] Fold 8 Accuracy = 0.85
[Info] Fold 9 Accuracy = 0.8516
[Info] Fold 10 Accuracy = 0.8436
[0.8576, 0.8776, 0.866, 0.8552, 0.848, 0.8544, 0.8596, 0.85, 0.8516, 0.8436]
[Info] Mean Accuracy = 0.8563600000000001


# Now let's train on the complete dataset and test on the provided test set...

In [33]:
classes= np.unique(Yt)
print ('Training a Classifier on Full training set with classes =', classes)
nb=NaiveBayes(classes)
nb.train(Xt.reshape(Xt.shape[0],1),Yt)

Training a Classifier on Full training set with classes = [0 1]


In [36]:
# Get the test data...
Xtest = test['review'].values  # Convert the Series to a NumPy array directly

# Reshape the array if needed
Xtest = Xtest.reshape((Xtest.shape[0], 1))

# Test the classifier on the provided test set...
pclasses = nb.test(Xtest)

In [37]:
#write the result in the kaggle's required format
output = pd.DataFrame( data={"id":test["id"], "sentiment":pclasses} )

# Use pandas to write the comma-separated output file
output.to_csv( "Naive_bays_Bag_of_Words_model.csv", index=False, quoting=3 )

# Time to Upload the prediction to Kaggle...

Now upload the result on the Kaggle and see your ranking and score. Using this simple method you can have an accuracy of around 0.80960.

# Improvement by Excluding Stop Words...

You can improve your score further by excluding the commonly occuring words (also known as stop words) in the English language.



In [None]:
#read and create a set of stop 
stopwords=set(t.read_txt_file('./data/sentiment/data/english.stop'))
print (stopwords)