The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. 

train.tsv contains the phrases and their associated sentiment labels.

The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

## Read the data and load it as a dataframe in the variable "dataset" ; Note: File is "tab" seperated ( 1 mark )

In [2]:
dataset = pd.read_csv('train.tsv','\t')

## Print the dataframe ( 1 mark )

In [3]:
display(dataset.head(3), dataset.tail(2))
# Displaying only 5 records. Print the whole dataset if needed using below
#dataset

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2


Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
156058,156059,8544,avuncular,2
156059,156060,8544,chortles,2


## Print the distribution of the Sentiment ( 1 mark )

In [4]:
dataset.groupby(dataset.Sentiment).size()
#dataset.Sentiment.value_counts()

Sentiment
0     7072
1    27273
2    79582
3    32927
4     9206
dtype: int64

In [5]:
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(dataset['Phrase'])

## Divide the data into train and test in the ratio 80 and 20 respectively. ( 1 mark )

In [6]:
# text_counts would be used as x ( features ) and data["Sentiment"] as y
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text_counts, dataset.Sentiment, train_size = 0.8, test_size = 0.2)

## Train Multinomial Naive Bayes Classification model using Sklearn ( 2 marks )

In [7]:
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()
nb_clf.fit(X_train, y_train)

MultinomialNB()

## Calculate the Test Accuracy , Precision , Recall , Confusion Matrix on test data ( 4 marks, each cary 1 mark )

In [8]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

y_pred = nb_clf.predict(X_test)
print('\n\033[1m\033[4m' + "Confusion Matrix" + '\033[0m')
display(pd.DataFrame(pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)))

# Print full classification report showing precision, recall, etc.
print ('\n\033[1m\033[4m' + "Accuracy:" + '\033[0m', nb_clf.score(X_test, y_test))
print('\n\033[1m\033[4m' + 'Report for Sentiment Analysis\n' + '\033[0m')
print(classification_report(y_test, y_pred, digits=3,))


[1m[4mConfusion Matrix[0m


Predicted,0,1,2,3,4,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,383,692,291,34,4,1404
1,385,2282,2478,304,26,5475
2,124,1442,12539,1713,112,15930
3,17,250,2584,3290,403,6544
4,1,34,327,1012,485,1859
All,910,4700,18219,6353,1030,31212



[1m[4mAccuracy:[0m 0.6080674099705241

[1m[4mReport for Sentiment Analysis
[0m
              precision    recall  f1-score   support

           0      0.421     0.273     0.331      1404
           1      0.486     0.417     0.449      5475
           2      0.688     0.787     0.734     15930
           3      0.518     0.503     0.510      6544
           4      0.471     0.261     0.336      1859

    accuracy                          0.608     31212
   macro avg      0.517     0.448     0.472     31212
weighted avg      0.592     0.608     0.595     31212



## Predict the class for the sentence : "I ate pizza last night at dominos which was very healthy and tasty" ( 2 marks )

In [9]:
input = ["I ate pizza last night at dominos which was very healthy and tasty"]
labelMap = {0:"negative", 1:"somewhat negative", 2:"neutral", 3:"somewhat positive", 4:"positive"} 
text_counts = cv.transform(input)
prediction = nb_clf.predict(text_counts)
print("Predicted class:", prediction, "-", labelMap[prediction[0]])

Predicted class: [3] - somewhat positive
