# Assessment3 task1

Task 1: Creating a Sentiment Prediction Platform (50/100) [max. 3 pages]
We consider a real-life dataset consisting of 50,000 labeled gourmet food reviews from Amazon
extracted from the work of McAuley and Leskovec1. A food review is labeled as 1 (positive) if it
received four or more stars and 0 (negative) otherwise. The dataset is balanced and we provide two
files (positive_reviews.csv and negative_reviews.csv) containing 25,000 positive and negative
reviews respectively (see Figure 1).
Review Label
Over 6.00 a box and people call this a good deal?? Waste of money. You can get better
deals at Sam's Club or Costco...
0
This was a very good buy and arrived in perfect condition in a very timely manner. Yes,
I would order again.
1
Figure 1: Example of food reviews from the data
The goal is to train machine learning (ML) models for sentiment prediction, evaluate them, and
deploy the best model on a platform. The result should be similar to the service proposed on the
following website: https://monkeylearn.com/sentiment-analysis-online/
Subtasks:
1.1) Using your own words, please explain the several steps that you will need to go through to
create your sentiment prediction platform. Your description should include (but not only) the
following points (2 marks):
• Data retrieval
• Feature Extraction
• Feature Engineering
• Model Evaluation
• Deployment
1.2) Please provide a short description of the dataset provided, along with how you imported the
data, providing snippets of code along with a detailed description (2 marks).
1.3) Employ exploratory data analysis (EDA) techniques to gain an initial understanding of the
data. Please provide appropriate visualization results and initial insights gained from EDA
(4 marks).
1.4) Motivate, explain, and apply any necessary pre-processing techniques on your food reviews.
Using an example, show how a string is transformed after each processing step. (6 marks)
1.5) Implement the following techniques for sentiment prediction:
• Logistic regression with BOW and TF-IDF word features
• Support vector machine with BOW and TF-IDF word features
• Long-short term memory network with an embedding layer
For each of them, describe in detail how you deployed them and adjusted their parameters,
going into detail on what each parameter does as well. You may use open source code and
libraries as long as you acknowledge them (14 marks).
1.6) Split your dataset into a training and test set of size 35,000 and 15,000 respectively. Train
and evaluate the performance of the techniques developed in task 1.5. Please present and
discuss your results using metrics and/or tables. (8 marks)
page 4 of 4
1.7) Select and save the “best” model, then deploy it on an online platform of your choice. The
end-user should be able to input a string of text and receive its polarity (along with a
confidence score). The report should contain the URL of the online platform along with
detailed explanations and screenshots. You may use Flask or Django and services like
Heroku. (14 marks)
********************************
Bonus – Optional: Should you decide to implement any further explanation features on your online
platform along with the predicted sentiment, there will be a bonus of up to 5 marks. To explain the
intuition of your explanation method and the code, you will be allowed at most one additional page.
The maximum overall mark for this assessment remains at 100/100; however, attempting the bonus
exercise will make you practice more on developing algorithms on your own and enhance your
chances of getting a higher mark overall.
********************************

In [4]:
!pip install contractions

Collecting contractions
  Using cached contractions-0.0.43-py2.py3-none-any.whl (6.0 kB)
Collecting textsearch
  Using cached textsearch-0.0.17-py2.py3-none-any.whl (7.5 kB)
Installing collected packages: textsearch, contractions
Successfully installed contractions-0.0.43 textsearch-0.0.17


In [6]:
!pip install afinn

Collecting afinn
  Downloading afinn-0.1.tar.gz (52 kB)
Building wheels for collected packages: afinn
  Building wheel for afinn (setup.py): started
  Building wheel for afinn (setup.py): finished with status 'done'
  Created wheel for afinn: filename=afinn-0.1-py3-none-any.whl size=53449 sha256=bc60bf6f92809587635d26d3dbe337e8c960b54ed73fe8d9e5b71200d1d246dd
  Stored in directory: c:\users\user\appdata\local\pip\cache\wheels\9d\16\3a\9f0953027434eab5dadf3f33ab3298fa95afa8292fcf7aba75
Successfully built afinn
Installing collected packages: afinn
Successfully installed afinn-0.1


In [1]:
# Usual data representation and manipulation libraries
import pandas as pd
import numpy as np
from collections import Counter

# NLTK is very useful for natural language applications
import nltk

# This will be used to tokenize sentences
from nltk.tokenize.toktok import ToktokTokenizer

# We use spacy for extracting useful information from English words
import spacy
nlp = spacy.load('en', parse = False, tag=False, entity=False)

# This dictionary will be used to expand contractions (e.g. we'll -> we will)
from contractions import contractions_dict
import re

# Unicodedata will be used to remove accented characters
import unicodedata

# BeautifulSoup will be used to remove html tags
from bs4 import BeautifulSoup

# Lexicon models
from afinn import Afinn
from nltk.corpus import sentiwordnet as swn
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Evaluation libraries
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [2]:
#We import the dataset
negative_reviews_raw = pd.read_csv("D:/MScAI/Semester1/CS5079AppliedAI/Assessm3/negative.csv")
positive_reviews_raw = pd.read_csv("D:/MScAI/Semester1/CS5079AppliedAI/Assessm3/positive.csv")
print(negative_reviews_raw.head(), negative_reviews_raw.shape)
print(positive_reviews_raw.head(), positive_reviews_raw.shape)                                  

                                              Review  Label
0  We love Malibu Rum but they sure missed the ma...      0
1  I just wanted to say that if you want to get y...      0
2  These seeds were accompanied by small broken p...      0
3  Way way way overpriced. I can get this same se...      0
4  I bought these on the strength of the reviews....      0 (25000, 2)
                                              Review  Label
0  Better than Wolff's Kasha. I grew up eating Ka...      1
1  It was such good product. Came in two differen...      1
2  MMMM Yes all chocolate is good.<br />But some ...      1
3  This is, as all of their cereals I've ordered ...      1
4  Whoever Photoshopped the cookie on the front o...      1 (25000, 2)


### Exploratory Data Analysis (EDA)

In [26]:
positive_reviews_raw['Review'] # negative_reviews_raw.rename(columns = lambda x: x.replace(' ', '_'), inplace=True)
# negative_reviews_raw.info()
# negative_reviews_raw.columns = negative_reviews_raw.columns.str.strip()
# negative_reviews_raw.columns

0        Better than Wolff's Kasha. I grew up eating Ka...
1        It was such good product. Came in two differen...
2        MMMM Yes all chocolate is good.<br />But some ...
3        This is, as all of their cereals I've ordered ...
4        Whoever Photoshopped the cookie on the front o...
                               ...                        
24995    Healthy alternative, high in protein, especial...
24996    I have tried a half a dozen unsweetened drinks...
24997    Is a little bit more costly then other dog foo...
24998    My husband is from Hungary and craves foods th...
24999    Katy Perry has sung the praises of <a href="ht...
Name: Review, Length: 25000, dtype: object

In [3]:
negative_reviews_raw.info()
positive_reviews_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  25000 non-null  object
 1   Label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  25000 non-null  object
 1   Label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [9]:
positive_reviews_raw['Review'].describe()

count                                                 25000
unique                                                23755
top       I'm addicted to salty and tangy flavors, so wh...
freq                                                      6
Name: Review, dtype: object

In [8]:
negative_reviews_raw['Review'].describe()

count                                                 25000
unique                                                21682
top       This review will make me sound really stupid, ...
freq                                                     29
Name: Review, dtype: object

In [12]:
print (positive_reviews_raw.Review.map(lambda x: len(x)).max())
print (negative_reviews_raw.Review.map(lambda x: len(x)).max())

10112
9296


In [15]:
print (positive_reviews_raw.Review.map(len).mean())
print (negative_reviews_raw.Review.map(len).mean())

420.4514
495.21012


In [16]:
print (positive_reviews_raw.Review.map(len).min())
print (negative_reviews_raw.Review.map(len).min())

45
33


# Pre-processing

In [19]:
positive_reviews_raw.Review[2]

'MMMM Yes all chocolate is good.<br />But some chocolates are better than others and this is a "better than others chocolate"<br />It has a smooth, easy creamy taste not overly sweet IMO.<br />It is just pretty much perfect'

In [17]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

In [18]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def remove_special_characters(text):
    text = re.sub('[^a-zA-z0-9\s]', '', text)
    return text

In [24]:
def expand_contractions(text, contraction_mapping=contractions_dict):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contractions_dict.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) if contraction_mapping.get(match) else contraction_mapping.get(match.lower())                               
        return first_char+expanded_contraction[1:] if expanded_contraction != None else match
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [25]:
def lemmatize_text(text):
    text = nlp(text)
    return ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])

In [26]:
#nltk.download('stopwords')
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

def remove_stopwords(text, is_lower_case=False):
    tokenizer = ToktokTokenizer()
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


In [27]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True):
    i=0
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        i+=1
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
            if i==3: print(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
            if i==3: print(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
            if i==3: print(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
            if i==3: print(doc)
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        if i==3: print(doc)
        # insert spaces between special characters to isolate them    
        special_char_pattern = re.compile(r'([{.(-)!}])')
        doc = special_char_pattern.sub(" \\1 ", doc)
        if i==3: print(doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
            if i==3: print(doc)
        # remove special characters    
        if special_char_removal:
            doc = remove_special_characters(doc)
            if i==3: print(doc)
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        if i==3: print(doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            if i==3: print(doc)
            
        normalized_corpus.append(doc)
    return normalized_corpus

In [28]:
# positive_reviews_raw['Review'] = normalize_corpus(positive_reviews_raw.Review)
# positive_reviews_raw.to_csv("normalized_positive_reviews.csv", index = False)
# negative_reviews_raw['Review'] = normalize_corpus(negative_reviews_raw.Review)
# negative_reviews_raw.to_csv("normalized_negative_reviews.csv", index = False)

MMMM Yes all chocolate is good.But some chocolates are better than others and this is a "better than others chocolate"It has a smooth, easy creamy taste not overly sweet IMO.It is just pretty much perfect
MMMM Yes all chocolate is good.But some chocolates are better than others and this is a "better than others chocolate"It has a smooth, easy creamy taste not overly sweet IMO.It is just pretty much perfect
MMMM Yes all chocolate is good.But some chocolates are better than others and this is a "better than others chocolate"It has a smooth, easy creamy taste not overly sweet IMO.It is just pretty much perfect
mmmm yes all chocolate is good.but some chocolates are better than others and this is a "better than others chocolate"it has a smooth, easy creamy taste not overly sweet imo.it is just pretty much perfect
mmmm yes all chocolate is good.but some chocolates are better than others and this is a "better than others chocolate"it has a smooth, easy creamy taste not overly sweet imo.it is 

# The train and test data sets creating

In [33]:
normalized_positive_reviews = pd.read_csv("normalized_positive_reviews.csv")
normalized_negative_reviews = pd.read_csv("normalized_negative_reviews.csv")


In [34]:
X_test_df = pd.concat([normalized_positive_reviews.Review.iloc[:7500],
                       normalized_negative_reviews.Review.iloc[:7500]],
                       ignore_index=True)
y_test_df = pd.concat([normalized_positive_reviews.Label.iloc[:7500],
                       normalized_negative_reviews.Label.iloc[:7500]], 
                       ignore_index=True)
X_train_df = pd.concat([normalized_positive_reviews.Review.iloc[7500:42500],
                        normalized_negative_reviews.Review.iloc[7500:42500]],
                        ignore_index=True)
y_train_df = pd.concat([normalized_positive_reviews.Label.iloc[7500:42500],
                        normalized_negative_reviews.Label.iloc[7500:42500]],
                        ignore_index=True)

print(X_test_df.shape, y_test_df.shape)
print(X_train_df.shape, y_train_df.shape)
X_train_df=X_train_df.astype('str')
X_train_df

(15000,) (15000,)
(35000,) (35000,)


0        always favorite coffee cremor make coffee smoo...
1        cat discerning palate eat fancy feast fragrant...
2        first attempt gluten free bread make regular b...
3        fiasconaro panettone make wonderful christmas ...
4        sparkling blackberry not get cola hope nice dr...
                               ...                        
34995    many glowing review feel little ashamed put lu...
34996    give 3 star purpose bake mix really good panca...
34997    affordable bully stick office fill incredible ...
34998    get daughter love blood hot chocolate get stuf...
34999    not enough flavor murky brown color lack aroma...
Name: Review, Length: 35000, dtype: object

In [35]:
X_test = np.array(X_test_df)
y_test = np.array(y_test_df)
X_train = np.array(X_train_df)
y_train = np.array(y_train_df)
print (len(X_test), len(y_test))
print (len(X_train), len(y_train))
X_test

15000 15000
35000 35000


array(['well wolffs kasha grow eat kasha easy prepare healthy meal bobs red mill great selection thing grain ever portland area check store restaurant great vegetarian selection well meat eater stuff',
       'good product come two different box describe cover plastic wrap together',
       'mmmm yes chocolate good chocolate well well chocolateit smooth easy creamy taste not overly sweet imo pretty much perfect',
       ...,
       'three dog love greenie one dog not chew food gulps whole cut greenie small portion help clean tooth cause diarrhea think look another product tooth clean',
       'buy 5 bag think would great work energy protein mineral content first bag open okay could tolerate shake sprinkle food hemp protein powder open next bag share friend horrify bitter nasty taste not know get bad batch stuff would try one bag becausew stick 5 bag',
       'vanilla coffee taste pretty good mild expect buy discount amazon warehouse otherwise k cup little expensive amazon usually mean 

# Afinn

In [71]:
afn = Afinn(emoticons=True)

In [73]:
T = lambda x: 1 if x else 0
y_predicted = [T(x) for x in [afn.score(review)>=0 for review in X_test]]

In [74]:
print("The model accuracy score is: {}".format(accuracy_score(y_test, y_predicted)))
print("The model precision score is: {}".format(precision_score(y_test, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(y_test, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(y_test, y_predicted, average="weighted")))

print(classification_report(y_test, y_predicted))

display(pd.DataFrame(confusion_matrix(y_test, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.6056
The model precision score is: 0.7146707276296839
The model recall score is: 0.6056
The model F1-score is: 0.5482136661608737
              precision    recall  f1-score   support

           0       0.87      0.25      0.39      7500
           1       0.56      0.96      0.71      7500

    accuracy                           0.61     15000
   macro avg       0.71      0.61      0.55     15000
weighted avg       0.71      0.61      0.55     15000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,1869,5631
Act. positive,285,7215


## SentiWordNet

Wordnet groupes synonyms into synsets with short definitions and usage examples. In the example below, we print the synsets for the word `extravagant`. You can notice than each synset is associated with a positive, a negative and an objectivity score.

In [75]:
extravant = list(swn.senti_synsets('extravagant', 'a'))
pd.DataFrame.from_dict({ "Synset" : [ s.synset for s in extravant],
"Definition" : [s.synset.definition() for s in extravant], "Positive Polarity" : [s._pos_score for s in extravant], "Negative Polarity" : [s._neg_score for s in extravant], "Objectivity Score" : [s._obj_score for s in extravant]})

Unnamed: 0,Synset,Definition,Positive Polarity,Negative Polarity,Objectivity Score
0,Synset('excessive.s.02'),"unrestrained, especially with regard to feelings",0.125,0.375,0.5
1,Synset('extravagant.s.02'),recklessly wasteful,0.0,0.125,0.875


In [76]:
def analyze_sentiment_sentiwordnet_lexicon(review):
    # tokenize and POS tag text tokens
    tagged_text = [(token.text, token.tag_) for token in nlp(review)]
    pos_score = neg_score = token_count = obj_score = 0
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and list(swn.senti_synsets(word, 'n')): #NOUNS
            ss_set = list(swn.senti_synsets(word, 'n'))[0]
        elif 'VB' in tag and list(swn.senti_synsets(word, 'v')): #VERBS
            ss_set = list(swn.senti_synsets(word, 'v'))[0]
        elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')): #ADJECTIVES
            ss_set = list(swn.senti_synsets(word, 'a'))[0]
        elif 'RB' in tag and list(swn.senti_synsets(word, 'r')): #ADVERBS
            ss_set = list(swn.senti_synsets(word, 'r'))[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    return norm_final_score

In [77]:
sample_review_ids = [7626, 3533, 13010]

for review, sentiment in zip(X_test[sample_review_ids], y_test[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    print('Predicted Sentiment polarity: '+ str(analyze_sentiment_sentiwordnet_lexicon(review)))
    print('-'*60)

REVIEW: not say like soup not say dislike kind culinary non event not flavor speak potato good base though want dress add broccoli cheese little chicken stock
Actual Sentiment: 0
Predicted Sentiment polarity: -0.08
------------------------------------------------------------
REVIEW: use morsel bake particularly baking banana breads standard price conducive
Actual Sentiment: 1
Predicted Sentiment polarity: 0.0
------------------------------------------------------------
REVIEW: kind goody excited try learn quickly need sweetener eat plain oatmeal cocoa stuff pretty awful without something sweet eat work cheat use flavor coffee creamer instantly make quantity stuff pretty yummy good taste good make good stuff pretty happy wheat free cap front along vegan superfood marketing word note make facility also process wheat not sensitive sensitive not die explode good big complaint stuff nut oil seem little stale hate rancid oil pick mile away picky people would parent old eat kind thing conside

In [78]:
y_predicted = [T(x) for x in [analyze_sentiment_sentiwordnet_lexicon(review)>=0 for review in X_test]]

In [79]:
print("The model accuracy score is: {}".format(accuracy_score(y_test, y_predicted)))
print("The model precision score is: {}".format(precision_score(y_test, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(y_test, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(y_test, y_predicted, average="weighted")))

print(classification_report(y_test, y_predicted))

display(pd.DataFrame(confusion_matrix(y_test, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.6779333333333334
The model precision score is: 0.682149092310148
The model recall score is: 0.6779333333333334
The model F1-score is: 0.6760589658780278
              precision    recall  f1-score   support

           0       0.71      0.60      0.65      7500
           1       0.65      0.75      0.70      7500

    accuracy                           0.68     15000
   macro avg       0.68      0.68      0.68     15000
weighted avg       0.68      0.68      0.68     15000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,4514,2986
Act. positive,1845,5655


## VADER Lexicon

VADER (Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabeled text data.

In the next cell, we can see that VADER returns four sentiment scores `compound`, `neg`, `neu` and `pos`. In the following model, we will only use the `compound` (i.e. the aggregated score).

In [80]:
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores('This movie was actually neither that funny, nor super witty.')

{'neg': 0.41, 'neu': 0.59, 'pos': 0.0, 'compound': -0.6759}

In [81]:
def analyze_sentiment_vader_lexicon(review, threshold=0.1):
    # analyze the sentiment for review
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold else 'negative'
    return final_sentiment

In [83]:
for review, sentiment in zip(X_test[sample_review_ids], y_test[sample_review_ids]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    pred = analyze_sentiment_vader_lexicon(review, threshold=0.4)    
    print('-'*60)

REVIEW: not say like soup not say dislike kind culinary non event not flavor speak potato good base though want dress add broccoli cheese little chicken stock
Actual Sentiment: 0
------------------------------------------------------------
REVIEW: use morsel bake particularly baking banana breads standard price conducive
Actual Sentiment: 1
------------------------------------------------------------
REVIEW: kind goody excited try learn quickly need sweetener eat plain oatmeal cocoa stuff pretty awful without something sweet eat work cheat use flavor coffee creamer instantly make quantity stuff pretty yummy good taste good make good stuff pretty happy wheat free cap front along vegan superfood marketing word note make facility also process wheat not sensitive sensitive not die explode good big complaint stuff nut oil seem little stale hate rancid oil pick mile away picky people would parent old eat kind thing consider past prime expiration date chocolate hot cereal let us call oatmeal 

In [86]:
y_predicted = [T(x) for x in [analyze_sentiment_vader_lexicon(review, threshold=0.4) for review in X_test]]

print("The model accuracy score is: {}".format(accuracy_score(y_test, y_predicted)))
print("The model precision score is: {}".format(precision_score(y_test, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(y_test, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(y_test, y_predicted, average="weighted")))

print(classification_report(y_test, y_predicted))

display(pd.DataFrame(confusion_matrix(y_test, y_predicted),
                     columns=["Pred. negative", "Pred. positive"],
                     index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.5
The model precision score is: 0.25
The model recall score is: 0.5
The model F1-score is: 0.3333333333333333
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      7500
           1       0.50      1.00      0.67      7500

    accuracy                           0.50     15000
   macro avg       0.25      0.50      0.33     15000
weighted avg       0.25      0.50      0.33     15000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Pred. negative,Pred. positive
Act. negative,0,7500
Act. positive,0,7500


# The Bag of Words (BOW) 

In [44]:
# Libraries for feature engineering
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# ML Models
from sklearn.linear_model import SGDClassifier, LogisticRegression

In [37]:
# build BOW features on train reviews
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(X_train)

In [41]:
# transform test reviews into features
cv_test_features = cv.transform(X_test)


In [42]:
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)


BOW model:> Train features shape: (35000, 623304)  Test features shape: (15000, 623304)


In [45]:
# We define our SVM and LR models
lr = LogisticRegression(penalty='l2', max_iter=1000, C=1)
svm = SGDClassifier(loss='hinge', max_iter=100)

In [46]:
# Logistic Regression model on BOW features
lr.fit(cv_train_features,y_train)
y_predicted = lr.predict(cv_test_features)

print("The model accuracy score is: {}".format(accuracy_score(y_test, y_predicted)))
print("The model precision score is: {}".format(precision_score(y_test, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(y_test, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(y_test, y_predicted, average="weighted")))

print(classification_report(y_test, y_predicted))

display(pd.DataFrame(confusion_matrix(y_test, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.876
The model precision score is: 0.876000240640154
The model recall score is: 0.876
The model F1-score is: 0.8759999801599969
              precision    recall  f1-score   support

           0       0.88      0.88      0.88      7500
           1       0.88      0.88      0.88      7500

    accuracy                           0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,6573,927
Act. positive,933,6567


In [48]:
# SVM model on BOW
svm.fit(cv_train_features,y_train)
y_predicted = svm.predict(cv_test_features)

print("The model accuracy score is: {}".format(accuracy_score(y_test, y_predicted)))
print("The model precision score is: {}".format(precision_score(y_test, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(y_test, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(y_test, y_predicted, average="weighted")))

print(classification_report(y_test, y_predicted))

display(pd.DataFrame(confusion_matrix(y_test, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.8690666666666667
The model precision score is: 0.8691519555344733
The model recall score is: 0.8690666666666667
The model F1-score is: 0.869059103520486
              precision    recall  f1-score   support

           0       0.86      0.88      0.87      7500
           1       0.87      0.86      0.87      7500

    accuracy                           0.87     15000
   macro avg       0.87      0.87      0.87     15000
weighted avg       0.87      0.87      0.87     15000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,6575,925
Act. positive,1039,6461


# The TF-IDF

In [38]:
# build TFIDF features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2),sublinear_tf=True)
tv_train_features = tv.fit_transform(X_train)

In [40]:
tv_test_features = tv.transform(X_test)

In [49]:
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

TFIDF model:> Train features shape: (35000, 623304)  Test features shape: (15000, 623304)


In [51]:
# Logistic Regression model on TF-IDF features
lr.fit(tv_train_features,y_train)
y_predicted = lr.predict(tv_test_features)

print("The model accuracy score is: {}".format(accuracy_score(y_test, y_predicted)))
print("The model precision score is: {}".format(precision_score(y_test, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(y_test, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(y_test, y_predicted, average="weighted")))

print(classification_report(y_test, y_predicted))

display(pd.DataFrame(confusion_matrix(y_test, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.8651333333333333
The model precision score is: 0.8652661213673928
The model recall score is: 0.8651333333333333
The model F1-score is: 0.8651210749371617
              precision    recall  f1-score   support

           0       0.86      0.87      0.87      7500
           1       0.87      0.86      0.86      7500

    accuracy                           0.87     15000
   macro avg       0.87      0.87      0.87     15000
weighted avg       0.87      0.87      0.87     15000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,6560,940
Act. positive,1083,6417


In [52]:
#SVM model on TF-IDF
svm.fit(tv_train_features,y_train)
y_predicted = svm.predict(tv_test_features)

print("The model accuracy score is: {}".format(accuracy_score(y_test, y_predicted)))
print("The model precision score is: {}".format(precision_score(y_test, y_predicted, average="weighted")))
print("The model recall score is: {}".format(recall_score(y_test, y_predicted, average="weighted")))
print("The model F1-score is: {}".format(f1_score(y_test, y_predicted, average="weighted")))

print(classification_report(y_test, y_predicted))

display(pd.DataFrame(confusion_matrix(y_test, y_predicted), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.8652
The model precision score is: 0.865370467245198
The model recall score is: 0.8652
The model F1-score is: 0.865184275093847
              precision    recall  f1-score   support

           0       0.86      0.88      0.87      7500
           1       0.87      0.85      0.86      7500

    accuracy                           0.87     15000
   macro avg       0.87      0.87      0.87     15000
weighted avg       0.87      0.87      0.87     15000



Unnamed: 0,Pred. negative,Pred. positive
Act. negative,6570,930
Act. positive,1092,6408


# Long-short term memory network (LSTM) with an embedding layer

In [102]:
# Truncate and pad the review sequences 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils.np_utils import to_categorical

max_features = 1000
tokenizer = Tokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(X_train)
tokenizer.fit_on_texts(X_test)
X_train_token = tokenizer.texts_to_sequences(X_train)
X_train_token = pad_sequences(X_train_token, maxlen=max_features) 

X_test_token = tokenizer.texts_to_sequences(X_test)
X_test_token = pad_sequences(X_test_token, maxlen=max_features)
Y_train = to_categorical(y_train)
Y_test = to_categorical(y_test)

print(X_train_token.shape,Y_train.shape)
print(X_test_token.shape,Y_test.shape)

(35000, 1000) (35000, 2)
(15000, 1000) (15000, 2)


In [106]:
embed_dim = 64
lstm_out = 200

model = Sequential()
model.add(Embedding(max_features, embed_dim, input_length = X_train_token.shape[1]))
model.add(LSTM(lstm_out)) 
model.add(Dense(2,activation='sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 1000, 64)          64000     
_________________________________________________________________
lstm_10 (LSTM)               (None, 200)               212000    
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 402       
Total params: 276,402
Trainable params: 276,402
Non-trainable params: 0
_________________________________________________________________
None


In [107]:
batch_size = 64
model.fit(X_train_token, Y_train, epochs = 5, batch_size=batch_size, verbose = 1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x222a8de39c8>

In [108]:
# LSTM model prediction on test data
y_pred_lstm = model.predict(X_test_token)

In [109]:
#LSTM model evaluation

print("The model accuracy score is: {}".format(accuracy_score(Y_test, y_pred_lstm.round())))
print("The model precision score is: {}".format(precision_score(Y_test, y_pred_lstm.round(), average="weighted")))
print("The model recall score is: {}".format(recall_score(Y_test, y_pred_lstm.round(), average="weighted")))
print("The model F1-score is: {}".format(f1_score(Y_test, y_pred_lstm.round(), average="weighted")))

print(classification_report(Y_test, y_pred_lstm.round()))

display(pd.DataFrame(confusion_matrix(y_test,np.argmax(y_pred_lstm.round(), axis=1)), columns=["Pred. negative", "Pred. positive"], index=["Act. negative", "Act. positive"]))

The model accuracy score is: 0.8388
The model precision score is: 0.8394841426340026
The model recall score is: 0.8402666666666667
The model F1-score is: 0.8398740152096007
              precision    recall  f1-score   support

           0       0.84      0.84      0.84      7500
           1       0.84      0.84      0.84      7500

   micro avg       0.84      0.84      0.84     15000
   macro avg       0.84      0.84      0.84     15000
weighted avg       0.84      0.84      0.84     15000
 samples avg       0.84      0.84      0.84     15000



  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Pred. negative,Pred. positive
Act. negative,6297,1203
Act. positive,1200,6300
