In [1]:
import nltk
import numpy as np
import pandas as pd

from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

from nltk.stem import PorterStemmer 

import string

The goal of this notebook is to create a Machine Learning model capable of detecting the sentimental connotation associated with a specific review. This type of analysis can be relevant to automatically understanding wether the customers are happy with a specific product or not.

## Load the data and create the dataset

The data used to train the model correponds to reviews of eletronic devices bought.

In [2]:
# load the reviews

# data courtesy of http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html
positive_reviews = BeautifulSoup(open('electronics/positive.review').read(), features="html5lib")
positive_reviews = positive_reviews.findAll('review_text')
positive_reviews = [i.contents[0] for i in positive_reviews]


negative_reviews = BeautifulSoup(open('electronics/negative.review').read(), features="html5lib")
negative_reviews = negative_reviews.findAll('review_text')
negative_reviews = [i.contents[0] for i in negative_reviews]

In [4]:
data = {'Text': positive_reviews + negative_reviews, 'Sentiment': [1] * len(positive_reviews) + [0] * len(negative_reviews)}
df = pd.DataFrame(data = data)
df.head()

Unnamed: 0,Text,Sentiment
0,\nI purchased this unit due to frequent blacko...,1
1,\nI ordered 3 APC Back-UPS ES 500s on the reco...,1
2,\nWish the unit had a separate online/offline ...,1
3,\nCheaper than thick CD cases and less prone t...,1
4,\nHi\n\nI brought 256 MB Kingston SD card from...,1


In [324]:
#Example of a positive review
print("Positive Review:", df.iloc[0]['Text'])

#Example of a negative review
print("\nNegative Review:", df.iloc[-1]['Text'])

Positive Review: 
I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.  It will run my cable modem, router, PC, and LCD monitor for 5 minutes.  This is more than enough time to save work and shut down.   Equally important, I know that my electronics are receiving clean power.

I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.

As always, Amazon had it to me in <2 business days


Negative Review: 
I bought this for easy transfer of pictures from my digital camera with SD memory card anywhere not home and sometimes from other peoples memory card (xD and memory stick)..

First of all I was disappointed with the flimsy, plastic design and the size of it. But it would have been ok if it worked!..IT DOESNT READ my SD card. And as menetioned in other people's review hard to insert and take out the cards! I'm scared if the cards get scratch and ruined wh

## Split and Preprocess of data

In [250]:
#split data into Train and Test dataset
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Sentiment'], test_size=0.3)


In [297]:
#convert each review in an array with the number of occurences of each word

count_vect_model = CountVectorizer(decode_error='ignore', stop_words='english')
count_vect_model.fit(X_train)

X_train_cnt = count_vect_model.transform(X_train)
X_test_cnt = count_vect_model.transform(X_test)


## Train and Test a Machine Learning model to predict whether a review has a positive or negative conotation

In [298]:
model = LogisticRegression()
model.fit(X_train_cnt, y_train)

print("Test Accuracy:", round(model.score(X_test_cnt, y_test), 3))

Test Accuracy: 0.795




In [299]:
#Analysis of the words with bigger impact in the prediction

threshold = 0.8
for i, j in zip(count_vect_model.get_feature_names(), model.coef_[0]):
    if j > threshold or j < -threshold:
        print(i, round(j, 3))
        

armband -0.871
bad -0.949
best 1.207
bit 0.865
clear 0.855
disappointed -0.804
easy 1.162
excellent 1.427
extra 0.985
fast 0.809
great 1.401
highly 1.198
item -1.026
live -1.039
memory 0.875
months -0.927
perfect 1.173
perfectly 1.135
poor -1.196
pretty 0.876
price 1.149
refund -0.871
return -1.287
returned -1.203
returning -0.968
terrible -1.154
unless -0.866
waste -1.063


An accuracy of **80%** was obtained in predicting the sentimental conotation associated with a review. 

As we could expect, words like *excellent*, *fast* and *perfect* have a positive impact in the prediction, meaning that they are related to a positive review. Opposingly, words like *bad*, *poor* and *return* are associated with a bad review, given their negative impact for the prediction

## Aditional Step - Filter words in the training

In [270]:
count_vect_model.get_feature_names()[:15]

['00',
 '000',
 '002',
 '007radardetectors',
 '00ghz',
 '01',
 '0183',
 '05',
 '06',
 '09',
 '0_20',
 '0gb',
 '0s',
 '10',
 '100']

As we can see there are a lot of words not relevant for prediction like *04*, *05* and *0s*. Also, there are a lot of words that have the same basic form, like *return*, *returned* and *returning*. Next, we will try to remove these redundancies using **stemming**

In [326]:
ps = PorterStemmer() 


#process sentences, by extracting only relevant words. Numbers, punctuation and shorter words are filtered
def stemming(s):
    s = s.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    tokens = [t for t in tokens if len(t) >= 3]
    tokens = [t.translate(str.maketrans('', '', string.punctuation)) for t in tokens]
    tokens = [t for t in tokens if not any(i.isdigit() for i in t)]
    tokens = [ps.stem(t) for t in tokens]
    
    return tokens



count_vect_model = CountVectorizer(decode_error='ignore', tokenizer = stemming)
count_vect_model.fit(X_train)

X_train_cnt = count_vect_model.transform(X_train)
X_test_cnt = count_vect_model.transform(X_test)

model = LogisticRegression()
model.fit(X_train_cnt, y_train)

print("Test Accuracy:", round(model.score(X_test_cnt, y_test), 3))

Test Accuracy: 0.81




In [318]:
count_vect_model.get_feature_names()[:15]

['',
 '\x1aafter',
 '\x1aat',
 '\x1ain',
 '\x1athe',
 'a',
 'aaa',
 'aac',
 'aback',
 'aband',
 'abandon',
 'abc',
 'abeit',
 'aberr',
 'abil']

In [315]:
threshold = 0.8
for i, j in zip(count_vect_model.get_feature_names(), model.coef_[0]):
    if j > threshold or j < -threshold:
        print(i, round(j, 3))

almost -1.023
armband -0.883
back -0.964
best 1.157
bit 0.848
easi 1.076
excel 1.162
extra 0.849
fast 1.048
flaw -0.823
great 1.314
highli 0.962
item -0.821
memori 1.012
month -0.807
not -1.11
perfect 1.146
perfectli 1.134
poor -1.325
pretti 0.993
price 0.919
refund -0.82
return -1.725
terribl -1.203
wast -1.168
worth 0.989


As can be observed, by applying stemming in the training corpus the results improved slightly (from 0.795 to 0.81). Further validation would be needed to conclude that, indeed, stemming helps to improve the results. Nevertheless, by applying stemming some relevant information may be lost. For example, if there are two words that are similar in their writing but have different meanings, they will be treated in the same way. However, it may help remove redundancy in some words like *return* and *returned*.

Further work could be applying other Machine Learning techniques for improving results and trying other feature extraction methods, besides *CountVectorizer*