# Sentiment Analysis - Amazon Reviews

Applying the Naive Bayes and Random Forest Classifiers to a dataset consisting of 1000 product reviews from amazon.com. Reviews are labelled as positive (1) or negative (0).

# Imports

In [31]:
import numpy as np
import pandas as pd

# Preparing the dataset

In [13]:
dataset = pd.read_csv('amazon_cells_labelled.txt', delimiter = '\t', header = None)

In [18]:
dataset.columns = ['Review', 'Liked']

In [19]:
dataset.head()

Unnamed: 0,Review,Liked
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [20]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
Review    1000 non-null object
Liked     1000 non-null int64
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


Cleaning the texts by removing non-alphabetic characters and stopwords, converting to lowercase, and applying stemming:

In [21]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0,1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

Creating the Bag of Words model:

In [32]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset['Liked'].values

# Train Test Split

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Naive Bayes Classifier

In [34]:
from sklearn.naive_bayes import GaussianNB
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [35]:
y_pred_nb = nb_classifier.predict(X_test)

In [26]:
from sklearn.metrics import confusion_matrix, classification_report

In [36]:
print(confusion_matrix(y_test, y_pred_nb))
print('\n')
print(classification_report(y_test, y_pred_nb))

[[60 36]
 [20 84]]


             precision    recall  f1-score   support

          0       0.75      0.62      0.68        96
          1       0.70      0.81      0.75       104

avg / total       0.72      0.72      0.72       200



# Random Forest Classifier

In [28]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
rf_classifier.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [29]:
y_pred_rf = rf_classifier.predict(X_test)

In [30]:
print(confusion_matrix(y_test, y_pred_rf))
print('\n')
print(classification_report(y_test, y_pred_rf))

[[86 10]
 [31 73]]


             precision    recall  f1-score   support

          0       0.74      0.90      0.81        96
          1       0.88      0.70      0.78       104

avg / total       0.81      0.80      0.79       200



The Naive Bayes classifier correctly identified more of the positive reviews and had fewer false negatives than the Random Forest Classifier, while the Random Forest Classifier correctly identified more of the negative reviews and gave fewer false positives. The Random Forest Classifier performed better than the Naive Bayes overall based on accuracy (79.5% versus 72.0%) and F1 score.