# Sentiment Analysis - Amazon Reviews

Applying the Naive Bayes and Random Forest Classifiers to a dataset consisting of 1000 product reviews from amazon.com. Reviews are labelled as positive (1) or negative (0).

# Preparing the dataset

In [6]:
dataset_original = read.delim('amazon_cells_labelled.txt', header = FALSE, quote = '', stringsAsFactors = FALSE)

In [7]:
head(dataset_original)

V1,V2
So there is no way for me to plug it in here in the US unless I go by a converter.,0
"Good case, Excellent value.",1
Great for the jawbone.,1
Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!,0
The mic is great.,1
I have to jiggle the plug to get it to line up right to get decent volume.,0


In [12]:
summary(dataset_original)

      V1                  V2     
 Length:1000        Min.   :0.0  
 Class :character   1st Qu.:0.0  
 Mode  :character   Median :0.5  
                    Mean   :0.5  
                    3rd Qu.:1.0  
                    Max.   :1.0  

Cleaning the texts by converting to lowercase, removing non-alphabetic characters and stopwords, applying stemming and stripping excess whitespace:

In [27]:
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(dataset_original$V1))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

Creating the Bag of Words model, removing infrequently used words:

In [60]:
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
dataset = as.data.frame(as.matrix(dtm))
dataset$Target = dataset_original$V2

In [61]:
dim(dataset)

Encoding the target feature as a factor:

In [58]:
dataset$Target = factor(dataset$Target, levels = c(0, 1))

# Train Test Split

In [59]:
library(caTools)
set.seed(123)
split = sample.split(dataset$Target, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Naive Bayes Classifier

In [62]:
library(e1071)
nb_classifier = naiveBayes(x = training_set[-620],
                        y = training_set$Target)

In [63]:
y_pred_nb = predict(nb_classifier, newdata = test_set[-620])

Confusion matrix:

In [74]:
cm_nb = table(test_set[, 620], y_pred_nb)
print(cm_nb)

   y_pred_nb
      0   1
  0   0 100
  1   1  99


The Naive Bayes model classified all but one of the reviews as positive! 

# Random Forest Classifier

In [66]:
library(randomForest)
rf_classifier = randomForest(x = training_set[-620],
                          y = training_set$Target,
                          ntree = 10)

In [67]:
y_pred_rf = predict(rf_classifier, newdata = test_set[-620])

In [75]:
cm_rf = table(test_set[, 620], y_pred_rf)
print(cm_rf)

   y_pred_rf
     0  1
  0 85 15
  1 21 79


The Random Forest model performed much better than the Naive Bayes classifier. 

In [83]:
accuracy = (79 + 85)/200
precision = 79/(79+15)
recall = 79/(79+21)
F1 = 2 * precision * recall / (precision + recall)

rf_summary = data.frame(accuracy, precision, recall, F1)

In [85]:
head(rf_summary)

accuracy,precision,recall,F1
0.82,0.8404255,0.79,0.814433
