# Sentiment Analysis of Amazon Customer Reviews

This project uses the customer review data from Amazon.com Kindle store to perform a supervised binary (positive or negative) sentiment classification analysis. We use various data pre-processing techniques and three machine learning models, namely, Naive Bayes classification model, the Logistic regression model, and the linear support vector classification model. The result provides 87% prediction accuracy.

In [3]:
import pyspark
spark.conf.set('spark.sql.shuffle.partitions', '8')

### Load dataset
The data comes from the website "Amazon product data" (http://jmcauley.ucsd.edu/data/amazon/) managed by Dr. Julian McAuley from UCSD. We choose the smaller subset of the customer review data from the Kindle store of Amazon.com. The data is in the JSON format, which contains 982,619 reviews and metadata spanning May 1996 - July 2014.

In [5]:
# load original .json data
kindle_json = spark.read.json('/FileStore/tables/Kindle_Store_5.json')

In [6]:
display(kindle_json)

asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
B000F83SZQ,"List(0, 0)",5.0,I enjoy vintage books and movies so I enjoyed reading this book. The plot was unusual. Don't think killing someone in self-defense but leaving the scene and the body without notifying the police or hitting someone in the jaw to knock them out would wash today.Still it was a good read for me.,"05 5, 2014",A1F6404F1VG29J,Avidreader,Nice vintage story,1399248000
B000F83SZQ,"List(2, 2)",4.0,"This book is a reissue of an old one; the author was born in 1910. It's of the era of, say, Nero Wolfe. The introduction was quite interesting, explaining who the author was and why he's been forgotten; I'd never heard of him.The language is a little dated at times, like calling a gun a ""heater."" I also made good use of my Fire's dictionary to look up words like ""deshabille"" and ""Canarsie."" Still, it was well worth a look-see.","01 6, 2014",AN0N05A9LIJEQ,critters,Different...,1388966400
B000F83SZQ,"List(2, 2)",4.0,"This was a fairly interesting read. It had old- style terminology.I was glad to get to read a story that doesn't have coarse, crasslanguage. I read for fun and relaxation......I like the free ebooksbecause I can check out a writer and decide if they are intriguing,innovative, and have enough of the command of Englishthat they can convey the story without crude language.","04 4, 2014",A795DMNCJILA6,dot,Oldie,1396569600
B000F83SZQ,"List(1, 1)",5.0,I'd never read any of the Amy Brewster mysteries until this one.. So I am really hooked on them now.,"02 19, 2014",A1FV0SX13TWVXQ,"Elaine H. Turley ""Montana Songbird""",I really liked it.,1392768000
B000F83SZQ,"List(0, 1)",4.0,"If you like period pieces - clothing, lingo, you will enjoy this mystery. Author had me guessing at least 2/3 of the way through.","03 19, 2014",A3SPTOKDG7WBLN,Father Dowling Fan,Period Mystery,1395187200
B000F83SZQ,"List(0, 0)",4.0,A beautiful in-depth character description makes it like a fast pacing movie. It is a pity Mr Merwin did not write 30 instead only 3 of the Amy Brewster mysteries.,"05 26, 2014",A1RK2OCZDSGC6R,ubavka seirovska,Review,1401062400
B000F83SZQ,"List(0, 0)",4.0,"I enjoyed this one tho I'm not sure why it's called An Amy Brewster Mystery as she's not in it very much. It was clean, well written and the characters well drawn.","06 10, 2014",A2HSAKHC3IBRE6,Wolfmist,Nice old fashioned story,1402358400
B000F83SZQ,"List(1, 1)",4.0,"Never heard of Amy Brewster. But I don't need to like Amy Brewster to like this book. Actually, Amy Brewster is a side kick in this story, who added mystery to the story not the one resolved it. The story brings back the old times, simple life, simple people and straight relationships.","03 22, 2014",A3DE6XGZ2EPADS,WPY,Enjoyable reading and reminding the old times,1395446400
B000FA64PA,"List(0, 0)",5.0,Darth Maul working under cloak of darkness committing sabotage now that is a story worth reading many times over. Great story.,"10 11, 2013",A1UG4Q4D3OAH3A,dsa,Darth Maul,1381449600
B000FA64PA,"List(0, 0)",4.0,"This is a short story focused on Darth Maul's role in helping the Trade Federation gain a mining colony. It's not bad, but it's also nothing exceptional. It's fairly short so we don't really get to see any characters develop. The few events that do happen seem to go by quickly, including what should have been major battles. The story is included in the novelShadow Hunter (Star Wars: Darth Maul), which is worth reading, so don't bother to buy this one separately.","02 13, 2011",AQZH7YTWQPOBE,Enjolras,"Not bad, not exceptional",1297555200


### Generate Sentiment Label

Reviews with overall rating of 1, 2, or 3 are labeled as negative (label=1), and reviews with overall rating of 4 or 5 are labeled as positive (label=0).

In [8]:
kindle_json.createOrReplaceTempView('kindle_json_view')

data_json = spark.sql('''
  SELECT CASE WHEN overall<4 THEN 1
          ELSE 0
          END as label,
        reviewText as text
  FROM kindle_json_view
  WHERE length(reviewText)>2''')

data_json.groupBy('label').count().show()

### Generate the dataset for modeling
We only sample a small portion of the data for demonstration and try to balance the two classes.

In [10]:
# Sampling data
pos = data_json.where('label=0').sample(False, 0.05, seed=1220)
neg = data_json.where('label=1').sample(False, 0.25, seed=1220)
data = pos.union(neg)
data.groupBy('label').count().show()

In [11]:
# Negative reviews are on average longer than the positive reviews, but not significantly longer
from pyspark.sql.functions import length
data.withColumn('review_length', length('text')).groupBy('label').avg('review_length').show()

### Data Preprocessing
Data preprocessing process uses the following steps:

* Use HTMLParser to un-escape the text
* Change "can't" to "can not", and change "n't" to "not" (This is useful for the negation handling process)
* Pad punctuations with blanks
* Lowercase every word
* Word tokenization
* Word lemmatization
* Perform **negation handling**
    * Use a state variable to store the negation state
    * Transform a word followed by a "not" or "no" into “not_” + word
    * Whenever the negation state variable is set, the words read are treated as “not_” + word
    * The state variable is reset when a punctuation mark is encountered or when there is double negation
* Use **bigram** and/or **trigram** models

In [13]:
# Define preprocessing function
def clean(text):
    import html
    import string
    import nltk
    nltk.download('wordnet')
    
    line = html.unescape(text)
    line = line.replace("can't", 'can not')
    line = line.replace("n't", " not")
    # Pad punctuations with white spaces
    pad_punct = str.maketrans({key: " {0} ".format(key) for key in string.punctuation}) 
    line = line.translate(pad_punct)
    line = line.lower()
    line = line.split() 
    lemmatizer = nltk.WordNetLemmatizer()
    line = [lemmatizer.lemmatize(t) for t in line] 
    
    # Negation handling
    # Add "not_" prefix to words behind "not", or "no" until the end of the sentence
    tokens = []
    negated = False
    for t in line:
        if t in ['not', 'no']:
            negated = not negated
        elif t in string.punctuation or not t.isalpha():
            negated = False
        else:
            tokens.append('not_' + t if negated else t)
    
    invalidChars = str(string.punctuation.replace("_", ""))  
    bi_tokens = list(nltk.bigrams(line))
    bi_tokens = list(map('_'.join, bi_tokens))
    bi_tokens = [i for i in bi_tokens if all(j not in invalidChars for j in i)]
    tri_tokens = list(nltk.trigrams(line))
    tri_tokens = list(map('_'.join, tri_tokens))
    tri_tokens = [i for i in tri_tokens if all(j not in invalidChars for j in i)]
    tokens = tokens + bi_tokens + tri_tokens      
    
    return tokens

In [14]:
# An example: how the function clean() pre-processes the input text
example = clean("I don't think this book has any decent information!!! It is full of typos and factual errors that I can't ignore.")
print(example)

In [15]:
# Perform data preprocessing
from pyspark.sql.functions import udf, col, size
from pyspark.sql.types import ArrayType, StringType
clean_udf = udf(clean, ArrayType(StringType()))
data_tokens = data.withColumn('tokens', clean_udf(col('text')))
data_tokens.show(3)

### Split dataset to training (70%) and testing (30%) sets

In [17]:
# Split data to 70% for training and 30% for testing
training, testing = data_tokens.randomSplit([0.7,0.3], seed=1220)
training.groupBy('label').count().show()

In [18]:
training.cache()

### Naive Bayes Model (with parameter tuning)

In [20]:
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml import Pipeline

count_vec = CountVectorizer(inputCol='tokens', outputCol='c_vec', minDF=5.0)
idf = IDF(inputCol="c_vec", outputCol="features")

In [21]:
# Naive Bayes model
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()

pipeline_nb = Pipeline(stages=[count_vec, idf, nb])

model_nb = pipeline_nb.fit(training)
test_nb = model_nb.transform(testing)
test_nb.show(3)

#### Naive Bayes model performance (using default parameters)
* Area under the ROC curve: 0.8551
* Accuracy: 0.8553

In [23]:
# Naive Bayes model ROC
from pyspark.ml.evaluation import BinaryClassificationEvaluator
roc_nb_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label')
roc_nb = roc_nb_eval.evaluate(test_nb)
print("ROC of the NB model: {}".format(roc_nb))

In [24]:
# Naive Bayes model accuracy
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_nb_eval = MulticlassClassificationEvaluator(metricName='accuracy')
acc_nb = acc_nb_eval.evaluate(test_nb)
print("Accuracy of the NB model: {}".format(acc_nb))

#### Naive Bayes model performance after parameter tuning
* CountVectorizer.minDF = 7.0
* NaiveBayes.smooting = 1.0
* Accuracy: 0.8568 (increased from 0.8553)

In [26]:
# NB parameter tuning and CV
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid_nb = (ParamGridBuilder()
                .addGrid(count_vec.minDF, [3.0, 5.0, 7.0, 10.0, 15.0])
                .addGrid(nb.smoothing, [0.1, 0.5, 1.0])
                .build())
cv_nb = CrossValidator(estimator=pipeline_nb, estimatorParamMaps=paramGrid_nb, evaluator=acc_nb_eval, numFolds=5)
cv_model_nb = cv_nb.fit(training) 

In [27]:
test_cv_nb = cv_model_nb.transform(testing)
acc_nb_cv = acc_nb_eval.evaluate(test_cv_nb)
print("Accuracy of the NB CV model: {}".format(acc_nb_cv))

In [28]:
cv_model_nb.bestModel.stages[0].extractParamMap()

In [29]:
cv_model_nb.bestModel.stages[2].extractParamMap()

### Logistic Regressions
Model performance (using default parameters)
* Area under the ROC curve: 0.8601
* Accuracy: 0.8610

In [31]:
# Logistic Regression model
from pyspark.ml.classification import LogisticRegression
lgr = LogisticRegression(maxIter=5)
pipeline_lgr = Pipeline(stages=[count_vec, idf, lgr])

model_lgr = pipeline_lgr.fit(training)
test_lgr = model_lgr.transform(testing)

In [32]:
# Logistic Regression model ROC
from pyspark.ml.evaluation import BinaryClassificationEvaluator
roc_lgr_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label')
roc_lgr = roc_lgr_eval.evaluate(test_lgr)
print("ROC of the model: {}".format(roc_lgr))

In [33]:
# Logistic Regression model accuracy
#from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_lgr_eval = MulticlassClassificationEvaluator(metricName='accuracy')
acc_lgr = acc_lgr_eval.evaluate(test_lgr)
print("Accuracy of the model: {}".format(acc_lgr))

### Linear SVC Model
Model performance (using default parameters)
* Area under the ROC curve: 0.8649
* Accuracy: 0.8656

In [35]:
# Linear SVC model
from pyspark.ml.classification import LinearSVC
lsvc = LinearSVC(maxIter=5)
pipeline_lsvc = Pipeline(stages=[count_vec, idf, lsvc])

model_lsvc = pipeline_lsvc.fit(training)
test_lsvc = model_lsvc.transform(testing)

In [36]:
# Linear SVC model ROC
from pyspark.ml.evaluation import BinaryClassificationEvaluator
roc_lsvc_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label')
roc_lsvc = roc_lsvc_eval.evaluate(test_lsvc)
print("ROC of the model: {}".format(roc_lsvc))

In [37]:
# Linear SVC model accuracy
#from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_lsvc_eval = MulticlassClassificationEvaluator(metricName='accuracy')
acc_lsvc = acc_lsvc_eval.evaluate(test_lsvc)
print("Accuracy of the model: {}".format(acc_lsvc))

### Predict on new reviews:
To demonstrate the model prediction on new review texts, I randomly choose five reviews from the Kindle book *The Brave Ones: A Memoir of Hope, Pride and Military Service, by Michael J. MacLeod*. 

The suffixes "_1", "_2", ..., "_5" indicate the real overall review stars 1, 2, ..., 5.

The model correctly predicts the first three reviews as "negative" (label=1), and the last two as "positive" (label=0).

In [39]:
review_1 = ["WOW!!! No words describe how bland this book is. It took me a lot to even pick up to read. I would definitely not recommend this book."]

In [40]:
review_2 = ["A first person account of the war in Afghanistan. It skipps around a lot and is like a never-ending news article. On the positive side, you do get a feel for what desert fighting is like from a soldiers point of view."]

In [41]:
review_3 = ["I liked the premise and most of the book. At the end parts I lost a little interest because I lost the thread of who was who. War is hell. MacLeod did his service unlike most of us."]

In [42]:
review_4 = ["Very informative first person account of the the daily life of a US Paratrooper. From training to deployment in combat situations in Afghanistan. Well worth the read and makes you really understand and appreciate their sacrifices"]

In [43]:
review_5 = ["This is perhaps the best wrote book I have ever read. Articulate and thought provoking. Not just a riveting account of actual combat, but Michael was able to do what few before him have...captured the essence of what one feels as the battle unfolds. Perhaps most of all, I am grateful to call this author 'Fellow Warrior' Airborne all the way!!!"]

In [44]:
from pyspark.sql.types import *
schema = StructType([StructField("text", StringType(), True)])

text = [review_1, review_2, review_3, review_4, review_5]
review_new = spark.createDataFrame(text, schema=schema)

In [45]:
# Data preprocessing
review_new_tokens = review_new.withColumn('tokens', clean_udf(col('text')))
review_new_tokens.show()

In [46]:
# Prediction using tuned Naive Bayes model
result = cv_model_nb.transform(review_new_tokens)
result.select('text', 'prediction').show()