<a href="https://colab.research.google.com/github/bwilson7/thinkful_drills/blob/master/6_8_4_challenge_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Toy & Game Sentiment Reviews

For this sentiment analysis I wanted to look at Amazon reviews of toys and games. In total there are ~167k reviews in the dataset.

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz

In [0]:
# Install spark-related depdencies for Python
!pip install -q findspark
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 1.2MB/s 
[?25hCollecting py4j==0.10.7 (from pyspark)
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 47.6MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130387 sha256=8cd806116cc609dc31fdc028925f2181076d97333475c7ddd4558e6dc801fc46
  Stored in directory: /root/.cache/pip/wheels/ab/09/4d/0d184230058e654eb1b04467dbc1292f00eaa186544604b471
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 py

In [0]:
!pip install nltk



In [0]:
# Set up required environment variables

import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# Module Imports

In [0]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel, LogisticRegression
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler, Tokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics

from pyspark.sql.functions import isnan, when, count, col, split, collect_set, lit

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import nltk
from nltk.corpus import stopwords


In [0]:
 nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
DATA_PATH = "/content/gdrive/My Drive/colab_datasets/reviews_Toys_and_Games_5.json"
APP_NAME = 'amazon_sentiment_analysis'
SPARK_URL = 'local[*]'

spark = SparkSession.builder.appName(APP_NAME).master(SPARK_URL).getOrCreate()
df = spark.read.options(inferschema = "true").json(DATA_PATH)

In [0]:
df.show(5)

+----------+-------+-------+--------------------+-----------+--------------+--------------+--------------------+--------------+----------+
|      asin|helpful|overall|          reviewText| reviewTime|    reviewerID|  reviewerName|             summary|unixReviewTime|isPositive|
+----------+-------+-------+--------------------+-----------+--------------+--------------+--------------------+--------------+----------+
|0439893577| [0, 0]|    5.0|I like the item p...|01 29, 2014|A1VXOAVRGKGEAK|         Angie|      Magnetic board|    1390953600|         1|
|0439893577| [1, 1]|    4.0|Love the magnet e...|03 28, 2014| A8R62G708TSCM|       Candace|it works pretty g...|    1395964800|         1|
|0439893577| [1, 1]|    5.0|Both sides are ma...|01 28, 2013|A21KH420DK0ICA|capemaychristy|          love this!|    1359331200|         1|
|0439893577| [0, 0]|    5.0|Bought one a few ...| 02 8, 2014| AR29QK6HPFYZ4|          dcrm|   Daughters love it|    1391817600|         1|
|0439893577| [1, 1]|    4.0

In [0]:
#adding sentiment column and column of arrays for each string
df = df.withColumn('isPositive', when(df.overall >=3, 1.0).otherwise(0.0))
#df = df.withColumn('reviewArray', split(df['reviewText'], '\s+'))
df.groupby('isPositive').count().show()

+----------+------+
|isPositive| count|
+----------+------+
|       0.0| 11005|
|       1.0|156592|
+----------+------+



The isPositive column is working and filling in a positive or negative based on the overall rating, but there is a small problem with class imbalance. Currently our negative review class is sitting at ~6% of the dataset. I'll use the logistic regression package since it has a built in column weights parameter that I can tune. For this I need to calcualte a balancing ratio based on the number of positive and negative results and the total dataset size. This step will be done on the training set, since the test set will have an "unknown" amount of class imbalance. Theoretically, the imbalance should be the same from training to test.

# Testing Pipeline Transformers

Below I am testing the different transformers outputs so that I know the features column will be what I expect.

In [0]:
#checking the tokenizer transformer works, it also lowercases everything
tokenizer = Tokenizer(inputCol='reviewText', outputCol='reviewToken')
df = tokenizer.transform(df)
df.select('reviewText', 'reviewToken').show(5)

+--------------------+--------------------+
|          reviewText|         reviewToken|
+--------------------+--------------------+
|I like the item p...|[i, like, the, it...|
|Love the magnet e...|[love, the, magne...|
|Both sides are ma...|[both, sides, are...|
|Bought one a few ...|[bought, one, a, ...|
|I have a stainles...|[i, have, a, stai...|
+--------------------+--------------------+
only showing top 5 rows



In [0]:
# taking the tokened reviews and removing stop words
# next step will be to check term frequencies for each review
stopWords = stopwords.words('english')
remover = StopWordsRemover(inputCol='reviewToken', outputCol='reviewToken_stop', stopWords=stopWords)

df = remover.transform(df)
df.select('reviewText', 'reviewToken', 'reviewToken_stop').show(5)

+--------------------+--------------------+--------------------+
|          reviewText|         reviewToken|    reviewToken_stop|
+--------------------+--------------------+--------------------+
|I like the item p...|[i, like, the, it...|[like, item, pric...|
|Love the magnet e...|[love, the, magne...|[love, magnet, ea...|
|Both sides are ma...|[both, sides, are...|[sides, magnetic....|
|Bought one a few ...|[bought, one, a, ...|[bought, one, yea...|
|I have a stainles...|[i, have, a, stai...|[stainless, steel...|
+--------------------+--------------------+--------------------+
only showing top 5 rows



In [0]:
#collecting all of the common words in the reviews
common_words = df.freqItems(['reviewToken_stop']).collect()   

In [0]:
#combining them into a set for generation of columns of features
common = []
for words in common_words[0][0]:
    common += words
len(set(common))

1236

In [0]:
cv = CountVectorizer(inputCol='reviewToken_stop', outputCol='reviewVector')
model = cv.fit(df)
df = model.transform(df)

In [0]:
#splitting review into tokens
tokenizer = Tokenizer(inputCol='reviewText', outputCol='reviewToken')

#removing stopwords from tokened reviews
stopWords = stopwords.words('english')
remover = StopWordsRemover(inputCol='reviewToken', outputCol='reviewToken_stop', stopWords=stopWords)

#sparse vectors that count valuesof each vocab word
cv = CountVectorizer(inputCol='reviewToken_stop', outputCol='featureVector')

(trainingData, testData) = df.randomSplit([0.75, 0.25])

#rf classifier
lr = LogisticRegression(featuresCol='featureVector', labelCol='isPositive', weightCol='classWeights')

pipeline = Pipeline(stages=[tokenizer, remover, cv, lr])

In [0]:
numPosTrain = trainingData.select('isPositive').where('isPositive == 1.0').count()
numNegTrain = trainingData.select('isPositive').where('isPositive == 0.0').count()
print('Training Size = {}'.format(numPosTrain + numNegTrain))
print('Number of Positive Reviews = {}'.format(numPosTrain))
print('Number of Negative Reviews = {}'.format(numNegTrain))
print('Minority Class Balancing Ratio = {}'.format(numPosTrain / (numPosTrain + numNegTrain)))

Training Size = 125898
Number of Positive Reviews = 117668
Number of Negative Reviews = 8230
Minority Class Balancing Ratio = 0.934629620804143


Ok, so the negative reviews will get a calss balance of 0.9346... and the positive reviews will get a class balance of 1 - 0.9346...

In [0]:
classBalanceRatio = numPosTrain / (numPosTrain + numNegTrain)
trainingData = trainingData.withColumn('classWeights', when(trainingData.isPositive == 0.0, classBalanceRatio).otherwise(1-classBalanceRatio))
trainingData.groupby('classWeights').count().show()

+-------------------+------+
|       classWeights| count|
+-------------------+------+
|0.06537037919585698|117668|
|  0.934629620804143|  8230|
+-------------------+------+



Looks like all of the class weights were input corectly into the trainingData df. Hopefully this will help with classification issues that were seen when running a RandomForestClassifier without any sort of class balance considerations (Model only predicts positive result since negative is so uncommon).

In [0]:
model = pipeline.fit(trainingData)

In [0]:
predictions = model.transform(testData)

In [0]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="isPositive", predictionCol="prediction", metricName='weightedPrecision')

print('Precision:', evaluator.evaluate(predictions))

recall = evaluator.evaluate(predictions, {evaluator.metricName:'weightedRecall'})
print('Recall:', recall)

f1 = evaluator.evaluate(predictions, {evaluator.metricName:'f1'})
print('Recall:', f1)

Precision: 0.920552393779295
Recall: 0.9170723518549606
Recall: 0.9187512002462199


In [0]:
conf_matrix = predictions.crosstab('isPositive', 'prediction')
conf_matrix.show()

+---------------------+----+-----+
|isPositive_prediction| 0.0|  1.0|
+---------------------+----+-----+
|                  1.0|1856|37068|
|                  0.0|1173| 1602|
+---------------------+----+-----+



In [0]:
predictions.groupby('isPositive').count().show()
predictions.groupby('prediction').count().show()

+----------+-----+
|isPositive|count|
+----------+-----+
|       0.0| 2775|
|       1.0|38924|
+----------+-----+

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0| 3029|
|       1.0|38670|
+----------+-----+



Not too bad without any hyperparameter tuning. The amount of false positives and false negatives were similar, but the minority class is still has a lot of error. Precision and recall are high for positive reviews, but the negative review predictions are still quite low.