# Descriptive Analysis of Amazon Reviews

## Project objective

To learn and assess the capabilities of using apache pyspark to perform descriptive analysis on large datasets.

## Notes

These data have been reduced to extract the 5-core, meaning that both the user and item must have at least 5 reviews. Analysis will be performed using an apache pyspark jupyter kernel on a local machine.

## References

http://spark.apache.org/docs/2.0.1/<br>
http://jmcauley.ucsd.edu/data/amazon/

In [1]:
# import libraries

import os, re
import pandas as pd

In [2]:
# construct data path

datapath = os.path.join('..', 'datasets', 'amazon-reviews/')

---

# Introductory exploration of pyspark

Getting familiar with pyspark and spark dataframes.

In [3]:
# load test data into dataframe

autodata = spark.read.json(datapath+'reviews_Automotive_5.json.gz')

In [4]:
# some info on the dataframe

print('Type:', type(autodata))
print('Row count:', autodata.count())
print('Columns:', autodata.columns)
# print(autodata.describe())

Type: <class 'pyspark.sql.dataframe.DataFrame'>
Row count: 20473
Columns: ['asin', 'helpful', 'overall', 'reviewText', 'reviewTime', 'reviewerID', 'reviewerName', 'summary', 'unixReviewTime']


In [5]:
# example find reviews that contain audi in the text

dfaudi = autodata.filter('LCASE(reviewText) RLIKE "audi(?![obetn])"')
print(type(dfaudi))
print(dfaudi.count())
print(dfaudi.first())

<class 'pyspark.sql.dataframe.DataFrame'>
58
Row(asin='B0002NYE5W', helpful=[1, 1], overall=5.0, reviewText="Until I went through a detailing class I had never used automobile detailing clay.  Once you have used it, you can never go back.  Not long ago I detailed a new red Audi A5 that sat on the dealer's lot for about 6 months.  The amount of embedded dirt was amazing.", reviewTime='11 12, 2013', reviewerID='A108AWE1CYYZVB', reviewerName='Good Gora "Good Gora"', summary="Can't Detail Without It", unixReviewTime=1384214400)


In [6]:
# convert pyspark dataframe to pandas dataframe if its small enough to fit in memory

rowcount = autodata.count()
colcount = len(autodata.columns)

if rowcount < 100000 and colcount < 100:
    df = pd.DataFrame(autodata.collect(), columns=autodata.columns)
    print(df.shape)
    print(df.info())
else:
    print('Dataset too large to convert to pandas dataframe.')
    print('Rows:', str(rowcount), '\t', 'Columns:', str(colcount))

(20473, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20473 entries, 0 to 20472
Data columns (total 9 columns):
asin              20473 non-null object
helpful           20473 non-null object
overall           20473 non-null float64
reviewText        20473 non-null object
reviewTime        20473 non-null object
reviewerID        20473 non-null object
reviewerName      20260 non-null object
summary           20473 non-null object
unixReviewTime    20473 non-null int64
dtypes: float64(1), int64(1), object(7)
memory usage: 1.4+ MB
None


In [7]:
df.head(10)

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,B00002243X,"[4, 4]",5.0,I needed a set of jumper cables for my new car...,"08 17, 2011",A3F73SC1LY51OO,Alan Montgomery,Work Well - Should Have Bought Longer Ones,1313539200
1,B00002243X,"[1, 1]",4.0,"These long cables work fine for my truck, but ...","09 4, 2011",A20S66SKYXULG2,alphonse,Okay long cables,1315094400
2,B00002243X,"[0, 0]",5.0,Can't comment much on these since they have no...,"07 25, 2013",A2I8LFSN2IS5EO,Chris,Looks and feels heavy Duty,1374710400
3,B00002243X,"[19, 19]",5.0,I absolutley love Amazon!!! For the price of ...,"12 21, 2010",A3GT2EWQSO45ZG,DeusEx,Excellent choice for Jumper Cables!!!,1292889600
4,B00002243X,"[0, 0]",5.0,I purchased the 12' feet long cable set and th...,"07 4, 2012",A3ESWJPAVRPWB4,E. Hernandez,"Excellent, High Quality Starter Cables",1341360000
5,B00002243X,"[1, 1]",5.0,"These Jumper cables are heavy Duty, Yet easy t...","11 14, 2009",A1ORODEBRN64C,"James F. Magowan ""Jimmy Mac""",Compact and Strong !,1258156800
6,B00002243X,"[1, 1]",5.0,bought these for my k2500 suburban plenty of l...,"01 10, 2012",A2R49ZN3G6FTCQ,John M. Harrell,nice cables,1326153600
7,B00002243X,"[0, 0]",5.0,these are good enough to get most motorized ve...,"06 13, 2013",A1Q65KYDKXIX8E,Leeland H.,for cars and pickups,1371081600
8,B00002243X,"[0, 0]",4.0,The Coleman Cable 08665 12-Feet Heavy-Duty Tru...,"07 18, 2013",A3BI8BKIHESDNQ,L. J. Cunningham,Coleman Cable 08665 12-Feet Heavy-Duty Truck a...,1374105600
9,B00002243X,"[0, 0]",5.0,"I have an old car, Its bound to need these som...","01 22, 2014",A1R089P5AS26UE,Mike,Beefy,1390348800


---
# Machine learning with Spark MLlib

In [37]:
# import libraries

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import IntegerType, StringType

In [61]:
# begin by converting helpfulness of reviews to something useful
# create column for number of helpful votes and another for number of votes

udf0 = UserDefinedFunction(lambda x: x[0], IntegerType())
udf1 = UserDefinedFunction(lambda x: x[1], IntegerType())

keepcols = ["helpful", "summary", "overall"]

helpvotedata = autodata.select(*keepcols).withColumn("helps", udf0("helpful")).withColumn("votes", udf1("helpful")).\
                drop("helpful")
helpvotedata.take(5)

[Row(summary='Work Well - Should Have Bought Longer Ones', overall=5.0, helps=4, votes=4),
 Row(summary='Okay long cables', overall=4.0, helps=1, votes=1),
 Row(summary='Looks and feels heavy Duty', overall=5.0, helps=0, votes=0),
 Row(summary='Excellent choice for Jumper Cables!!!', overall=5.0, helps=19, votes=19),
 Row(summary='Excellent, High Quality Starter Cables', overall=5.0, helps=0, votes=0)]

---
# Natural language processing with Spark MLlib

Convert summary text into features that would be appropriate for a machine learning model.

## Reference
http://spark.apache.org/docs/latest/ml-features.html

In [8]:
# import libraries

from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover

In [21]:
# tokenize text with bag of words approach and create new column
tokenizer = Tokenizer(inputCol="summary", outputCol="rawWords")
rawwordsData = tokenizer.transform(autodata)

# apply stopwords remover
remover = StopWordsRemover(inputCol="rawWords", outputCol="words")
wordsData = remover.transform(rawwordsData)

# hash words with given number of bins (numFeatures)
# term frequencies are calculated based on mapping indices
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=64)
featurizedData = hashingTF.transform(wordsData)
# alternatively, CountVectorizer can also be used to get term frequency vectors

# IDF is an estimator that is fit on a dataset and returns a model
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

# examine first 3 rows
for features_label in rescaledData.select("features", "summary", "words").take(10):
    print()
    print(features_label)


Row(features=SparseVector(64, {23: 3.3758, 27: 2.9896, 39: 3.0864, 45: 2.7952, 52: 1.7119, 63: 2.863}), summary='Work Well - Should Have Bought Longer Ones', words=['work', 'well', '-', 'bought', 'longer', 'ones'])

Row(features=SparseVector(64, {19: 3.6243, 40: 3.1735, 42: 3.2831}), summary='Okay long cables', words=['okay', 'long', 'cables'])

Row(features=SparseVector(64, {30: 3.478, 37: 2.804, 49: 5.176}), summary='Looks and feels heavy Duty', words=['looks', 'feels', 'heavy', 'duty'])

Row(features=SparseVector(64, {9: 3.2031, 30: 3.478, 35: 3.3407, 41: 3.0424}), summary='Excellent choice for Jumper Cables!!!', words=['excellent', 'choice', 'jumper', 'cables!!!'])

Row(features=SparseVector(64, {19: 3.6243, 20: 3.0993, 38: 3.1069, 49: 2.588, 54: 3.396}), summary='Excellent, High Quality Starter Cables', words=['excellent,', 'high', 'quality', 'starter', 'cables'])

Row(features=SparseVector(64, {8: 3.1312, 31: 2.9694, 62: 3.2115}), summary='Compact and Strong !', words=['compact'

In [19]:
# sentenceData = spark.createDataFrame([
#     (0, ["I", "saw", "the", "red", "balloon"]),
#     (1, ["Mary", "had", "a", "little", "lamb"])
# ], ["id", "raw"])

# remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
# remover.transform(sentenceData).show(truncate=False)

print(StopWordsRemover.loadDefaultStopWords("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no