<a href="https://colab.research.google.com/github/bbchen33/Machine-Learning/blob/master/Amazon_review_pyspark_MachineLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Amazon cellphone reviews data from https://www.kaggle.com/grikomsn/amazon-cell-phones-reviews

My goal is to use PySpark and machine learning to determine if it's possible to predict the ratings based on review contents.

In [1]:
from google.colab import files
upload_file = files.upload()

Saving amazon-cell-phones-reviews.zip to amazon-cell-phones-reviews.zip


In [2]:
!unzip amazon-cell-phones-reviews.zip

Archive:  amazon-cell-phones-reviews.zip
  inflating: 20190928-items.csv      
  inflating: 20190928-reviews.csv    


In [3]:
!sudo apt install openjdk-8-jdk
!sudo update-alternatives --config java

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin
  libgtk2.0-common libxxf86dga1 openjdk-8-jre x11-utils
Suggested packages:
  gvfs openjdk-8-demo openjdk-8-source visualvm icedtea-8-plugin mesa-utils
The following NEW packages will be installed:
  fonts-dejavu-core fonts-dejavu-extra libatk-wrapper-java
  libatk-wrapper-java-jni libgail-common libgail18 libgtk2.0-0 libgtk2.0-bin
  libgtk2.0-common libxxf86dga1 openjdk-8-jdk openjdk-8-jre x11-utils
0 upgraded, 13 newly installed, 0 to remove and 35 not upgraded.
Need to get 7,119 kB of archives.
After this operation, 20.1 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 libxxf86dga1 amd64 2:1.1.4-1 [13.7 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/m

In [4]:
!java -version

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~18.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)


In [5]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 59kB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 56.5MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216130387 sha256=9abb68a74322fcad6d7e9dd10bf8de4be22fa8a0ba8baa95ce5e08cdb645666e
  Stored in directory: /root/.cache/pip/wheels/ab/09/4d/0d184230058e654eb1b04467dbc1292f00eaa186544604b471
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.4


In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('phone_reviews').getOrCreate()

In [0]:
df = spark.read.csv('20190928-reviews.csv', header = True, inferSchema = True)

In [8]:
df.printSchema()

root
 |-- asin: string (nullable = true)
 |-- name: string (nullable = true)
 |-- rating: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- verified: boolean (nullable = true)
 |-- title: string (nullable = true)
 |-- body: string (nullable = true)
 |-- helpfulVotes: string (nullable = true)



In [9]:
df.describe().show()

+-------+----------+------------------+------------------+-----------------+------------------+--------+--------------------+
|summary|      asin|              name|            rating|             date|             title|    body|        helpfulVotes|
+-------+----------+------------------+------------------+-----------------+------------------+--------+--------------------+
|  count|     82815|             82815|             82815|            82815|             82815|   82799|               33859|
|   mean|      null|          Infinity|3.7603574231721306|             null|246.04250000000002|    null|   6.561823802163833|
| stddev|      null|               NaN|  1.60564358002101|             null| 614.6793183459753|    null|  25.759316802565234|
|    min|B0000SX2UC|"""Mark"" Anthony"|                 1|    April 1, 2010|                 !| "" ..."| "" Hand Candy""U...|
|    max|B07X51T2VK|       🥜 Potplant|                 5|September 9, 2019|            🥰🥰🥰|      🧐|                tim

By using describe(), one can see that some of the review body is missing with 82799 entries rather than 82815 like the most of the columns. 

In [0]:
new_df = df.select('rating','body')

We can filter out rows with null body. 

In [0]:
new_df = new_df.filter(df.body.isNotNull())

Now we can process the text in the body for machine learning.

In [12]:
new_df.columns

['rating', 'body']

In [0]:
from pyspark.ml.feature import RegexTokenizer, CountVectorizer, IDF, StringIndexer

In [14]:
regexTokenize = RegexTokenizer(inputCol = 'body', outputCol = 'words', pattern = '\\W')
new_df = regexTokenize.transform(new_df)
new_df.head()

Row(rating=3, body="I had the Samsung A600 for awhile which is absolute doo doo. You can read my review on it and detect my rage at the stupid thing. It finally died on me so I used this Nokia phone I bought in a garage sale for $1. I wonder y she sold it so cheap?... Bad: ===> I hate the menu. It takes forever to get to what you want because you have to scroll endlessly. Usually phones have numbered categories so u can simply press the # and get where you want to go. ===> It's a pain to put it on silent or vibrate. If you're in class and it rings, you have to turn it off immediately. There's no fast way to silence the damn thing. Always remember to put it on silent! I learned that the hard way. ===> It's so true about the case. It's a mission to get off and will break ur nails in the process. Also, you'll damage the case each time u try. For some reason the phone started giving me problems once I did succeed in opening it. ===> Buttons could be a bit bigger. Vibration could be stronge

In [15]:
new_df.select('words').head()

Row(words=['i', 'had', 'the', 'samsung', 'a600', 'for', 'awhile', 'which', 'is', 'absolute', 'doo', 'doo', 'you', 'can', 'read', 'my', 'review', 'on', 'it', 'and', 'detect', 'my', 'rage', 'at', 'the', 'stupid', 'thing', 'it', 'finally', 'died', 'on', 'me', 'so', 'i', 'used', 'this', 'nokia', 'phone', 'i', 'bought', 'in', 'a', 'garage', 'sale', 'for', '1', 'i', 'wonder', 'y', 'she', 'sold', 'it', 'so', 'cheap', 'bad', 'i', 'hate', 'the', 'menu', 'it', 'takes', 'forever', 'to', 'get', 'to', 'what', 'you', 'want', 'because', 'you', 'have', 'to', 'scroll', 'endlessly', 'usually', 'phones', 'have', 'numbered', 'categories', 'so', 'u', 'can', 'simply', 'press', 'the', 'and', 'get', 'where', 'you', 'want', 'to', 'go', 'it', 's', 'a', 'pain', 'to', 'put', 'it', 'on', 'silent', 'or', 'vibrate', 'if', 'you', 're', 'in', 'class', 'and', 'it', 'rings', 'you', 'have', 'to', 'turn', 'it', 'off', 'immediately', 'there', 's', 'no', 'fast', 'way', 'to', 'silence', 'the', 'damn', 'thing', 'always', 'rem

In [0]:
countVectorize = CountVectorizer(inputCol = 'words', outputCol = 'TF')
countVectorize_model = countVectorize.fit(new_df)
new_df = countVectorize_model.transform(new_df)

CountVectorize created a sparsevector to vectorize the word count.

In [17]:
new_df.head()

Row(rating=3, body="I had the Samsung A600 for awhile which is absolute doo doo. You can read my review on it and detect my rage at the stupid thing. It finally died on me so I used this Nokia phone I bought in a garage sale for $1. I wonder y she sold it so cheap?... Bad: ===> I hate the menu. It takes forever to get to what you want because you have to scroll endlessly. Usually phones have numbered categories so u can simply press the # and get where you want to go. ===> It's a pain to put it on silent or vibrate. If you're in class and it rings, you have to turn it off immediately. There's no fast way to silence the damn thing. Always remember to put it on silent! I learned that the hard way. ===> It's so true about the case. It's a mission to get off and will break ur nails in the process. Also, you'll damage the case each time u try. For some reason the phone started giving me problems once I did succeed in opening it. ===> Buttons could be a bit bigger. Vibration could be stronge

To see the list of words in the word count vector, one can look at the countVectorize_model.

In [18]:
countVectorize_model.vocabulary[:5]

['the', 'i', 'it', 'phone', 'and']

We can use TF-IDF (Term frequency-inverse document frequency) to mine the text and determine the frequency of terms. It will find terms that are important (high frequency) in all the words but not in all the samples because that suggests it could be a stop word like "a", "the" or "it". 

In [0]:
idf_model = IDF(inputCol = 'TF', outputCol = 'TFIDF').fit(new_df)
new_df = idf_model.transform(new_df)

In [20]:
new_df.columns

['rating', 'body', 'words', 'TF', 'TFIDF']

In [21]:
new_df.select('TFIDF').head()

Row(TFIDF=SparseVector(34250, {0: 8.6548, 1: 5.6328, 2: 11.2846, 3: 3.6635, 4: 4.3415, 5: 4.742, 6: 13.2212, 7: 3.0006, 8: 2.0553, 9: 3.5135, 12: 6.5855, 13: 1.348, 14: 11.0838, 15: 1.4097, 16: 1.5316, 17: 1.3605, 18: 5.9716, 19: 1.3818, 20: 4.9681, 21: 15.6472, 22: 3.0222, 23: 7.2207, 25: 10.4334, 26: 1.7364, 27: 1.8389, 28: 1.8969, 29: 1.8434, 31: 2.0429, 32: 2.0827, 33: 1.9715, 34: 6.2587, 35: 2.0421, 38: 1.9781, 39: 8.711, 40: 4.333, 41: 2.2323, 42: 2.1739, 44: 2.2719, 47: 2.3802, 48: 2.3069, 49: 2.3139, 50: 2.154, 51: 11.6793, 55: 2.4275, 56: 2.3889, 59: 2.4691, 60: 2.4951, 62: 5.2114, 64: 5.397, 66: 2.5357, 70: 2.6876, 72: 2.5666, 75: 2.6036, 76: 2.6484, 79: 2.7039, 82: 2.6802, 84: 5.6204, 91: 2.8374, 92: 2.7519, 99: 2.9394, 100: 2.867, 103: 2.9764, 105: 2.8993, 112: 3.0054, 115: 2.9794, 117: 6.1284, 126: 6.166, 127: 3.1157, 128: 3.1266, 139: 3.2036, 140: 6.4608, 145: 3.2195, 146: 3.3639, 156: 6.793, 157: 6.711, 167: 3.4114, 172: 3.3272, 176: 6.7551, 188: 6.905, 190: 3.4525, 217:

There are two ways to look at the label in this case: rating. We could see it as a category, like rating 1, 2,3, 4, 5 are five separate categories. Altneratively, we could say even though people can only choose integers between 1 and 5 as the rating, a phone can have a rating of 3.5 on average so that would make the ratings conintuous.

In [0]:
train_df, test_df = new_df.randomSplit([0.7, 0.3], seed = 1)

In [0]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol = 'TFIDF', labelCol = 'rating', maxIter = 10)
lrModel = lr.fit(train_df)

In [0]:
trainingSummary = lrModel.summary

In [25]:
trainingSummary.accuracy

0.8652262310899755

In [0]:
predictions = lrModel.transform(test_df)

In [27]:
predictions.columns

['rating',
 'body',
 'words',
 'TF',
 'TFIDF',
 'rawPrediction',
 'probability',
 'prediction']

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(
    labelCol = 'rating', predictionCol = 'prediction', metricName = 'accuracy')

This is the accuracy for the test set. Only 0.68, not great.

In [30]:
evaluator.evaluate(predictions)

0.6840999959613909

Maybe try again with 100 iterations.

In [0]:
lr2 = LogisticRegression(featuresCol = 'TFIDF', labelCol = 'rating', maxIter = 100)
lrModel2 = lr2.fit(train_df)

In [32]:
predictions2 = lrModel2.transform(test_df)
print("The accuracy is: ", evaluator.evaluate(predictions2))

The accuracy is:  0.6292960704333428


The accuracy on the test set is lower than before. This may be the model is overfitting the training data.

Use col function to change the column name. It is not necessary but just a practice and consistent with what people tend to use in the field.

In [0]:
from pyspark.ml.classification import RandomForestClassifier
RF = RandomForestClassifier(featuresCol = 'TFIDF', labelCol = 'rating')
RF_model = RF.fit(train_df)

In [0]:
RF_predictions = RF_model.transform(test_df)


In [37]:
evaluator.evaluate(RF_predictions)

0.5446064375429103

Again, the accuracy score isn't great. The problem is some machine learning methods don't work with my PySpark libraries for some reason. It's possible that based on the contents of the review, it's not easy to predict the rating and it may require deep learning. I'll come back to try to improve the machine learning.