# Load Data

First we load data from HDFS. It is stored as a trivial CSV file with three columns
1. product name
2. review text
3. rating (1 - 5)

In [2]:
from pyspark.sql.functions import *

data = sqlContext.read.text("s3://dimajix-training/data/amazon_baby")
data = data.select(
        split('value',',')[0].alias('name'),
        split('value',',')[1].alias('review'),
        split('value',',')[2].alias('rating').cast('int')
).filter(col('rating').isNotNull())

data.limit(5).toPandas()

Unnamed: 0,name,review,rating
0,SoHo Bug's Party Baby Crib Nursery Bedding Set...,We purchased this bedding because we loved the...,5
1,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,this 0-3 mos gown is precious but it is signif...,3
2,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,Fit nicely with a little room to grow. VERY S...,5
3,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,I bought this for my sister who is doing koala...,5
4,Medela Symphony &amp; Lactina Double Pumping K...,Waste waste waste! Manual part sucked also.......,1


# Extract Sentiment

Since we want to perform a classification (positive review vs negative review), we need to extract a binary sentiment value. We will map the ratings as follows:

1. Ratings 1 and 2 count as a negative review
2. Rating 3 counts as a neutral review
3. Ratings 4 and 5 count as a positive review

Since we want a binary classification, we will also remove neutral reviews altogether.

In [3]:
data = data.filter(data.rating != 3)
data = data.withColumn('sentiment', when(data.rating < 3, 0.0).otherwise(1.0))

data.limit(10).toPandas()

Unnamed: 0,name,review,rating,sentiment
0,SoHo Bug's Party Baby Crib Nursery Bedding Set...,We purchased this bedding because we loved the...,5,1.0
1,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,Fit nicely with a little room to grow. VERY S...,5,1.0
2,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,I bought this for my sister who is doing koala...,5,1.0
3,Medela Symphony &amp; Lactina Double Pumping K...,Waste waste waste! Manual part sucked also.......,1,0.0
4,Medela Symphony &amp; Lactina Double Pumping K...,I bought this pumping kit for a Lactina Select...,5,1.0
5,Trend Lab Dr Seuss ABC Framed Receiving Blanket,Very soft little blanket. all of the Dr.Seuss ...,4,1.0
6,Infant Airplane Seat - Flyebaby Airplane Baby ...,Never got to use it!!! It should have come wit...,1,0.0
7,Infant Airplane Seat - Flyebaby Airplane Baby ...,Turns airline seat into car seat. Used on a U...,5,1.0
8,Infant Airplane Seat - Flyebaby Airplane Baby ...,I mine as well pull out my lighter and burn $5...,1,0.0
9,Infant Airplane Seat - Flyebaby Airplane Baby ...,Like some of the other reviews mention- the Fl...,5,1.0


# Extract Features from Reviews

Now we want to split the review text into individual words, so we can create a "bag of words" model. In order to get a somewhat nice model, we also need to remove all punctuations from the reviews. This will be done as the first step using a user defined function (UDF) in PySpark.

In [4]:
import string
from pyspark.sql.types import *

def cleanup_text(text):
    for c in string.punctuation:
        text = text.replace(c, ' ')
    return text

remove_punctuation = udf(cleanup_text, StringType())
data2 = data.withColumn('review',remove_punctuation('review'))

## Split Reviews into Words
We could do that ourselves using the Python split method, but we use a Transformer provided by PySpark instead. Saves us some time and helps to create clean code.

In [8]:
from pyspark.ml.feature import *

tokenizer = Tokenizer(inputCol='review', outputCol='words')
words = tokenizer.transform(data2)

words.limit(3).toPandas()

Unnamed: 0,name,review,rating,sentiment,words
0,SoHo Bug's Party Baby Crib Nursery Bedding Set...,We purchased this bedding because we loved the...,5,1.0,"[we, purchased, this, bedding, because, we, lo..."
1,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,Fit nicely with a little room to grow VERY S...,5,1.0,"[fit, nicely, with, a, little, room, to, grow,..."
2,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,I bought this for my sister who is doing koala...,5,1.0,"[i, bought, this, for, my, sister, who, is, do..."


## Remove Stop words

We also want to remove so called stop words, which are all those tiny words which mainly serve as glue for building sentences. Usually they do not contain much information in a simple bag of words model. So we get rid of them.

This is so common practice that PySpark already contains a Transformer for just doing that.

In [10]:
stopWords=['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']
    
stopWordsRemover = StopWordsRemover(inputCol='words', outputCol='vwords', stopWords=stopWords)
vwords = stopWordsRemover.transform(words)

vwords.limit(3).toPandas()

Unnamed: 0,name,review,rating,sentiment,words,vwords
0,SoHo Bug's Party Baby Crib Nursery Bedding Set...,We purchased this bedding because we loved the...,5,1.0,"[we, purchased, this, bedding, because, we, lo...","[we, purchased, bedding, because, we, loved, c..."
1,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,Fit nicely with a little room to grow VERY S...,5,1.0,"[fit, nicely, with, a, little, room, to, grow,...","[fit, nicely, with, little, room, grow, , , ve..."
2,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,I bought this for my sister who is doing koala...,5,1.0,"[i, bought, this, for, my, sister, who, is, do...","[bought, my, sister, who, doing, koalas, her, ..."


## Create Bag of Words Features

Finally we simply count the number of occurances of all words within the reviews. Again we can simply use a Transformer from PySpark to perform that task.

In [11]:
countVectorizer = CountVectorizer(inputCol='vwords', outputCol='features', minDF=2.0)
countVectorizerModel = countVectorizer.fit(vwords)

## Inspect Vocabulary

The countVectorizerModel contains an implcit vocabulary containing all words. This can be useful for mapping features back to words

In [12]:
print countVectorizerModel.vocabulary[0:50]

[u'', u'my', u'with', u'that', u'have', u'we', u'so', u'great', u't', u'baby', u'not', u'very', u'but', u's', u'they', u'one', u'you', u'she', u'love', u'these', u'he', u'when', u'would', u'can', u'them', u'her', u'use', u'be', u'just', u'at', u'old', u'all', u'product', u'our', u'up', u'had', u'easy', u'loves', u'out', u'like', u'little', u'has', u'bought', u'well', u'son', u'daughter', u'good', u'get', u'will', u'only']


# Tidy up DataFrame

We now carry so many columns inside the DataFrame, let's remove some intermediate columns to get more focus on our model.

In [13]:
features = countVectorizerModel.transform(vwords).drop('words')

features.limit(3).toPandas()

Unnamed: 0,name,review,rating,sentiment,vwords,features
0,SoHo Bug's Party Baby Crib Nursery Bedding Set...,We purchased this bedding because we loved the...,5,1.0,"[we, purchased, bedding, because, we, loved, c...","(2.0, 0.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, ..."
1,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,Fit nicely with a little room to grow VERY S...,5,1.0,"[fit, nicely, with, little, room, grow, , , ve...","(4.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ..."
2,Babysoy Unisex Baby Oh Soy Bundler - Eggplant ...,I bought this for my sister who is doing koala...,5,1.0,"[bought, my, sister, who, doing, koalas, her, ...","(3.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."


# Split Train Data / Test Data

Now let's do the usual split of our data into a training data set and a validation data set. Let's use 80% of all reviews for training and 20% for validation

In [14]:
train_data, test_data = features.randomSplit([0.8,0.2], seed=0)

print "train_data: %d" % train_data.count()
print "test_data: %d" % test_data.count()

train_data: 23781
test_data: 5984


# Train Classifier

There are many different classification algorithms out there. We will use a LogisticRegression, of course a DecisionTreeClassifier could be another interesting option.

In [15]:
from pyspark.ml.classification import *

logisticRegression = LogisticRegression(featuresCol='features',labelCol='sentiment')
logisticModel = logisticRegression.fit(train_data)

## Inspect Model
The LogisticRegressionModel also uses coefficients mapped to individual words. Let's have a look at them.

In [16]:
print logisticModel.coefficients.toArray()[0:20]

[ 0.00425415  0.08691081  0.05572719  0.02105409  0.04835519  0.04449103
  0.0879977   0.29522709 -0.13587291  0.06643145 -0.40355055  0.0250124
 -0.03450168  0.03922628  0.01034398  0.03779361  0.00164203  0.06741657
  0.30557325  0.06181389]


In [18]:
numPositiveWeights = len(filter(lambda x: x > 0, logisticModel.coefficients.toArray()))
numNegativeWeights = len(filter(lambda x: x < 0, logisticModel.coefficients.toArray()))

print "Number positive weights %d" % numPositiveWeights
print "Number negative weights %d" % numNegativeWeights

Number positive weights 6935
Number negative weights 3905


## Find Weights of some Words

Let's check how coefficients look like for some clearly positive or negative words

In [20]:
def print_word_weight(word):
    index = countVectorizerModel.vocabulary.index(word)
    weight = logisticModel.coefficients[index]
    print '%s : %f' % (word, weight)
    
print_word_weight('good')
print_word_weight('great')    
print_word_weight('bad')
print_word_weight('ugly')

good : 0.141940
great : 0.295227
bad : -0.339597
ugly : -0.447269


## Find Extreme Words

Let us try to find the most positive and most negative word according to the weights. This can be achieved using numpy argmin function to find the index and the vocabulary to map the index to the actual word.

In [53]:
import numpy as np

worstWordIndex = np.argmin(logisticModel.coefficients.toArray())
worstWord = countVectorizerModel.vocabulary[worstWordIndex]
worstWeight = logisticModel.coefficients[worstWordIndex]
print "Worst word: %s  value %f" % (worstWord, worstWeight)

bestWordIndex = np.argmax(logisticModel.coefficients.toArray())
bestWord = countVectorizerModel.vocabulary[bestWordIndex]
bestWeight = logisticModel.coefficients[bestWordIndex]
print "Best word: %s value %f" % (bestWord, bestWeight)


Worst word: positon  value -3.163343
Best word: wiser value 2.061051


In [38]:
doc_freq = features.select(explode('vwords').alias('word'),'review') \
     .distinct() \
     .groupBy('word').count()

doc_freq.orderBy(col('count').desc()).limit(10).toPandas()

Unnamed: 0,word,count
0,,27454
1,my,15368
2,with,8634
3,that,7602
4,have,7221
5,great,7166
6,so,6968
7,t,6209
8,very,6190
9,baby,6134


# Making Predictions

The primary idea is of course to make predictions of the sentiment using the learned model.

In [39]:
pred = logisticModel.transform(test_data)

pred.drop('features').limit(10).toPandas()



Unnamed: 0,name,review,rating,sentiment,vwords,rawPrediction,probability,prediction
0,,Awesome product does its job been having it f...,5,1.0,"[awesome, product, does, its, job, , been, hav...","[-4.59277428694, 4.59277428694]","[0.0100232477843, 0.989976752216]",1.0
1,,Fits at least 2 Dr Brown bottles with nipples...,5,1.0,"[fits, at, least, 2, dr, , brown, bottles, wit...","[-4.67755595987, 4.67755595987]","[0.00921599529836, 0.990784004702]",1.0
2,,Good productPro very sturdy and no assembly ...,4,1.0,"[good, productpro, , , very, sturdy, no, assem...","[-2.51569799721, 2.51569799721]","[0.0747649931555, 0.925235006845]",1.0
3,,Reasonable price Very sturdy with option to r...,5,1.0,"[reasonable, price, , very, sturdy, with, opti...","[-3.94537175885, 3.94537175885]","[0.0189769334697, 0.98102306653]",1.0
4,,She loves it She can pull up on it and it is...,5,1.0,"[she, loves, , , she, can, pull, up, heavy, en...","[-1.90070148925, 1.90070148925]","[0.130029100226, 0.869970899774]",1.0
5,,The dinosaur is exactly as shown and arrived m...,5,1.0,"[dinosaur, exactly, shown, arrived, more, than...","[-2.54981182416, 2.54981182416]","[0.0724391281993, 0.927560871801]",1.0
6,,These are fabulous We just moved into a new ...,5,1.0,"[these, fabulous, , , we, just, moved, into, n...","[-3.39417881657, 3.39417881657]","[0.0324778871925, 0.967522112808]",1.0
7,,This is a great gate It fits perfectly aroun...,5,1.0,"[great, gate, , , fits, perfectly, around, my,...","[-4.32541824417, 4.32541824417]","[0.0130553204062, 0.986944679594]",1.0
8,,We bought this to cover our daughter s car sea...,5,1.0,"[we, bought, cover, our, daughter, s, car, sea...","[-3.43093188572, 3.43093188572]","[0.031342627589, 0.968657372411]",1.0
9,,We have been housesitting for a couple who ins...,5,1.0,"[we, have, been, housesitting, couple, who, in...","[-3.18373914439, 3.18373914439]","[0.0397822539864, 0.960217746014]",1.0


## Find the most Positive Review

Using the column rawPrediction, we can find the review which has the highest positive prediction.

In [40]:
# Extract one component from a Vectors
extract_from_vector = udf(lambda v,i : float(v[i]), FloatType())

positives = pred.orderBy(extract_from_vector(pred.rawPrediction,lit(1)).desc())

positives.limit(6).toPandas()

Unnamed: 0,name,review,rating,sentiment,vwords,features,rawPrediction,probability,prediction
0,Hands Free Bottle Holder Multi-Purpose Bib,After we had twins I had a hard time producing...,5,1.0,"[after, we, had, twins, had, hard, time, produ...","(12.0, 4.0, 1.0, 4.0, 0.0, 3.0, 1.0, 0.0, 0.0,...","[-12.0156696708, 12.0156696708]","[6.04864837798e-06, 0.999993951352]",1.0
1,Maxi-Cosi Priori Convertible Car Seat - Penguin,My daughter is very small for her age She is ...,5,1.0,"[my, daughter, very, small, her, age, , she, 1...","(29.0, 5.0, 4.0, 7.0, 3.0, 14.0, 1.0, 0.0, 5.0...","[-10.679292409, 10.679292409]","[2.30161267337e-05, 0.999976983873]",1.0
2,Evenflo Portable Ultrasaucer,This is a lifesaver My 5 month old loves t...,5,1.0,"[lifesaver, , , , , my, 5, month, old, loves, ...","(27.0, 1.0, 1.0, 2.0, 1.0, 0.0, 1.0, 1.0, 0.0,...","[-9.0994326238, 9.0994326238]","[0.000111716700974, 0.999888283299]",1.0
3,Edushape Edu-Tiles 25 Piece Solid Play Mat wit...,I purchased these foam squares to create the p...,5,1.0,"[purchased, these, foam, squares, create, path...","(17.0, 3.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0,...","[-8.96044348875, 8.96044348875]","[0.000128372820199, 0.99987162718]",1.0
4,Fisher-Price Ocean Wonders Aquarium Bouncer,this is a must have product the soothing vibr...,5,1.0,"[must, have, product, , soothing, vibrator, se...","(6.0, 1.0, 0.0, 2.0, 2.0, 1.0, 3.0, 1.0, 0.0, ...","[-8.80785806875, 8.80785806875]","[0.000149530883059, 0.999850469117]",1.0
5,Golden Asian Silky Baby Crib Nursery Bedding S...,I must say I was apprehensive of buying the be...,5,1.0,"[must, say, apprehensive, buying, bedding, due...","(19.0, 2.0, 2.0, 1.0, 2.0, 0.0, 1.0, 1.0, 0.0,...","[-8.06884580403, 8.06884580403]","[0.000313046473905, 0.999686953526]",1.0


# Evaluation of Prediction

Again we want to assess the performance of the prediction model. This can be done using the builtin class BinaryClassificationEvaluator.

In [41]:
from pyspark.ml.evaluation import *

evaluator = BinaryClassificationEvaluator(labelCol='sentiment')
result = evaluator.evaluate(pred)

print result

0.946327339714


# Custom Evaluator

We want to use a different metric namely accuracy. Accuracy is defined as

    number_correct_predictions / total_number_predictions
    
First let us directly calculate that metric

In [44]:
num_total = pred.count()
num_correct = pred.filter(pred.sentiment == pred.prediction).count()

model_accuracy = float(num_correct) / num_total

print "Model Accuracy: %f" % (float(num_correct) / num_total)

Model Accuracy: 0.904412


## Compare with Dummy Predictor

It is always interesting to see how a trivial prediction performs. The trivial predictor simply predicts the most common class for all objects. In this case this would be a positive review.

In [45]:
num_total = pred.count()
num_positive = pred.filter(pred.sentiment == 1.0).count()

baseline_accuracy = float(num_positive) / num_total

print "Baseline Accuracy: %f" % (float(num_positive) / num_total)

Baseline Accuracy: 0.850267


## Custom Evaluator

Now let us create a new Evaluator class implementing accuracy as the relevant Metric.

In [46]:
class AccuracyClassificationEvaluator(Evaluator):
    def __init__(self, predictionCol='prediction', labelCol='label'):
        super(Evaluator,self).__init__()
        self.predictionCol = predictionCol
        self.labelCol = labelCol
    
    def _evaluate(self, dataset):
        num_total = dataset.count()
        num_correct = dataset.filter(col(self.labelCol) == col(self.predictionCol)).count()
        accuracy = float(num_correct) / num_total
        return accuracy


In [48]:
evaluator = AccuracyClassificationEvaluator(labelCol='sentiment')

print evaluator.evaluate(pred)

0.904411764706


# Tweak Hyper Parameters

Again we want to improve overall performance by tweaking model parameters. So first let's see which parameters are available for tweaking

In [43]:
print LogisticRegression().explainParams()

elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
maxIter: max number of iterations (>= 0). (default: 100)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)
regParam: regularization parameter (>= 0). (default: 0.1)
standardization: whether to standardize the training features before fitting the model. (default: True)
threshold: Threshold in binary classification predictio

Let us try some different parameters and check the results

In [49]:
logisticRegression2 = LogisticRegression(featuresCol='features',labelCol='sentiment')
logisticRegression2.setRegParam(0.01).setMaxIter(100)
logisticModel2 = logisticRegression2.fit(train_data)

pred = logisticModel2.transform(test_data)

roc_evaluator = BinaryClassificationEvaluator(labelCol='sentiment', metricName="areaUnderROC")
acc_evaluator = AccuracyClassificationEvaluator(labelCol='sentiment')

print "areaUnderROC = %f" % roc_evaluator.evaluate(pred)
print "Model Accuracy = %f" % acc_evaluator.evaluate(pred)


areaUnderROC = 0.931838
Model Accuracy = 0.912767


## Finding best Hyper Parameters

So we got an improvement, but what would be best? We need to try.

In [50]:
for reg_param in [0.0, 0.0001, 0.01, 1.0, 100.0]:
    logisticRegression2 = LogisticRegression(featuresCol='features',labelCol='sentiment')
    logisticRegression2.setRegParam(reg_param).setMaxIter(100)
    logisticModel2 = logisticRegression2.fit(train_data)
    
    pred = logisticModel2.transform(test_data)
    
    roc_evaluator = BinaryClassificationEvaluator(labelCol='sentiment', metricName="areaUnderROC")
    acc_evaluator = AccuracyClassificationEvaluator(labelCol='sentiment')

    print "reg_param = %f" % reg_param
    print "    areaUnderROC = %f" % roc_evaluator.evaluate(pred)
    print "    Model Accuracy = %f" % acc_evaluator.evaluate(pred)


reg_param = 0.000000
    areaUnderROC = 0.894418
    Model Accuracy = 0.899398
reg_param = 0.000100
    areaUnderROC = 0.908762
    Model Accuracy = 0.906250
reg_param = 0.010000
    areaUnderROC = 0.931838
    Model Accuracy = 0.912767
reg_param = 1.000000
    areaUnderROC = 0.943480
    Model Accuracy = 0.861631
reg_param = 100.000000
    areaUnderROC = 0.906305
    Model Accuracy = 0.850267


## ParamGridBuilder & CrossValidator

Since the selection of hyper parameters is a very common job and might be tedious work, there is some nice support in PySpark to simplify it. It is a two-step approach:
1. Use ParamGridBuilder to create a set of parameters to test, possibly for different hyper parameters
2. Use a CrossValidator for selecting the best set of parameters

In [51]:
from pyspark.ml.tuning import *

lr = LogisticRegression(featuresCol='features',labelCol='sentiment')
param_grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.0, 0.0001, 0.01, 1.0, 100.0]) \
    .addGrid(lr.maxIter, [10, 100]) \
    .build()
    
for pset in param_grid:
    params = ["%s=%s" % (key.name, str(value)) for (key,value) in pset.items()]
    print ' '.join(params)

regParam=0.0 maxIter=10
regParam=0.0001 maxIter=10
regParam=0.01 maxIter=10
regParam=1.0 maxIter=10
regParam=100.0 maxIter=10
regParam=0.0 maxIter=100
regParam=0.0001 maxIter=100
regParam=0.01 maxIter=100
regParam=1.0 maxIter=100
regParam=100.0 maxIter=100


In [52]:
lr = LogisticRegression(featuresCol='features',labelCol='sentiment')
evaluator = AccuracyClassificationEvaluator(labelCol='sentiment')
cv = CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3)
model = cv.fit(train_data)

print evaluator.evaluate(model.transform(test_data))

0.904411764706
