# Load Data

First we load data from HDFS. It is stored as a trivial CSV file with three columns
1. product name
2. review text
3. rating (1 - 5)

In [1]:
from pyspark.sql.functions import *

data = sqlContext.read.text("/user/cloudera/data/amazon_baby")
data = data.select(
        split('value',',')[0].alias('name'),
        split('value',',')[1].alias('review'),
        split('value',',')[2].alias('rating').cast('int')
).filter(col('rating').isNotNull())

data.limit(5).toPandas()

Unnamed: 0,name,review,rating
0,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
1,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
2,Nature's Lullabies Second Year Sticker Calendar,This was the only calender I could find for th...,5
3,Nature's Lullabies Second Year Sticker Calendar,I completed a calendar for my son's first year...,4
4,Nature's Lullabies Second Year Sticker Calendar,We wanted to get something to keep track of ou...,5


# Split Train Data / Test Data

Now let's do the usual split of our data into a training data set and a validation data set. Let's use 80% of all reviews for training and 20% for validation

In [2]:
train_data, test_data = data.randomSplit([0.8,0.2], seed=1)

print "train_data: %d" % train_data.count()
print "test_data: %d" % test_data.count()

train_data: 25613
test_data: 6497


# Implement Transformer

We need a custom Transformer to build the pipeline. The transformer should remove all punctuations from a given column containing text.

In [4]:
from pyspark.ml import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

def remove_punctuations(text):
    import string
    for c in string.punctuation:
        text = text.replace(c, ' ')
    return text


class PunctuationCleanupTransformer(Transformer):
    def __init__(self, inputCol, outputCol):
        super(Transformer, self).__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

    def _transform(self, dataset):
        remove_punctuation_udf = udf(remove_punctuations, StringType())
        return dataset.withColumn(self.outputCol, remove_punctuation_udf(self.inputCol))

## Test Transformer

Lets create an instance of the Transformer and test it

In [5]:
cleaner = PunctuationCleanupTransformer(inputCol='review', outputCol='clean_review')
clean_data = cleaner.transform(data)

clean_data.limit(4).toPandas()

Unnamed: 0,name,review,rating,clean_review
0,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love...
1,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...
2,Nature's Lullabies Second Year Sticker Calendar,This was the only calender I could find for th...,5,This was the only calender I could find for th...
3,Nature's Lullabies Second Year Sticker Calendar,I completed a calendar for my son's first year...,4,I completed a calendar for my son s first year...


# Create ML Pipeline

Now we have all components for creating an initial ML Pipeline. Remember that we have been using the following components before

* Tokenizer - for splitting reviews into words
* StopWordRemover - for removing stop words
* CountVectorizer - for creating bag-of-word features from the words
* LogisticRegression - for creating the real model

You also need to transform the incoming rating (1-5) to a sentiment (0 or 1) and you need to drop reviews with a rating of 3. This can be done using one ore more SQLTransformer instances.

In [9]:
from pyspark.ml.feature import *
from pyspark.ml.classification import *

stopWords=['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']

stages = [
    PunctuationCleanupTransformer(inputCol='review', outputCol='clean_review'),
    SQLTransformer(statement='SELECT *,CASE WHEN rating < 3 THEN 0.0 ELSE 1.0 END AS sentiment FROM __THIS__ WHERE rating <> 3'),
    Tokenizer(inputCol='clean_review', outputCol='words'),
    StopWordsRemover(inputCol='words', outputCol='vwords', stopWords=stopWords),
    CountVectorizer(inputCol='vwords', outputCol='features', minDF=2.0),
    LogisticRegression(featuresCol='features',labelCol='sentiment')
]
pipe = Pipeline(stages = stages)

# Fit Pipeline Model
Using training data, we create a PipelineModel by fitting the Pipeline to the data

In [10]:
model = pipe.fit(train_data)

# Predict Data

Let us do some predictions of the test data using the model.

In [11]:
pred = model.transform(test_data)

pred.limit(10).toPandas()



Unnamed: 0,name,review,rating,clean_review,sentiment,words,vwords,features,rawPrediction,probability,prediction
0,,I don't understand some of the high reviews fo...,1,I don t understand some of the high reviews fo...,0.0,"[i, don, t, understand, some, of, the, high, r...","[don, t, understand, some, high, reviews, item...","(15.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 4.0,...","[5.1398939764, -5.1398939764]","[0.994175809121, 0.00582419087928]",0.0
1,,I'm all for being 'green'; this bag is perfect...,5,I m all for being green this bag is perfect...,1.0,"[i, m, all, for, being, , green, , , this, bag...","[m, all, being, , green, , , bag, perfect, hol...","(6.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-2.38193431757, 2.38193431757]","[0.0845607094526, 0.915439290547]",1.0
2,,My youngest son is now 16 months old but we ha...,5,My youngest son is now 16 months old but we ha...,1.0,"[my, youngest, son, is, now, 16, months, old, ...","[my, youngest, son, now, 16, months, old, but,...","(3.0, 1.0, 0.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, ...","[-3.83209868013, 3.83209868013]","[0.0212047206389, 0.978795279361]",1.0
3,,This blanket goes perfectly in our future litt...,4,This blanket goes perfectly in our future litt...,1.0,"[this, blanket, goes, perfectly, in, our, futu...","[blanket, goes, perfectly, our, future, little...","(2.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","[-3.45763126954, 3.45763126954]","[0.0305420917109, 0.969457908289]",1.0
4,,This necklace is very light weight and has rea...,5,This necklace is very light weight and has rea...,1.0,"[this, necklace, is, very, light, weight, and,...","[necklace, very, light, weight, has, really, h...","(1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-3.39360763737, 3.39360763737]","[0.0324958401926, 0.967504159807]",1.0
5,,does not come with pole that attaches to crib ...,1,does not come with pole that attaches to crib ...,0.0,"[does, not, come, with, pole, that, attaches, ...","[does, not, come, with, pole, that, attaches, ...","(0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","[0.502370008844, -0.502370008844]","[0.623016130218, 0.376983869782]",0.0
6,&quot;A Little Pillow Company&quot; Hypoallerg...,I purchased this pillow for my 2 1/2 yr old on...,5,I purchased this pillow for my 2 1 2 yr old on...,1.0,"[i, purchased, this, pillow, for, my, 2, 1, 2,...","[purchased, pillow, my, 2, 1, 2, yr, old, once...","(7.0, 1.0, 3.0, 1.0, 0.0, 2.0, 1.0, 0.0, 1.0, ...","[-6.72779333296, 6.72779333296]","[0.00119574029854, 0.998804259701]",1.0
7,&quot;A Little Pillow Company&quot; Hypoallerg...,If this pillow were any larger I would worry a...,4,If this pillow were any larger I would worry a...,1.0,"[if, this, pillow, were, any, larger, i, would...","[if, pillow, were, any, larger, would, worry, ...","(3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-3.38684333301, 3.38684333301]","[0.0327091828269, 0.967290817173]",1.0
8,&quot;A Little Pillow Company&quot; Hypoallerg...,This is just perfect size for a little toddler...,5,This is just perfect size for a little toddler...,1.0,"[this, is, just, perfect, size, for, a, little...","[just, perfect, size, little, toddler, , s, so...","(2.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, ...","[-3.37929013603, 3.37929013603]","[0.0329490059584, 0.967050994042]",1.0
9,&quot;B&quot; Is for Babies (And Booties!) 201...,Exactly what I was looking for! These are porc...,5,Exactly what I was looking for These are porc...,1.0,"[exactly, what, i, was, looking, for, , these,...","[exactly, what, looking, , these, porcelain, ,...","(4.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[-1.16730279952, 1.16730279952]","[0.237342861917, 0.762657138083]",1.0


# Model Evaluation
As in the original exercise, we want to use a custom metric for assessing the performance.

In [13]:
from pyspark.ml.evaluation import *

class AccuracyClassificationEvaluator(Evaluator):
    def __init__(self, predictionCol='prediction', labelCol='label'):
        super(Evaluator,self).__init__()
        self.predictionCol = predictionCol
        self.labelCol = labelCol
    
    def _evaluate(self, dataset):
        num_total = dataset.count()
        num_correct = dataset.filter(col(self.labelCol) == col(self.predictionCol)).count()
        accuracy = float(num_correct) / num_total
        return accuracy

## Assess Performance

With the evaluator we can assess the performance of the prediction and easily compare it to a simple model which always predicts 'positive'.

In [14]:
always_positive = pred.withColumn('prediction',lit(1.0))

evaluator = AccuracyClassificationEvaluator(predictionCol='prediction', labelCol='sentiment')

print "Model Accuracy = %f" % evaluator.evaluate(pred)
print "Baseline Accuracy = %f" % evaluator.evaluate(always_positive)

Model Accuracy = 0.910344
Baseline Accuracy = 0.848580


# Hyper Parameter Tuning

Again we want to tune some hyper parameters, but this time inside a pipeline. The methodology is the same as before, we can directly include the CrossValidator into the pipeline. But step by step...

First let us have a look at all paremeters of a LogisticRegression.

In [15]:
print LogisticRegression().explainParams()

elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
maxIter: max number of iterations (>= 0). (default: 100)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)
regParam: regularization parameter (>= 0). (default: 0.1)
standardization: whether to standardize the training features before fitting the model. (default: True)
threshold: Threshold in binary classification predictio

## Create ParamGrid

Now we create a param grid that should be used for using different sets of parameters. We want to tweak two parameters again:

* regParam should take values in [0.0, 0.0001, 0.01, 1.0, 100.0]
* maxIter should take values in [10, 100])

In order to create this grid, we first need to create an instance of a LogisticRegression, so we can access its parameters.

In [17]:
from pyspark.ml.tuning import *

lr = LogisticRegression(featuresCol='features',labelCol='sentiment')

param_grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.0, 0.0001, 0.01, 1.0, 100.0]) \
    .addGrid(lr.maxIter, [10, 100]) \
    .build()

## Create Pipeline

Now we can create a pipeline using a CrossValidator instead of directly using a LogisticRegression. This means the configuration of the Pipeline should match the old one except that a CrossValidator is inserted instead of the LogisticRegression. The CrossValidator works as a wrapper of the regression algorithm.

In [19]:
evaluator = AccuracyClassificationEvaluator(labelCol='sentiment')

stopWords=['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']

stages = [
    TextCleanupTransformer(inputCol='review', outputCol='clean_review'),
    SQLTransformer(statement='SELECT *,CASE WHEN rating < 3 THEN 0.0 ELSE 1.0 END AS sentiment FROM __THIS__ WHERE rating <> 3'),
    Tokenizer(inputCol='clean_review', outputCol='words'),
    StopWordsRemover(inputCol='words', outputCol='vwords', stopWords=stopWords),
    CountVectorizer(inputCol='vwords', outputCol='features', minDF=2.0),
    CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3)
]

pipe = Pipeline(stages = stages)

# Fit model to pipeline
model = pipe.fit(train_data)

In [21]:
# Predict sentiment for test data
pred = model.transform(test_data)

print "Model Accuracy = %f" % evaluator.evaluate(pred)
print "Baseline Accuracy = %f" % evaluator.evaluate(always_positive)

Model Accuracy = 0.921302
Baseline Accuracy = 0.848580
