# Load Data

First we load data from HDFS. It is stored as a trivial CSV file with three columns
1. product name
2. review text
3. rating (1 - 5)

In [1]:
from pyspark.sql.functions import *

data = spark.read.text("s3://dimajix-training/data/amazon_baby")
data = data.select(
        split('value',',')[0].alias('name'),
        split('value',',')[1].alias('review'),
        split('value',',')[2].alias('rating').cast('int')
).filter(col('rating').isNotNull())

data.limit(5).toPandas()

Unnamed: 0,name,review,rating
0,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
1,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
2,Nature's Lullabies Second Year Sticker Calendar,This was the only calender I could find for th...,5
3,Nature's Lullabies Second Year Sticker Calendar,I completed a calendar for my son's first year...,4
4,Nature's Lullabies Second Year Sticker Calendar,We wanted to get something to keep track of ou...,5


# Split Train Data / Test Data

Now let's do the usual split of our data into a training data set and a validation data set. Let's use 80% of all reviews for training and 20% for validation

In [3]:
train_data, test_data = data.randomSplit([0.8,0.2], seed=1)

print("train_data: %d" % train_data.count())
print("test_data: %d" % test_data.count())

train_data: 25652
test_data: 6458


# Implement Transformer

We need a custom Transformer to build the pipeline. The transformer should remove all punctuations from a given column containing text.

In [4]:
from pyspark.ml import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

def remove_punctuations(text):
    import string
    for c in string.punctuation:
        text = text.replace(c, ' ')
    return text


class PunctuationCleanupTransformer(Transformer):
    def __init__(self, inputCol, outputCol):
        super(Transformer, self).__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

    def _transform(self, dataset):
        remove_punctuation_udf = udf(remove_punctuations, StringType())
        return dataset.withColumn(self.outputCol, remove_punctuation_udf(self.inputCol))

## Test Transformer

Lets create an instance of the Transformer and test it

In [5]:
cleaner = PunctuationCleanupTransformer(inputCol='review', outputCol='clean_review')
clean_data = cleaner.transform(data)

clean_data.limit(4).toPandas()

Unnamed: 0,name,review,rating,clean_review
0,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love...
1,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...
2,Nature's Lullabies Second Year Sticker Calendar,This was the only calender I could find for th...,5,This was the only calender I could find for th...
3,Nature's Lullabies Second Year Sticker Calendar,I completed a calendar for my son's first year...,4,I completed a calendar for my son s first year...


# Create ML Pipeline

Now we have all components for creating an initial ML Pipeline. Remember that we have been using the following components before

* Tokenizer - for splitting reviews into words
* StopWordRemover - for removing stop words
* NGram - for creating NGrams (we'll use two words per n-gram)
* CountVectorizer - for creating bag-of-word features from the words
* IDF - for creating TF-IDF features from the NGram counts
* LogisticRegression - for creating the real model

You also need to transform the incoming rating (1-5) to a sentiment (0 or 1) and you need to drop reviews with a rating of 3. This can be done using one ore more SQLTransformer instances.

In [7]:
from pyspark.ml.feature import *
from pyspark.ml.classification import *

stopWords=['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']

stages = [
    PunctuationCleanupTransformer(inputCol='review', outputCol='clean_review'),
    SQLTransformer(statement='SELECT *,CASE WHEN rating < 3 THEN 0.0 ELSE 1.0 END AS sentiment FROM __THIS__ WHERE rating <> 3'),
    Tokenizer(inputCol='clean_review', outputCol='words'),
    StopWordsRemover(inputCol='words', outputCol='vwords', stopWords=stopWords),
    NGram(inputCol='vwords', outputCol='ngrams', n=2),
    CountVectorizer(inputCol='ngrams', outputCol='tf', minDF=5.0),
    IDF(inputCol='tf', outputCol='features'),
    LogisticRegression(featuresCol='features',labelCol='sentiment')
]
pipe = Pipeline(stages = stages)

# Fit Pipeline Model
Using training data, we create a PipelineModel by fitting the Pipeline to the data

In [8]:
model = pipe.fit(train_data)

# Predict Data

Let us do some predictions of the test data using the model.

In [9]:
pred = model.transform(test_data)

pred.limit(10).toPandas()

Unnamed: 0,name,review,rating,clean_review,sentiment,words,vwords,ngrams,tf,features,rawPrediction,probability,prediction
0,,Been using for a number of months now and thes...,5,Been using for a number of months now and thes...,1.0,"[been, using, for, a, number, of, months, now,...","[been, using, number, months, now, these, have...","[been using, using number, number months, mont...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-56.0856711115, 56.0856711115]","[4.38836277488e-25, 1.0]",1.0
1,,I bought this and the extensions so I could fo...,5,I bought this and the extensions so I could fo...,1.0,"[i, bought, this, and, the, extensions, so, i,...","[bought, extensions, so, could, form, into, re...","[bought extensions, extensions so, so could, c...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-96.3992629011, 96.3992629011]","[1.36248604418e-42, 1.0]",1.0
2,,I just received these. I had them on my regist...,5,I just received these I had them on my regist...,1.0,"[i, just, received, these, , i, had, them, on,...","[just, received, these, , had, them, my, regis...","[just received, received these, these , had, ...","(0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 5.11631489614, 0.0, 0.0, 0.0, ...","[-228.504441611, 228.504441611]","[5.77805827959e-100, 1.0]",1.0
3,,I too ordered this based on the picture which ...,4,I too ordered this based on the picture which ...,1.0,"[i, too, ordered, this, based, on, the, pictur...","[too, ordered, based, picture, which, showed, ...","[too ordered, ordered based, based picture, pi...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.721261567719, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[2.83439161447, -2.83439161447]","[0.944506234909, 0.0554937650906]",0.0
4,,It was better than I thought. Extremely soft a...,5,It was better than I thought Extremely soft a...,1.0,"[it, was, better, than, i, thought, , extremel...","[better, than, thought, , extremely, soft, has...","[better than, than thought, thought , extreme...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.721261567719, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[-96.795510606, 96.795510606]","[9.16735122882e-43, 1.0]",1.0
5,,Love making lil bites for toddlers that are pi...,5,Love making lil bites for toddlers that are pi...,1.0,"[love, making, lil, bites, for, toddlers, that...","[love, making, lil, bites, toddlers, that, pic...","[love making, making lil, lil bites, bites tod...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-22.4782129651, 22.4782129651]","[1.72916384433e-10, 0.999999999827]",1.0
6,,Love this hat - good not great quality - cotto...,5,Love this hat good not great quality cotto...,1.0,"[love, this, hat, , , good, not, great, qualit...","[love, hat, , , good, not, great, quality, , ,...","[love hat, hat , , good, good not, not great...","(5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(3.6063078386, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...","[-151.141721745, 151.141721745]","[2.29078463118e-66, 1.0]",1.0
7,,My 3yr old loves helping me cut his sandwiches...,5,My 3yr old loves helping me cut his sandwiches...,1.0,"[my, 3yr, old, loves, helping, me, cut, his, s...","[my, 3yr, old, loves, helping, me, cut, his, s...","[my 3yr, 3yr old, old loves, loves helping, he...","(2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.44252313544, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-48.8340691591, 48.8340691591]","[6.18918173677e-22, 1.0]",1.0
8,,My kids sometimes complain about eating sandwi...,5,My kids sometimes complain about eating sandwi...,1.0,"[my, kids, sometimes, complain, about, eating,...","[my, kids, sometimes, complain, about, eating,...","[my kids, kids sometimes, sometimes complain, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-52.2447079852, 52.2447079852]","[2.04367396985e-23, 1.0]",1.0
9,,Soft snuggly practical item. Great service! Ba...,5,Soft snuggly practical item Great service Ba...,1.0,"[soft, snuggly, practical, item, , great, serv...","[soft, snuggly, practical, item, , great, serv...","[soft snuggly, snuggly practical, practical it...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-104.987221544, 104.987221544]","[2.53880306101e-46, 1.0]",1.0


# Model Evaluation
As in the original exercise, we want to use a custom metric for assessing the performance.

In [10]:
from pyspark.ml.evaluation import *

class AccuracyClassificationEvaluator(Evaluator):
    def __init__(self, predictionCol='prediction', labelCol='label'):
        super(Evaluator,self).__init__()
        self.predictionCol = predictionCol
        self.labelCol = labelCol
    
    def _evaluate(self, dataset):
        num_total = dataset.count()
        num_correct = dataset.filter(col(self.labelCol) == col(self.predictionCol)).count()
        accuracy = float(num_correct) / num_total
        return accuracy

## Assess Performance

With the evaluator we can assess the performance of the prediction and easily compare it to a simple model which always predicts 'positive'.

In [None]:
print("Num positive reviews: %d" % pred.filter(pred.sentiment > 0.5).count())
print("Num negative reviews: %d" % pred.filter(pred.sentiment < 0.5).count())

In [11]:
always_positive = pred.withColumn('prediction',lit(1.0))

evaluator = AccuracyClassificationEvaluator(predictionCol='prediction', labelCol='sentiment')

print("Model Accuracy = %f" % evaluator.evaluate(pred))
print("Baseline Accuracy = %f" % evaluator.evaluate(always_positive))

Model Accuracy = 0.923770
Baseline Accuracy = 0.845705


# Hyper Parameter Tuning

Again we want to tune some hyper parameters, but this time inside a pipeline. The methodology is the same as before, we can directly include the CrossValidator into the pipeline. But step by step...

First let us have a look at all paremeters of a LogisticRegression.

In [12]:
print(LogisticRegression().explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
maxIter: max number of iterations (>= 0). (default: 100)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name. (defa

## Create ParamGrid

Now we create a param grid that should be used for using different sets of parameters. We want to tweak two parameters again:

* regParam should take values in [0.0, 0.0001, 0.01, 1.0, 100.0]
* maxIter should take values in [10, 100])

In order to create this grid, we first need to create an instance of a LogisticRegression, so we can access its parameters.

In [13]:
from pyspark.ml.tuning import *

lr = LogisticRegression(featuresCol='features',labelCol='sentiment')

param_grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.0, 0.0001, 0.01, 1.0, 100.0]) \
    .addGrid(lr.maxIter, [10, 100]) \
    .build()

## Create Pipeline

Now we can create a pipeline using a CrossValidator instead of directly using a LogisticRegression. This means the configuration of the Pipeline should match the old one except that a CrossValidator is inserted instead of the LogisticRegression. The CrossValidator works as a wrapper of the regression algorithm.

In [19]:
evaluator = AccuracyClassificationEvaluator(labelCol='sentiment')

stopWords=['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']

stages = [
    TextCleanupTransformer(inputCol='review', outputCol='clean_review'),
    SQLTransformer(statement='SELECT *,CASE WHEN rating < 3 THEN 0.0 ELSE 1.0 END AS sentiment FROM __THIS__ WHERE rating <> 3'),
    Tokenizer(inputCol='clean_review', outputCol='words'),
    StopWordsRemover(inputCol='words', outputCol='vwords', stopWords=stopWords),
    NGram(inputCol='vwords', outputCol='ngrams', n=2),
    CountVectorizer(inputCol='ngrams', outputCol='tf', minDF=5.0),
    IDF(inputCol='tf', outputCol='features'),
    CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3)
]

pipe = Pipeline(stages = stages)

# Fit model to pipeline
model = pipe.fit(train_data)

In [21]:
# Predict sentiment for test data
pred = model.transform(test_data)

print("Model Accuracy = %f" % evaluator.evaluate(pred))
print("Baseline Accuracy = %f" % evaluator.evaluate(always_positive))

Model Accuracy = 0.921302
Baseline Accuracy = 0.848580
