# Load Data

First we load data from HDFS. It is stored as a trivial CSV file with three columns
1. product name
2. review text
3. rating (1 - 5)

In [1]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

schema =  StructType([
    StructField('name',StringType(),True),
    StructField('review',StringType(), True),
    StructField('rating',StringType(), True),
])

raw_data = spark.read.schema(schema).csv("s3://dimajix-training/data/amazon_baby")

Unnamed: 0,name,review,rating
0,name,review,rating
1,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
2,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
3,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
4,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5


## Clean and Cache Data

We need to convert the "rating" columns to an integer - but this will obviously fail for the first record, as this one contains the CSV header. So we need to perform some cleanup after trying to convert the data.

For helping distributing the workload, we repartition the DataFrame and also cache it.

In [2]:
data = raw_data.withColumn('rating',col('rating').cast(IntegerType())) \
    .filter(col('rating').isNotNull()) \
    .filter(col('review').isNotNull()) \
    .repartition(31) \
    .cache()

Unnamed: 0,name,review,rating
0,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4
1,Our Baby Girl Memory Book,Really happy with this purchase. I was looking...,5
2,Semanario (7) Little Girls 14k Gold Overlay Ba...,. I am pleased with product. I love the bangle...,4
3,Neurosmith - Music Blocks with Mozart Music Ca...,It takes a youthful spirit of inquiry and fasc...,5
4,Fisher Price Nesting Action Vehicles,This is a great toy. The wheels really work a...,5


# Split Train Data / Test Data

Now let's do the usual split of our data into a training data set and a validation data set. Let's use 80% of all reviews for training and 20% for validation

In [3]:
train_data, test_data = data.randomSplit([0.8,0.2], seed=1)

print("train_data: %d" % train_data.count())
print("test_data: %d" % test_data.count())

train_data: 139461
test_data: 34861


# Implement Transformer for Removing Punctuations

We need a custom Transformer to build the pipeline. The transformer should remove all punctuations from a given column containing text.

In [4]:
from pyspark.ml import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

def remove_punctuations(text):
    import string
    for c in string.punctuation:
        text = text.replace(c, ' ')
    return text


class PunctuationCleanupTransformer(Transformer):
    def __init__(self, inputCol, outputCol):
        """
        Constructor of PunctuationCleanupTransformer which takes two arguments:
        inputCol - name of input column
        outputCol - name of output column
        """
        super(Transformer, self).__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

    def _transform(self, dataset):
        """
        Protecetd _transform method which will be called by the public transform
        method. You should not call this method directly.
        """
        remove_punctuation_udf = udf(remove_punctuations, StringType())
        return dataset.withColumn(self.outputCol, remove_punctuation_udf(self.inputCol))

# Implement Transformer for Stemming

We need to stem words, and for doing so we use the Python NLTK library.

In [5]:
from nltk.stem import PorterStemmer

def stem_word(words):
    ps = PorterStemmer()
    return [ps.stem(word) for word in words]


class PorterStemmerTransformer(Transformer):
    def __init__(self, inputCol, outputCol):
        """
        Constructor of PorterStemmerTransformer which takes two arguments:
        inputCol - name of input column
        outputCol - name of output column
        """
        super(Transformer, self).__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

    def _transform(self, dataset):
        """
        Protecetd _transform method which will be called by the public transform
        method. You should not call this method directly.
        """
        stem_word_udf = udf(stem_word, ArrayType(StringType()))
        return dataset.withColumn(self.outputCol, stem_word_udf(self.inputCol))

# Create ML Pipeline

Now we have all components for creating an initial ML Pipeline. Remember that we have been using the following components before

* PunctuationCleanupTransformer - remove punctuations from reviews
* Tokenizer - for splitting reviews into words
* StopWordRemover - for removing stop words
* PorterStemmerTransformer - for stemming words
* NGram - for creating NGrams (we'll use two words per n-gram)
* CountVectorizer - for creating bag-of-word features from the words
* IDF - for creating TF-IDF features from the NGram counts
* LogisticRegression - for creating the real model

You also need to transform the incoming rating (1-5) to a sentiment (0 or 1) and you need to drop reviews with a rating of 3. This can be done using one ore more SQLTransformer instances. Inside the SQLTransformer instance you simply write SQL code and access the current DataFrame via `__THIS__`.

In [6]:
from pyspark.ml.feature import *
from pyspark.ml.classification import *

stopWords = ['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']
stopWords = StopWordsRemover.loadDefaultStopWords("english")

stages = [
    PunctuationCleanupTransformer(inputCol='review', outputCol='clean_review'),
    SQLTransformer(statement='SELECT *,CASE WHEN rating < 3 THEN 0.0 ELSE 1.0 END AS sentiment FROM __THIS__ WHERE rating <> 3'),
    Tokenizer(inputCol='clean_review', outputCol='words'),
    StopWordsRemover(inputCol='words', outputCol='vwords', stopWords=stopWords),
    PorterStemmerTransformer(inputCol='vwords', outputCol='stems'),
    NGram(inputCol='stems', outputCol='ngrams', n=3),
    CountVectorizer(inputCol='ngrams', outputCol='tf', minDF=2.0),
    IDF(inputCol='tf', outputCol='features'),
    LogisticRegression(featuresCol='features',labelCol='sentiment')
]
pipe = Pipeline(stages = stages)

# Model Evaluation
As in the original exercise, we want to use a custom metric for assessing the performance.

In [7]:
from pyspark.ml.evaluation import *

class AccuracyClassificationEvaluator(Evaluator):
    def __init__(self, predictionCol='prediction', labelCol='label'):
        super(Evaluator,self).__init__()
        self.predictionCol = predictionCol
        self.labelCol = labelCol
    
    def _evaluate(self, dataset):
        num_total = dataset.count()
        num_correct = dataset.filter(col(self.labelCol) == col(self.predictionCol)).count()
        accuracy = float(num_correct) / num_total
        return accuracy

# Hyper Parameter Tuning

The whole pipeline has some parameters which have an influence on the result, i.e. the accuracy. For example the size of the n-grams will probably have a big impact and also the minDF parameter of the CountVecttorizer will probably have some impact. These settings are called "hyper parameters", because they are also model parameters, but not learnt directly during the training phase. But which parameters work best?

We will use a CrossValidation to select the best set of hyperparameters.

First let us have a look at the paremeters of some stages.

In [8]:
print(pipe.getStages()[5].explainParams())

inputCol: input column name. (current: stems)
n: number of elements per n-gram (>=1) (default: 2, current: 3)
outputCol: output column name. (default: NGram_4f6286e3942710af6a00__output, current: ngrams)


## Create ParamGrid

Now we create a param grid that should be used for using different sets of parameters. We want to tweak two parameters again:

* NGRam sizes should take values in [2,3,5]
* CounterVectoriter minimum document frequency (minDF) should take values in [1,2,3,5])

In order to create this grid, we first need to retrieve the corresponding stages from the pipeline, so we can access its parameters.

In [9]:
from pyspark.ml.tuning import *

ngram = pipe.getStages()[5]
count = pipe.getStages()[6]

param_grid = ParamGridBuilder() \
    .addGrid(ngram.n, [2, 3, 5]) \
    .addGrid(count.minDF, [1, 2, 3, 5]) \
    .build()

## Perform Hyper Parameter tuning using CrossValidator

Now we can wrap the previous pipeline inside a CrossValidator which trains the model over and over again for all entries in the ParameterGrid. The CrossValidator works as a wrapper of the regression algorithm and will return a pipeline model. In order to evaluate the goodness of a fit, the CrossValidator also needs an Evaluator - we'll use our AccuracyClassificationEvaluator again.

In [None]:
# Create instance of AccuracyClassificationEvaluator with labelCol set to the real sentiment column
evaluator = AccuracyClassificationEvaluator(labelCol='sentiment')

# Create a CrossValidator instance
validator = CrossValidator(estimator=pipe, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3)

# Fit model to pipeline
model = validator.fit(train_data)

In [None]:
# Predict sentiment for test data
pred = model.transform(test_data)
always_positive = pred.withColumn('prediction',lit(1.0))

print("Model Accuracy = %f" % evaluator.evaluate(pred))
print("Baseline Accuracy = %f" % evaluator.evaluate(always_positive))