# Load Data

First we load data from HDFS. It is stored as a trivial CSV file with three columns
1. product name
2. review text
3. rating (1 - 5)

In [None]:
from pyspark.sql.functions import *

data = spark.read.text("s3://dimajix-training/data/amazon_baby")
data = data.select(
        split('value',',')[0].alias('name'),
        split('value',',')[1].alias('review'),
        split('value',',')[2].alias('rating').cast('int')
).filter(col('rating').isNotNull())

data.limit(5).toPandas()

# Split Train Data / Test Data

Now let's do the usual split of our data into a training data set and a validation data set. Let's use 80% of all reviews for training and 20% for validation

In [None]:
train_data, test_data = data.randomSplit([0.8,0.2], seed=1)

print("train_data: %d" % train_data.count())
print("test_data: %d" % test_data.count())

# Implement Transformer

We need a custom Transformer to build the pipeline. The transformer should remove all punctuations from a given column containing text.

In [None]:
from pyspark.ml import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

def remove_punctuations(text):
    import string
    for c in string.punctuation:
        text = text.replace(c, ' ')
    return text


class PunctuationCleanupTransformer(Transformer):
    def __init__(self, inputCol, outputCol):
        super(Transformer, self).__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

    def _transform(self, dataset):
        remove_punctuation_udf = udf(remove_punctuations, StringType())
        return dataset.withColumn(self.outputCol, remove_punctuation_udf(self.inputCol))

## Test Transformer

Lets create an instance of the Transformer and test it

In [None]:
# Create instance of PunctuationCleanupTransformer and apply it to the data. The result should be stored in clean_data
clean_data = ...

# Extract a couple of rows, so we can inspect result
clean_data.limit(4).toPandas()

# Create ML Pipeline

Now we have all components for creating an initial ML Pipeline. Remember that we have been using the following components before

* Tokenizer - for splitting reviews into words
* StopWordRemover - for removing stop words
* NGram - for creating NGrams (we'll use two words per n-gram)
* CountVectorizer - for creating bag-of-word features from the words
* IDF - for creating TF-IDF features from the NGram counts
* LogisticRegression - for creating the real model

Now we want to add the PunctuationCleanupTransformer. Note that punctuations need to be removed *before* tokenization!

You also need to transform the incoming rating (1-5) to a sentiment (0 or 1) and you need to drop reviews with a rating of 3. This can be done using one ore more SQLTransformer instances.

In [None]:
from pyspark.ml.feature import *
from pyspark.ml.classification import *

# Define list of stopwords used in StopWordsRemover
stopWords=['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']

stages = [
    # You will probably need in some meaningful order and with appropriate arguments
    #   CountVectorizer
    #   IDF
    #   LogisticRegression
    #   NGram
    #   PunctuationCleanupTransformer
    #   SQLTransformer
    #   StopWordsRemover
    #   Tokenizer
]

pipe = Pipeline(stages = stages)

# Fit Pipeline Model
Using training data, we create a PipelineModel by fitting the Pipeline to the data

In [None]:
model = pipe.fit(train_data)

# Predict Data

Let us do some predictions of the test data using the model.

In [None]:
pred = model.transform(test_data)

pred.limit(10).toPandas()

# Model Evaluation
As in the original exercise, we want to use a custom metric for assessing the performance.

In [None]:
# Copy Paste AccuracyClassificationEvaluator from last exercise

## Assess Performance

With the evaluator we can assess the performance of the prediction and easily compare it to a simple model which always predicts 'positive'.

In [None]:
always_positive = pred.withColumn('prediction',lit(1.0))

evaluator = AccuracyClassificationEvaluator(predictionCol='prediction', labelCol='sentiment')

print("Model Accuracy = %f" % evaluator.evaluate(pred))
print("Baseline Accuracy = %f" % evaluator.evaluate(always_positive))

# Hyper Parameter Tuning

Again we want to tune some hyper parameters, but this time inside a pipeline. The methodology is the same as before, we can directly include the CrossValidator into the pipeline. But step by step...

First let us have a look at all paremeters of a LogisticRegression.

In [None]:
print(LogisticRegression().explainParams())

## Create ParamGrid

Now we create a param grid that should be used for using different sets of parameters. We want to tweak two parameters again:

* regParam should take values in [0.0, 0.0001, 0.01, 1.0, 100.0]
* maxIter should take values in [10, 100])

In order to create this grid, we first need to create an instance of a LogisticRegression, so we can access its parameters.

In [None]:
from pyspark.ml.tuning import *

# create LogisticRegression lr
lr = LogisticRegression(featuresCol='features',labelCol='sentiment')

# Build a param_grid with specified parameters using ParamGridBuilder()
param_grid = ...

## Create Pipeline

Now we can create a pipeline using a CrossValidator instead of directly using a LogisticRegression. This means the configuration of the Pipeline should match the old one except that a CrossValidator is inserted instead of the LogisticRegression. The CrossValidator works as a wrapper of the regression algorithm.

We want to put our own AccuracyClassificationValidator into the CrossValidator.

In [None]:
# Provide Evaluator required by CrossValidator
evaluator = AccuracyClassificationEvaluator(labelCol='sentiment')

# Define list of stopwords used in StopWordsRemover
stopWords=['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']

stages = [
    # You will probably need the following stages
    #   CountVectorizer
    #   CrossValidator
    #   IDF
    #   NGram
    #   PunctuationCleanupTransformer
    #   SQLTransformer
    #   StopWordsRemover
    #   Tokenizer
]

pipe = Pipeline(stages = stages)

# Fit model to pipeline
model = pipe.fit(train_data)

In [None]:
# Predict sentiment for test data
pred = model.transform(test_data)

# Evaluate and compare against baseline
evaluator = AccuracyClassificationEvaluator(labelCol='sentiment')
print("Model Accuracy = %f" % evaluator.evaluate(pred))
print("Baseline Accuracy = %f" % evaluator.evaluate(always_positive))