# Load Data

First we load data from HDFS. It is stored as a trivial CSV file with three columns
1. product name
2. review text
3. rating (1 - 5)

In [4]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

schema =  StructType([
    StructField('name',StringType(),True),
    StructField('review',StringType(), True),
    StructField('rating',StringType(), True),
])

raw_data = spark.read.schema(schema).csv("s3://dimajix-training/data/amazon_baby")
raw_data.limit(5).toPandas()

Unnamed: 0,name,review,rating
0,name,review,rating
1,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
2,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
3,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
4,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5


## Clean and Cache Data

We need to convert the "rating" columns to an integer - but this will obviously fail for the first record, as this one contains the CSV header. So we need to perform some cleanup after trying to convert the data.

For helping distributing the workload, we repartition the DataFrame and also cache it.

In [5]:
data = raw_data.withColumn('rating',col('rating').cast(IntegerType())) \
    .filter(col('rating').isNotNull()) \
    .filter(col('review').isNotNull()) \
    .repartition(31) \
    .cache()

data.limit(5).toPandas()

Unnamed: 0,name,review,rating
0,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4
1,Our Baby Girl Memory Book,Really happy with this purchase. I was looking...,5
2,Semanario (7) Little Girls 14k Gold Overlay Ba...,. I am pleased with product. I love the bangle...,4
3,Neurosmith - Music Blocks with Mozart Music Ca...,It takes a youthful spirit of inquiry and fasc...,5
4,Fisher Price Nesting Action Vehicles,This is a great toy. The wheels really work a...,5


# Split Train Data / Test Data

Now let's do the usual split of our data into a training data set and a validation data set. Let's use 80% of all reviews for training and 20% for validation

In [6]:
train_data, test_data = data.randomSplit([0.8,0.2], seed=1)

print("train_data: %d" % train_data.count())
print("test_data: %d" % test_data.count())

train_data: 139461
test_data: 34861


# Implement Transformer for Removing Punctuations

We need a custom Transformer to build the pipeline. The transformer should remove all punctuations from a given column containing text.

In [7]:
from pyspark.ml import *
from pyspark.sql.types import *
from pyspark.sql.functions import *

def remove_punctuations(text):
    import string
    for c in string.punctuation:
        text = text.replace(c, ' ')
    return text


class PunctuationCleanupTransformer(Transformer):
    def __init__(self, inputCol, outputCol):
        """
        Constructor of PunctuationCleanupTransformer which takes two arguments:
        inputCol - name of input column
        outputCol - name of output column
        """
        super(Transformer, self).__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

    def _transform(self, dataset):
        """
        Protecetd _transform method which will be called by the public transform
        method. You should not call this method directly.
        """
        remove_punctuation_udf = udf(remove_punctuations, StringType())
        return dataset.withColumn(self.outputCol, remove_punctuation_udf(self.inputCol))

## Test Transformer

Lets create an instance of the Transformer and test it

In [8]:
cleaner = PunctuationCleanupTransformer(inputCol='review', outputCol='clean_review')
clean_data = cleaner.transform(data)

clean_data.limit(4).toPandas()

Unnamed: 0,name,review,rating,clean_review
0,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,One of baby s first and favorite books and it...
1,Our Baby Girl Memory Book,Really happy with this purchase. I was looking...,5,Really happy with this purchase I was looking...
2,Semanario (7) Little Girls 14k Gold Overlay Ba...,. I am pleased with product. I love the bangle...,4,I am pleased with product I love the bangle...
3,Neurosmith - Music Blocks with Mozart Music Ca...,It takes a youthful spirit of inquiry and fasc...,5,It takes a youthful spirit of inquiry and fasc...


# Implement Transformer for Stemming

We need to stem words, and for doing so we use the Python NLTK library.

In [9]:
from nltk.stem import PorterStemmer

def stem_word(words):
    ps = PorterStemmer()
    return [ps.stem(word) for word in words]


class PorterStemmerTransformer(Transformer):
    def __init__(self, inputCol, outputCol):
        """
        Constructor of PorterStemmerTransformer which takes two arguments:
        inputCol - name of input column
        outputCol - name of output column
        """
        super(Transformer, self).__init__()
        self.inputCol = inputCol
        self.outputCol = outputCol

    def _transform(self, dataset):
        """
        Protecetd _transform method which will be called by the public transform
        method. You should not call this method directly.
        """
        stem_word_udf = udf(stem_word, ArrayType(StringType()))
        return dataset.withColumn(self.outputCol, stem_word_udf(self.inputCol))

## Test Transformer

Again we want to test the `PorterStemmerTransformer`

In [10]:
from pyspark.ml.feature import *

# First we need to Tokenize each line. In order to perform this task, we implement the following steps
# 1. Instantiate a Tokenizer instance from pyspark.ml.feature
# 2. Transform the raw data using the tokenizer
tokenizer = Tokenizer(inputCol='review', outputCol='words')
tokenized_data = tokenizer.transform(data)

# Then we can instantiate the Stemmer and use it on the words
stemmer = PorterStemmerTransformer(inputCol='words', outputCol='stemmed_review')
stemmed_data = stemmer.transform(tokenized_data)

stemmed_data.limit(4).toPandas()

Unnamed: 0,name,review,rating,words,stemmed_review
0,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,"[one, of, baby's, first, and, favorite, books,...","[one, of, baby', first, and, favorit, books,, ..."
1,Our Baby Girl Memory Book,Really happy with this purchase. I was looking...,5,"[really, happy, with, this, purchase., i, was,...","[realli, happi, with, thi, purchase., i, wa, l..."
2,Semanario (7) Little Girls 14k Gold Overlay Ba...,. I am pleased with product. I love the bangle...,4,"[., i, am, pleased, with, product., i, love, t...","[., i, am, pleas, with, product., i, love, the..."
3,Neurosmith - Music Blocks with Mozart Music Ca...,It takes a youthful spirit of inquiry and fasc...,5,"[it, takes, a, youthful, spirit, of, inquiry, ...","[it, take, a, youth, spirit, of, inquiri, and,..."


# Create ML Pipeline

Now we have all components for creating an initial ML Pipeline. Remember that we have been using the following components before

* PunctuationCleanupTransformer - remove punctuations from reviews
* Tokenizer - for splitting reviews into words
* StopWordRemover - for removing stop words
* PorterStemmerTransformer - for stemming words
* NGram - for creating NGrams (we'll use two words per n-gram)
* CountVectorizer - for creating bag-of-word features from the words
* IDF - for creating TF-IDF features from the NGram counts
* LogisticRegression - for creating the real model

You also need to transform the incoming rating (1-5) to a sentiment (0 or 1) and you need to drop reviews with a rating of 3. This can be done using one ore more SQLTransformer instances. Inside the SQLTransformer instance you simply write SQL code and access the current DataFrame via `__THIS__`.

In [11]:
from pyspark.ml.feature import *
from pyspark.ml.classification import *

stopWords = ['the','a','and','or', 'it', 'this', 'of', 'an', 'as', 'in', 'on', 'is', 'are', 'to', 'was', 'for', 'then', 'i']
stopWords = StopWordsRemover.loadDefaultStopWords("english")

stages = [
    PunctuationCleanupTransformer(inputCol='review', outputCol='clean_review'),
    SQLTransformer(statement='SELECT *,CASE WHEN rating < 3 THEN 0.0 ELSE 1.0 END AS sentiment FROM __THIS__ WHERE rating <> 3'),
    Tokenizer(inputCol='clean_review', outputCol='words'),
    StopWordsRemover(inputCol='words', outputCol='vwords', stopWords=stopWords),
    PorterStemmerTransformer(inputCol='vwords', outputCol='stems'),
    NGram(inputCol='stems', outputCol='ngrams', n=3),
    CountVectorizer(inputCol='ngrams', outputCol='tf', minDF=2.0),
    IDF(inputCol='tf', outputCol='features'),
    LogisticRegression(featuresCol='features',labelCol='sentiment')
]
pipe = Pipeline(stages = stages)

# Fit Pipeline Model
Using training data, we create a PipelineModel by fitting the Pipeline to the training data

In [12]:
model = pipe.fit(train_data)

# Predict Data

Let us do some predictions of the test data using the model.

In [13]:
pred = model.transform(test_data)

pred.limit(10).toPandas()

Unnamed: 0,name,review,rating,clean_review,sentiment,words,vwords,stems,ngrams,tf,features,rawPrediction,probability,prediction
0,,"My son is now 2 years old, we bought this when...",5,My son is now 2 years old we bought this when...,1.0,"[my, son, is, now, 2, years, old, , we, bought...","[son, 2, years, old, , bought, 7, months, , ha...","[son, 2, year, old, , bought, 7, month, , hand...","[son 2 year, 2 year old, year old , old bough...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-4044.31968499, 4044.31968499]","[0.0, 1.0]",1.0
1,,We have used this to enclose our wood stove to...,5,We have used this to enclose our wood stove to...,1.0,"[we, have, used, this, to, enclose, our, wood,...","[used, enclose, wood, stove, protect, kids, , ...","[use, enclos, wood, stove, protect, kid, , wor...","[use enclos wood, enclos wood stove, wood stov...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-3164.92569261, 3164.92569261]","[0.0, 1.0]",1.0
2,*SPECIAL PROMOTION*The Art of CureTM *SAFETY K...,"So as much as I love all things natural, we we...",5,So as much as I love all things natural we we...,1.0,"[so, as, much, as, i, love, all, things, natur...","[much, love, things, natural, , skeptical, pro...","[much, love, thing, natur, , skeptic, product,...","[much love thing, love thing natur, thing natu...","(1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.82692062624, 0.0, 3.22663271037, 0.0, 0.0, ...","[-5341.67880727, 5341.67880727]","[0.0, 1.0]",1.0
3,100% Solid Wood Safety Rail Guard &bull; Honey...,This was a beautiful piece and fit nicely with...,5,This was a beautiful piece and fit nicely with...,1.0,"[this, was, a, beautiful, piece, and, fit, nic...","[beautiful, piece, fit, nicely, bunk, beds, , ...","[beauti, piec, fit, nice, bunk, bed, , , easi,...","[beauti piec fit, piec fit nice, fit nice bunk...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-1222.1505214, 1222.1505214]","[0.0, 1.0]",1.0
4,"2 Handle 8oz. Cup with Flip-It Straw Top, 1-pk...","There's nothing wrong with the cup itself, I l...",1,There s nothing wrong with the cup itself I l...,0.0,"[there, s, nothing, wrong, with, the, cup, its...","[nothing, wrong, cup, , love, cup, , make, mis...","[noth, wrong, cup, , love, cup, , make, mistak...","[noth wrong cup, wrong cup , cup love, love ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1610.33450113, -1610.33450113]","[1.0, 0.0]",0.0
5,2 Red Hens Whole Roost Bag W/ Changing Pad-Che...,I was amazed when I saw how cheap this bag was...,5,I was amazed when I saw how cheap this bag was...,1.0,"[i, was, amazed, when, i, saw, how, cheap, thi...","[amazed, saw, cheap, bag, compared, others, ev...","[amaz, saw, cheap, bag, compar, other, even, o...","[amaz saw cheap, saw cheap bag, cheap bag comp...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-992.717734987, 992.717734987]","[0.0, 1.0]",1.0
6,2 in 1 Floating Baby Bottle Brush,I have no idea why I even thought of getting t...,1,I have no idea why I even thought of getting t...,0.0,"[i, have, no, idea, why, i, even, thought, of,...","[idea, even, thought, getting, , , exterior, p...","[idea, even, thought, get, , , exterior, packa...","[idea even thought, even thought get, thought ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-837.094291579, 837.094291579]","[0.0, 1.0]",1.0
7,24 Sq. Ft. (set of 24 + borders) 'We Sell Mats...,I gave the mats to my daughter's preschool as ...,5,I gave the mats to my daughter s preschool as ...,1.0,"[i, gave, the, mats, to, my, daughter, s, pres...","[gave, mats, daughter, preschool, gift, , kids...","[gave, mat, daughter, preschool, gift, , kid, ...","[gave mat daughter, mat daughter preschool, da...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-677.727977688, 677.727977688]","[4.63958417669e-295, 1.0]",1.0
8,"3 Packs of NUK Replacement Silicone Spout, Clear",These are the best tips on the Nuk Bottles tha...,5,These are the best tips on the Nuk Bottles tha...,1.0,"[these, are, the, best, tips, on, the, nuk, bo...","[best, tips, nuk, bottles, get, teeth, start, ...","[best, tip, nuk, bottl, get, teeth, start, che...","[best tip nuk, tip nuk bottl, nuk bottl get, b...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-217.578766464, 217.578766464]","[3.21175452925e-95, 1.0]",1.0
9,Aden + Anais Issie Security Blanket Set Declan...,My 10 month old son is a great sleeper and has...,5,My 10 month old son is a great sleeper and has...,1.0,"[my, 10, month, old, son, is, a, great, sleepe...","[10, month, old, son, great, sleeper, since, a...","[10, month, old, son, great, sleeper, sinc, ar...","[10 month old, month old son, old son great, s...","(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 3.08723358449, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-6818.20425706, 6818.20425706]","[0.0, 1.0]",1.0


# Model Evaluation
As in the original exercise, we want to use a custom metric for assessing the performance.

In [14]:
from pyspark.ml.evaluation import *

class AccuracyClassificationEvaluator(Evaluator):
    def __init__(self, predictionCol='prediction', labelCol='label'):
        super(Evaluator,self).__init__()
        self.predictionCol = predictionCol
        self.labelCol = labelCol
    
    def _evaluate(self, dataset):
        num_total = dataset.count()
        num_correct = dataset.filter(col(self.labelCol) == col(self.predictionCol)).count()
        accuracy = float(num_correct) / num_total
        return accuracy

## Assess Performance

With the evaluator we can assess the performance of the prediction and easily compare it to a simple model which always predicts 'positive'.

In [15]:
print("Num positive reviews: %d" % pred.filter(pred.sentiment > 0.5).count())
print("Num negative reviews: %d" % pred.filter(pred.sentiment < 0.5).count())

Num positive reviews: 26740
Num negative reviews: 5010


In [16]:
always_positive = pred.withColumn('prediction',lit(1.0))

evaluator = AccuracyClassificationEvaluator(predictionCol='prediction', labelCol='sentiment')

print("Model Accuracy = %f" % evaluator.evaluate(pred))
print("Baseline Accuracy = %f" % evaluator.evaluate(always_positive))

Model Accuracy = 0.879780
Baseline Accuracy = 0.842205
