# MLlib in Action

Now that we have described some of the core pieces you can expect to come across, let’s create a simple pipeline to demonstrate each of the components. We’ll use a small synthetic dataset that will help illustrate our point. Let’s read the data in and see a sample before talking about it further:

In [1]:
# the following line gets the bucket name attached to our cluster
bucket = spark._jsc.hadoopConfiguration().get("fs.gs.system.bucket")

# specifying the path to our bucket where the data is located (no need to edit this path anymore)
data = "gs://" + bucket + "/notebooks/jupyter/data/"
print(data)

gs://is843-demo/notebooks/jupyter/data/


In [2]:
df = spark.read.json(data + "simple-ml")
df.orderBy("value2").show(5)

                                                                                

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|  red|good|    35|14.386294994851129|
| blue| bad|    12|14.386294994851129|
|  red| bad|     2|14.386294994851129|
| blue| bad|     8|14.386294994851129|
|  red| bad|    16|14.386294994851129|
+-----+----+------+------------------+
only showing top 5 rows



This dataset consists of a categorical label with two values (good or bad), a categorical variable (color), and two numerical variables. While the data is synthetic, let’s imagine that this dataset represents a company’s customer health. The “color” column represents some categorical health rating made by a customer service representative. The “lab” column represents the true customer health. The other two values are some numerical measures of activity within an application (e.g., minutes spent on site and purchases). Suppose that we want to train a classification model where we hope to predict a binary variable—the label—from the other values.

## Feature Engineering with Transformers

As already mentioned, transformers help us manipulate our current columns in one way or another. Manipulating these columns is often in pursuit of building features (that we will input into our model). Transformers exist to either cut down the number of features, add more features, manipulate current ones, or simply to help us format our data correctly. Transformers add new columns to DataFrames.

When we use MLlib, all inputs to machine learning algorithms (with several exceptions) in Spark must consist of type Double (for labels) and Vector[Double] (for features). The current dataset does not meet that requirement and therefore we need to transform it to the proper format.

To achieve this in our example, we are going to specify an **RFormula**. This is a declarative language for specifying machine learning transformations and is simple to use once you understand the syntax. RFormula supports a limited subset of the R operators that in practice work quite well for simple models and manipulations (we demonstrate the manual approach to this problem in next class). The basic RFormula operators are:

~
Separate target and terms

+
Concat terms; “+ 0” means removing the intercept (this means that the y-intercept of the line that we will fit will be 0)

-
Remove a term; “- 1” means removing the intercept (this means that the y-intercept of the line that we will fit will be 0—yes, this does the same thing as “+ 0”

:
Interaction (multiplication for numeric values, or binarized categorical values)

.
All columns except the target/dependent variable

In order to specify transformations with this syntax, we need to import the relevant class. Then we go through the process of defining our formula. In this case we want to use all available variables (the .) and also add in the interactions between value1 and color and value2 and color, treating those as new features:

In [3]:
from pyspark.ml.feature import RFormula

supervised = RFormula(formula="lab ~ . + color:value1 + color:value2")

At this point, we have declaratively specified how we would like to change our data into what we will train our model on. The next step is to fit the RFormula transformer to the data to let it discover the possible values of each column. Not all transformers have this requirement but because RFormula will automatically handle categorical variables for us, it needs to determine which columns are categorical and which are not, as well as what the distinct values of the categorical columns are. For this reason, we have to call the fit method. Once we call fit, it returns a “trained” version of our transformer we can then use to actually transform our data.

Now that we covered those details, let’s continue on and prepare our DataFrame:

In [4]:
fittedRF = supervised.fit(df)  # fit the transformer
preparedDF = fittedRF.transform(df)  # transform
preparedDF.show(5, False)

                                                                                

+-----+----+------+------------------+----------------------------------------------------------------------+-----+
|color|lab |value1|value2            |features                                                              |label|
+-----+----+------+------------------+----------------------------------------------------------------------+-----+
|green|good|1     |14.386294994851129|(10,[1,2,3,5,8],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])  |1.0  |
|blue |bad |8     |14.386294994851129|(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129])        |0.0  |
|blue |bad |12    |14.386294994851129|(10,[2,3,6,9],[12.0,14.386294994851129,12.0,14.386294994851129])      |0.0  |
|green|good|15    |38.97187133755819 |(10,[1,2,3,5,8],[1.0,15.0,38.97187133755819,15.0,38.97187133755819])  |1.0  |
|green|good|12    |14.386294994851129|(10,[1,2,3,5,8],[1.0,12.0,14.386294994851129,12.0,14.386294994851129])|1.0  |
+-----+----+------+------------------+----------------------------------

                                                                                

In the output we can see the result of our transformation—a column called features that has our previously raw data. What’s happening behind the scenes is actually pretty simple. RFormula inspects our data during the fit call and outputs an object that will transform our data according to the specified formula, which is called an RFormulaModel. This “trained” transformer always has the word Model in the type signature. When we use this transformer, Spark automatically converts our categorical variable to Doubles so that we can input it into a (yet to be specified) machine learning model. In particular, it assigns a numerical value to each possible color category, creates additional features for the interaction variables between colors and value1/value2, and puts them all into a single vector. We then call transform on that object in order to transform our input data into the expected output data.

Thus far you (pre)processed the data and added some features along the way. Now it is time to actually train a model (or a set of models) on this dataset. In order to do this, you first need to prepare a test set for evaluation.

TIP: Having a good test set is probably the most important thing you can do to ensure you train a model you can actually use in the real world (in a dependable way). Not creating a representative test set or using your test set for hyperparameter tuning are surefire ways to create a model that does not perform well in real-world scenarios. Don’t skip creating a test set—it’s a requirement to know how well your model actually does!

Let’s create a simple test set based off a random split of the data now (we’ll be using this test set throughout the remainder of the notebook):

In [5]:
train, test = preparedDF.randomSplit([0.7, 0.3], seed = 843)
test.show(2)

+-----+---+------+------------------+--------------------+-----+
|color|lab|value1|            value2|            features|label|
+-----+---+------+------------------+--------------------+-----+
| blue|bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|
| blue|bad|     8|14.386294994851129|(10,[2,3,6,9],[8....|  0.0|
+-----+---+------+------------------+--------------------+-----+
only showing top 2 rows



                                                                                

**Estimators**

Now that we have transformed our data into the correct format and created some valuable features, it’s time to actually fit our model. In this case we will use a classification algorithm called logistic regression. To create our classifier we instantiate an instance of *LogisticRegression*, using the default configuration or hyperparameters. We then set the label columns and the feature columns; the column names we are setting—label and features—are actually the default labels for all estimators in Spark MLlib, and in later notebooks we omit them:

In [6]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="label",featuresCol="features")

Before we actually go about training this model, let’s inspect the parameters. This is also a great way to remind yourself of the options available for each particular model:

In [7]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The

The `explainParams` method exists on all algorithms available in MLlib.

Upon instantiating an untrained algorithm, it becomes time to fit it to data. In this case, this returns a *LogisticRegressionModel*:

In [8]:
fittedLR = lr.fit(train)

                                                                                

This code will kick off a Spark job to train the model. As opposed to the transformations that you saw, the fitting of a machine learning model is eager and performed immediately.

Once complete, you can use the model to make predictions. Logically this means tranforming features into labels. We make predictions with the transform method. For example, **we can transform our training dataset to see what labels our model assigned to the training data and how those compare to the true outputs**. This, again, is just another DataFrame we can manipulate. Let’s perform that prediction with the following code snippet:

In [9]:
fittedLR.transform(test).select("label", "prediction").show(5)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 5 rows



### SUMMARY
All the above steps can be summarized in the following cell:

In [10]:
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import LogisticRegression

supervised = RFormula(formula="lab ~ . + color:value1 + color:value2")  # define the R formula
fittedRF = supervised.fit(df)  # fit the transformer
preparedDF = fittedRF.transform(df)  # transform

train, test = preparedDF.randomSplit([0.7, 0.3], seed = 843)  # split into train/test
lr = LogisticRegression()  # instantiate an instance of LogisticRegression
fittedLR = lr.fit(train)  # fit the estimator
#fittedLR.transform(train).select("probability", "prediction").show(5)  # checkout the prediction on train dataset

In [11]:
fittedLR.transform(test).select("probability", "prediction").show(5, False)

+-----------+----------+
|probability|prediction|
+-----------+----------+
|[1.0,0.0]  |0.0       |
|[1.0,0.0]  |0.0       |
|[1.0,0.0]  |0.0       |
|[1.0,0.0]  |0.0       |
|[1.0,0.0]  |0.0       |
+-----------+----------+
only showing top 5 rows



Our next step would be to manually evaluate this model and calculate performance metrics like the true positive rate, false negative rate, and so on. We might then turn around and try a different set of parameters to see if those perform better. However, while this is a useful process, it can also be quite tedious. Spark helps you avoid manually trying different models and evaluation criteria by allowing you to specify your workload as a declarative pipeline of work that includes all your transformations as well as tuning your hyperparameters.

**A REVIEW OF HYPERPARAMETERS**

Although we mentioned them previously, let’s more formally define hyperparameters. Hyperparameters are configuration parameters that affect the training process, such as model architecture and regularization. They are set prior to starting training. For instance, logistic regression has a hyperparameter that determines how much regularization should be performed on our data through the training phase (regularization is a technique that pushes models against overfitting data). Coming next you’ll see that we can set up our pipeline to try different hyperparameter values (e.g., different regularization values) in order to compare different variations of the same model against one another.

### Pipelining Our Workflow

As you probably noticed, if you are performing a lot of transformations, writing all the steps and keeping track of DataFrames ends up being quite tedious. That’s why Spark includes the Pipeline concept. A pipeline allows you to set up a dataflow of the relevant transformations that ends with an estimator that is automatically tuned according to your specifications, resulting in a tuned model ready for use. Figure below illustrates this process:

<img src="https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/figs/09-02-Pipelining-the-ML-workflow.png?raw=true" width="800" align="center"/>


Note that it is essential that instances of transformers or models are not reused across different pipelines. Always create a new instance of a model before creating another pipeline.

**In order to make sure we don’t overfit, we are going to create a holdout test set and tune our hyperparameters based on a validation set** (note that we create this validation set based on the original dataset, not the preparedDF used in the previous example):

In [12]:
train, test = df.randomSplit([0.7, 0.3], seed = 843)
test.show(2)

+-----+---+------+------------------+
|color|lab|value1|            value2|
+-----+---+------+------------------+
| blue|bad|     8|14.386294994851129|
| blue|bad|     8|14.386294994851129|
+-----+---+------+------------------+
only showing top 2 rows



Now that you have a holdout set, let’s create the base stages in our pipeline. A stage simply represents a transformer or an estimator. In our case, we will have two estimators. The RFomula will first analyze our data to understand the types of input features and then transform them to create new features. Subsequently, the LogisticRegression object is the algorithm that we will train to produce a model:

In [13]:
rForm = RFormula()
lr = LogisticRegression().setLabelCol("label").setFeaturesCol("features")

We will set the potential values for the RFormula in the next section. Now instead of manually using our transformations and then tuning our model we just make them stages in the overall pipeline, as in the following code snippet:

In [14]:
from pyspark.ml import Pipeline

stages = [rForm, lr]
pipeline = Pipeline().setStages(stages)

### Training and Evaluation

Now that you arranged the logical pipeline, the next step is training. In our case, we won’t train just one model (like we did previously); we will train several variations of the model by specifying different combinations of hyperparameters that we would like Spark to test. We will then select the best model using an Evaluator that compares their predictions on our validation data. We can test different hyperparameters in the entire pipeline, even in the RFormula that we use to manipulate the raw data. This code shows how we go about doing that:

In [15]:
from pyspark.ml.tuning import ParamGridBuilder

params = ParamGridBuilder()\
  .addGrid(rForm.formula, [
    "lab ~ . + color:value1",
    "lab ~ . + color:value1 + color:value2"])\
  .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
  .addGrid(lr.regParam, [0.1, 2.0])\
  .build()

In our current paramter grid, there are three hyperparameters that will diverge from the defaults:

* Two different versions of the RFormula

* Three different options for the ElasticNet parameter

* Two different options for the regularization parameter

This gives us a total of 12 different combinations of these parameters, which means we will be training 12 different versions of logistic regression. We explain the ElasticNet parameter as well as the regularization options in the next class.

Now that the grid is built, it’s time to specify our evaluation process. The evaluator allows us to automatically and objectively compare multiple models to the same evaluation metric. There are evaluators for classification and regression, covered in later notebooks, but in this case we will use the `BinaryClassificationEvaluator`, which has a number of potential evaluation metrics, as we’ll discuss in the future notebooks. In this case we will use `areaUnderROC`, which is the total area under the receiver operating characteristic, a common measure of classification performance:

In [16]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(metricName='areaUnderROC')

Now that we have a pipeline that specifies how our data should be transformed, we will perform model selection to try out different hyperparameters in our logistic regression model and measure success by comparing their performance using the areaUnderROC metric.

As we discussed, it is a best practice in machine learning to fit hyperparameters on a validation set (instead of your test set) to prevent overfitting. For this reason, we cannot use our holdout test set (that we created before) to tune these parameters. Luckily, Spark provides two options for performing hyperparameter tuning automatically. We can use **TrainValidationSplit**, which will simply perform an arbitrary random split of our data into two different groups, or **CrossValidator**, which performs K-fold cross-validation by splitting the dataset into k non-overlapping, randomly partitioned folds:

In [17]:
from pyspark.ml.tuning import TrainValidationSplit

tvs = TrainValidationSplit()\
  .setTrainRatio(0.75)\
  .setEstimatorParamMaps(params)\
  .setEstimator(pipeline)\
  .setEvaluator(evaluator)

Let’s run the entire pipeline we constructed. To review, running this pipeline will test out every version of the model against the validation set. Note the type of tvsFitted is TrainValidationSplitModel. Any time we fit a given model, it outputs a “model” type:

In [18]:
tvsFitted = tvs.fit(train)

And of course evaluate how it performs on the test set!

In [19]:
evaluator.evaluate(tvsFitted.transform(test))

1.0

In [20]:
bestParams = tvsFitted.bestModel.stages[-1].extractParamMap()
for param in bestParams:
    print(param.name, bestParams[param])

aggregationDepth 2
elasticNetParam 0.0
family auto
featuresCol features
fitIntercept True
labelCol label
maxBlockSizeInMB 0.0
maxIter 100
predictionCol prediction
probabilityCol probability
rawPredictionCol rawPrediction
regParam 0.1
standardization True
threshold 0.5
tol 1e-06


In [34]:
tvsFitted.bestModel.stages[0]

RFormulaModel: uid=RFormula_99142c29dcc3, resolvedFormula=ResolvedRFormula(label=lab, terms=[color,value1,value2,{color,value1},{color,value2}], hasIntercept=true)

### SUMMARY
All the above steps can be summarized in the following cell:

In [21]:
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import BinaryClassificationEvaluator

train, test = df.randomSplit([0.7, 0.3], seed = 843)  # create a holdout set before transformation
rForm = RFormula()  # defining stage 1 by creating an empty R formula
lr = LogisticRegression().setLabelCol("label").setFeaturesCol("features")  # defining stage 2 by instantiating an instance of LogisticRegression

stages = [rForm, lr]  # setting the stages
pipeline = Pipeline().setStages(stages)  # adding the stages to the pipeline

# building the hyperparameter grid
params = ParamGridBuilder()\
  .addGrid(rForm.formula, [
    "lab ~ . + color:value1",
    "lab ~ . + color:value1 + color:value2"])\
  .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
  .addGrid(lr.regParam, [0.1, 2.0])\
  .build()

# setting the evaluator as AUC
evaluator = BinaryClassificationEvaluator(metricName='areaUnderROC')

# defining Train Validation Split to be used for hypyerparameter tuning
tvs = TrainValidationSplit()\
  .setTrainRatio(0.75)\
  .setEstimatorParamMaps(params)\
  .setEstimator(pipeline)\
  .setEvaluator(evaluator)

tvsFitted = tvs.fit(train)  # fit the estimator

evaluator.evaluate(tvsFitted.transform(test))  # evaluate the test set (AUC)

1.0

### Persisting and Applying Models

Now that we trained this model, we can persist it to disk to use it for prediction purposes later on:

In [22]:
tvsFitted.bestModel.write().overwrite().save(data + "model/firstModel")

                                                                                

After writing out the model, we can load it into another Spark program to make predictions:

In [23]:
from pyspark.ml import PipelineModel

model = PipelineModel.load(data + "model/firstModel")
prediction = model.transform(test)

                                                                                

In [24]:
prediction.select("probability", "prediction", "label").show(10, False)

+----------------------------------------+----------+-----+
|probability                             |prediction|label|
+----------------------------------------+----------+-----+
|[0.9233050601958284,0.07669493980417164]|0.0       |0.0  |
|[0.9233050601958284,0.07669493980417164]|0.0       |0.0  |
|[0.9233050601958284,0.07669493980417164]|0.0       |0.0  |
|[0.9348943057269108,0.06510569427308921]|0.0       |0.0  |
|[0.9348943057269108,0.06510569427308921]|0.0       |0.0  |
|[0.49777956051405065,0.5022204394859493]|1.0       |0.0  |
|[0.4003383700385894,0.5996616299614106] |1.0       |1.0  |
|[0.4003383700385894,0.5996616299614106] |1.0       |1.0  |
|[0.4003383700385894,0.5996616299614106] |1.0       |1.0  |
|[0.4714656760682543,0.5285343239317457] |1.0       |1.0  |
+----------------------------------------+----------+-----+
only showing top 10 rows



### Deployment Patterns

In Spark there are several different deployment patterns for putting machine learning models into production. Figure below illustrates common workflows.

<img src="https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/figs/09-02-productionization-process.png?raw=true" width="800" align="center"/>


Here are the various options for how you might go about deploying a Spark model. These are the general options you should be able to link to the process illustrated in the figure above.

* Train your machine learning (ML) model offline and then supply it with offline data. In this context, we mean offline data to be data that is stored for analysis, and not data that you need to get an answer from quickly. Spark is well suited to this sort of deployment.

* Train your model offline and then put the results into a database (usually a key-value store). This works well for something like recommendation but poorly for something like classification or regression where you cannot just look up a value for a given user but must calculate one based on the input.

* Train your ML algorithm offline, persist the model to disk, and then use that for serving. This is not a low-latency solution if you use Spark for the serving part, as the overhead of starting up a Spark job can be high, even if you’re not running on a cluster. Additionally this does not parallelize well, so you’ll likely have to put a load balancer in front of multiple model replicas and build out some REST API integration yourself. There are some interesting potential solutions to this problem, but no standards currently exist for this sort of model serving.

* Manually (or via some other software) convert your distributed model to one that can run much more quickly on a single machine. This works well when there is not too much manipulation of the raw data in Spark but can be hard to maintain over time. Again, there are several solutions in progress. For example, MLlib can export some models to PMML, a common model interchange format.

* Train your ML algorithm online and use it online. This is possible when used in conjunction with Structured Streaming, but can be complex for some models.

While these are some of the options, there are many other ways of performing model deployment and management. This is an area under heavy development and many potential innovations are currently being worked on.