 <img src="uva_seal.png"> 

## ML Pipelines

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: October 8, 2024

---  


### SOURCES

- Learning Spark, Chapter 11: Machine Learning with MLlib  
- https://spark.apache.org/docs/latest/ml-pipeline.html  
- http://blog.insightdatalabs.com/spark-pipelines-elegant-yet-powerful/  




### OBJECTIVES
- Introduction to ML Pipelines  


### CONCEPTS

- ML Pipeline
- `DataFrame`
- `Transformer`
- `Estimator`
- `Parameter`

---

**ML Pipelines**

ML Pipelines use the following objects:

**Transformer**  
Transforms one DataFrame into another DataFrame

**Estimator**  
An algorithm that can be fit on a DataFrame (e.g., Logistic Regression)

**Parameter**  
Properties of an estimator (e.g., max number of iterations, regularization parameter)

*Setter methods* are available for setting parameters:

**Set Parameters for Logistic Regression Instance**

```
# lr is our logistic regression model

lr.setMaxIter(10)
  .setRegParam(0.01)
  ```

**Pipeline**  
A sequential chain of multiple `Transformers` and `Estimators` to specify an ML workflow  

The pipeline in Spark is very similar to the pipeline in scikit-learn.  
It acts as a workflow to keep all steps together from start to finish, for example:
- Data preprocessing
- Feature extraction
- Model fitting
- Model tuning

Keeping track of these steps manually can be painful and error-prone.  
For example, the analyst might train on the test set by accident.  That would be VERY bad.  

Pipelines can be saved, loaded, and applied to any dataset containing the necessary columns of data.  This works in training mode and scoring mode (scoring mode is when we make predictions; it is also called *inference*).

By encapsulating all of the steps in a pipeline, the required code for scoring becomes substantially less.  There is no need to code the steps again.  Simply load the pipeline, pass the data to it, and make predictions.  Data engineers love this!

---

**Pipeline Schematic**  
`Cylinders` are DataFrames

<img src="ml_pipeline_graph.png">  

**Pipeline example**

In [None]:
# DATA OUTLINE
#train_df  dataframe containing labels (1=like, 0=dislike), restaurant reviews (string), ratings (integer) 
#             will be used to train LogReg model
#test_df   dataframe with the same fields, set aside for model evaluation
#----------------------------------------------------------------------------------------------

from pyspark.mllib.linalg import Vectors
from pyspark.sql import SparkSession

spark= SparkSession.builder.getOrCreate()

# some training data
train_df = spark.createDataFrame([
    (0, "The food was terrible...and such small portions!", 1),
    (1, "I would eat here EVERY DAY", 5),
    (1, "LOVE LOVE LOVE the tacos!!", 5)
], ["label", "review", "rating"])

train_df.show(truncate=False)

In [None]:
# Configure pipeline stages

from pyspark.ml import Pipeline  
from pyspark.ml.feature import *  
from pyspark.ml.classification import LogisticRegression

# process review data into first feature
tok = Tokenizer(inputCol="review", outputCol="words")  
htf = HashingTF(inputCol="words", outputCol="tf", numFeatures=200)  

# process rating data into second feature
ohe = OneHotEncoder(inputCol="rating", outputCol="rc") 

va = VectorAssembler(inputCols=["tf","rc"], outputCol="features")  
lr = LogisticRegression(labelCol='label', featuresCol='features', maxIter=10, regParam=0.01)

# Fit the pipeline
pipeline = Pipeline(stages=[tok, htf, ohe, va, lr])
model = pipeline.fit(train_df)

In [None]:
# Create test set

test_df = spark.createDataFrame([
    (0, "I would give this place ZERO STARS if I could!", 1),
    (1, "Yum!", 5),
    (1, "Omg the best fries", 5)
], ["label", "review", "rating"])

test_df.show(truncate=False)

In [None]:
# Make predictions on test set
prediction = model.transform(test_df)
prediction.select('label', 'rawPrediction','probability','prediction').show(3, False)

Model gets the first instance wrong, but bear in mind it's a tiny training set.

---

At a high level, the pipeline outlines the steps that will take place sequentially: 

1. The data is processed into features  
2. The features are combined using `VectorAssembler`  
3. The combined features are input to the Logistic Regression model  

Calling `pipeline.fit(train_df)` will actually execute the workflow  

Each step is either a `Transformer` or an `Estimator`  

Each of the preprocessing steps is a `Transformer`  
The logistic regression is an `Estimator`  

**Another Pipeline Example:**  
https://spark.apache.org/docs/1.6.0/ml-guide.html  

**Custom Transformers**  
There are many transformers available in `MLlib`  
Users can also create custom transformers.  

`Transformer` requirements:  

1. Implement the `transform` method  
2. Specify an `inputCol` and `outputCol`  
3. Accept a DataFrame as input and return a DataFrame as output  


**Saving and Loading Pipeline**  
As mentioned earlier, pipelines can be saved for future use.  
This is helpful in several circumstances, including:  

1. The user wishes to return to model development at a later time  
2. Calling the pipeline to score records in production


**TRY FOR YOURSELF (UNGRADED EXERCISES)**

1) **Running the Pipeline**  
i. Copy the pipeline code to the cell below  
ii. Pass some data to the pipeline  
iii. Run the pipeline and show the predictions  
iv. Measure the accuracy on the passed data  