<img src="uva_seal.png">  

## MLlib Regression

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023

---  


### SOURCES
Learning Spark: Machine Learning with MLlib

*Details on regularization equation*  
https://spark.apache.org/docs/1.5.2/ml-linear-methods.html

https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression


### OBJECTIVES
- Introduction to major regression models in MLlib using the DataFrame API

### CONCEPTS

- Linear regression
- VectorAssembler
- RegressionEvaluator

---

### Introduction to Regression

Earlier, we discussed the classification problem where the response variable is discrete

Regression is another common form of supervised learning
The response variable in a regression problem is quantitative or continuous  

Several classification models also have regression counterparts, including:

- Support vector machines  
- Tree-based methods like random forests and gradient-boosted trees  

**The distributed processing used by Spark is particularly helpful in random forests. (Also with k cross validation)**  
The trees can be distributed across executors and built, and the results can be aggregated.

To implement the regression counterpart, the same package is loaded but a different method is called.

### Linear Regression

Linear regression is the most fundamental model used in regression.

Model assumes a linear relationship between a set of explanatory variables $X$ (aka features, factors, predictors, independent variables) and a scalar response variable $Y$.

Linear regression models are most often fit using the *ordinary least squares* (*OLS*) approach.  

A regularization term is often added to the loss function to help with generalization. Examples include:

- ridge regression ($L^2$-norm penalty)
- lasso ($L^1$-norm penalty)
- elastic net is a blend of ridge and lasso regression

#### Linear Regression Example

In [1]:
import os
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("mllib_classifier").getOrCreate()

# Load training data
filename = "sample_housing_data.csv"
training = spark.read.csv(filename,  inferSchema=True, header = True)
training.show(2)

/opt/conda/lib/python3.7/site-packages/pyspark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/03 02:14:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


                                                                                

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|          452600.0|       8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|
|          358500.0|       8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
only showing top 2 rows



In DataFrame API version of MLlib, features need to be assembled into a feature column for the ML 
Model

`VectorAssembler` will handle this

In [2]:
from pyspark.ml.feature import VectorAssembler

# inputCols take a list of column names
# outputCol is arbitrary name of new column; generally called features

assembler = VectorAssembler(inputCols=["median_income", "total_rooms"],
                            outputCol="features")

tr = assembler.transform(training)
tr.select("*").show(1, truncate=False)

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+--------------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|features      |
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+--------------+
|452600.0          |8.3252       |41.0              |880.0      |129.0         |322.0     |126.0     |37.88   |-122.23  |[8.3252,880.0]|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+--------------+
only showing top 1 row



Next we set up the Linear Regression Model and fit it

In [3]:
###should scale first

from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol='features',         # feature vector name
                      labelCol='median_house_value',  # target variable name
                      maxIter=10,
                      regParam=0.3, 
                      elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(tr)

# Print the weights and intercept for linear regression
print("Weights: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

Weights: [19866.979640214464,-9.836075133390931]
Intercept: 261023.86960904166


Now let's measure the model fit

In [4]:
from pyspark.ml.evaluation import RegressionEvaluator

# compute predictions. this will append column "prediction" to dataframe
lrPred = lrModel.transform(tr)
lrPred.show(1)

ev = RegressionEvaluator(predictionCol="prediction", labelCol="median_house_value")

print('-'*20)
print("METRICS")
print("Mean Squared Error:", ev.evaluate(lrPred, {ev.metricName: "mse"}))
print("R Squared:", ev.evaluate(lrPred, {ev.metricName:'r2'}))

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+--------------+------------------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|      features|        prediction|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+--------------+------------------+
|          452600.0|       8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|[8.3252,880.0]|417764.70239237114|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+--------------+------------------+
only showing top 1 row

--------------------
METRICS
Mean Squared Error: 703796169.4197679
R Squared: 0.6032614088503244


---  

Notice we extracted a metric like MSE by: 
1. passing to the evaluator a dataframe with labels and predictions
2. going to the Regression Evaluator dictionary with the desired key: "mse"

```ev.evaluate(lrPred, {ev.metricName: "mse"})```

--- 

#### Regularization Parameters

You might have noticed the parameters `regParam` and `elasticNetParam` in the function call above.  
You can read about them [here](https://spark.apache.org/docs/1.5.2/ml-linear-methods.html).  

The `elasticNetParam` parameter controls the relative blending of Lasso and Ridge regression.  
These regularization terms often help a model better generalize to new data by reducing overfitting.

---

#### Other Regression Models using the DataFrame API

PySpark supports several regression models including:  
- Generalized linear regression
- Decision tree regression
- Random forest regression
- Gradient-boosted tree regression

For more details, including code examples, please see [here](https://spark.apache.org/docs/latest/ml-classification-regression.html)

---

**TRY FOR YOURSELF (UNGRADED EXERCISES)**

1) Copy the Linear Regression example from above, and modify in the cells below to fit and evaluate these two models:  

i. Lasso Regression

ii. Ridge Regression

2) Think of at least one real-world example of when you would need to implement each of the following tasks:  
- regression
- binary classification
- multiclass classification
- multilabel classification

If you are not sure about the difference between multiclass and multilabel, here is one resource:

https://scikit-learn.org/stable/modules/multiclass.html