# Non Linear Regression modelling

In this tutorial, we'll recap Pipelines from the previous one, and develop and evaluate two more regression modeling algrithms, from the ML pyspark module:

- [Decision Tree Regression](https://spark.apache.org/docs/2.0.0/api/python/pysp
ark.ml.html#pyspark.ml.regression.DecisionTreeRegressor) and
- [Random Forest Regression](https://spark.apache.org/docs/2.0.0/api/python/pysp
ark.ml.html#pyspark.ml.regression.RandomForestRegressor)

Remember that regression models predict a numeric value from a vector of input
variables.

In the Power Plant example we will use the two methods above to create models
that predict PE (Power output) from the rest four:
- AT = Atmospheric Temperature in C
- V = Exhaust Vacuum Speed
- AP = Atmospheric Pressure
- RH = Relative Humidity
- PE = Power Output.  This is the value we are trying to predict given the
measurements above.

# Data preparation
First, lets initialize the Spark environment with the following code:

In [None]:
import pixiedust

To enable monitoring of Spark via the notebook:

In [None]:
pixiedust.enableJobMonitor()

Now there is a Spark Session, named `spark` that available for this notebook. Check it out  here:

In [None]:
spark

## Preprocessing Step 1: Loading data

In [None]:
powerPlantDF = spark.read.csv('../data/powerplant/', header=True, inferSchema = True)

## Preprocessing Step 2: Data splitting

Lets follow the same train-test split as we did in the previous tutorial.

In [None]:
seed = 1800009193

(split20DF, split80DF) = datasetDF.randomSplit([0.2,0.8],seed)

testSetDF = split20DF.cache()
trainingSetDF = split80DF.cache()

## Step 3: Prepare features
Initialize the VectorAssembler to  extract the input variables as features.

In [None]:
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler(
    inputCols=["AT", "V", "AP", "RH"],
    outputCol="features")

# Part 1: Decision Tree Regression

[Decision Tree Learning](https://en.wikipedia.org/wiki/Decision_tree_learning)
uses a [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree) as a
predictive model which maps observations about an item to conclusions about the
item's target value. It is one of the predictive modelling approaches used in
statistics, data mining and machine learning. Decision trees where the target
variable can take continuous values (typically real numbers) are called
regression trees.

Spark ML Pipeline provides [DecisionTreeRegressor()](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor)
as an implementation of Decision Tree
Learning

The cell below we reproduce the learning steps as above, but now  based on the
Spark ML Pipeline API for Decision Tree Regressor.



### Challenge 1

Create a Decision Tree regressor for estimating the powerplant energy production!

- In the next cell, we create a DecisionTreeRegressor
and set the parameters for the method:
  - Set the name of the prediction column to "Predicted_PE"
  - Set the name of the features column to "features"
  - Set the maximum depth of the tree  to 3
- Create the ML Pipeline and set the stages to the Vectorizer we created
earlier and DecisionTreeRegressor learner we just created.

Check the [Decision Tree Regressor](https://spark.apache.org/docs/2.0.0/api/pyt
hon/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor) documentation, if needed.

In [None]:
from pyspark.ml.regression import DecisionTreeRegressor
dt = DecisionTreeRegressor()
dt.setPredictionCol("Predicted_PE")\
  .setLabelCol("PE")\
  .setMaxDepth(3)


dtPipeline = Pipeline()
dtPipeline.setStages([vectorizer, dt])

### Challenge 2

Instead of guessing what parameters to use, employ _Model Selection_ to find the best model.

Reuse the CrossValidator of the previous Tutorial by replacing the
Estimator with our new `dtPipeline`. Keep the rest of the parameters the same i.e. the number of folds remains 3.

- Build a parameter grid with the
parameter `dt.maxDepth` and a list of the values 2 and 3, and add the grid to
the CrossValidator
- Run the CrossValidator to find the parameters that yield
the best model (i.e. lowest RMSE) and return the best model.

_Note that it will take some time to run the [CrossValidator](https://spark.apac
he.org/docs/1.6.2/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator)
as it will run almost 50 Spark jobs_

In [None]:
# Your code here

Now let's see how our tuned DecisionTreeRegressor model's RMSE and \\(r^2\\)
values compare to our tuned LinearRegression model.

Write the code to calculate and print the predictions of the best decision tree model.

In [None]:
# Your code here

# Random Forest regression
[Random forests](https://en.wikipedia.org/wiki/Random_forest) or random decision
tree forests are an ensemble learning method for regression that operate by
constructing a multitude of decision trees at training time and outputting the
class that is the mean prediction (regression) of the individual trees. Random
decision forests correct for decision trees' habit of overfitting to their
training set.

Spark ML Pipeline provides [RandomForestRegressor()](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#pyspark.ml.regression.RandomForestRegressor).

### Challenge 3

- In the next cell, create a [RandomForestRegressor()](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#pyspark.ml.regression.RandomForestRegressor)
- The next step is to set the parameters for the method :
  - Set the name of the prediction column to "Predicted_PE"
  - Set the name of the features column to "features"
  - Set the random number generator seed to 100088121
  - Set the maximum depth to 8
  - Set the number of trees to 30
- Create the ML Pipeline and set the stages to the Vectorizer we created
earlier and RandomForestRegressor() learner we just created.
- Build a parameter grid with the parameter `maxBins ` and a list of the values 50 and 100, and add the grid to
the CrossValidator 
- Run the CrossValidator to find the parameters that yield
the best model (i.e. lowest RMSE) and return the best model.

In [None]:
# Create a RandomForestRegressor
from pyspark.ml.regression import RandomForestRegressor

rf = RandomForestRegressor()

rf.setLabelCol("PE")\
  .setPredictionCol("Predicted_PE")\
  .setFeaturesCol("features")\
  .setSeed(100088121)\
  .setMaxDepth(8)\
  .setNumTrees(30)

# Your code here
  

Now let's see how our tuned RandomForestRegressor model's RMSE and \\(r^2\\)
values compare to our tuned LinearRegression and Decision Tree models.

Write the code to calculate and print the predictions of the best decision tree model.

In [None]:
# Your code here

Note: Inspecting the best random Forest:  
The line below will pull the Random Forest model named `rfModel` from the Pipeline and display
it.


<pre>
print(rfModel.stages[1]._java_obj.toDebugString())
</pre>

In [None]:
# Your code here

**Discussion**

How do the r^2 and RMSE values compare for the three models? Which model would you select?



# What comes next?
For your project work we may need other components of the ML module, for
example:

- [Principal Components Analysis](https://spark.apache.org/docs/2.0.0/api/python
/pyspark.ml.html#pyspark.ml.feature.PCA)
- [Clustering](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#module-pyspark.ml.clustering)
   - [K-Means](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans)
   - [Gaussian Mixture Models](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture)
- [Classification](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#module-pyspark.ml.classification)
   - [Decision Trees Classifier](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier)
   - [Bayesian models](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#pyspark.ml.classification.NaiveBayes)
   - [Neural networks](https://spark.apache.org/docs/2.0.0/api/python/pyspark.ml.html#pyspark.ml.classification.MultilayerPerceptronClassifier)

Consult the ML package documentation for those, they follow the same logic as
Regression.
_If you need extra help, ask [Ioannis](mailto:ioannis.athanasiadis@wur.nl)_.