# Predict bike sharing Using MLlib's pipeline and Gradient-Boosted Trees

**Topics**: *Loading CSV into DataFrame, Droping Columns, Casting column types, split data, plot histogram, plot GroupBy results in a line chart, VectorAssembler, VectorIndexer, GBTRegressor, RegressionEvaluator(RMSE), CrossValidator, ParamGridBuilder, Pipeline, scatter predictions, testing transformers/estimators.* 


This Python notebook demonstrates creating an ML Pipeline to preprocess a dataset, train a Machine Learning model, and make predictions. It is adapted from Databricks's examples.

**Data**: The dataset contains bike rental info from 2011 and 2012 in the Capital bikeshare system, plus additional relevant information such as weather.  This dataset is from Fanaee-T and Gama (2013) and is hosted by the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset).

**Goal**: We want to learn to predict bike rental counts (per hour) from information such as day of the week, weather, season, etc.  Having good predictions of customer demand allows a business or service to prepare and increase supply as needed.

**Approach**: We will use Spark ML Pipelines, which help users piece together parts of a workflow such as feature processing and model training.  We will also demonstrate [model selection (a.k.a. hyperparameter tuning)](http://spark.apache.org/docs/1.6.0/ml-guide.html) using [Cross Validation](http://spark.apache.org/docs/1.6.0/api/python/pyspark.ml.html) in order to fine-tune and improve our ML model.

## Step 1: Load and understand the data

1\. Download our data from [http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip) and unzip it. We will load the data file `hour.csv` as a dataframe.  We also cache the data so that we only read it from disk once.

In [None]:
%%bash

rm -f *.zip
wget -nv http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip

2022-10-23 04:40:45 URL:http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip [279992/279992] -> "Bike-Sharing-Dataset.zip" [1]


In [None]:
%%bash
unzip Bike-Sharing-Dataset.zip

Archive:  Bike-Sharing-Dataset.zip
  inflating: Readme.txt              
  inflating: day.csv                 
  inflating: hour.csv                


In [None]:
%%bash
head hour.csv

instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0,3,13,16
2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0,8,32,40
3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0,5,27,32
4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0,3,10,13
5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0,0,1,1
6,2011-01-01,1,0,1,5,0,6,0,2,0.24,0.2576,0.75,0.0896,0,1,1
7,2011-01-01,1,0,1,6,0,6,0,1,0.22,0.2727,0.8,0,2,0,2
8,2011-01-01,1,0,1,7,0,6,0,1,0.2,0.2576,0.86,0,1,2,3
9,2011-01-01,1,0,1,8,0,6,0,1,0.24,0.2879,0.75,0,1,7,8


In [None]:
hour = spark.read.csv('file:/databricks/driver/hour.csv', inferSchema=True, header=True)

In [None]:
hour.cache()

Out[5]: DataFrame[instant: int, dteday: timestamp, season: int, yr: int, mnth: int, hr: int, holiday: int, weekday: int, workingday: int, weathersit: int, temp: double, atemp: double, hum: double, windspeed: double, casual: int, registered: int, cnt: int]

#### Data description

From the [UCI ML Repository description](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset), we know that the columns have the following meanings.

|field|description|
|--|--|
|instant| record index|
|dteday| date|
|season| season (1:spring, 2:summer, 3:fall, 4:winter)|
|yr| year (0:2011, 1:2012)|
|mnth| month (1 to 12)|
|hr| hour (0 to 23)|
|holiday| whether day is holiday or not|
|weekday| day of the week|
|workingday| if day is neither weekend nor holiday is 1, otherwise is 0.|
|weathersit| 1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2:Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog|
|temp| Normalized temperature in Celsius. The values are derived via `(t-t_min)/(t_max-t_min)`, `t_min=-8`, `t_max=+39` (only in hourly scale)|
|atemp| Normalized feeling temperature in Celsius. The values are derived via `(t-t_min)/(t_max-t_min)`, `t_min=-16`, `t_max=+50` (only in hourly scale) |
|hum| Normalized humidity. The values are divided to 100 (max)|
|windspeed| Normalized wind speed. The values are divided to 67 (max)|
|casual | count of casual users|
|registered | count of registered users|
|cnt | count of total rental bikes including both casual and registered|

2\. Now display some basic info about the data
- a few rows of the data
- schema
- number of rows

Note, most columns should be numerical ones. If not, you should adjust your data loading statement.

In [None]:
hour.limit(10).display()

instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
1,2011-01-01T00:00:00.000+0000,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
2,2011-01-01T00:00:00.000+0000,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
3,2011-01-01T00:00:00.000+0000,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
4,2011-01-01T00:00:00.000+0000,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
5,2011-01-01T00:00:00.000+0000,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1
6,2011-01-01T00:00:00.000+0000,1,0,1,5,0,6,0,2,0.24,0.2576,0.75,0.0896,0,1,1
7,2011-01-01T00:00:00.000+0000,1,0,1,6,0,6,0,1,0.22,0.2727,0.8,0.0,2,0,2
8,2011-01-01T00:00:00.000+0000,1,0,1,7,0,6,0,1,0.2,0.2576,0.86,0.0,1,2,3
9,2011-01-01T00:00:00.000+0000,1,0,1,8,0,6,0,1,0.24,0.2879,0.75,0.0,1,7,8
10,2011-01-01T00:00:00.000+0000,1,0,1,9,0,6,0,1,0.32,0.3485,0.76,0.0,8,6,14


In [None]:
hour.printSchema()

root
 |-- instant: integer (nullable = true)
 |-- dteday: timestamp (nullable = true)
 |-- season: integer (nullable = true)
 |-- yr: integer (nullable = true)
 |-- mnth: integer (nullable = true)
 |-- hr: integer (nullable = true)
 |-- holiday: integer (nullable = true)
 |-- weekday: integer (nullable = true)
 |-- workingday: integer (nullable = true)
 |-- weathersit: integer (nullable = true)
 |-- temp: double (nullable = true)
 |-- atemp: double (nullable = true)
 |-- hum: double (nullable = true)
 |-- windspeed: double (nullable = true)
 |-- casual: integer (nullable = true)
 |-- registered: integer (nullable = true)
 |-- cnt: integer (nullable = true)



In [None]:
hour.count()

Out[8]: 17379

This dataset is nicely prepared for Machine Learning: values such as weekday are already indexed, and all of the columns except for the date (`dteday`).

### Preprocess data

**label**: We want to predict the count of bike rentals, hence the `cnt` column is our label.

**Features**: We can use the rest as features, except these:
* `casual`, `registered`: The `cnt` column equals the sum of the `casual` + `registered` columns. Unless we are interested in separating `casual` and `registered`, these two are not useful to us. 
* `dteday`:  We will discard it because it is well-represented by the other date-related columns `season`, `yr`, `mnth`, and `weekday`. 
* `instant`: This will not be used for analysis.

3\. Drop `instant`,`dteday`,`casual`,`registered` columns and verify the content and schema

In [None]:
hour2 = hour.drop("instant","dteday",'casual','registered')
hour2.limit(5).display()
hour2.printSchema()

season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,16
1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,40
1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,32
1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,13
1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,1


root
 |-- season: integer (nullable = true)
 |-- yr: integer (nullable = true)
 |-- mnth: integer (nullable = true)
 |-- hr: integer (nullable = true)
 |-- holiday: integer (nullable = true)
 |-- weekday: integer (nullable = true)
 |-- workingday: integer (nullable = true)
 |-- weathersit: integer (nullable = true)
 |-- temp: double (nullable = true)
 |-- atemp: double (nullable = true)
 |-- hum: double (nullable = true)
 |-- windspeed: double (nullable = true)
 |-- cnt: integer (nullable = true)



5\. Split the dataset randomly to keep 70% for training and 30% for testing. Then verify the count of each dataset

- for grading purpose, please set seed to 0 when you do `randomSplit` so that we all get the exact split

In [None]:
train, test = hour2.randomSplit([0.7, 0.3], seed=0)

In [None]:
print("train {}, test {}".format(train.count(), test.count()))

train 12081, test 5298


## Step 2. Visualize our data

Now that we have preprocessed our features and prepared a training dataset, we can quickly visualize our data to get a sense of the value distribution

### Histogram of variables

The purpose of the histogram is to view the distribution of a variable (e.g. whether it is skewed, whether there are outliers).

We can **approximate histogram** with group-by-and-count.
- For categorical variables, a group by and count would do.
- For numerical variables, we can first bin the data using rounding, and then do a group by and count. 
- In each case, we should sort by the variable and produce a barchart.

You may also leverage the **built-in histogram chart** of Databricks.  But you must keep in mind if the data volume is large, too much data can be an issue. 
- you may sample the column before collecting it.

6\. Display approximate histograms for `hr` (hour) and `temp` (temperature)

In [None]:
train.select("hr").groupBy("hr").count().orderBy("hr").display()

hr,count
0,505
1,490
2,503
3,489
4,481
5,490
6,505
7,516
8,497
9,499


In [None]:
from pyspark.sql.functions import round 
train.select(round(train.temp, 1).alias("temp_round")).groupBy("temp_round").count().orderBy("temp_round").display()

temp_round,count
0.0,24
0.1,202
0.2,1153
0.3,1892
0.4,1880
0.5,1748
0.6,2100
0.7,1976
0.8,852
0.9,243


7\. Plot the histogram for `temp` using databricks' built-in histogram chart.

- sample the data first with a 0.1 ratio, without replacement.

In [None]:
train.select("temp").sample(False, 0.1).display()

temp
0.16
0.18
0.2
0.1
0.22
0.14
0.22
0.18
0.08
0.14


### Examine the relationship between the feature and label.

It may also be useful to examine the relationship between a feature and the response. 

This may reveal whether a feature is predictive of the response variable (i.e. whether the latter systematically changes with the feature) and whether the relationship appears linear.

- for categorical/binary features, a group-by-and-average approach can produce a chart that captures this relationship. 
- for numerical features, you may want to draw a x-y scatter plot. We can leverage Databricks' built-in scatter plot but keep in mind that if there are too many data point, you may overwhelm the driver node. I recommend using sampling to control the number of data points.


8\. Draw the relationship between `hr` and `cnt` in a line chart.  Then repeat it for `mnth` and `cnt`.

- Does the `cnt` variable appear to be change systematically with the feature?
- Do the relationship appear to be linear?

In some models such as linear regression, you need to enter a feature differently if it has a nonlinear relationship with the response variable (e.g., bucketize it or add square terms). In decision tree models, however, the trees can capture nonlinear relationships well, so no need for special processing of nonlinear relationships.

In [None]:
train.select("hr",'cnt').groupBy("hr").avg("cnt").orderBy("hr").display()

hr,avg(cnt)
0,50.20990099009901
1,35.00816326530612
2,23.308151093439363
3,11.81799591002045
4,6.3056133056133055
5,19.757142857142856
6,75.8039603960396
7,214.7829457364341
8,355.9959758551308
9,221.0


In [None]:
train.select("mnth",'cnt').groupBy("mnth").avg("cnt").orderBy("mnth").display()

mnth,avg(cnt)
1,96.1831831831832
2,114.74020156774915
3,154.8230769230769
4,184.822
5,223.72586872586876
6,245.09428284854565
7,236.9560117302053
8,231.82079207920796
9,242.00289855072464
10,222.31892411143133


`hum` (humidity) and `temp` (temperature) are continous variables. We plot their relationship with `cnt` via scatter plots.

9\. Scatter plots for `hum` and `temp`

- Draw a scatter plot for `hum` and `cnt` (after you sample the data)
- Repeat it for `temp` and `cnt` (after you sample the data)

In [None]:
train.select("hum","cnt").sample(False, 0.1).display()

hum,cnt
0.69,9
0.93,3
0.81,16
0.86,1
0.5,1
0.64,2
0.54,3
0.5,4
0.48,5
0.55,5


In [None]:
train.select("temp","cnt").sample(False, 0.1).display()

temp,cnt
0.1,25
0.18,11
0.26,23
0.1,5
0.22,20
0.02,18
0.26,2
0.2,1
0.2,2
0.26,1


## Step 3. Train a Machine Learning Pipeline

Now that we have understood our data and prepared it as a DataFrame with numeric values, let's learn an ML model to predict bike sharing rentals in the future.  Most ML algorithms expect to predict a single "label" column (`cnt` for our dataset) using a single "features" column of feature vectors.  

We will put together a simple Pipeline with the following stages (click on the links to view the official documentation):
* [VectorAssembler](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html): Assemble the feature columns into a feature vector.
* [VectorIndexer](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorIndexer.html): Identify columns which should be treated as categorical and indexed (some algorithms will accomodate categorical variables and treat them differently from ordinal variables).  This is done heuristically, identifying any column with a small number of distinct values as being categorical.  For us, this will be the `yr` (2 values), `season` (4 values), `holiday` (2 values), `workingday` (2 values), and `weathersit` (4 values).
* [GBTRegressor](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GBTRegressor.html): This will use the Gradient-Boosted Trees (GBT) algorithm to learn how to predict rental counts from the feature vectors.
* [RegressionEvaluator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.RegressionEvaluator.html): This specifies the evaluation metric used for the regression model.
* [CrossValidator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html): The GBT algorithm has several hyperparameters, and tuning them to our data can improve accuracy.  We will do this tuning using Spark's Cross Validation framework, which automatically tests a grid of hyperparameters and chooses the best.

![Image of Pipeline](https://i.ibb.co/d0LqYwP/GBTpipeline.png)

First, we define the feature processing stages of the Pipeline:

10\. Create a vector assembler `VectorAssembler` that assembles the feature columns into a feature vector with column name  `rawFeatures`. Name it `va`. Note that all fields in `df3` will be used except for `cnt`

In [None]:
from pyspark.ml.feature import VectorAssembler, VectorIndexer

featureCols = train.columns
featureCols.remove("cnt")

va = VectorAssembler(inputCols=featureCols, outputCol='rawFeatures')

11\. Verify your assembler by transforming `train` to a new dataframe called `train_va` and show some sample rows from it.

In [None]:
train_va = va.transform(train)
train_va.limit(3).display()

season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt,rawFeatures
1,0,1,0,0,0,0,1,0.1,0.0758,0.42,0.3881,25,"Map(vectorType -> dense, length -> 12, values -> List(1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.1, 0.0758, 0.42, 0.3881))"
1,0,1,0,0,0,0,1,0.16,0.1818,0.8,0.1045,33,"Map(vectorType -> dense, length -> 12, values -> List(1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.16, 0.1818, 0.8, 0.1045))"
1,0,1,0,0,0,0,1,0.26,0.303,0.56,0.0,39,"Map(vectorType -> sparse, length -> 12, indices -> List(0, 2, 7, 8, 9, 10), values -> List(1.0, 1.0, 1.0, 0.26, 0.303, 0.56))"


12\. So far the variables are all presumed ordinal variables but this does not make sense for certain variables such as season. We use VectorIndexor to add meta data to variables in the features vector so that some are designated as categorical variables. 

Create a VectorIndexer that indexes all features in `rawFeatures` with <= 4 distinct values. The output column should be called `features`, and the VectorIndex should be called `vi`

In [None]:
vi = VectorIndexer(inputCol='rawFeatures', outputCol="features", maxCategories=4)

13\. Verify your VectorIndexer by transforming `assembled` to a new dataframe called `indexed` and show some sample rows from it. 

- you may observe that some values in the `rawFeatures` and `features` have changed (due to zero-based indexing)
- additionally, you may view the metadata for `features` and `rawFeatures` using  `.schema.fields[n].metadata`

In [None]:
train_vi = vi.fit(train_va).transform(train_va)
train_vi.select("rawFeatures","features").limit(3).display()

rawFeatures,features
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.1, 0.0758, 0.42, 0.3881))","Map(vectorType -> dense, length -> 12, values -> List(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.0758, 0.42, 0.3881))"
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.16, 0.1818, 0.8, 0.1045))","Map(vectorType -> dense, length -> 12, values -> List(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16, 0.1818, 0.8, 0.1045))"
"Map(vectorType -> sparse, length -> 12, indices -> List(0, 2, 7, 8, 9, 10), values -> List(1.0, 1.0, 1.0, 0.26, 0.303, 0.56))","Map(vectorType -> sparse, length -> 12, indices -> List(0, 2, 7, 8, 9, 10), values -> List(0.0, 1.0, 0.0, 0.26, 0.303, 0.56))"


In [None]:
train_vi.schema.fields[-1].metadata

Out[23]: {'ml_attr': {'attrs': {'numeric': [{'idx': 2, 'name': 'mnth'},
    {'idx': 3, 'name': 'hr'},
    {'idx': 5, 'name': 'weekday'},
    {'idx': 8, 'name': 'temp'},
    {'idx': 9, 'name': 'atemp'},
    {'idx': 10, 'name': 'hum'},
    {'idx': 11, 'name': 'windspeed'}],
   'nominal': [{'ord': False,
     'vals': ['1.0', '2.0', '3.0', '4.0'],
     'idx': 0,
     'name': 'season'},
    {'ord': False,
     'vals': ['1.0', '2.0', '3.0', '4.0'],
     'idx': 7,
     'name': 'weathersit'}],
   'binary': [{'vals': ['0.0', '1.0'], 'idx': 1, 'name': 'yr'},
    {'vals': ['0.0', '1.0'], 'idx': 4, 'name': 'holiday'},
    {'vals': ['0.0', '1.0'], 'idx': 6, 'name': 'workingday'}]},
  'num_attrs': 12}}

In [None]:
train_vi.schema.fields[-2].metadata

Out[24]: {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'season'},
    {'idx': 1, 'name': 'yr'},
    {'idx': 2, 'name': 'mnth'},
    {'idx': 3, 'name': 'hr'},
    {'idx': 4, 'name': 'holiday'},
    {'idx': 5, 'name': 'weekday'},
    {'idx': 6, 'name': 'workingday'},
    {'idx': 7, 'name': 'weathersit'},
    {'idx': 8, 'name': 'temp'},
    {'idx': 9, 'name': 'atemp'},
    {'idx': 10, 'name': 'hum'},
    {'idx': 11, 'name': 'windspeed'}]},
  'num_attrs': 12}}

14\. Define the GBTRegressor. 

- `GBTRegressor` takes `features` as feature vectors and `cnt` as labels. 
- Save the resulting GBTRegressor as `gbt`.

In [None]:
from pyspark.ml.regression import GBTRegressor

gbt = GBTRegressor(labelCol='cnt')

15\. We create an evaluator `e` and specify RMSE as the evaluation metric

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

e = RegressionEvaluator(metricName="rmse", labelCol='cnt')

We wrap the model training stage within a `CrossValidator` stage.  `CrossValidator` knows how to call the GBT algorithm with different hyperparameter settings.  It will train multiple models and choose the best one, based on minimizing some metric. In this example, our metric is **Root Mean Squared Error (RMSE)** as defined above.

16\. Create a parameter grid `paramGrid` that explores two values for GTB's maxDepth parameters: 5 (default) and 8

  - `maxDepth`: max depth of each decision tree in the GBT ensemble
  - `maxIter`: iterations, i.e., number of trees in each GBT ensemble

In this example, we keep these values small.  In practice, to get the highest accuracy, you would likely want to try deeper trees (10 or higher) and more trees in the ensemble (>100).

In [None]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = ParamGridBuilder().addGrid(gbt.maxDepth, [5,8]).build()

17\. Create a CrossValidator `cv`, which uses `gbt` as estimator, `e` as evaluator and `paramGrid` as parameter maps.

In [None]:
cv = CrossValidator(estimator=gbt, evaluator=e, estimatorParamMaps=paramGrid)

Finally, we can tie our feature processing and model training stages together into a single `Pipeline`.

18\. Create a Pipeline `pipeline` according to the above diagram.

In [None]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages = [va, vi, cv])

#### Train the Pipeline

Now that we have set up our workflow, we can train the Pipeline in a single call.  Calling `fit()` will run feature processing, model tuning, and training in a single call.  We get back a fitted Pipeline with the best model found.

***Note***: This next cell can take up to **10 minutes**.  This is because it is training *a lot* of models:
* For each random sample of data in Cross Validation,
  * For each setting of the hyperparameters,
    * `CrossValidator` is training a separate GBT ensemble which contains many Decision Trees.

19\. Train the pipeline and save the resulting model as `pipelineModel`

In [None]:
pipelinemodel = pipeline.fit(train)

20\. View the best model parameters

- You can access pipelineModel's stages using `.stages[n]`
- CrossValidator stage has a `bestModel` property that returns the best model (a transformer)
- Transformers have a explainParams method.

In [None]:
print(pipelinemodel.stages[2].bestModel.explainParams())

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 

In [None]:
print(pipelinemodel.stages[1].explainParams())

handleInvalid: How to handle invalid data (unseen labels or NULL values). Options are 'skip' (filter out rows with invalid data), 'error' (throw an error), or 'keep' (put invalid data in a special additional bucket, at index of the number of categories of the feature). (default: error)
inputCol: input column name. (current: rawFeatures)
maxCategories: Threshold for the number of values a categorical feature can take (>= 2). If a feature is found to have > maxCategories values, then it is declared continuous. (default: 20, current: 4)
outputCol: output column name. (default: VectorIndexer_230134507244__output, current: features)


## Step 4. Make predictions and evaluate the model performance on test data

Our final step will be to use our fitted model to make predictions on new data.  We will use our held-out test set, but you could also use this model to make predictions on completely new data.  For example, if we created some features data based on weather predictions for the next week, we could predict bike rentals expected during the next week!

We will also evaluate our predictions.  Computing evaluation metrics is important for understanding the quality of predictions, as well as for comparing models and tuning parameters.

21\. Calling `transform()` on a the `test` dataset to obtain predictions, saving the resulting Dataframe as `predictions`

In [None]:
predictions = pipelinemodel.transform(test)

In [None]:
predictions.printSchema()

root
 |-- season: integer (nullable = true)
 |-- yr: integer (nullable = true)
 |-- mnth: integer (nullable = true)
 |-- hr: integer (nullable = true)
 |-- holiday: integer (nullable = true)
 |-- weekday: integer (nullable = true)
 |-- workingday: integer (nullable = true)
 |-- weathersit: integer (nullable = true)
 |-- temp: double (nullable = true)
 |-- atemp: double (nullable = true)
 |-- hum: double (nullable = true)
 |-- windspeed: double (nullable = true)
 |-- cnt: integer (nullable = true)
 |-- rawFeatures: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- prediction: double (nullable = false)



We expect `predictions` to have a new column `predictions` (as well as intermediate results such as our `rawFeatures` column from previous steps). It is easier to view the results when we limit the columns displayed to:
* `cnt`: the true count of bike rentals
* `prediction`: our predicted count of bike rentals
* feature columns: our original (human-readable) feature columns

22\. Display `cnt`, `prediction`, and original feature columns, limit results to 10 rows

In [None]:
predictions.select("cnt", "prediction", *featureCols).limit(10).display

Out[40]: <bound method apply_dataframe_display_patch.<locals>.df_display of DataFrame[cnt: int, prediction: double, season: int, yr: int, mnth: int, hr: int, holiday: int, weekday: int, workingday: int, weathersit: int, temp: double, atemp: double, hum: double, windspeed: double]>

Are these good results?  They are not perfect, but you can see correlation between the counts and predictions.  And there is room to improve---see the next section for ideas to take you further!

Before we continue, we give two tips on understanding results:

**(1) Metrics**: Manually viewing the predictions gives intuition about accuracy, but it can be useful to have a more concrete metric.  Below, we compute an evaluation metric which tells us how well our model makes predictions on all of our data.  In this case (for [RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation)), lower is better.  This metric can be used to compare different models.  (This is what `CrossValidator` does internally.)

22\. Obtain the RMSE for predictions

In [None]:
rmse = e.evaluate(predictions)
rmse

Out[41]: 48.92290529101954

**(2) Visualization**: Plotting predictions vs. features can help us make sure that the model "understands" the input features and is using them properly to make predictions.  Below, we can see that the model predictions are correlated with the hour of the day, just like the true labels were.

23\. Plot a scatter plot on a 10% sample of (`cnt`,`prediction`) pair.

In [None]:
predictions.select("cnt", "prediction").sample(False, 0.1).display()

cnt,prediction
22,25.485062785061995
40,20.724314524132836
3,3.427242585656655
3,3.3448830348265486
5,0.4658875612124294
3,-26.614170358068456
88,130.88790970904125
185,209.40737937205864
202,221.33982861060812
219,236.96097741836957


#### Improving our model

There are several ways we could further improve our model:
* **Expert knowledge**: We may not be experts on bike sharing programs, but we know a few things we can use:
  * The count of rentals cannot be negative.  `GBTRegressor` does not know that, but we could threshold the predictions to be `>= 0` post-hoc.
  * The count of rentals is the sum of `registered` and `casual` rentals.  These two counts may have different behavior.  (Frequent cyclists and casual cyclists probably rent bikes for different reasons.)  The best models for this dataset take this into account.  Try training one GBT model for `registered` and one for `casual`, and then add their predictions together to get the full prediction.
* **Better tuning**: To make this notebook run quickly, we only tried a few hyperparameter settings.  To get the most out of our data, we should test more settings.  Start by increasing the number of trees in our GBT model by setting `maxIter=200`; it will take longer to train but can be more accurate.
* **Feature engineering**: We used the basic set of features given to us, but we could potentially improve them.  For example, we may guess that weather is more or less important depending on whether or not it is a workday vs. weekend.  To take advantage of that, we could build a few feature by combining those two base features.  MLlib provides a suite of feature transformers; find out more in the [ML guide](http://spark.apache.org/docs/latest/ml-features.html).