# Building A Linear Regression with PySpark and MLlib by using the pipeline 

@ author : Frederic Twahirwa

Apache Spark has become one of the most commonly used and supported open-source tools for machine learning and data science. In this post, I’ll help you get started using Apache Spark’s spark.ml Linear Regression for predicting rental bike. Data is from "UCI Machine learning repository : Bike sharing Data set". In this example below we use the day data set.

We are going to use the concept of pipeline in this example.
In machine learning, it is common to run a sequence of algorithms to process and learn from data.MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. We will use this simple workflow as a running example in example below.
more info at https://spark.apache.org/docs/2.2.0/ml-pipeline.html

### Data set : Attribute information

- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from [Web Link])
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered


### import pyspark package, create a session and load data

In [1]:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()

In [2]:
bike_sharing = spark_session.read.csv("./Documents/MachineLearning/BikeSharing/day.csv", header=True)
bike_sharing.limit(5).toPandas()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


In [3]:
bike_sharing.printSchema()

root
 |-- instant: string (nullable = true)
 |-- dteday: string (nullable = true)
 |-- season: string (nullable = true)
 |-- yr: string (nullable = true)
 |-- mnth: string (nullable = true)
 |-- holiday: string (nullable = true)
 |-- weekday: string (nullable = true)
 |-- workingday: string (nullable = true)
 |-- weathersit: string (nullable = true)
 |-- temp: string (nullable = true)
 |-- atemp: string (nullable = true)
 |-- hum: string (nullable = true)
 |-- windspeed: string (nullable = true)
 |-- casual: string (nullable = true)
 |-- registered: string (nullable = true)
 |-- cnt: string (nullable = true)



### Split data

In [4]:
train, test = bike_sharing.randomSplit([0.7, 0.3])

## convert string to integer or double by Using SQLTransformer

### SQLTransformer implements the transformations which are defined by SQL statement.

In [5]:
#transform integer as int and flaot as double and the label
from pyspark.ml.feature import SQLTransformer
sql_transformer01 = SQLTransformer(
    statement = """
    SELECT
        cast(season as int),
        cast(yr as int),
        cast(mnth as int),
        cast(holiday as int),
        cast(weekday as int),
        cast(workingday as int),
        cast(weathersit as int),
        cast(temp as double),
        cast(atemp as double),
        cast(hum as double),
        cast(windspeed as double),
        cast(cnt as int) as label
    FROM __THIS__
    """ )

### VectorAssembler Transformer
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models.
VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order

In [6]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler()\
    .setInputCols(["season",
                  "yr",
                  "mnth",
                  "holiday",
                  "weekday",
                  "workingday",
                  "weathersit",
                  "temp",
                  "atemp",
                  "hum",
                  "windspeed"])\
    .setOutputCol("features")

### Select Features and label by SQLTransformer

In [7]:
# select features and label only
sql_transformer02 = SQLTransformer(
    statement = """
    SELECT
        features,
        label
    FROM __THIS__
    """ )

## Train a linear regression in spark ML

In [8]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression()

### Configure an ML pipeline,
which consists of three stages: SQL_transformer01, essembler, and SQL_transformer02.

In [9]:
# pipeline
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = [
    sql_transformer01,
    assembler,
    sql_transformer02,
    lr
    ])

### Fit the pipeline to training data

In [10]:
pipeline_model=pipeline.fit(train)

In [11]:
train01 = pipeline_model.transform(train)
test01= pipeline_model.transform(test)
test01.limit(3).toPandas()

Unnamed: 0,features,label,prediction
0,"[2.0, 0.0, 4.0, 0.0, 4.0, 1.0, 1.0, 0.4675, 0....",3267,3922.35507
1,"[2.0, 0.0, 4.0, 0.0, 0.0, 0.0, 2.0, 0.581667, ...",4191,2941.055603
2,"[2.0, 0.0, 4.0, 0.0, 4.0, 1.0, 2.0, 0.6175, 0....",4058,3316.944245


### Evaluate the model (on the test data)

In [12]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator()
print ("r2 =", evaluator.evaluate(test01, {evaluator.metricName:"r2"}))
print ("rmse =", evaluator.evaluate(test01, {evaluator.metricName:"rmse"}))

r2 = 0.795052189267716
rmse = 874.8801605587502


### save the pipeline model for the production needs

Often times it is worth it to save a model or a pipeline to disk for later use

In [13]:
# save the pipeline model for the production needs
pipeline_model.save("pipeline_model")

### load the pipeline model

In [14]:
# load the pipeline_model
from pyspark.ml import PipelineModel
new_pipeline_model = PipelineModel.load("pipeline_model")

In [15]:
# test
test02 = new_pipeline_model.transform(test)
print ("r2 =", evaluator.evaluate(test02, {evaluator.metricName:"r2"}))
print ("rmse =", evaluator.evaluate(test02, {evaluator.metricName:"rmse"}))

r2 = 0.795052189267716
rmse = 874.8801605587502


more info :
    - https://spark.apache.org/docs/latest/ml-features.html#sqltransformer    
    - https://spark.apache.org/docs/2.2.0/ml-pipeline.html