## Regression Models in PySpark
##### Agenda
<hr>
* Pre-requisite of Data for Machine Learning
* Linear Model for Regression
* Evaluating Regression Models
* Genralized Linear Regression
* Survival Regression 
* Isotonic Regression
* Tree Based Regression
* Ensemble Methods for Regression

<hr>

### Pre-requisite of Data for Machine Learning
<hr>
* Feature data should have been vectorized, VectorAssembler should have come handy.

### Linear Model for Regression
<hr>
* Feature consist of p independent variables (p-dim)
* Target/dependent variable is represented by y
* Relation between feature & target is represented by the following equation
* w's represent weights or coef's for each feature, w0 is intercept
* The best line/hyper-plane is the one with minimal loss
* Objective is to minimize loss
* Loss can be represented as "root mean squared error", "l2 loss", "l1 loss" or combined

<img src="http://abhijitbangera.com/wp-content/uploads/2017/04/multi-regression-equation.png" width="600px">

In [4]:
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
import pandas as pd

In [5]:
X,y = make_regression(n_features=1,n_samples=1000, noise=10)
X = X +10
y = y +10

In [6]:
df = pd.DataFrame({'X':X.ravel(),'y':y.ravel()})
data = spark.createDataFrame(df)

In [7]:
display(data)

X,y
10.773741712731471,39.160258765168535
8.887802309767446,-25.722562465631533
11.17207909628788,46.495439190734345
10.43525445563436,34.20142827659538
10.117147054677854,19.091311451638887
9.521059212241491,-16.95097123596604
11.05318462644746,33.87366093562471
9.561395388732103,-8.734483590024148
9.937884864456136,-8.873860132894002
9.388789430004495,-6.970095111296963


In [8]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [9]:
ve = VectorAssembler(inputCols=['X'],outputCol='features')

In [10]:
data = ve.transform(data)

In [11]:
from pyspark.ml.regression import LinearRegression

In [12]:
lr = LinearRegression(featuresCol='features',labelCol='y')

In [13]:
lr_model = lr.fit(data)

In [14]:
lr_model.coefficients

In [15]:
lr_model.intercept

In [16]:
out = lr_model.transform(data)

In [17]:
display(out)

X,y,features,prediction
10.773741712731471,39.160258765168535,"List(1, 1, List(), List(10.773741712731471))",40.87520231191888
8.887802309767446,-25.722562465631533,"List(1, 1, List(), List(8.887802309767446))",-34.27567018140371
11.17207909628788,46.495439190734345,"List(1, 1, List(), List(11.17207909628788))",56.74814174672986
10.43525445563436,34.20142827659538,"List(1, 1, List(), List(10.435254455634361))",27.38716942033153
10.117147054677854,19.091311451638887,"List(1, 1, List(), List(10.117147054677854))",14.711232594952662
9.521059212241491,-16.95097123596604,"List(1, 1, List(), List(9.521059212241493))",-9.041662843442964
11.05318462644746,33.87366093562471,"List(1, 1, List(), List(11.053184626447461))",52.01043748693451
9.561395388732103,-8.734483590024148,"List(1, 1, List(), List(9.561395388732105))",-7.43434775603879
9.937884864456136,-8.873860132894002,"List(1, 1, List(), List(9.937884864456137))",7.567996720537337
9.388789430004495,-6.970095111296963,"List(1, 1, List(), List(9.388789430004497))",-14.312346262003302


### Evaluating Regression Models
<hr>
* Finding the best model among competing models requires models to be evaluated
* PySpark provides libraries to evalute regression models

In [19]:
from pyspark.ml.evaluation import RegressionEvaluator

In [20]:
regression_eval = RegressionEvaluator(predictionCol='prediction', labelCol='y')

In [21]:
regression_eval.evaluate(out, {regression_eval.metricName: "r2"})

In [22]:
regression_eval.evaluate(out, {regression_eval.metricName: "mae"})

### Generalized Linear Regression
<hr>
* Linear regression mentioned above is a part of Generalized Linear Regression
* Linear regression have certain assumptions about target data, feature data like target data should be normally distributed.
* Target can be of any linear for continues, logit for binary & poission for count

In [24]:
from pyspark.ml.regression import GeneralizedLinearRegression

In [25]:
glr = GeneralizedLinearRegression(family="binomial",link="logit", featuresCol='features', labelCol='y')

In [26]:
from sklearn.datasets import make_blobs
X,y = make_blobs(n_features=2, n_samples=1000, cluster_std=2,centers=2)

In [27]:
ve = VectorAssembler(inputCols=['X1','X2'], outputCol='features')

In [28]:
df = pd.DataFrame({'X1':X[:,0], 'X2':X[:,1],'y':y.ravel()})
data = spark.createDataFrame(df)

In [29]:
data = ve.transform(data)

In [30]:
glr_model = glr.fit(data)

In [31]:
out = glr_model.transform(data)

In [32]:
display(out)

X1,X2,y,features,prediction
-0.0294076792394546,10.055384586574752,0,"List(1, 2, List(), List(-0.029407679239454643, 10.055384586574753))",0.0003399453611665722
-5.77814120197689,3.5173164054359765,1,"List(1, 2, List(), List(-5.77814120197689, 3.5173164054359765))",0.9988787074112904
0.3700633454962521,4.956514069086183,1,"List(1, 2, List(), List(0.3700633454962521, 4.956514069086183))",0.1215136182390598
4.022892811622982,3.6339534478565088,0,"List(1, 2, List(), List(4.022892811622982, 3.6339534478565088))",0.0117728515731059
-1.3443846896132876,4.823459943065423,1,"List(1, 2, List(), List(-1.3443846896132876, 4.823459943065423))",0.5318562368324615
2.999200406837827,4.290139600564432,0,"List(1, 2, List(), List(2.999200406837827, 4.290139600564432))",0.0162270832208374
-2.1224418941953616,2.3174659615804294,1,"List(1, 2, List(), List(-2.1224418941953616, 2.3174659615804294))",0.9849543856759352
0.7008877253719983,11.584280939767991,0,"List(1, 2, List(), List(0.7008877253719983, 11.584280939767991))",2.147763133583518e-05
1.6149456797545785,11.462658603218143,0,"List(1, 2, List(), List(1.6149456797545783, 11.462658603218145))",8.919859926950338e-06
1.3214059146884811,5.450898433249814,0,"List(1, 2, List(), List(1.3214059146884811, 5.450898433249814))",0.0246157375518694


### Survival Regression
<hr>
* Survival analysis is generally defined as a set of methods for analyzing data where the outcome variable is the time until the occurrence of an event of interest.
* The event can be death, occurrence of a disease, marriage, divorce.
* Observations are called censored when the information about their survival time is incomplete; the most commonly encountered form is right censoring.
* Suppose patients are followed in a study for 20 weeks.A patient who does not experience the event of interest for the duration of the study is said to be right censored.

In [34]:
 lung_data = spark.read.csv('/FileStore/tables/lung_cancer.csv',inferSchema=True, header=True)

In [35]:
display(lung_data)

Treatment,Celltype,Survival_in_days,Status,Karnofsky_score,Months_from_Diagnosis,Age_in_years,Prior_therapy
'standard','squamous',72,'dead',60,7,69,'no'
'standard','squamous',411,'dead',70,5,64,'yes'
'standard','squamous',228,'dead',60,3,38,'no'
'standard','squamous',126,'dead',60,9,63,'yes'
'standard','squamous',118,'dead',70,11,65,'yes'
'standard','squamous',10,'dead',20,5,49,'no'
'standard','squamous',82,'dead',40,10,69,'yes'
'standard','squamous',110,'dead',80,29,68,'no'
'standard','squamous',314,'dead',50,18,43,'no'
'standard','squamous',100,'censored',70,6,70,'no'


In [36]:
lung_data.columns

In [37]:
from pyspark.ml.feature import OneHotEncoderEstimator
from pyspark.ml.feature import StringIndexer

In [38]:
cat_cols = ['Treatment', 'Celltype', 'Status', 'Prior_therapy']

In [39]:
num_cols = list(set(lung_data.columns) - set(cat_cols))

In [40]:
num_cols

In [41]:
stringIndexers = []
for col in cat_cols:
  si = StringIndexer(inputCol=col, outputCol=col+'_tf')
  stringIndexers.append(si)

In [42]:
for idx,col in enumerate(cat_cols):
  lung_data = stringIndexers[idx].fit(lung_data).transform(lung_data)

In [43]:
display(lung_data)

Treatment,Celltype,Survival_in_days,Status,Karnofsky_score,Months_from_Diagnosis,Age_in_years,Prior_therapy,Treatment_tf,Celltype_tf,Status_tf,Prior_therapy_tf
'standard','squamous',72,'dead',60,7,69,'no',0.0,1.0,0.0,0.0
'standard','squamous',411,'dead',70,5,64,'yes',0.0,1.0,0.0,1.0
'standard','squamous',228,'dead',60,3,38,'no',0.0,1.0,0.0,0.0
'standard','squamous',126,'dead',60,9,63,'yes',0.0,1.0,0.0,1.0
'standard','squamous',118,'dead',70,11,65,'yes',0.0,1.0,0.0,1.0
'standard','squamous',10,'dead',20,5,49,'no',0.0,1.0,0.0,0.0
'standard','squamous',82,'dead',40,10,69,'yes',0.0,1.0,0.0,1.0
'standard','squamous',110,'dead',80,29,68,'no',0.0,1.0,0.0,0.0
'standard','squamous',314,'dead',50,18,43,'no',0.0,1.0,0.0,0.0
'standard','squamous',100,'censored',70,6,70,'no',0.0,1.0,1.0,0.0


In [44]:
ohe = OneHotEncoderEstimator(inputCols=['Treatment_tf','Celltype_tf','Prior_therapy_tf'], outputCols=['Treatment_en','Celltype_en','Prior_therapy_en'])

In [45]:
ohe_model = ohe.fit(lung_data)

In [46]:
lung_data = ohe_model.transform(lung_data)

In [47]:
lung_data.columns

In [48]:
ve = VectorAssembler(inputCols=['Treatment_en','Celltype_en','Prior_therapy_en']+num_cols, outputCol='features')

In [49]:
lung_data = ve.transform(lung_data)

In [50]:
display(lung_data)

Treatment,Celltype,Survival_in_days,Status,Karnofsky_score,Months_from_Diagnosis,Age_in_years,Prior_therapy,Treatment_tf,Celltype_tf,Status_tf,Prior_therapy_tf,Treatment_en,Celltype_en,Prior_therapy_en,features
'standard','squamous',72,'dead',60,7,69,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 69.0, 72.0, 60.0, 7.0))"
'standard','squamous',411,'dead',70,5,64,'yes',0.0,1.0,0.0,1.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(), List())","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 0.0, 64.0, 411.0, 70.0, 5.0))"
'standard','squamous',228,'dead',60,3,38,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 38.0, 228.0, 60.0, 3.0))"
'standard','squamous',126,'dead',60,9,63,'yes',0.0,1.0,0.0,1.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(), List())","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 0.0, 63.0, 126.0, 60.0, 9.0))"
'standard','squamous',118,'dead',70,11,65,'yes',0.0,1.0,0.0,1.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(), List())","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 0.0, 65.0, 118.0, 70.0, 11.0))"
'standard','squamous',10,'dead',20,5,49,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 49.0, 10.0, 20.0, 5.0))"
'standard','squamous',82,'dead',40,10,69,'yes',0.0,1.0,0.0,1.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(), List())","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 0.0, 69.0, 82.0, 40.0, 10.0))"
'standard','squamous',110,'dead',80,29,68,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 68.0, 110.0, 80.0, 29.0))"
'standard','squamous',314,'dead',50,18,43,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 43.0, 314.0, 50.0, 18.0))"
'standard','squamous',100,'censored',70,6,70,'no',0.0,1.0,1.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 70.0, 100.0, 70.0, 6.0))"


In [51]:
from pyspark.ml.regression import AFTSurvivalRegression

In [52]:
survival_regression = AFTSurvivalRegression(labelCol='Survival_in_days', censorCol='Status_tf',featuresCol='features')

In [53]:
survival_regression_model = survival_regression.fit(lung_data)

In [54]:
lung_data = survival_regression_model.transform(lung_data)

In [55]:
display(lung_data)

Treatment,Celltype,Survival_in_days,Status,Karnofsky_score,Months_from_Diagnosis,Age_in_years,Prior_therapy,Treatment_tf,Celltype_tf,Status_tf,Prior_therapy_tf,Treatment_en,Celltype_en,Prior_therapy_en,features,prediction
'standard','squamous',72,'dead',60,7,69,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 69.0, 72.0, 60.0, 7.0))",97.54550659115826
'standard','squamous',411,'dead',70,5,64,'yes',0.0,1.0,0.0,1.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(), List())","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 0.0, 64.0, 411.0, 70.0, 5.0))",1289.202511734985
'standard','squamous',228,'dead',60,3,38,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 38.0, 228.0, 60.0, 3.0))",287.0067438260255
'standard','squamous',126,'dead',60,9,63,'yes',0.0,1.0,0.0,1.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(), List())","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 0.0, 63.0, 126.0, 60.0, 9.0))",131.46497776906273
'standard','squamous',118,'dead',70,11,65,'yes',0.0,1.0,0.0,1.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(), List())","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 0.0, 65.0, 118.0, 70.0, 11.0))",124.26925702333172
'standard','squamous',10,'dead',20,5,49,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 49.0, 10.0, 20.0, 5.0))",56.534579120387335
'standard','squamous',82,'dead',40,10,69,'yes',0.0,1.0,0.0,1.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(), List())","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 0.0, 69.0, 82.0, 40.0, 10.0))",99.8201871218082
'standard','squamous',110,'dead',80,29,68,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 68.0, 110.0, 80.0, 29.0))",160.54765748794492
'standard','squamous',314,'dead',50,18,43,'no',0.0,1.0,0.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 43.0, 314.0, 50.0, 18.0))",717.1988387681024
'standard','squamous',100,'censored',70,6,70,'no',0.0,1.0,1.0,0.0,"List(0, 1, List(0), List(1.0))","List(0, 3, List(1), List(1.0))","List(0, 1, List(0), List(1.0))","List(1, 9, List(), List(1.0, 0.0, 1.0, 0.0, 1.0, 70.0, 100.0, 70.0, 6.0))",119.4238341662238


### Isotonic Regression
<hr>
* The isotonic regression finds a non-decreasing approximation of a function while minimizing the mean squared error on the training data. 
* The benefit of such a model is that it does not assume any form for the target function such as linearity.
<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_isotonic_regression_001.png">

In [57]:
import numpy as np
import pandas as pd

n = 100
x = np.arange(n)
y = np.random.randint(-50, 50, size=(n,)) + 50. * np.log1p(np.arange(n))

In [58]:
df = pd.DataFrame({'x':x, 'y':y})
data = spark.createDataFrame(df)

In [59]:
display(data)

x,y,features
0,-34.0,"List(1, 1, List(), List(0.0))"
1,39.65735902799727,"List(1, 1, List(), List(1.0))"
2,16.93061443340548,"List(1, 1, List(), List(2.0))"
3,56.31471805599453,"List(1, 1, List(), List(3.0))"
4,113.47189562170502,"List(1, 1, List(), List(4.0))"
5,124.58797346140275,"List(1, 1, List(), List(5.0))"
6,70.29550745276566,"List(1, 1, List(), List(6.0))"
7,151.9720770839918,"List(1, 1, List(), List(7.0))"
8,141.86122886681096,"List(1, 1, List(), List(8.0))"
9,153.1292546497023,"List(1, 1, List(), List(9.0))"


In [60]:
from pyspark.ml.regression import IsotonicRegression
iso_regress = IsotonicRegression(featuresCol='features', labelCol='y')

In [61]:
va = VectorAssembler(inputCols=['x'],outputCol='features')

In [62]:
data = va.transform(data)

In [63]:
iso_regress_model = iso_regress.fit(data)

In [64]:
result = iso_regress_model.transform(data)

In [65]:
display(result)

x,y,features,prediction
0,-34.0,"List(1, 1, List(), List(0.0))",-34.0
1,39.65735902799727,"List(1, 1, List(), List(1.0))",28.293986730701373
2,16.93061443340548,"List(1, 1, List(), List(2.0))",28.293986730701373
3,56.31471805599453,"List(1, 1, List(), List(3.0))",56.31471805599453
4,113.47189562170502,"List(1, 1, List(), List(4.0))",102.78512551195782
5,124.58797346140275,"List(1, 1, List(), List(5.0))",102.78512551195782
6,70.29550745276566,"List(1, 1, List(), List(6.0))",102.78512551195782
7,151.9720770839918,"List(1, 1, List(), List(7.0))",125.7038819167514
8,141.86122886681096,"List(1, 1, List(), List(8.0))",125.7038819167514
9,153.1292546497023,"List(1, 1, List(), List(9.0))",125.7038819167514


### Decision Trees for Regression
<hr>
* Feature data can be categorical or continues
* Don't require feature scaling
* Highly interpretable
* Target can be non-linearly related to features
* Entropy & Gini calculation can be distributed & parallelized

<hr>

In [67]:
from pyspark.ml.regression import DecisionTreeRegressor

In [68]:
house_data = spark.read.csv('/FileStore/tables/house_rental_data.csv', inferSchema=True,header=True)

In [69]:
house_data = house_data.withColumnRenamed('Living.Room','LivingRoom')

In [70]:
display(house_data)

_c0,Sqft,Floor,TotalFloor,Bedroom,LivingRoom,Bathroom,Price
1,1177.698,2,7,2,2,2,62000
2,2134.8,5,7,4,2,2,78000
3,1138.56,5,7,2,2,1,58000
4,1458.78,2,7,3,2,2,45000
5,967.776,11,14,3,2,2,45000
6,1127.886,11,12,4,2,2,148000
7,1352.04,5,7,3,2,1,58000
8,757.854,5,14,1,0,1,48000
9,1152.792,10,12,3,2,2,45000
10,1423.2,4,5,4,2,2,65000


In [71]:
from pyspark.ml.feature import VectorAssembler

In [72]:
va = VectorAssembler(inputCols=['Sqft','Floor','TotalFloor','Bedroom','LivingRoom','Bathroom'], outputCol='features')

In [73]:
house_data = va.transform(house_data)

In [74]:
dt = DecisionTreeRegressor(maxDepth=5, featuresCol='features',labelCol='Price')

In [75]:
dt_model = dt.fit(house_data)

In [76]:
result = dt_model.transform(house_data)

In [77]:
display(result)

_c0,Sqft,Floor,TotalFloor,Bedroom,LivingRoom,Bathroom,Price,features,prediction
1,1177.698,2,7,2,2,2,62000,"List(1, 6, List(), List(1177.698, 2.0, 7.0, 2.0, 2.0, 2.0))",47521.875
2,2134.8,5,7,4,2,2,78000,"List(1, 6, List(), List(2134.8, 5.0, 7.0, 4.0, 2.0, 2.0))",68499.875
3,1138.56,5,7,2,2,1,58000,"List(1, 6, List(), List(1138.56, 5.0, 7.0, 2.0, 2.0, 1.0))",47521.875
4,1458.78,2,7,3,2,2,45000,"List(1, 6, List(), List(1458.78, 2.0, 7.0, 3.0, 2.0, 2.0))",47521.875
5,967.776,11,14,3,2,2,45000,"List(1, 6, List(), List(967.776, 11.0, 14.0, 3.0, 2.0, 2.0))",42209.782608695656
6,1127.886,11,12,4,2,2,148000,"List(1, 6, List(), List(1127.886, 11.0, 12.0, 4.0, 2.0, 2.0))",149000.0
7,1352.04,5,7,3,2,1,58000,"List(1, 6, List(), List(1352.04, 5.0, 7.0, 3.0, 2.0, 1.0))",47521.875
8,757.854,5,14,1,0,1,48000,"List(1, 6, List(), List(757.854, 5.0, 14.0, 1.0, 0.0, 1.0))",42209.782608695656
9,1152.792,10,12,3,2,2,45000,"List(1, 6, List(), List(1152.792, 10.0, 12.0, 3.0, 2.0, 2.0))",47521.875
10,1423.2,4,5,4,2,2,65000,"List(1, 6, List(), List(1423.2, 4.0, 5.0, 4.0, 2.0, 2.0))",47521.875


### Ensemble Methods of Regression
<hr>
* Comibining weak models & resulting a more genralized model is known as ensemble methods.
* Random Forest & Gradient Boosting Trees are two major ensemble techniques supported.
* Random Forest for Regression means averaging outcome of multiple decision trees.
* Different decision trees can be learnt of different nodes

In [79]:
from pyspark.ml.regression import RandomForestRegressor

In [80]:
rf = RandomForestRegressor(numTrees=10, featuresCol='features',labelCol='Price')

In [81]:
rf_model = rf.fit(house_data)

In [82]:
results = rf_model.transform(house_data)

In [83]:
display(results)

_c0,Sqft,Floor,TotalFloor,Bedroom,LivingRoom,Bathroom,Price,features,prediction
1,1177.698,2,7,2,2,2,62000,"List(1, 6, List(), List(1177.698, 2.0, 7.0, 2.0, 2.0, 2.0))",49832.56661047177
2,2134.8,5,7,4,2,2,78000,"List(1, 6, List(), List(2134.8, 5.0, 7.0, 4.0, 2.0, 2.0))",71596.87270466669
3,1138.56,5,7,2,2,1,58000,"List(1, 6, List(), List(1138.56, 5.0, 7.0, 2.0, 2.0, 1.0))",43657.3335843472
4,1458.78,2,7,3,2,2,45000,"List(1, 6, List(), List(1458.78, 2.0, 7.0, 3.0, 2.0, 2.0))",50682.14061047176
5,967.776,11,14,3,2,2,45000,"List(1, 6, List(), List(967.776, 11.0, 14.0, 3.0, 2.0, 2.0))",52980.983987904445
6,1127.886,11,12,4,2,2,148000,"List(1, 6, List(), List(1127.886, 11.0, 12.0, 4.0, 2.0, 2.0))",62718.01630996375
7,1352.04,5,7,3,2,1,58000,"List(1, 6, List(), List(1352.04, 5.0, 7.0, 3.0, 2.0, 1.0))",46161.57151291862
8,757.854,5,14,1,0,1,48000,"List(1, 6, List(), List(757.854, 5.0, 14.0, 1.0, 0.0, 1.0))",35701.956192365215
9,1152.792,10,12,3,2,2,45000,"List(1, 6, List(), List(1152.792, 10.0, 12.0, 3.0, 2.0, 2.0))",52636.68408947131
10,1423.2,4,5,4,2,2,65000,"List(1, 6, List(), List(1423.2, 4.0, 5.0, 4.0, 2.0, 2.0))",46801.86874256116
