<font size=5>

Regression with boston_data.csv. Dataset downloaded from Kaggle, to predict Boston housing price



</font>

| Code   | Description   |
|:---|:---|
|**CRIM** | per capita crime rate by town |
|**ZN**  | proportion of residential land zoned for lots over 25,000 sq.ft. | 
|**INDUS**  | proportion of non-retail business acres per town | 
|**CHAS**  | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) | 
|**NOX**  | nitric oxides concentration (parts per 10 million) | 
|**RM**  | average number of rooms per dwelling | 
|**AGE**  | proportion of owner-occupied units built prior to 1940 | 
|**DIS**  | weighted distances to five Boston employment centres | 
|**RAD**  | index of accessibility to radial highways | 
|**TAX**  | full-value property-tax rate per $10,000 | 
|**PTRATIO**  | pupil-teacher ratio by town | 
|**B**  | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town | 
|**LSTAT**  | % lower status of the population | 
|**MEDV**  | Median value of owner-occupied homes in \$1000's | 



<font size=5>mdev is the label, all other columns are features. </font>

<font size=5> Import PySpark libraries, create SparkContext and SQL context, then load the csv data file. </font>

In [63]:
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

sc.stop()
sc= SparkContext()
sqlContext = SQLContext(sc)
boston_house_df = sqlContext.read.format('csv').options(header='true', inferschema='true').load('file:///opt/spark/data/data_compare/boston_data.csv')

<font size=5> Show statistics of each column, including feature columns and label column (medv)  </font>

In [64]:
boston_house_df.describe().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
crim,404,3.7309119306930723,8.943922212251913,0.00632,88.9762
zn,404,10.509900990099009,22.053733184762923,0.0,95.0
indus,404,11.189900990099002,6.8149093223650885,0.46,27.74
chas,404,0.06930693069306931,0.25429026389960196,0.0,1.0
nox,404,0.5567103960396043,0.11732064984156548,0.392,0.871
rm,404,6.301450495049499,0.6758302935149543,3.561,8.78
age,404,68.60173267326732,28.066142579151702,2.9,100.0
dis,404,3.7996663366336647,2.1099159643057357,1.1691,12.1265
rad,404,9.836633663366337,8.834741064787444,1.0,24.0


<font size=5>

We need to find out corelationship beween each feature column with label medv.  The corelationship is between 0 to |1|, the more close to -1, or 1, that means that feature column is more negatively or positively corelated to medv, the more close to 0, that means less or little corelationship between the feature column and label medv.

   
    
</font>

In [65]:
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline

In [9]:
import six
for i in boston_house_df.columns:
    if not( isinstance(boston_house_df.select(i).take(1)[0][0], six.string_types)):
        print( "Correlation to medv for ", i, boston_house_df.stat.corr('medv',i))

Correlation to medv for  crim -0.4009558757372438
Correlation to medv for  zn 0.355607582415516
Correlation to medv for  indus -0.5016982293419979
Correlation to medv for  chas 0.14140044808241922
Correlation to medv for  nox -0.4392251926056786
Correlation to medv for  rm 0.6835409939262136
Correlation to medv for  age -0.39086335148339485
Correlation to medv for  dis 0.26487595153417776
Correlation to medv for  rad -0.4235083975722877
Correlation to medv for  tax -0.49579240671703434
Correlation to medv for  ptratio -0.5063125552383506
Correlation to medv for  black 0.36007109188975617
Correlation to medv for  lstat -0.7426954940642168
Correlation to medv for  medv 1.0


<font size=5>

Spark ML requires features of the dataset are vectorized before the dataset can be fit into ML model, 
VectorAssembler is to convert a Spark Dataframe into Spark Vectorized Dataframe

</font>

In [68]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat'], outputCol = 'features')
#vectorAssembler = VectorAssembler(inputCols = ['rm'], outputCol = 'features')
vector_house_df = vectorAssembler.transform(boston_house_df)
vector_house_df = vector_house_df.select(['features', 'medv'])
vector_house_df.show(2)

+--------------------+----+
|            features|medv|
+--------------------+----+
|[0.15876,0.0,10.8...|21.7|
|[0.10328,25.0,5.1...|19.6|
+--------------------+----+
only showing top 2 rows



<font size=5>  

Now randomly split Spark Vectorized DataFrame (dataset) into training data (70%) and testing data (30%)
    
    
</font>

In [69]:
splits = vector_house_df.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]
print(test_df.count())

115


In [70]:
train_df.show(2)

+--------------------+----+
|            features|medv|
+--------------------+----+
|[0.01301,35.0,1.5...|32.7|
|[0.01381,80.0,0.4...|50.0|
+--------------------+----+
only showing top 2 rows



<font size=5>

Let's do Linear Regression first, fit the Linear Regression model with train_df
    
</font>

In [14]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = 'features', labelCol='medv', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [-0.04605858960728708,0.0009782224296996135,-0.012130255724909542,1.6497514727852698,-5.725161579044625,4.364815635088261,0.0,-0.5725774171192526,0.0,0.0,-0.8172417837138445,0.007950352749432139,-0.49227049878479]
Intercept: 18.969627314022954


<font size=5>
Linear Regression produced slope coefficients and intercept

y=a1 X x1 + a2 X x2 +...+ an X xn + b

a1,a2,...an are coefficients for the xn in their space
b is intercept

x1, x2, ... xn are independent variables

</font>

In [15]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 4.514423
r2: 0.734197


In [16]:
train_df.describe().show()

+-------+------------------+
|summary|              medv|
+-------+------------------+
|  count|               301|
|   mean|22.162458471760797|
| stddev| 8.770922752810543|
|    min|               5.0|
|    max|              50.0|
+-------+------------------+



<font size=5>

Test the model with test_df, testing produces metrics that evaluates the performance of the regressor with RMSE and R2 score.

  
    
</font>

In [17]:
lr_predictions = lr_model.transform(test_df)
lr_predictions.select("prediction","medv","features").show(5)
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
|26.276616151728096|33.0|[0.01951,17.5,1.3...|
| 38.45717891007576|50.0|[0.02009,95.0,2.6...|
| 26.78556110995195|16.5|[0.02498,0.0,1.89...|
|28.312244074548467|30.8|[0.02763,75.0,2.9...|
|26.387800739076635|25.0|[0.02875,28.0,15....|
+------------------+----+--------------------+
only showing top 5 rows

R Squared (R2) on test data = 0.707176


In [18]:
test_result = lr_model.evaluate(test_df)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_result.rootMeanSquaredError)

Root Mean Squared Error (RMSE) on test data = 4.87695


In [20]:
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show(2)

numIterations: 11
objectiveHistory: [0.5000000000000004, 0.43301699522033144, 0.232197220657441, 0.2072378796302617, 0.17607126421866942, 0.17384247862056668, 0.1731252493266702, 0.1714446715795995, 0.17116908096115527, 0.17103599732933553, 0.17092588169896675]
+------------------+
|         residuals|
+------------------+
|-6.435796203387795|
|0.9012965616796755|
+------------------+
only showing top 2 rows



In [21]:
predictions = lr_model.transform(test_df)
predictions.select("prediction","medv","features").show(5)

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
|26.276616151728096|33.0|[0.01951,17.5,1.3...|
| 38.45717891007576|50.0|[0.02009,95.0,2.6...|
| 26.78556110995195|16.5|[0.02498,0.0,1.89...|
|28.312244074548467|30.8|[0.02763,75.0,2.9...|
|26.387800739076635|25.0|[0.02875,28.0,15....|
+------------------+----+--------------------+
only showing top 5 rows



<font size=5>
    
Now try Gradient Boost Tree Regressor with the same train_df and test_df 
    
    
</font>

In [71]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator


In [72]:
gbt = GBTRegressor(featuresCol="features",labelCol='medv', maxIter=10)
gbt_model = gbt.fit(train_df)


In [73]:
gbt_predictions = gbt_model.transform(test_df)
gbt_predictions.select("prediction","medv","features").show(5)

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
| 27.82113650393453|24.0|[0.00632,18.0,2.3...|
|19.633749465438637|18.9|[0.0136,75.0,4.0,...|
|30.478603890145347|29.1|[0.01439,60.0,2.9...|
|28.708202626278265|24.5|[0.01501,80.0,2.0...|
|22.722511085699136|21.6|[0.02731,0.0,7.07...|
+------------------+----+--------------------+
only showing top 5 rows



<font size=5>

Test the model with test_df, testing produces metrics that evaluates the performance of the regressor with RMSE and R2 score.

Looks like the metrics of Gradient Boost Tree are better that those of Linear Regressor
    
    
</font>

In [77]:
from pyspark.ml.evaluation import RegressionEvaluator
gbt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="r2")
print("R Squared (R2) on test data = %g" % gbt_evaluator.evaluate(gbt_predictions))

R Squared (R2) on test data = 0.788076


In [78]:
gbt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="rmse")

In [79]:
print("RMSE on test data = %g" % gbt_evaluator.evaluate(gbt_predictions))

RMSE on test data = 3.73236


<font size=5>

Now try Random Forest Regressor with the same train_df and test_df
    
</font>

In [80]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator


In [81]:

rf = RandomForestRegressor(featuresCol="features",labelCol='medv', maxDepth=3)
rf_model = rf.fit(train_df)

In [82]:
rf_predictions = rf_model.transform(test_df)
rf_predictions.select("prediction","medv","features").show(5)

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
|27.388638589503927|24.0|[0.00632,18.0,2.3...|
|22.301061336550966|18.9|[0.0136,75.0,4.0,...|
| 28.23928643587368|29.1|[0.01439,60.0,2.9...|
| 25.37852450611539|24.5|[0.01501,80.0,2.0...|
| 23.28916174580568|21.6|[0.02731,0.0,7.07...|
+------------------+----+--------------------+
only showing top 5 rows



<font size=5>
    
Test the model with test_df, testing produces metrics that evaluates the performance of the regressor with RMSE and R2 score.

Looks like the metrics of Random Forest are better that those of Linear Regressor, but similar to those of Gradient Boost Tree
    
</font>

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
rf_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="r2")
print("R Squared (R2) on test data = %g" % rf_evaluator.evaluate(rf_predictions))

In [84]:
rf_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="rmse")

In [85]:
print("RMSE on test data = %g" % gbt_evaluator.evaluate(gbt_predictions))

RMSE on test data = 3.73236


<font size=5>

Finally, try Decision Tree regressor with the same train_df and test_df
    
    
</font>

In [86]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator


In [87]:
dt = DecisionTreeRegressor(featuresCol="features",labelCol='medv', maxDepth=3)
dt_model = dt.fit(train_df)

In [88]:
dt_predictions = dt_model.transform(test_df)
dt_predictions.select("prediction","medv","features").show(5)

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
|26.283050847457623|24.0|[0.00632,18.0,2.3...|
|            21.025|18.9|[0.0136,75.0,4.0,...|
|26.283050847457623|29.1|[0.01439,60.0,2.9...|
|26.283050847457623|24.5|[0.01501,80.0,2.0...|
|            21.025|21.6|[0.02731,0.0,7.07...|
+------------------+----+--------------------+
only showing top 5 rows



<font size=5>
    
Test the model with test_df, testing produces metrics that evaluates the performance of the regressor with RMSE and R2 score.

Looks like the metrics of Decision Tree Regressor are slightly better than that those of Linear Regressor, but not as good as Gradient Boost Tree and Random Forest

</font>

In [92]:
from pyspark.ml.evaluation import RegressionEvaluator
dt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="r2")
print("R Squared (R2) on test data = %g" % dt_evaluator.evaluate(dt_predictions))

R Squared (R2) on test data = 0.727042


In [90]:
dt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="rmse")

In [51]:
print("RMSE on test data = %g" % dt_evaluator.evaluate(dt_predictions))

RMSE on test data = 4.70311


<font size=5>

This concludes the testing of Spark ML regressors

</font>