## Spark ML : Linear Regression Example1


### Concepts :

* Creating RDD using SparkContext
* Providing schema to create a DataFrame from an RDD
* Performing basic data analysis using Spark SQL
* Using Spark ML to perform train a linear regression model

### Input Dataset :

* California Housing Dataset, housing prices per 'blocks' of census. Each row in the dataset corresponds to a block group. A block corresponds to a group of citizens that live in a geographically compact area

### Objective :

* Build a model that is able to predict the median house price

### Dataset Details:

Features : 

* Latitude 
* Longitude
* Housing median age : median age of the people that belong to a block group 
* Total rooms : total nb of rooms in the houses of the block group 
* Total bedrooms : total nb of bedrooms in the houses of the block group
* Population : nb of inhabitants of a block group 
* Households : units of houses and their occupants per block group 
* Median income : median income of people that belong to a block group 

Target :

* Median house value 

### Overall Workflow

1. Load Data
2. Inspect Data
3. Preprocess Data
4. Create Model
5. Make Predictions
6. Evaluate how good are our predictions

In [1]:
import os
my_home=os.environ['HOME']
dataset_path=my_home+"/spark-course/data/housing_data/"
outputs_path=my_home

In [2]:
import os
print(os.environ['SPARK_HOME'])

/usr/hdp/current/spark2-client


In [3]:
import findspark
findspark.init()
import pyspark

In [4]:
# Create a SparkSession and specify configuration
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Lab5-ML-LinearRegression-Example") \
    .getOrCreate()

In [5]:
spark.version

'2.1.1.2.6.2.14-5'

### Data Loading

Direct data inspection shows that the input data has no header.
There are several ways this can be tacked , I provide here only 2 examples on how to provide the schema, to construc the data frame
  * option 1 : use the Row object,construct a DataFrame by creating Row objects (remember a DataFrame is a Dataset[Row]
  * option 2 : infer the schema from the data and add a header
  * option 3 : manually provide a schema using the StructType construct
  * there are possible other options , even simpler ... provide it yourself 

In [6]:
def readLine(line):
    """ Parse a line from the input data
    Args:
        line (str): a line (row) of the input data file
    Returns:
        Row : row object containin the parsed elements from the line
        Note we are adding schema by directly transforming the str into double types
    """
    
    parts=re.split(",", line)
   
    # Read in each feature PLUS THE TARGET
    lat = parts[0]
    lon = parts[1]
    age = parts[2]
    trm = parts[3]
    tbr = parts[4]
    pop = parts[5]
    hou = parts[6]
    inc = parts[7]
    val = parts[8]
        
    return Row(
                latitude=float(lat),
                longitude=float(lon),
                median_housing_age=float(age),
                total_rooms=float(trm),
                total_bedrooms=float(tbr),
                population=float(pop),
                households=float(hou),
                median_income=float(inc),
                median_value=float(val)
           )

In [7]:
# ---------
# Option 1 : use SparkContext and a function to map each line to a Row object
# ---------
from pyspark.sql import Row
import re
sc=spark.sparkContext
rdd = sc.textFile("file://"+dataset_path+"*.data")
#
df = rdd \
        .map(lambda line: readLine(line)) \
        .toDF()

In [8]:
df.show(10)

+----------+--------+---------+------------------+-------------+------------+----------+--------------+-----------+
|households|latitude|longitude|median_housing_age|median_income|median_value|population|total_bedrooms|total_rooms|
+----------+--------+---------+------------------+-------------+------------+----------+--------------+-----------+
|     126.0| -122.23|    37.88|              41.0|       8.3252|    452600.0|     322.0|         129.0|      880.0|
|    1138.0| -122.22|    37.86|              21.0|       8.3014|    358500.0|    2401.0|        1106.0|     7099.0|
|     177.0| -122.24|    37.85|              52.0|       7.2574|    352100.0|     496.0|         190.0|     1467.0|
|     219.0| -122.25|    37.85|              52.0|       5.6431|    341300.0|     558.0|         235.0|     1274.0|
|     259.0| -122.25|    37.85|              52.0|       3.8462|    342200.0|     565.0|         280.0|     1627.0|
|     193.0| -122.25|    37.85|              52.0|       4.0368|    2697

In [9]:
df.printSchema()

root
 |-- households: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- median_housing_age: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_value: double (nullable = true)
 |-- population: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- total_rooms: double (nullable = true)



In [10]:
# ---------
# Option 2 :  use SparkSession and infer schema, then add a header
# ---------

df2 = spark.read \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"*.data")

In [11]:
df2.printSchema()

root
 |-- _c0: double (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: double (nullable = true)
 |-- _c5: double (nullable = true)
 |-- _c6: double (nullable = true)
 |-- _c7: double (nullable = true)
 |-- _c8: double (nullable = true)



In [12]:
   
features=[ "latitude","longitude","median_housing_age", \
            "total_rooms","total_bedrooms","population", \
            "households","median_income"]
target=["median_value"]

fieldnames=features+target

rawnames=df2.schema.names

# Create a small function
def updateColNames(df,oldnames,newnames):
    for i in range(len(newnames)):
        df=df.withColumnRenamed(oldnames[i], newnames[i])
    return df

df2=updateColNames(df2,rawnames,fieldnames)

df2.printSchema()

root
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- median_housing_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_value: double (nullable = true)



In [13]:
# ---------
# Option 3 :  manually provide a Schema
# ---------

# from pyspark.sql.types import *
# fieldnames=[ "latitude","longitude","median_housing_age", \
#            "total_rooms","total_bedrooms","population", \
#            "households","median_income","median_value"]
# def applySchema(x,fieldnames):
#      fields = [StructField(field_name, DoubleType(), True) for field_name in fieldnames]
#      schema = StructType(fields)
#      return x
# 
# features = rdd \
#            .map(lambda line: line.split(",")) \
#            .map(lambda x : applySchema(x,fieldnames))

In [14]:
# Create a table for SQL access
df.registerTempTable("houses")

### Data Inspection

In [15]:
# Records
df.count()

20640

In [16]:
# Summary statistics on the selected (or full) set of fields of the dataframe
# ( Remember the total rooms are PER GROPU BLOCK of census , not per house ..., obviously)
df.select('total_rooms','total_bedrooms','median_income','population').describe().show()

+-------+------------------+-----------------+------------------+------------------+
|summary|       total_rooms|   total_bedrooms|     median_income|        population|
+-------+------------------+-----------------+------------------+------------------+
|  count|             20640|            20640|             20640|             20640|
|   mean|2635.7630813953488|537.8980135658915|3.8706710029070246|1425.4767441860465|
| stddev|2181.6152515827944| 421.247905943133| 1.899821717945263|  1132.46212176534|
|    min|               2.0|              1.0|            0.4999|               3.0|
|    max|           39320.0|           6445.0|           15.0001|           35682.0|
+-------+------------------+-----------------+------------------+------------------+



### Worth Noting Here

See here that the **standard deviation is in almost all cases of the order of the mean value**

**Meaning there is a large spread in our data** and pointing to the fact that we will need to somehow **normalize our data**

### Look for Correlations

In [17]:
# Create a small function
# that computes the correlation of each column against the target
# Computes Pearson Correlation Coefficient between the two columns
def computeCorrelation(df,targetColumnName):
    for col in df.columns:
        r=df.stat.corr(col,targetColumnName)
        print("Pearson correlation : %s %s %f \n" %(col,targetColumnName,r))

In [18]:
# THERE SEEMS TO BE A SPARK INSTALLATION PROBLEM WITH TEH STATS LIB
# FIXME !
computeCorrelation(df, 'median_value')

Pearson correlation : households median_value 0.065843 

Pearson correlation : latitude median_value -0.045967 

Pearson correlation : longitude median_value -0.144160 

Pearson correlation : median_housing_age median_value 0.105623 

Pearson correlation : median_income median_value 0.688075 

Pearson correlation : median_value median_value 1.000000 

Pearson correlation : population median_value -0.024650 

Pearson correlation : total_bedrooms median_value 0.050594 

Pearson correlation : total_rooms median_value 0.134153 



### Data Visualization


You would typically take a sub sample (no replacement here) from the data JUST for plotting purposes.

Sampling with or without replacement has important statistical differences ( selection bias ), but we are jsut plotting

Usefull explanation of sample with or without replacement implications here:

https://www.ma.utexas.edu/users/parker/sampling/repl.htm

Linear Relationships (plotting):
https://seaborn.pydata.org/tutorial/regression.html

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.lmplot("median_income", "median_value", data=df.toPandas(), fit_reg=True, markers=".")
#
# SIMPLE , least squares fit
# 
# IMPORTANT NOTE : you will -always- want to scale your data
# before doing this below , is there just for simple demo purposes
#
import numpy as np
resi,rank,sing,rcond,c=np.polyfit(df.toPandas()['median_income'], df.toPandas()['median_value'], 1,full=True)
print('residuals : %f' % resi[0])

In [None]:
import seaborn as sns
sns.set(style="ticks", color_codes=True )
sns.pairplot(df.toPandas(), markers=".")
plt.show()

### Data Preprocessing

**Adjust our target variable**

As we said before is recommended to scale the target variable prior to model creation. Doing this will avoid problems during the model creation and predictions computation, due to large values and possible outliers in the data. This also eases the scaling process later.

**Add some 'new' features** to our existing set of features.
This are qualitative information that could help us in predicting the median_value of a house

1. Rooms per household :  number of rooms in households per block group
2. Population per household :  how many people live in households per block group 
3. Bedrooms per room : gives you an idea about how many rooms are bedrooms per block group

In [None]:
# Adjust the values of `median_value`
# We will express the median_value in units of 100
from pyspark.sql.functions import *
df = df.withColumn("median_value", col("median_value")/100000)

In [None]:
from pyspark.sql.functions import *
# 1.
roomsPerHousehold = df['total_rooms']/df['households']

# 2.
populationPerHousehold = df['population']/df['households']

# 3.
bedroomsPerRoom = df['total_bedrooms']/df['total_rooms']

# Add the new columns to `df`
df = df.withColumn("roomsPerHousehold",roomsPerHousehold) \
       .withColumn("populationPerHousehold",populationPerHousehold) \
       .withColumn("bedroomsPerRoom", bedroomsPerRoom)
   
# Check what is the output
df.first()

In [None]:
# Re-order and select columns
df = df.select('median_value',
              'total_bedrooms', 
              'population', 
              'households', 
              'median_income', 
              'roomsPerHousehold', 
              'populationPerHousehold', 
              'bedroomsPerRoom'
              )

In [None]:
from pyspark.ml.linalg import DenseVector

# We need to transform each row of feature into a vector ( a continous space of values )
# for the algorithm : in particular a DenseVector
# The density of a vector is defined by the number of empty values it has. 
# lesser empty values, bigger density of the vector

# A full view of data types in RDD based API ;
# https://spark.apache.org/docs/2.2.0/mllib-data-types.html


# Define the input_data 
# The median value (row[0]) is our target variable ( the label )
# The rest of the values row[1:] our our features
data = df.rdd.map(lambda row: (row[0], DenseVector(row[1:])))

# Replace df with the new DataFrame
df = spark.createDataFrame(data, ["label", "features"])

In [None]:
df.toPandas().head(4)

### Feature Scaling ( Standarization )

At this stage we can see that features are not scaled.

Scaling features is a very common pre-processing step and can improve the convergence rate during the optimization process, and also prevents against features with very large variances exerting an overly large influence during model training.

In [None]:
from pyspark.ml.feature import StandardScaler

standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")

# Fit the DataFrame to the scaler
scaler = standardScaler.fit(df)

# Transform the data in df using the scaler
scaled_df = scaler.transform(df)

# Inspect the result
scaled_df.take(2)

In [None]:
# Split the data into train and test sets
train_data, test_data = scaled_df.randomSplit([.8,.2],seed=1234)
print('Training records : %d' % train_data.count())
print('Test records : %d ' % test_data.count())

### Model Creation

In [None]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol="label", maxIter=10, elasticNetParam=0.8)

# Fit the data to the models
linearModelA = lr.fit(train_data,{lr.regParam:0.1})
linearModelB = lr.fit(train_data,{lr.regParam:0.3})
linearModelC = lr.fit(train_data,{lr.regParam:0.6})
# Generate predictions for models
predictedA = linearModelA.transform(test_data)
predictedB = linearModelB.transform(test_data)
predictedC = linearModelC.transform(test_data)

### Model Evaluation

In [None]:
predictedB.toPandas().head(10)

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction')
scoreA = evaluator.evaluate(predictedA)
scoreB = evaluator.evaluate(predictedB)
scoreC = evaluator.evaluate(predictedC)
print('Score for model A is : %f' % scoreA )
print('Score for model B is : %f' % scoreB )
print('Score for model C is : %f' % scoreC )

In [None]:
# Get the RMSE ( standard deviation of the residuals ) residual = predicted - observed
# It indicates the absolute fit of the model to the data ,
# or how close the observed data points are to the model's predicted values
# The smaller an RMSE value, the closer predicted and observed values are.
linearModelA.summary.rootMeanSquaredError

In [None]:
# Get the R2
# The R2 (coefficient of determination) is a measure 
# of the dispersion of the data with respect to fitted regression line, a relative measurement,
# and varies between 0-100% 
# 0% indicates that the model explains none of the variability of the response data around its mean, 
# and 100% indicates the opposite: it explains all the variability. 
# In gemneral , the higher the R-squared, the better the model fits your data.
linearModelA.summary.r2

In [None]:
spark.stop()