Note: Since this notebook was created on DataBricks, Spark is available by default and does not require explicit import. If it were a Google Colab notebook, explicit import of Spark and other libraries might have been a must.

# Multiple Linear Regression

### A. Import Data
First, import data that was uploaded to the DataBricks File System (DBFS). While doing so, providing an explicit data type schema for the columns make the data import faster than if PySpark were to infer the data schema of the columns. Here, we allow for schema inference.

In [0]:
# file location in DBFS
file_location = "/FileStore/tables/housing.csv"
# read the data from DBFS
df = spark.read.csv(path = file_location, header = True, inferSchema = True)
df.show(5)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  -122.25|   37.85|              

In [0]:
df.count()

Out[7]: 20640

In [0]:
df.columns

Out[8]: ['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value',
 'ocean_proximity']

ASIDE: Create a temporary view of the Spark SQL dataframe.<br>
This temporary view's lifetime is tied to this SparkSession and gets killed off once the session ends. The temporary view is useful if you want to access the same data multiple times within the notebook. Also, it will not copy the actual data at any place.

In [0]:
df.createOrReplaceTempView("temp_table_view")

This view can be used to work with the data. For instance, to display the data as below.

In [0]:
%sql
SELECT * FROM temp_table_view
LIMIT 5

longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


For now, we will work only with the SQL DataFrame itself instead of its temporary view.

## B. Data Preprocessing
The dataframe that we created has a specific schema. It also needs reorganization of the columns and other changes before the data can be used in the Multiple Linear Regression model made available by the Spark mlLib API. Since Linear Regression model is being looked at here, it is absolutely necessary to ensure that the data follows the assumptions of linear regression in real scenarios. Here, however, it is skipped in order to demonstrate the linear regression model training using PySpark.

In [0]:
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)



In [0]:
df.show(5)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  -122.25|   37.85|              

d Among the given columns, **median_house_value** is the dependent variable while the rest are the independent variables. **ocean_proximity** is the only categorical variable while the rest are numerical variables.

### B.1 Handling Missing Values
Check for missing values and take actions to eliminate the issue of missing values. First, let's try to see if we can find any row with null values.

In [0]:
# first
df_without_null = df.dropna("any")
df_with_null = df.subtract(df_without_null)
df_with_null.show(10)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|   -122.1|   37.69|              41.0|      746.0|          null|     387.0|     161.0|       3.9063|          178400.0|       NEAR BAY|
|  -122.24|   37.75|              45.0|      891.0|          null|     384.0|     146.0|       4.9489|          247100.0|       NEAR BAY|
|  -121.95|   38.03|               5.0|     5526.0|          null|    3207.0|    1012.0|       4.0767|          143100.0|         INLAND|
|  -122.08|   37.88|              26.0|     2947.0|          null|     825.0|     626.0|        2.933|           85000.0|       NEAR BAY|
|  -122.28|   37.78|              

As seen above, there are records with null values. Let's calculate the total number of null values in each of the columns.

In [0]:
null_val_dict = {col:df.filter(df[col].isNull()).count() for col in df.columns}
null_val_dict

Out[15]: {'longitude': 0,
 'latitude': 0,
 'housing_median_age': 0,
 'total_rooms': 0,
 'total_bedrooms': 207,
 'population': 0,
 'households': 0,
 'median_income': 0,
 'median_house_value': 0,
 'ocean_proximity': 0}

It appears that only one column named **total_bedrooms** has null values. There are different ways to deal with instances with null values. Here, we will filter out the records with null values for simplicity.

In [0]:
df = df.dropna("any")
df.show(5)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  -122.25|   37.85|              

In [0]:
df.count()

Out[17]: 20433

### B.1 Handling Categorical Features.

In [0]:
df.select("ocean_proximity").distinct().collect()

Out[18]: [Row(ocean_proximity='ISLAND'),
 Row(ocean_proximity='NEAR OCEAN'),
 Row(ocean_proximity='NEAR BAY'),
 Row(ocean_proximity='<1H OCEAN'),
 Row(ocean_proximity='INLAND')]

As seen above, each of the house record has one of the five possible distinct values for the **ocean_proximity** attribute.

#### B.1.1 One-Hot Encoding
Since most ML algorithms require the features to be numerical, categorical features are converted into numerical features using label encoding or one-hot encoding. One-hot encoding is preferred over label encoding becauselabel encoding provides a false sense of "intensity" for larger value assigned to a categorical value. However, here, label encoding is used for demo purpose. **StringIndexed** class allows the label encoding in PySpark.

In [0]:
from pyspark.ml.feature import StringIndexer

In [0]:
string_indexer = StringIndexer(inputCol="ocean_proximity", outputCol="ocean_proximity_encoded")
df = string_indexer.fit(df).transform(df)
df = df.drop("ocean_proximity")
df.show(5)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+-----------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity_encoded|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+-----------------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|                    3.0|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|                    3.0|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|                    3.0|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|   

### B.2 Handling Numerical Features.
To allow faster convergence of the gradient descent/ ascent algorithm, it is useful to bring all the numerical features to the same scale. However, since we have used label encoding, this scaling would provide a false sense of "largeness". Hence, for the demonstration purpose here, let us skip the scaling of feature for now.

## C. Vector Assembling
%md Before using the ML model provided by default in PySpark, the input features should be grouped together in a vector as a requirement of PySpark. Notice that the target variable **median_house_value** has been omitted in the **inputCols** below.

In [0]:
from pyspark.ml.feature import VectorAssembler

In [0]:
vector_assembler = VectorAssembler(inputCols=["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income", "ocean_proximity_encoded"],
                outputCol="independent_features")
vectorized_features = vector_assembler.transform(df)
vectorized_features.show(5)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+-----------------------+--------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity_encoded|independent_features|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+-----------------------+--------------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|                    3.0|[-122.23,37.88,41...|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|                    3.0|[-122.22,37.86,21...|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|                    3.0|[-122.24,37.85,52...

As seen above, the transformation, produces a new column that contains all the features together. This new column has a type called **User-Defined Type (UDT)**.

In [0]:
vectorized_features.select("independent_features").show(5)

+--------------------+
|independent_features|
+--------------------+
|[-122.23,37.88,41...|
|[-122.22,37.86,21...|
|[-122.24,37.85,52...|
|[-122.25,37.85,52...|
|[-122.25,37.85,52...|
+--------------------+
only showing top 5 rows



In [0]:
prepared_data = vectorized_features.select("independent_features", "median_house_value")
prepared_data.show(5)

+--------------------+------------------+
|independent_features|median_house_value|
+--------------------+------------------+
|[-122.23,37.88,41...|          452600.0|
|[-122.22,37.86,21...|          358500.0|
|[-122.24,37.85,52...|          352100.0|
|[-122.25,37.85,52...|          341300.0|
|[-122.25,37.85,52...|          342200.0|
+--------------------+------------------+
only showing top 5 rows



## D. Model Training

In [0]:
from pyspark.ml.regression import LinearRegression

# train-test split
train_data, test_data = prepared_data.randomSplit([0.8, 0.2])
regressor = LinearRegression(featuresCol="independent_features", labelCol="median_house_value", maxIter = 200, standardization=False)
regressor = regressor.fit(train_data)

In [0]:
regressor.coefficients, regressor.intercept

Out[29]: (DenseVector([-43298.3578, -42760.0963, 1196.6226, -7.5649, 103.721, -43.0357, 66.362, 40032.0478, -1637.5999]),
 -3642223.304937113)

Turns out that the linear regression co-efficients and the intercept are large in magnitude because of the large scale of the features themselves.

## E. Model Evaluation

In [0]:
test_predictions = regressor.evaluate(test_data)
test_predictions.predictions.show(5)

+--------------------+------------------+------------------+
|independent_features|median_house_value|        prediction|
+--------------------+------------------+------------------+
|[-124.35,40.54,52...|           94600.0|188646.22838411806|
|[-124.19,40.77,30...|           69000.0|145165.32279695896|
|[-124.18,40.79,40...|           64600.0| 99975.60574790556|
|[-124.17,40.77,30...|           81300.0| 117252.5025258218|
|[-124.17,40.8,52....|           75500.0|129241.83977745939|
+--------------------+------------------+------------------+
only showing top 5 rows



In [0]:
test_predictions.meanSquaredError, test_predictions.meanAbsoluteError, test_predictions.r2

Out[31]: (5165111140.376361, 51832.03286885627, 0.6149777410629)

As seen above, the simple model trained above has massive MSE and MAE. There is a large room for improvement using different techniques such as feature selection, model assumptions verification, One-Hot Encoding and Standardization/ Normalization. The usage of these techniques in PySpark will be explored later.