Note: Since this notebook was created on DataBricks, Spark is available by default and does not require explicit import. If it were a Google Colab notebook, explicit import of Spark and other libraries might have been a must.

# Multiple Linear Regression

### A. Import Data
First, import data that was uploaded to the DataBricks File System (DBFS). While doing so, providing an explicit data type schema for the columns make the data import faster than if PySpark were to infer the data schema of the columns. Here, we allow for schema inference.

In [0]:
# file location in DBFS
file_location = "/FileStore/tables/housing.csv"
# read the data from DBFS
df = spark.read.csv(path = file_location, header = True, inferSchema = True)
display(df)

longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


In [0]:
df.columns

Out[20]: ['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value',
 'ocean_proximity']

ASIDE: Create a temporary view of the Spark SQL dataframe.<br>
This temporary view's lifetime is tied to this SparkSession and gets killed off once the session ends. The temporary view is useful if you want to access the same data multiple times within the notebook. Also, it will not copy the actual data at any place.

In [0]:
df.createOrReplaceTempView("temp_table_view")

This view can be used to work with the data. For instance, to display the data as below.

In [0]:
%sql
SELECT * FROM "temp_table_view"
LIMIT 5

longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


For now, we will work only with the SQL DataFrame itself instead of its temporary view.

## B. Data Preprocessing
The dataframe that we created has a specific schema. It also needs reorganization of the columns and other changes before the data can be used in the Multiple Linear Regression model made available by the Spark mllib API.

In [0]:
df.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)
 |-- ocean_proximity: string (nullable = true)



In [0]:
df.show(5)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  -122.25|   37.85|              

d Among the given columns, **median_house_value** is the dependent variable while the rest are the independent variables. **ocean_proximity** is the only categorical variable while the rest are numerical variables.

### B.1 Handling Categorical Features.

In [0]:
df.select("ocean_proximity").distinct().collect()

Out[22]: [Row(ocean_proximity='ISLAND'),
 Row(ocean_proximity='NEAR OCEAN'),
 Row(ocean_proximity='NEAR BAY'),
 Row(ocean_proximity='<1H OCEAN'),
 Row(ocean_proximity='INLAND')]

As seen above, each of the house record has one of the five possible distinct values for the **ocean_proximity** attribute.

In [0]:
# TODO

### B.2 Handling Numerical Features.

In [0]:
# TODO