## Housing costs prediction

**The main objective of the project** is to predict the median home value in a residential community based on housing data in California in 1990 using a linear regression model. 

**Metrics to assess the quality of the model:** RMSE, MAE, and R2.

**Process:**
1. [Read and preprocess the data](#preprocessing).
2. [Training the linear regression model in two variants (with and without categorical features)](#training)
3. [Selecting the best model and analyzing the results](#results)

In this project we will work with "big data" so we will use **PySpark**.

<a id='preprocessing'></a>
# Data preprocessing

In [1]:
# imports
import pandas as pd 
import numpy as np
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.evaluation import RegressionMetrics

In [2]:
# spark-session initialization
spark = SparkSession.builder \
                    .master("local") \
                    .appName("EDA California Housing") \
                    .getOrCreate()

In [3]:
# data reading
df_housing = spark.read.load('/datasets/housing.csv', format='csv', inferSchema=True, header=True)
df_housing.show(5)

                                                                                

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
|  -122.25|   37.85|              52.0|     1274.0|         235.0|     558.0|     219.0|       5.6431|          341300.0|       NEAR BAY|
|  -122.25|   37.85|              

In [4]:
# data types
print(pd.DataFrame(df_housing.dtypes, columns=['column', 'type']))

               column    type
0           longitude  double
1            latitude  double
2  housing_median_age  double
3         total_rooms  double
4      total_bedrooms  double
5          population  double
6          households  double
7       median_income  double
8  median_house_value  double
9     ocean_proximity  string


The dataframe contains **10 columns** with the following data:

- `longitude` - latitude;
- `latitude` - longitude;
- `housing_median_age` - median age of residents of the housing estate;
- `total_rooms` - total number of rooms in the houses of the housing estate;
- `total_bedrooms` - total number of bedrooms in the houses of the housing estate;
- `population` - the number of people who live in the housing estate;
- `households` - number of households in the housing estate;
- `median_income` - median income of the inhabitants of the housing estate;
- `median_house_value` - median value of a house in the housing estate;
- `ocean_proximity` - proximity to the ocean.

The `median_house_value` represents a numeric **target variable**. Most of the attributes are similarly numeric, with the exception of `ocean_proximity`.

In [5]:
# missing values
for col in df_housing.columns:
    print(col, df_housing.filter(F.col(col).isNull()).count())

longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 207
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0


Gaps are present only in the `total_bedrooms` column. Let's fill them with the median:

In [6]:
# median
bedrooms_median = df_housing.approxQuantile('total_bedrooms', [0.5], 0)[0]

In [7]:
df_housing = df_housing.na.fill(value=bedrooms_median, subset='total_bedrooms')

for col in df_housing.columns:
    print(col, df_housing.filter(F.col(col).isNull()).count())

longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0


Before training the model, it is also necessary to encode the categorical features. In this dataset, there is only one.

In [8]:
stages = []

# categorical to numeric
indexer = StringIndexer(inputCols=['ocean_proximity'], 
                        outputCols=['ocean_proximity_idx'])

stages += [indexer]

In [9]:
# ohe
encoder = OneHotEncoder(inputCol='ocean_proximity_idx',
                        outputCol='ocean_proximity_ohe')

stages += [encoder]

Numerical features should also be prepared - scaling off the values.

In [10]:
numeric = df_housing.columns[:-2]

In [11]:
numeric

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income']

In [12]:
# vectorization
numerical_assembler = VectorAssembler(inputCols=numeric, outputCol="numeric_idx")
stages += [numerical_assembler]

In [13]:
# scaling
standardScaler = StandardScaler(inputCol='numeric_idx', outputCol="numeric_scaled")
stages += [standardScaler]

In [14]:
# final table
all_features = ['ocean_proximity_ohe','numeric_scaled']

final_assembler = VectorAssembler(inputCols=all_features, 
                                  outputCol="features") 

stages += [final_assembler]

<a id='training'></a>
# Model training

Splitting into sets for model training and testing.

In [15]:
train_data, test_data = df_housing.randomSplit([.8,.2], seed=123)
print(train_data.count(), test_data.count()) 

                                                                                

16442 4198


The chosen machine learning algorithm is *Linear regression*. \
Let's compare two models trained on the full data set and using only numerical features.

In [16]:
# fit on full data
lr = LinearRegression(featuresCol = 'features', labelCol='median_house_value', regParam=0.3)

stages_full = stages.copy()
stages_full += [lr]

# pipeline creation
pipeline = Pipeline(stages=stages_full)
lr_model = pipeline.fit(train_data)

# model predictions
predictions = lr_model.transform(test_data)
predictedLabes = predictions.select('median_house_value', 'prediction')

# to RDD
predictionLabes_list = [(float(i[1]), float(i[0])) for i in predictedLabes.collect()]
predictionAndObservations = spark.sparkContext.parallelize(predictionLabes_list)

# metrics
metrics = RegressionMetrics(predictionAndObservations)

print('RMSE:', round(metrics.rootMeanSquaredError))
print('MAE:', round(metrics.meanAbsoluteError))
print('R2:', round(metrics.r2, 2))

23/03/21 15:16:18 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
23/03/21 15:16:18 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
23/03/21 15:16:18 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
23/03/21 15:16:18 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
                                                                                

RMSE: 67538
MAE: 49798
R2: 0.65


In [17]:
from pyspark.ml.evaluation import RegressionEvaluator
r2 = \
        RegressionEvaluator(labelCol='median_house_value',
                                            metricName='r2').evaluate(predictedLabes)
mae = \
        RegressionEvaluator(labelCol='median_house_value',
                                            metricName='mae').evaluate(predictedLabes)
rmse = \
        RegressionEvaluator(labelCol='median_house_value',
                                            metricName='rmse').evaluate(predictedLabes)
print('RMSE:', round(rmse, 2))
print('MAE:', round(mae, 2))
print('R2:', round(r2, 2))

RMSE: 67538.45
MAE: 49798.26
R2: 0.65


In [18]:
# fit on only numeric data
lr = LinearRegression(featuresCol = 'numeric_scaled', labelCol='median_house_value', regParam=0.3)

stages_part = stages.copy()
stages_part += [lr]

# pipeline creation
pipeline = Pipeline(stages=stages_part)
lr_model = pipeline.fit(train_data)

# model predictions
predictions = lr_model.transform(test_data)
predictedLabes = predictions.select('median_house_value', 'prediction')

# to RDD
predictionLabes_list = [(float(i[1]), float(i[0])) for i in predictedLabes.collect()]
predictionAndObservations = spark.sparkContext.parallelize(predictionLabes_list)

# metrics
metrics = RegressionMetrics(predictionAndObservations)

print('RMSE:', round(metrics.rootMeanSquaredError))
print('MAE:', round(metrics.meanAbsoluteError))
print('R2:', round(metrics.r2, 2))

RMSE: 68335
MAE: 50796
R2: 0.64


In [19]:
r2 = \
        RegressionEvaluator(labelCol='median_house_value',
                                            metricName='r2').evaluate(predictedLabes)
mae = \
        RegressionEvaluator(labelCol='median_house_value',
                                            metricName='mae').evaluate(predictedLabes)
rmse = \
        RegressionEvaluator(labelCol='median_house_value',
                                            metricName='rmse').evaluate(predictedLabes)
print('RMSE:', round(rmse, 2))
print('MAE:', round(mae, 2))
print('R2:', round(r2, 2))

RMSE: 68335.15
MAE: 50795.84
R2: 0.64


In [20]:
# close spark-session
spark.stop()

<a id='results'></a>
# Results

This project, using *pySpark*, was able to build two linear regression models predicting the median housing cost in a housing tract based on housing data in California in 1990. All attributes present were used to train the first model, and only numerical attributes were used to train the second model. The selected metrics for evaluating the quality of the models are RMSE (root mean squared error), MAE (mean absolute error), and R2.

In the preprocessing stage, missing values in the data were recovered as median values, the categorical feature was coded using *One Hot Encoding*, and the numerical features were scaled to reduce the impact of outliers on model prediction.

**Training on the full dataset proved to be more efficient.* * The model that takes into account the influence of all attributes explains a larger proportion of the variance of the target variable (R2 **42%**) and is also characterized by lower values of prediction errors (RMSE and MAE).