## Предсказание стоимости жилья

В проекте вам нужно обучить модель линейной регрессии на данных о жилье в Калифорнии в 1990 году. На основе данных нужно предсказать медианную стоимость дома в жилом массиве. Обучите модель и сделайте предсказания на тестовой выборке. Для оценки качества модели используйте метрики RMSE, MAE и R2.

# Подготовка данных

Подключаем необходимые модули:

In [1]:
import numpy as np
import pandas as pd

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler, OneHotEncoderEstimator 
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
    
pd.set_option('max_colwidth', 400)

RANDOM_SEED = 2022

Создаём Spark-сессию, читаем наш набор данных в объект `DataFrame`:

In [2]:
spark = SparkSession.builder \
                    .master("local") \
                    .appName("California Housing") \
                    .getOrCreate()

df = spark.read.load('...', format='csv', header=True, inferSchema=True)
df.show(3)

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
|  -122.23|   37.88|              41.0|      880.0|         129.0|     322.0|     126.0|       8.3252|          452600.0|       NEAR BAY|
|  -122.22|   37.86|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|       8.3014|          358500.0|       NEAR BAY|
|  -122.24|   37.85|              52.0|     1467.0|         190.0|     496.0|     177.0|       7.2574|          352100.0|       NEAR BAY|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
only showing top 3 rows



Выводим статистическую информацию, а также типы данных колонок датасета:

In [3]:
display(df.describe().toPandas())
display(df.dtypes)

Unnamed: 0,summary,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0,20640
1,mean,-119.56970445736148,35.6318614341087,28.639486434108527,2635.7630813953488,537.8705525375618,1425.4767441860463,499.5396802325581,3.8706710029070246,206855.81690891477,
2,stddev,2.003531723502584,2.135952397457101,12.58555761211163,2181.6152515827944,421.3850700740312,1132.46212176534,382.3297528316098,1.899821717945263,115395.6158744136,
3,min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0,<1H OCEAN
4,max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0,NEAR OCEAN


[('longitude', 'double'),
 ('latitude', 'double'),
 ('housing_median_age', 'double'),
 ('total_rooms', 'double'),
 ('total_bedrooms', 'double'),
 ('population', 'double'),
 ('households', 'double'),
 ('median_income', 'double'),
 ('median_house_value', 'double'),
 ('ocean_proximity', 'string')]

Исследуем данные на наличие пропусков:

In [4]:
for c in df.columns:
    print(f'N/A count for column "{c}" = {df.filter(df[c].isNull()).count()}')

N/A count for column "longitude" = 0
N/A count for column "latitude" = 0
N/A count for column "housing_median_age" = 0
N/A count for column "total_rooms" = 0
N/A count for column "total_bedrooms" = 207
N/A count for column "population" = 0
N/A count for column "households" = 0
N/A count for column "median_income" = 0
N/A count for column "median_house_value" = 0
N/A count for column "ocean_proximity" = 0


Видим, что пропуски есть только в столбце `total_bedrooms`.

Выполним их заполнение средним значением по данному столбцу, а затем убедимся, что пропусков в данных больше не осталось:

In [5]:
avg_bedrooms = df.select(F.mean('total_bedrooms')).collect()[0][0]
df = df.fillna(avg_bedrooms)

for c in df.columns:
    print(f'N/A count for column "{c}" = {df.filter(df[c].isNull()).count()}')

N/A count for column "longitude" = 0
N/A count for column "latitude" = 0
N/A count for column "housing_median_age" = 0
N/A count for column "total_rooms" = 0
N/A count for column "total_bedrooms" = 0
N/A count for column "population" = 0
N/A count for column "households" = 0
N/A count for column "median_income" = 0
N/A count for column "median_house_value" = 0
N/A count for column "ocean_proximity" = 0


Преобразуем колонку `ocean_proximity` через `StringIndexer` и `OneHotEncoderEstimator` в числовой вид:

In [6]:
numerical_cols = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
                    'total_bedrooms', 'population', 'households', 'median_income']
target = "median_house_value"

display(df.toPandas().sample(5, random_state=12345))

indexer = StringIndexer(inputCol='ocean_proximity', outputCol='ocean_proximity_idx')
df = indexer.fit(df).transform(df)

display(df.toPandas().sample(5, random_state=12345))

encoder = OneHotEncoderEstimator(inputCols=['ocean_proximity_idx'], outputCols=['categorical_features'])
df = encoder.fit(df).transform(df)

display(df.toPandas().sample(5, random_state=12345))

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
18219,-122.07,37.41,26.0,1184.0,225.0,815.0,218.0,5.7657,322300.0,NEAR BAY
10848,-117.91,33.66,26.0,5761.0,1326.0,2681.0,1116.0,4.0341,243300.0,<1H OCEAN
15119,-116.93,32.85,15.0,3273.0,895.0,1872.0,842.0,2.5388,119000.0,<1H OCEAN
8997,-118.34,34.0,44.0,3183.0,513.0,1183.0,473.0,5.0407,314900.0,<1H OCEAN
12807,-121.45,38.61,34.0,438.0,116.0,263.0,100.0,0.9379,67500.0,INLAND


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,ocean_proximity_idx
18219,-122.07,37.41,26.0,1184.0,225.0,815.0,218.0,5.7657,322300.0,NEAR BAY,3.0
10848,-117.91,33.66,26.0,5761.0,1326.0,2681.0,1116.0,4.0341,243300.0,<1H OCEAN,0.0
15119,-116.93,32.85,15.0,3273.0,895.0,1872.0,842.0,2.5388,119000.0,<1H OCEAN,0.0
8997,-118.34,34.0,44.0,3183.0,513.0,1183.0,473.0,5.0407,314900.0,<1H OCEAN,0.0
12807,-121.45,38.61,34.0,438.0,116.0,263.0,100.0,0.9379,67500.0,INLAND,1.0


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,ocean_proximity_idx,categorical_features
18219,-122.07,37.41,26.0,1184.0,225.0,815.0,218.0,5.7657,322300.0,NEAR BAY,3.0,"(0.0, 0.0, 0.0, 1.0)"
10848,-117.91,33.66,26.0,5761.0,1326.0,2681.0,1116.0,4.0341,243300.0,<1H OCEAN,0.0,"(1.0, 0.0, 0.0, 0.0)"
15119,-116.93,32.85,15.0,3273.0,895.0,1872.0,842.0,2.5388,119000.0,<1H OCEAN,0.0,"(1.0, 0.0, 0.0, 0.0)"
8997,-118.34,34.0,44.0,3183.0,513.0,1183.0,473.0,5.0407,314900.0,<1H OCEAN,0.0,"(1.0, 0.0, 0.0, 0.0)"
12807,-121.45,38.61,34.0,438.0,116.0,263.0,100.0,0.9379,67500.0,INLAND,1.0,"(0.0, 1.0, 0.0, 0.0)"


Векторизуем числовые столбцы, а затем выполняем их шкалирование через `StandardScaler`:

In [7]:
numerical_assembler = VectorAssembler(inputCols=numerical_cols, outputCol="numerical_features_raw")
df = numerical_assembler.transform(df)

standardScaler = StandardScaler(inputCol='numerical_features_raw', outputCol="numerical_features")
df = standardScaler.fit(df).transform(df)

Формируем набор данных со всеми признаками (включая преобразованный категориальный), получая тем самым `df_all_features`, а также набор данных только с числовыми признаками, получая `df_numerical_features`:

In [8]:
all_features = 'features'
numerical_features = 'numerical_features'

final_assembler = VectorAssembler(inputCols=['categorical_features', 'numerical_features'], outputCol=all_features)
df_final = final_assembler.transform(df)[all_features, numerical_features, target]

display(df_final.toPandas().head(5))

Unnamed: 0,features,numerical_features,median_house_value
0,"[0.0, 0.0, 0.0, 1.0, -61.00726959606955, 17.734477624640412, 3.2577023016083064, 0.40337085073160667, 0.30768013087921575, 0.2843362208866199, 0.3295584480852433, 4.382095394195227]","[-61.00726959606955, 17.734477624640412, 3.2577023016083064, 0.40337085073160667, 0.30768013087921575, 0.2843362208866199, 0.3295584480852433, 4.382095394195227]",452600.0
1,"[0.0, 0.0, 0.0, 1.0, -61.002278409814444, 17.725114120086744, 1.668579227653035, 3.2540109878905406, 2.6379397267628883, 2.1201592122632746, 2.9764882057222772, 4.369567902917918]","[-61.002278409814444, 17.725114120086744, 1.668579227653035, 3.2540109878905406, 2.6379397267628883, 2.1201592122632746, 2.9764882057222772, 4.369567902917918]",358500.0
2,"[0.0, 0.0, 0.0, 1.0, -61.012260782324645, 17.720432367809913, 4.131719992283705, 0.6724375432082579, 0.453172285791093, 0.4379837439744208, 0.4629511532626037, 3.820042655291457]","[-61.012260782324645, 17.720432367809913, 4.131719992283705, 0.6724375432082579, 0.453172285791093, 0.4379837439744208, 0.4629511532626037, 3.820042655291457]",352100.0
3,"[0.0, 0.0, 0.0, 1.0, -61.01725196857974, 17.720432367809913, 4.131719992283705, 0.5839709816273487, 0.5605025640047729, 0.4927317119712234, 0.5728039692910182, 2.970331345671345]","[-61.01725196857974, 17.720432367809913, 4.131719992283705, 0.5839709816273487, 0.5605025640047729, 0.4927317119712234, 0.5728039692910182, 2.970331345671345]",341300.0
4,"[0.0, 0.0, 0.0, 1.0, -61.01725196857974, 17.720432367809913, 4.131719992283705, 0.7457776978867319, 0.6678328422184528, 0.4989129341644108, 0.6774256988418891, 2.024505754234575]","[-61.01725196857974, 17.720432367809913, 4.131719992283705, 0.7457776978867319, 0.6678328422184528, 0.4989129341644108, 0.6774256988418891, 2.024505754234575]",342200.0


# Обучение моделей

Выполним разделение наборов данных на тренировочную и тестовую части:

In [9]:
train_data, test_data = df_final.randomSplit([0.8, 0.2], seed=RANDOM_SEED)

train_data_all_features = train_data[all_features, target]
test_data_all_features = test_data[all_features, target]

train_data_numerical_features = train_data[numerical_features, target]
test_data_numerical_features = test_data[numerical_features, target]

print(f'Train datset split size {train_data.count()}:')
display(train_data_all_features.toPandas().sample(5, random_state=12345))
display(train_data_numerical_features.toPandas().sample(5, random_state=12345))

print(f'Test datset split size {test_data.count()}:')
display(test_data_all_features.toPandas().sample(5, random_state=12345))
display(test_data_numerical_features.toPandas().sample(5, random_state=12345))

Train datset split size 16437:


Unnamed: 0,features,median_house_value
655,"[0.0, 0.0, 0.0, 1.0, -61.042207899855235, 17.7766133951319, 2.780965379421725, 0.8301188757669768, 0.899189219701274, 0.8044419168562447, 0.8892847011824027, 1.7764824815510607]",149700.0
4799,"[0.0, 1.0, 0.0, 0.0, -60.642912999447354, 18.024746265804005, 2.3042284572351437, 0.9392123558512069, 0.7870887069003194, 0.6949459808626394, 0.8082028607804776, 1.969342683400041]",98500.0
3334,"[0.0, 0.0, 1.0, 0.0, -58.506685282265174, 15.332738706625486, 2.7015092257239615, 2.750714176409513, 2.6498653132310754, 2.3435662429599047, 2.80386235196334, 2.414858171514061]",291000.0
6984,"[0.0, 1.0, 0.0, 0.0, -59.674622865958234, 16.901125719364277, 2.3042284572351437, 0.19435140989795593, 0.18603914890371184, 0.2507810146950313, 0.19093465643033938, 0.8060229997034487]",43800.0
1743,"[0.0, 0.0, 0.0, 1.0, -60.90744587096758, 17.58466155178178, 1.9069476887463257, 3.870526663156462, 3.5585950021068986, 3.9259591244158765, 3.8997749690087127, 2.4727583412831327]",240300.0


Unnamed: 0,numerical_features,median_house_value
655,"[-61.042207899855235, 17.7766133951319, 2.780965379421725, 0.8301188757669768, 0.899189219701274, 0.8044419168562447, 0.8892847011824027, 1.7764824815510607]",149700.0
4799,"[-60.642912999447354, 18.024746265804005, 2.3042284572351437, 0.9392123558512069, 0.7870887069003194, 0.6949459808626394, 0.8082028607804776, 1.969342683400041]",98500.0
3334,"[-58.506685282265174, 15.332738706625486, 2.7015092257239615, 2.750714176409513, 2.6498653132310754, 2.3435662429599047, 2.80386235196334, 2.414858171514061]",291000.0
6984,"[-59.674622865958234, 16.901125719364277, 2.3042284572351437, 0.19435140989795593, 0.18603914890371184, 0.2507810146950313, 0.19093465643033938, 0.8060229997034487]",43800.0
1743,"[-60.90744587096758, 17.58466155178178, 1.9069476887463257, 3.870526663156462, 3.5585950021068986, 3.9259591244158765, 3.8997749690087127, 2.4727583412831327]",240300.0


Test datset split size 4203:


Unnamed: 0,features,median_house_value
2184,"[0.0, 1.0, 0.0, 0.0, -58.58155307609165, 15.978820520828332, 1.3507546128619807, 1.228905966831179, 1.1210051280095459, 1.4313944535938228, 1.2005343465962435, 2.039665071410477]",118500.0
2636,"[1.0, 0.0, 0.0, 0.0, -60.77767502833501, 17.28971115834135, 1.5891230739552715, 1.1858186259575756, 1.3046591596196202, 1.3545706920499223, 1.412393348936757, 1.283015125564655]",190400.0
2628,"[1.0, 0.0, 0.0, 0.0, -60.79264858710031, 17.467617744860974, 1.1918423054664535, 0.828285371900015, 0.9015743369949113, 1.1276315343857568, 0.8919002444211744, 2.3710119520434825]",164500.0
1864,"[0.0, 1.0, 0.0, 0.0, -59.529878464560376, 16.624902335031177, 2.22477230353738, 0.4991714277803632, 0.4269359955610823, 0.4803692675848486, 0.4969532153666368, 1.6990541636144203]",95800.0
2562,"[1.0, 0.0, 0.0, 0.0, -60.8375692633962, 17.537844029013456, 0.3972807684888179, 0.696273093478762, 0.44363181661654366, 0.6225373780281586, 0.48649104241154967, 5.463565292445541]",500001.0


Unnamed: 0,numerical_features,median_house_value
2184,"[-58.58155307609165, 15.978820520828332, 1.3507546128619807, 1.228905966831179, 1.1210051280095459, 1.4313944535938228, 1.2005343465962435, 2.039665071410477]",118500.0
2636,"[-60.77767502833501, 17.28971115834135, 1.5891230739552715, 1.1858186259575756, 1.3046591596196202, 1.3545706920499223, 1.412393348936757, 1.283015125564655]",190400.0
2628,"[-60.79264858710031, 17.467617744860974, 1.1918423054664535, 0.828285371900015, 0.9015743369949113, 1.1276315343857568, 0.8919002444211744, 2.3710119520434825]",164500.0
1864,"[-59.529878464560376, 16.624902335031177, 2.22477230353738, 0.4991714277803632, 0.4269359955610823, 0.4803692675848486, 0.4969532153666368, 1.6990541636144203]",95800.0
2562,"[-60.8375692633962, 17.537844029013456, 0.3972807684888179, 0.696273093478762, 0.44363181661654366, 0.6225373780281586, 0.48649104241154967, 5.463565292445541]",500001.0


Обучаем на тренировочных наборах две модели линейной регрессии - одну на наборе данных со всеми признаками, и вторую - на наборе данных только с числовыми признаками:

In [10]:
lr_model_all_features = (LinearRegression(featuresCol=all_features, labelCol=target)
                             .fit(train_data_all_features))
lr_model_numerical_features = (LinearRegression(featuresCol=numerical_features, labelCol=target)
                                   .fit(train_data_numerical_features))

Делаем предсказания на тестовых выборках:

In [11]:
predictions_all_features = lr_model_all_features.transform(test_data_all_features)
predictions_numeric_features = lr_model_numerical_features.transform(test_data_numerical_features)

display(predictions_all_features.select('prediction', target)
                                .orderBy('median_house_value')
                                .toPandas()
                                .head(5))
display(predictions_numeric_features.select('prediction', target)
                                    .orderBy('median_house_value')
                                    .toPandas()
                                    .head(5))

Unnamed: 0,prediction,median_house_value
0,56720.56362,14999.0
1,151837.258589,22500.0
2,-12506.432852,25000.0
3,143144.503754,26900.0
4,71838.427295,32500.0


Unnamed: 0,prediction,median_house_value
0,61036.31631,14999.0
1,148148.672206,22500.0
2,-18312.056778,25000.0
3,175756.884199,26900.0
4,77464.172454,32500.0


Оценим качество наших моделей по метрикам качества `RMSE`, `MAE` и `R2`:

In [12]:
def evaluate_for_metric(df_predictions, features_col_name, metric_name):
    evaluator = RegressionEvaluator(labelCol=target, predictionCol='prediction', metricName=metric_name)
    metric_value = evaluator.evaluate(df_predictions)
    features_used = "all" if features_col_name == all_features else "numeric"
    print(f'\tMetric "{metric_name.upper()}" for dataset with features "{features_used.upper()}" = {metric_value}')

# RMSE
print('RMSE:')
evaluate_for_metric(predictions_all_features, all_features, 'rmse')
evaluate_for_metric(predictions_numeric_features, numerical_features, 'rmse')
print()

# MAE
print('MAE:')
evaluate_for_metric(predictions_all_features, all_features, 'mae')
evaluate_for_metric(predictions_numeric_features, numerical_features, 'mae')
print()

# R2
print('R2:')
evaluate_for_metric(predictions_all_features, all_features, 'r2')
evaluate_for_metric(predictions_numeric_features, numerical_features, 'r2')
print()

RMSE:
	Metric "RMSE" for dataset with features "ALL" = 68135.01224890398
	Metric "RMSE" for dataset with features "NUMERIC" = 69347.88195014803

MAE:
	Metric "MAE" for dataset with features "ALL" = 49634.74027664459
	Metric "MAE" for dataset with features "NUMERIC" = 50810.477940772354

R2:
	Metric "R2" for dataset with features "ALL" = 0.6490707068972161
	Metric "R2" for dataset with features "NUMERIC" = 0.6364657385915713



# Анализ результатов

Можно видеть, что для показателей `RMSE` и `MAE` более точное прогнозирование наблюдается для модели линейной регрессии, в которой используются только числовые признаки (т.е. признак близости к океану не учитывается):

- `68135` против `69347` по `RMSE`;
- `49634` против `50810` по `MAE`.

Что касается метрики `R2`, то для неё более высокое качество получается при использовании всех доступных признаков, в том числе признака близости к океану:

- `0.649` против `0.636` по `R2`.