#Предсказание цены дома

Данный проект посвещен предсказанию цены дома. Объекты датасета имеют 79 призаков, которые описывают здания города Эймс. Данные взяты с kaggle соревнования [Housing Prices](https://www.kaggle.com/competitions/home-data-for-ml-course).

Целевая переменная:


*   SalePrice - Стоимость недвижимости в долларах.







## Импортирование библиотек и загрузка датасета

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [3]:
df_train = pd.read_csv('/content/gdrive/MyDrive/datasets/House_pricing_train.csv')
df_test = pd.read_csv('/content/gdrive/MyDrive/datasets/House_pricing_test.csv')
df_train

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


## Анализ данных

Узнаем количество пропущенных значений в столбцах train датасета:

In [4]:
missing_counts = df_train.isna().sum()
columns_with_missing = missing_counts[missing_counts > 0].sort_values(ascending=False)
print(columns_with_missing)

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
LotFrontage      259
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
BsmtExposure      38
BsmtFinType2      38
BsmtQual          37
BsmtCond          37
BsmtFinType1      37
MasVnrArea         8
Electrical         1
dtype: int64


### Обработка пропущенных данных

В документации к данным указано, что означают пропущенные значения в колонках.



*   PoolQC: NA означает "No Pool".
*   MiscFeature: NA означает "None".
*   Alley: NA означает "No alley access"
*   Fence: NA означает "No Fence"
*   MasVnrType: NA означает "None"
*   FireplaceQu: NA означает "No Fireplace"
*   GarageType, GarageFinish, GarageQual, GarageCond: NA означает "No Garage"
*   BsmtExposure, BsmtFinType2, BsmtQual, BsmtCond, BsmtFinType1: NA означает "No Basement"            




In [5]:
columns = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'MasVnrType', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'BsmtExposure', 'BsmtFinType2', 'BsmtQual', 'BsmtCond', 'BsmtFinType1']
for col in columns:
  df_train[col] = df_train[col].fillna('None')


Для остальных столбцов в документации не указано значение пропусков. Выведем эти столбцы.

In [6]:
remained_columns = df_train.columns[df_train.isna().any()].tolist()
remained_columns

['LotFrontage', 'MasVnrArea', 'Electrical', 'GarageYrBlt']

In [7]:
for col in remained_columns:
  print(f"Процент пропущенных значений для столбца {col} равен: {100* df_train[col].isna().sum()/ len(df_train)}%")

Процент пропущенных значений для столбца LotFrontage равен: 17.73972602739726%
Процент пропущенных значений для столбца MasVnrArea равен: 0.547945205479452%
Процент пропущенных значений для столбца Electrical равен: 0.0684931506849315%
Процент пропущенных значений для столбца GarageYrBlt равен: 5.5479452054794525%


Удалим ячейки с пропущенными значениями.

In [8]:
# Удаляем все NaN
df_train.dropna(inplace=True)

Проверяем, что пропущенных значений действительно не осталось

In [9]:
df_train.isna().sum().sum()

np.int64(0)

### Создание новых признаков

Новые столбцы можно получить путем сложения нескольких заданных:

In [10]:
df_train['GeneralSquare'] = df_train['1stFlrSF'] + df_train['2ndFlrSF'] + df_train['TotalBsmtSF']
df_train['AmountBath'] = df_train["FullBath"] + df_train["BsmtFullBath"] + 0.5*(df_train["HalfBath"] + df_train["BsmtHalfBath"])
df_train['TotalPorch'] = df_train['OpenPorchSF'] + df_train['EnclosedPorch'] + df_train['3SsnPorch'] + df_train['ScreenPorch']

### Категориальные признаки

**Label Encoding**

Выведем категориальные признаки c нечисловыми значениями:

In [11]:
cat_features = []
for feature in df_train.columns:
  if df_train[feature].dtype == 'O':
    cat_features.append(feature)
print(cat_features)

['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']


In [12]:
from sklearn.preprocessing import LabelEncoder

label_encoding = LabelEncoder()

df_train_changed = df_train

for column in cat_features:
  df_train_changed[column] = label_encoding.fit_transform(df_train[column])

In [13]:
df_train_changed.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,GeneralSquare,AmountBath,TotalPorch
0,1,60,3,65.0,8450,1,1,3,3,0,...,0,0,2,2008,8,4,208500,2566,3.5,61
1,2,20,3,80.0,9600,1,1,3,3,0,...,0,0,5,2007,8,4,181500,2524,2.5,0
2,3,60,3,68.0,11250,1,1,0,3,0,...,0,0,9,2008,8,4,223500,2706,3.5,42
3,4,70,3,60.0,9550,1,1,0,3,0,...,0,0,2,2006,8,0,140000,2473,2.0,307
4,5,60,3,84.0,14260,1,1,0,3,0,...,0,0,12,2008,8,4,250000,3343,3.5,84


## Подготовка данных для обучения модели

Подготовим обучающую и тестирующую части. Сперва избавимся от пропусков в test датасете и добавим введенные признаки.

---



In [14]:
test_features = df_test.columns.tolist()
num_features = list(set(test_features) ^ set(cat_features))

In [15]:
for col in cat_features:
  df_test[col] = df_test[col].fillna('None')

for col in num_features:
  mean_Age = np.mean(df_train[col].dropna().values)
  df_test[col] = df_test[col].fillna(mean_Age)

df_test['GeneralSquare'] = df_test['1stFlrSF'] + df_test['2ndFlrSF'] + df_test['TotalBsmtSF']
df_test['AmountBath'] = df_test["FullBath"] + df_test["BsmtFullBath"] + 0.5*(df_test["HalfBath"] + df_test["BsmtHalfBath"])
df_test['TotalPorch'] = df_test['OpenPorchSF'] + df_test['EnclosedPorch'] + df_test['3SsnPorch'] + df_test['ScreenPorch']

label_encoding_test = LabelEncoder()

df_test_changed = df_test

for column in cat_features:
  df_test_changed[column] = label_encoding_test.fit_transform(df_test[column])

Выделим в train датасете валидационную часть.

In [16]:
from sklearn.model_selection import train_test_split

X = df_train.drop(columns=['Id', 'SalePrice']).values
y = df_train['SalePrice'].values

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y
    , train_size=0.80
    , test_size=0.20
    , shuffle=True
)

X_changed = df_train_changed.drop(columns=['Id', 'SalePrice']).values
y_changed = df_train_changed['SalePrice'].values

X_train_changed, X_valid_changed, y_train_changed, y_valid_changed = train_test_split(
    X_changed, y_changed
    , train_size=0.80
    , test_size=0.20
    , shuffle=True
)

Отдельно выделим данные, на которых будем делать предсказания.

In [17]:
X_test = df_test.values
X_test_changed = df_test_changed.values

## Бустинг

In [18]:
!pip install xgboost -q
!pip install catboost -q

In [19]:
import catboost
import xgboost
from sklearn.metrics import mean_squared_error
rmse = lambda predictions, real_values: np.sqrt(mean_squared_error(predictions, real_values))

### XGBoost

In [20]:
boosting_model = xgboost.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)

boosting_model.fit(X_train_changed, y_train_changed)

y_predicted = boosting_model.predict(X_valid_changed)

In [21]:
xgb_rmse = rmse(y_predicted, y_valid_changed)
print('XGBoost Regressor RMSE: ', xgb_rmse)

XGBoost Regressor RMSE:  19947.208727037476


In [22]:
print('Train Score: %.2f%%' % (boosting_model.score(X_train_changed, y_train_changed) * 100))
print('Validation Score: %.2f%%' % (boosting_model.score(X_valid_changed, y_valid_changed) * 100))

Train Score: 97.16%
Validation Score: 93.87%


### CatBoost

При использовании catboost не надо кодировать категориальные признаки, поэтому передаем данные без кодировки.

In [23]:
catboost_model = catboost.CatBoostRegressor(learning_rate=0.1, loss_function='RMSE', depth=6)

catboost_model.fit(X_train, y_train)

y_predicted = boosting_model.predict(X_valid)

0:	learn: 78031.9263694	total: 58.5ms	remaining: 58.4s
1:	learn: 73177.0205674	total: 70.2ms	remaining: 35s
2:	learn: 68553.0241476	total: 77.2ms	remaining: 25.6s
3:	learn: 64677.2745983	total: 88.2ms	remaining: 22s
4:	learn: 60765.5326730	total: 95.8ms	remaining: 19.1s
5:	learn: 57786.1731131	total: 107ms	remaining: 17.8s
6:	learn: 54713.0452251	total: 115ms	remaining: 16.3s
7:	learn: 52442.3034726	total: 129ms	remaining: 15.9s
8:	learn: 50110.5729399	total: 137ms	remaining: 15s
9:	learn: 47861.5689956	total: 146ms	remaining: 14.5s
10:	learn: 45758.9929409	total: 153ms	remaining: 13.8s
11:	learn: 43933.0606929	total: 166ms	remaining: 13.7s
12:	learn: 42129.6967528	total: 185ms	remaining: 14.1s
13:	learn: 40508.8636518	total: 194ms	remaining: 13.7s
14:	learn: 39100.7668895	total: 207ms	remaining: 13.6s
15:	learn: 37591.5534288	total: 219ms	remaining: 13.5s
16:	learn: 36310.6000190	total: 226ms	remaining: 13s
17:	learn: 35084.4849239	total: 232ms	remaining: 12.7s
18:	learn: 34113.104879

In [24]:
xgb_rmse = rmse(y_predicted, y_valid)
print('CatBoost Regressor RMSE: ', xgb_rmse)

CatBoost Regressor RMSE:  15887.084062218591


In [25]:
print('Train Score: %.2f%%' % (boosting_model.score(X_train, y_train) * 100))
print('Validation Score: %.2f%%' % (boosting_model.score(X_valid, y_valid) * 100))

Train Score: 96.65%
Validation Score: 96.09%


Сделаем предсказания, испольуя catboost.

In [26]:
y_predicted_test = catboost_model.predict(X_test)

output = pd.DataFrame({'Id': df_test.Id,
                       'SalePrice': y_predicted_test})

output.to_csv('submission.csv', index=False)