# Problem definition

> We want to predict the house prices on the test dataset.

> Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.

# Data

* train.csv - the training set
* test.csv - the test set
* data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
* sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

Let's see the data description:

* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale

In [522]:
!pip3 install catboost
!pip3 install xgboost
# Utilities
from xgboost import XGBRegressor
from sklearn import tree
from sklearn.linear_model import LinearRegression, RANSACRegressor
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_halving_search_cv
from sklearn.feature_selection import mutual_info_regression
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV
from sklearn.metrics import mean_squared_error
# Models
from catboost import CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor




In [523]:
train_data = pd.read_csv('drive/MyDrive/House Prices Regression/data/train.csv')
train_data_2 = pd.read_csv('drive/MyDrive/House Prices Regression/data/AmesHousing.csv')
test_data = pd.read_csv('drive/MyDrive/House Prices Regression/data/test.csv')

In [524]:
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [525]:
test_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


## Data Preprocess

In [526]:
train_data_original = train_data.copy()

In [527]:
def show_missing_data_columns(df):
  for column in df.columns:
    if df[column].isnull().any():
      print(f'{column} {df[column].isnull().sum()/df.shape[0] * 100:.2f}% {df[column].dtypes}')

def get_missing_data_columns(df):
  column_list = []
  for column in df.columns:
    if df[column].isnull().any():
      column_list.append(column)
  return column_list

def fill_columns(df):
  '''
  Get the column with missing data.
  If string dtype, it fills with most frequent data. If numerical, it fills with median.
  '''
  for column, content in df.items():
    if df[column].isnull().any():
      if pd.api.types.is_string_dtype(content):
        print(f'Filling {column} with {df[column].value_counts().idxmax()}...')
        df[column].fillna(df[column].value_counts().idxmax(), inplace = True) # Fill with most frequent value
      else:
        print(f'Filling {column} with {df[column].median()}...')
        df[column].fillna(df[column].median(), inplace = True)

def category_converter(df):
  for label, content in df.items():
    if not pd.api.types.is_numeric_dtype(content):
        df[label] = pd.Categorical(content).codes+1 
      

In [528]:
show_missing_data_columns(train_data)

LotFrontage 17.74% float64
Alley 93.77% object
MasVnrType 0.55% object
MasVnrArea 0.55% float64
BsmtQual 2.53% object
BsmtCond 2.53% object
BsmtExposure 2.60% object
BsmtFinType1 2.53% object
BsmtFinType2 2.60% object
Electrical 0.07% object
FireplaceQu 47.26% object
GarageType 5.55% object
GarageYrBlt 5.55% float64
GarageFinish 5.55% object
GarageQual 5.55% object
GarageCond 5.55% object
PoolQC 99.52% object
Fence 80.75% object
MiscFeature 96.30% object


In [529]:
show_missing_data_columns(test_data)

MSZoning 0.27% object
LotFrontage 15.56% float64
Alley 92.67% object
Utilities 0.14% object
Exterior1st 0.07% object
Exterior2nd 0.07% object
MasVnrType 1.10% object
MasVnrArea 1.03% float64
BsmtQual 3.02% object
BsmtCond 3.08% object
BsmtExposure 3.02% object
BsmtFinType1 2.88% object
BsmtFinSF1 0.07% float64
BsmtFinType2 2.88% object
BsmtFinSF2 0.07% float64
BsmtUnfSF 0.07% float64
TotalBsmtSF 0.07% float64
BsmtFullBath 0.14% float64
BsmtHalfBath 0.14% float64
KitchenQual 0.07% object
Functional 0.14% object
FireplaceQu 50.03% object
GarageType 5.21% object
GarageYrBlt 5.35% float64
GarageFinish 5.35% object
GarageCars 0.07% float64
GarageArea 0.07% float64
GarageQual 5.35% object
GarageCond 5.35% object
PoolQC 99.79% object
Fence 80.12% object
MiscFeature 96.50% object
SaleType 0.07% object


In [530]:
# Big empty data columns are no missing data, just doesn't have that feature, so let's fill with 'None' value
no_value_columns = ['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']
for column in no_value_columns:
  train_data[column].fillna(value = 'None', inplace = True)
  test_data[column].fillna(value = 'None', inplace = True)


In [531]:
show_missing_data_columns(train_data)

LotFrontage 17.74% float64
MasVnrType 0.55% object
MasVnrArea 0.55% float64
BsmtQual 2.53% object
BsmtCond 2.53% object
BsmtExposure 2.60% object
BsmtFinType1 2.53% object
BsmtFinType2 2.60% object
Electrical 0.07% object
GarageType 5.55% object
GarageYrBlt 5.55% float64
GarageFinish 5.55% object
GarageQual 5.55% object
GarageCond 5.55% object


In [532]:
# We can assume the same with all garage features (81 missing values in all cases)
garage_features = ['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond']
for column in garage_features:
  train_data[column].fillna(value = 'None', inplace = True)
  test_data[column].fillna(value = 'None', inplace = True)

In [533]:
show_missing_data_columns(train_data)

LotFrontage 17.74% float64
MasVnrType 0.55% object
MasVnrArea 0.55% float64
BsmtQual 2.53% object
BsmtCond 2.53% object
BsmtExposure 2.60% object
BsmtFinType1 2.53% object
BsmtFinType2 2.60% object
Electrical 0.07% object


In [534]:
# Now already we have string and numerical missing data
# Let's impute numerical with mean strategy and string with most_frequent
fill_columns(train_data)

Filling LotFrontage with 69.0...
Filling MasVnrType with None...
Filling MasVnrArea with 0.0...
Filling BsmtQual with TA...
Filling BsmtCond with TA...
Filling BsmtExposure with No...
Filling BsmtFinType1 with Unf...
Filling BsmtFinType2 with Unf...
Filling Electrical with SBrkr...


In [535]:
fill_columns(test_data)

Filling MSZoning with RL...
Filling LotFrontage with 67.0...
Filling Utilities with AllPub...
Filling Exterior1st with VinylSd...
Filling Exterior2nd with VinylSd...
Filling MasVnrType with None...
Filling MasVnrArea with 0.0...
Filling BsmtQual with TA...
Filling BsmtCond with TA...
Filling BsmtExposure with No...
Filling BsmtFinType1 with GLQ...
Filling BsmtFinSF1 with 350.5...
Filling BsmtFinType2 with Unf...
Filling BsmtFinSF2 with 0.0...
Filling BsmtUnfSF with 460.0...
Filling TotalBsmtSF with 988.0...
Filling BsmtFullBath with 0.0...
Filling BsmtHalfBath with 0.0...
Filling KitchenQual with TA...
Filling Functional with Typ...
Filling GarageCars with 2.0...
Filling GarageArea with 480.0...
Filling SaleType with WD...


In [536]:
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [537]:
# Now convert the string to category codes
category_converter(train_data)
category_converter(test_data)

In [538]:
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,4,65.0,8450,2,2,4,4,1,...,0,4,5,2,0,2,2008,9,5,208500
1,2,20,4,80.0,9600,2,2,4,4,1,...,0,4,5,2,0,5,2007,9,5,181500
2,3,60,4,68.0,11250,2,2,1,4,1,...,0,4,5,2,0,9,2008,9,5,223500
3,4,70,4,60.0,9550,2,2,1,4,1,...,0,4,5,2,0,2,2006,9,1,140000
4,5,60,4,84.0,14260,2,2,1,4,1,...,0,4,5,2,0,12,2008,9,5,250000


## Data Analysis

In [539]:
# Mutual Information
X = train_data.drop(labels = 'SalePrice', axis = 1)
y = train_data['SalePrice']
mi_score = mutual_info_regression(X, y)
mi_df = pd.DataFrame({'Feature': X.columns,
                      'MI_score': mi_score}).sort_values(by = ['MI_score'], ascending = False)
mi_df['MI_score'].astype(float)

mi_df.head(20)

Unnamed: 0,Feature,MI_score
17,OverallQual,0.577712
12,Neighborhood,0.509438
46,GrLivArea,0.48043
38,TotalBsmtSF,0.366246
62,GarageArea,0.366168
19,YearBuilt,0.364174
61,GarageCars,0.358159
27,ExterQual,0.344364
53,KitchenQual,0.328042
43,1stFlrSF,0.308905


In [540]:
mi_df.tail(20)

Unnamed: 0,Feature,MI_score
35,BsmtFinType2,0.018412
39,Heating,0.01658
36,BsmtFinSF2,0.016422
10,LotConfig,0.016142
74,MiscFeature,0.013937
14,Condition2,0.011294
5,Street,0.010693
11,LandSlope,0.008132
22,RoofMatl,0.007134
48,BsmtHalfBath,0.007028


In [541]:
useless_features = [column for column in mi_df[mi_df['MI_score'] < 0.06]['Feature']]
useless_features

['GarageQual',
 'BsmtExposure',
 'Electrical',
 'PavedDrive',
 'BldgType',
 'Alley',
 'KitchenAbvGr',
 'ScreenPorch',
 'LandContour',
 'Fence',
 'EnclosedPorch',
 'Condition1',
 'BsmtFullBath',
 'ExterCond',
 'BsmtCond',
 'RoofStyle',
 'BsmtFinType2',
 'Heating',
 'BsmtFinSF2',
 'LotConfig',
 'MiscFeature',
 'Condition2',
 'Street',
 'LandSlope',
 'RoofMatl',
 'BsmtHalfBath',
 'MiscVal',
 'PoolArea',
 'LowQualFinSF',
 'Utilities',
 'YrSold',
 '3SsnPorch',
 'Functional',
 'PoolQC',
 'MoSold',
 'Id']

In [542]:
for column in useless_features:
  if column != 'Id':
    train_data.drop(labels = column, axis = 1, inplace = True)
    test_data.drop(labels = column, axis = 1, inplace = True)

print(len(train_data.columns))

46


# Modelling

## First Model

In [543]:
# Let's split the data into train and validations sets
#np.random.seed(21)
train_set, val_set = train_test_split(train_data,
                                      test_size = 0.2,
                                      shuffle = True)

train_set.shape, val_set.shape

((1168, 46), (292, 46))

In [544]:
# Now we're going to split both sets into X and y
X_train, y_train = train_set.drop(labels = 'SalePrice', axis = 1), train_set['SalePrice']
X_val, y_val = val_set.drop(labels = 'SalePrice', axis = 1), val_set['SalePrice']

In [545]:
# We are going to try first with CatBoostRegressor and XGBoost
#np.random.seed(21)

cbr = CatBoostRegressor(verbose = False)
xgbr = XGBRegressor(verbosity = 0)
rfr = RandomForestRegressor()
lr = LinearRegression()
treer = tree.DecisionTreeRegressor()

cbr.fit(X_train, y_train)
xgbr.fit(X_train, y_train)
rfr.fit(X_train, y_train)
lr.fit(X_train, y_train)
treer.fit(X_train, y_train)

print(f'CatBoostRegressor: {cbr.score(X_val, y_val)}')
print(f'XGBRegressor: {xgbr.score(X_val, y_val)}')
print(f'RandomForestRegressor: {rfr.score(X_val, y_val)}')
print(f'LinearRegression: {lr.score(X_val, y_val)}')
print(f'DecisionTreeRegressor: {treer.score(X_val, y_val)}')

CatBoostRegressor: 0.8869313913875267
XGBRegressor: 0.8606138889041897
RandomForestRegressor: 0.8382101417810871
LinearRegression: 0.6981307558376344
DecisionTreeRegressor: 0.6052370269276204


In [546]:
y_val_preds = cbr.predict(X_val)
np.sqrt(mean_squared_error(y_val, y_val_preds))

28993.53587413615

In [547]:
y_preds = cbr.predict(test_data)

In [548]:
house_price_predictions = pd.DataFrame({'Id': test_data['Id'],
                                        'SalePrice': y_preds})
house_price_predictions

Unnamed: 0,Id,SalePrice
0,1461,125727.641431
1,1462,167222.579835
2,1463,189897.871089
3,1464,193070.984979
4,1465,194147.184942
...,...,...
1454,2915,78923.464880
1455,2916,80528.297518
1456,2917,175000.734438
1457,2918,120130.202827


In [549]:
house_price_predictions.to_csv('drive/MyDrive/House Prices Regression/predictions/house_price_predictions.csv',
                               index = False)

## Model Tuning

In [550]:
cbr.get_all_params()

{'auto_class_weights': 'None',
 'bayesian_matrix_reg': 0.10000000149011612,
 'best_model_min_trees': 1,
 'boost_from_average': True,
 'boosting_type': 'Plain',
 'bootstrap_type': 'MVS',
 'border_count': 254,
 'classes_count': 0,
 'depth': 6,
 'eval_metric': 'RMSE',
 'feature_border_type': 'GreedyLogSum',
 'force_unit_auto_pair_weights': False,
 'grow_policy': 'SymmetricTree',
 'iterations': 1000,
 'l2_leaf_reg': 3,
 'leaf_estimation_backtracking': 'AnyImprovement',
 'leaf_estimation_iterations': 1,
 'leaf_estimation_method': 'Newton',
 'learning_rate': 0.04196000099182129,
 'loss_function': 'RMSE',
 'max_leaves': 64,
 'min_data_in_leaf': 1,
 'model_shrink_mode': 'Constant',
 'model_shrink_rate': 0,
 'model_size_reg': 0.5,
 'nan_mode': 'Min',
 'penalties_coefficient': 1,
 'pool_metainfo_options': {'tags': {}},
 'posterior_sampling': False,
 'random_seed': 0,
 'random_strength': 1,
 'rsm': 1,
 'sampling_frequency': 'PerTree',
 'score_function': 'Cosine',
 'sparse_features_conflict_fracti

In [551]:
# Let's tune CatBoostRegressor
#param_grid = {
#    'iterations': np.arange(100, 1000, 200),
#    'learning_rate': np.arange(0.01, 0.9, 0.5),
#    'depth': np.arange(4, 10, 1)
#}
#
#cbr_hs = HalvingGridSearchCV(estimator = cbr,
#                             param_grid = param_grid,
#                             cv = 5,
#                             factor = 3,
#                             n_jobs = -1,
#                             verbose = True)
#
#cbr_hs.fit(X_train, y_train)

In [552]:
cbr_hs = CatBoostRegressor(depth = 5, iterations = 900, learning_rate = 0.01, verbose = False)
cbr_hs.fit(X_train, y_train)
cbr_hs.score(X_val, y_val)

0.8719996435587265

# Other Data

There's another dataset from Kaggle we can use for improve our model! 😄

In [553]:
train_data_2.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [554]:
len(train_data_original.columns), len(train_data_2.columns)

(81, 82)

## Data Preprocess

In [555]:
def unique_columns_classifier(df_1, df_2):
  classifier = []
  df_1_columns = [column.replace(' ', '') for column in df_1.columns]
  df_2_columns = [column.replace(' ', '') for column in df_2.columns]
  unique_columns = np.unique(df_1_columns + df_2_columns)
  for column in df_2_columns:
    if column not in df_1_columns:
      print(f'Column {column} from df_2 non in df_1')
  
  for column in df_1_columns:
    if column not in df_2_columns:
      print(f'Column {column} from df_1 non in df_2')

In [556]:
unique_columns_classifier(train_data_original, train_data_2)

Column Order from df_2 non in df_1
Column PID from df_2 non in df_1
Column YearRemod/Add from df_2 non in df_1
Column Id from df_1 non in df_2
Column YearRemodAdd from df_1 non in df_2


In [557]:
train_data_2.rename(columns = {'PID':'Id', 'Year Remod/Add':'YearRemodAdd'}, inplace = True)
train_data_2.drop(labels = 'Order', axis = 1, inplace = True)

In [558]:
train_data_2.columns = train_data_2.columns.str.replace(' ','')

In [559]:
no_value_columns = ['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']
for column in no_value_columns:
  train_data_2[column].fillna(value = 'None', inplace = True)

In [560]:
garage_features = ['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond']
for column in garage_features:
  train_data_2[column].fillna(value = 'None', inplace = True)

In [561]:
fill_columns(train_data_2)
category_converter(train_data_2)

Filling LotFrontage with 68.0...
Filling MasVnrType with None...
Filling MasVnrArea with 0.0...
Filling BsmtQual with TA...
Filling BsmtCond with TA...
Filling BsmtExposure with No...
Filling BsmtFinType1 with GLQ...
Filling BsmtFinSF1 with 370.0...
Filling BsmtFinType2 with Unf...
Filling BsmtFinSF2 with 0.0...
Filling BsmtUnfSF with 466.0...
Filling TotalBsmtSF with 990.0...
Filling Electrical with SBrkr...
Filling BsmtFullBath with 0.0...
Filling BsmtHalfBath with 0.0...
Filling GarageCars with 2.0...
Filling GarageArea with 480.0...


In [562]:
for column in useless_features:
  if column != 'Id':
    train_data_2.drop(labels = column, axis = 1, inplace = True)
print(len(train_data_2.columns))

46


In [563]:
len(train_data.columns), len(train_data_2.columns)

(46, 46)

In [564]:
train_data_full = pd.concat([train_data, train_data_2])

In [565]:
train_data_full

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,LotShape,Neighborhood,HouseStyle,OverallQual,OverallCond,...,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageCond,WoodDeckSF,OpenPorchSF,SaleType,SaleCondition,SalePrice
0,1,60,4,65.0,8450,4,6,6,7,5,...,90,3,2.0,548.0,6,0,61,9,5,208500
1,2,20,4,80.0,9600,4,25,3,6,8,...,63,3,2.0,460.0,6,298,0,9,5,181500
2,3,60,4,68.0,11250,1,6,6,7,5,...,88,3,2.0,608.0,6,0,42,9,5,223500
3,4,70,4,60.0,9550,1,7,6,7,5,...,85,4,3.0,642.0,6,0,35,9,1,140000
4,5,60,4,84.0,14260,1,16,6,8,5,...,87,3,3.0,836.0,6,192,84,9,5,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,923275080,80,6,37.0,7937,1,15,8,6,6,...,76,4,2.0,588.0,6,120,0,10,5,142500
2926,923276100,20,6,68.0,8885,1,15,3,5,5,...,75,4,2.0,484.0,6,164,0,10,5,131000
2927,923400125,85,6,62.0,10441,4,15,7,5,5,...,104,2,0.0,0.0,4,80,32,10,5,132000
2928,924100070,20,6,77.0,10010,4,15,3,5,5,...,67,3,2.0,418.0,6,240,38,10,5,170000


## Data Improved Train Model

In [566]:
train_set, val_set = train_test_split(train_data_full,
                                      test_size = 0.2,
                                      shuffle = True)

train_set.shape, val_set.shape

((3512, 46), (878, 46))

In [567]:
X_train, y_train = train_set.drop(labels = 'SalePrice', axis = 1), train_set['SalePrice']
X_val, y_val = val_set.drop(labels = 'SalePrice', axis = 1), val_set['SalePrice']

In [568]:
cbr = CatBoostRegressor(verbose = False)
xgbr = XGBRegressor(verbosity = 0)
rfr = RandomForestRegressor()
lr = LinearRegression()
treer = tree.DecisionTreeRegressor()

cbr.fit(X_train, y_train)
xgbr.fit(X_train, y_train)
rfr.fit(X_train, y_train)
lr.fit(X_train, y_train)
treer.fit(X_train, y_train)

print(f'CatBoostRegressor: {cbr.score(X_val, y_val)}')
print(f'XGBRegressor: {xgbr.score(X_val, y_val)}')
print(f'RandomForestRegressor: {rfr.score(X_val, y_val)}')
print(f'LinearRegression: {lr.score(X_val, y_val)}')
print(f'DecisionTreeRegressor: {treer.score(X_val, y_val)}')

CatBoostRegressor: 0.9583814131928995
XGBRegressor: 0.9336487905681977
RandomForestRegressor: 0.937521181660207
LinearRegression: 0.8472204212616676
DecisionTreeRegressor: 0.8606617326815569


In [569]:
test_data = test_data.reindex(sorted(test_data.columns), axis=1)
train_data_full = train_data_full.reindex(sorted(train_data_full.columns), axis=1)

In [570]:
test_data.columns, train_data_full.columns

(Index(['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtFinSF1', 'BsmtFinType1',
        'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'ExterQual', 'Exterior1st',
        'Exterior2nd', 'FireplaceQu', 'Fireplaces', 'Foundation', 'FullBath',
        'GarageArea', 'GarageCars', 'GarageCond', 'GarageFinish', 'GarageType',
        'GarageYrBlt', 'GrLivArea', 'HalfBath', 'HeatingQC', 'HouseStyle', 'Id',
        'KitchenQual', 'LotArea', 'LotFrontage', 'LotShape', 'MSSubClass',
        'MSZoning', 'MasVnrArea', 'MasVnrType', 'Neighborhood', 'OpenPorchSF',
        'OverallCond', 'OverallQual', 'SaleCondition', 'SaleType',
        'TotRmsAbvGrd', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt',
        'YearRemodAdd'],
       dtype='object'),
 Index(['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtFinSF1', 'BsmtFinType1',
        'BsmtQual', 'BsmtUnfSF', 'CentralAir', 'ExterQual', 'Exterior1st',
        'Exterior2nd', 'FireplaceQu', 'Fireplaces', 'Foundation', 'FullBath',
        'GarageArea', 'GarageCars', 'Gara

In [571]:
y_val_preds = cbr.predict(X_val)
house_price_val_preds = pd.DataFrame({'Id': X_val['Id'],
                                      'Prediction': y_val_preds,
                                      'SalePrice': y_val})

house_price_val_preds

Unnamed: 0,Id,Prediction,SalePrice
761,762,120746.360453,100000
1337,903228040,154862.536529,157000
2605,535382130,158675.228795,170000
1925,535179120,119967.307627,116000
600,601,259467.057231,275000
...,...,...,...
2256,916253320,217233.352899,330000
2435,528250060,179807.176241,180000
492,493,171575.521602,172785
279,280,217082.329393,192000


In [572]:
y_preds_improved = cbr.predict(test_data)

In [573]:
house_price_predictions_improved = pd.DataFrame({'Id': test_data['Id'],
                                                 'SalePrice': y_preds_improved})
house_price_predictions_improved

Unnamed: 0,Id,SalePrice
0,1461,120417.981444
1,1462,168266.057089
2,1463,189046.454604
3,1464,196262.980398
4,1465,180679.523448
...,...,...
1454,2915,79075.193471
1455,2916,82955.754661
1456,2917,134308.418609
1457,2918,123104.009872


In [574]:
house_price_predictions_improved.to_csv('drive/MyDrive/House Prices Regression/predictions/house_price_predictions_improved.csv',
                               index = False)