Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

**LOAD DATA**

In [None]:
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
ids = df_test['Id'].values

In [None]:
df_train

## Building hypotheses. Target variable

In [None]:
df_train.describe()

In [None]:
df_train.columns

To build hypotheses, it is necessary to clearly understand the task: in our case, the goal is to predict the price of a residential building. At the same time, the data contains 38 columns (characteristics), among which there are non-informative features (such as id) and the target variable (SalePrice). Let's start with the target variable.

In [None]:
df_train['SalePrice'].describe()

In [None]:
sns.distplot(df_train['SalePrice'])

What conclusions can be drawn?
Firstly, it seems that the data looks correct (the price is greater than 0, there are no obvious outliers) and there is a clear trend towards a biased normal distribution with an expected value of ~18000 and std ~ 79000 (quite a large spread).

Final step:
write the target variable into a separate variable, removing it from the features

In [None]:
y_train = df_train.SalePrice.values
x_train = df_train.drop('SalePrice', 1)

## Building hypotheses. signs

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['OverallQual']], axis=1)
plt.figure(figsize=(8, 6))
sns.boxplot(x='OverallQual', y="SalePrice", data=data)

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['GrLivArea']], axis=1)
data.plot.scatter(x='GrLivArea', y='SalePrice')

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['TotalBsmtSF']], axis=1)
data.plot.scatter(x='TotalBsmtSF', y='SalePrice')

For numerical variables (TotalBsmtSF, GrLivArea) we observe a linear trend

Consider a delayed feature - Neighborhood

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['Neighborhood']], axis=1)
plt.figure(figsize=(20, 6))
sns.boxplot(x='Neighborhood', y="SalePrice", data=data)

There is no obvious trend, but at the same time, one can single out, for example, expensive areas (albeit with a very wide spread) and the local "Ghetto" - BrDale

To make sure we don't miss anything - build our own heatmap

In [None]:
#correlation matrix
corrmat = df_train.corr()
plt.figure(figsize=(12, 12))
sns.heatmap(corrmat, vmax=.8, square=True)

Here we see confirmation of the importance of the OverallQual feature. We also see many interesting connections here - for example, we can conclude that garages are built together with the house =) (GarageYearBlt - YearBllt); but LotArea surprisingly does not affect the price much.

You can analyze the data for a long time and find interesting dependencies, but let's get back to iterative development and move on to the next step.

## Data preparation: filling in the gaps

There are several classic approaches - drop rows with such data, fill with an average, fill with something logical (depending on the specifics of the data), build, for example, RF and fill in the gaps iteratively. Let's start by looking at the missing values.

In [None]:
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

Let's analyze: the first 6 candidates have a large percentage of missing values (more than 17) - since these features did not have a strong correlation in the previous analysis - we will replace it with the most frequent value. For the rest, we will remove the missing values

In [None]:
x_train = x_train.drop((missing_data[missing_data['Total'] > 81]).index,1)
x_train = x_train.apply(lambda x:x.fillna(x.value_counts().index[0]))
x_train.isnull().sum().max()

In [None]:
x_train.shape

let's deal with the missing values in the test - you can't drop rows

In [None]:
df_test.info()

In [None]:
df_test = df_test.drop((missing_data[missing_data['Total'] > 81]).index,1)
df_test

In [None]:
df_test = df_test.apply(lambda x:x.fillna(x.value_counts().index[0]))

In [None]:
df_test.shape

## Data preparation. Normalization and cleaning

Let's remove the identifiers, since they are unique and non-ifnormative. Let's do the same for the test

In [None]:
x_train.drop("Id", axis = 1, inplace = True)
df_test.drop("Id", axis = 1, inplace = True)

In [None]:
x_train.shape

In [None]:
df_test.shape

Encoding of categorical variables - translate into numerical values. same for test

In [None]:
x_train.select_dtypes(include='object').columns

In [None]:
from sklearn.preprocessing import LabelEncoder
cols = x_train.select_dtypes(include='object').columns

for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(x_train[c].values)) 
    x_train[c] = lbl.transform(list(x_train[c].values))
    df_test[c] = lbl.transform(list(df_test[c].values))

print('Shape all_data: {}'.format(x_train.shape))

Let's clean up the data a little more from outliers: 

In [None]:
indexes = x_train[(df_train['GrLivArea']>4000) & (df_train['SalePrice']<300000)].index 

x_train = x_train.drop(indexes)
y_train = np.delete(y_train, indexes)

In [None]:
 x_train.shape,y_train.shape

The process of data preparation can be continued indefinitely, generating new features, filling in gaps in different ways, etc. But let's go further, build the first model and see what we already have

## Building the model

In [None]:
# Test split the data 
from sklearn.model_selection import train_test_split
x_train1,x_valid,y_train1,y_valid = train_test_split(x_train,y_train,test_size = 0.1,random_state=42)

In [None]:
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb

In [None]:
model_xgb = xgb.XGBRegressor(n_estimators=2200)

In [None]:
n_folds = 5

def rmse(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse = np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

def rmse(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

**XGB Method**

In [None]:
model_xgb.fit(x_train1, y_train1)
xgb_train_pred = model_xgb.predict(x_valid)
xgb_pred = model_xgb.predict(df_test)
print(rmse(y_valid, xgb_train_pred)) 

In [None]:
from sklearn.metrics import mean_squared_log_error
mean_squared_log_error(y_valid, xgb_train_pred)  

After a few iterations, a still simple regressor with a lot of estimators gave a good result!
Doesn't look like a top 1 score yet.

In [None]:
sub = pd.DataFrame()
sub['Id'] = ids
sub['SalePrice'] = xgb_pred
sub.to_csv('submission.csv',index=False) 

Let's tune the model, let's try to rise a little

In [None]:
model_xgb = xgb.XGBRegressor(reg_lambda=0.8571, n_estimators=2200, nthread = -1)

In [None]:
model_xgb.fit(x_train1, y_train1)
xgb_train_pred1 = model_xgb.predict(x_valid)
xgb_pred = model_xgb.predict(df_test)
print(rmse(y_valid, xgb_train_pred1))

In [None]:
mean_squared_log_error(y_valid, xgb_train_pred)  

In [None]:
sub = pd.DataFrame()
sub['Id'] = ids
sub['SalePrice'] = xgb_pred
sub.to_csv('submission.csv',index=False) 

## Stacked and Ensemble Models

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import ElasticNet, Lasso
import lightgbm as lgb

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

Let's assemble a model consisting of a set of basic classifiers of different types

In [None]:
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

GBoost = GradientBoostingRegressor(n_estimators=3000, random_state =42)

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005,random_state=42))
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=42))

stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

In [None]:
import xgboost as xgb
model_xgb = xgb.XGBRegressor(n_estimators=2200, nthread = -1)
model_xgb.fit(x_train1, y_train1)

model_lgb = lgb.LGBMRegressor(objective='regression',n_estimators=720)
model_lgb.fit(x_train1, y_train1) 

stacked_pred=stacked_averaged_models.fit(x_train1.values, y_train1)

In [None]:
lgb_pred = model_lgb.predict(x_valid)
xgb_pred = model_xgb.predict(x_valid)
stacked_pred = stacked_averaged_models.predict(x_valid.values)

In [None]:
print(rmse(y_valid, stacked_pred))

In [None]:
from sklearn.metrics import mean_squared_log_error
mean_squared_log_error(y_valid, stacked_pred)

In [None]:
lgb_pred = model_lgb.predict(df_test)
xgb_pred = model_xgb.predict(df_test)
stacked_pred = stacked_averaged_models.predict(df_test.values)

In [None]:
sub = pd.DataFrame()
sub['Id'] = ids
sub['SalePrice'] = xgb_pred
sub.to_csv('submission.csv',index=False) 

In [None]:
ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15

In [None]:
print(rmse(y_valid, ensemble))

In [None]:
mean_squared_log_error(y_valid, ensemble) 

In [None]:
lgb_pred = model_lgb.predict(df_test)
xgb_pred = model_xgb.predict(df_test)
stacked_pred = stacked_averaged_models.predict(df_test.values)

In [None]:
ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15

In [None]:
sub = pd.DataFrame() 
sub['Id'] = ids
sub['SalePrice'] = xgb_pred
sub.to_csv('submission.csv',index=False)