<a href="https://colab.research.google.com/github/clanguser/house-price-prediction/blob/main/MGT_House_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

The "House Prices - Advanced Regression Techniques" is a machine learning competition that challenges participants to develop models that accurately predict the sale price of residential homes based on various features such as square footage, number of bedrooms, and location. The dataset used in this competition contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa.

The goal of the competition is to create a regression model with the lowest root-mean-squared-error (RMSE) between the predicted and actual sale prices. Participants are required to use the provided training dataset to develop a model that can accurately predict the sale prices of homes in the testing dataset.

To successfully tackle this problem, I will analyze the features provided in the dataset, understanding their meanings, and identifying any missing or incomplete data. I will also need to investigate the relationships between the different features and their correlations with the sale prices. By doing this, I can gain insights into which features are most important in predicting the sale prices and develop a strategy for feature engineering and selection.

# <h2 style = "font-family:Georgia;font-weight: bold; font-size:30px; background-color: white; color : #1192AA; border-radius: 100px 100px; text-align:left">Table of Contents</h2>

* &nbsp; **[Introduction](#Introduction)**
    
* &nbsp; **[Import](#Import)**

* &nbsp; **[Check Dataset](#Check-Dataset)**
   
* &nbsp; **[Exploratory Data Analysis](#EDA)**

* &nbsp; **[Data Cleaning](#Data-Cleaning)**
    
* &nbsp; **[Feature Engineering](#Feature-Engineering)**
    
* &nbsp; **[Data Preprocessing](#Data-Preprocessing)**
    
* &nbsp; **[Model Building](#Model-Building)**
    
* &nbsp; **[Blend and Predict](#Blend-and-Predict)**

# Import

In [None]:
pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp39-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1


In [None]:
# Essentials
import numpy as np
import pandas as pd

# Visualizations
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Stats
from scipy.stats import skew, norm
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

# Misc
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error

# Models
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from mlxtend.regressor import StackingCVRegressor
from sklearn.linear_model import ElasticNetCV
from sklearn.preprocessing import RobustScaler

# Useful line of code to set the display option so we could see all the columns in pd dataframe
pd.set_option('display.max_columns', None)

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore")

In [None]:
# Load train dataset and make a copy of it
# train = pd.read_csv('D:/MGT Dataset/train.csv')
train = pd.read_csv('train.csv')
df_train = train.copy()

# Load test dataset
# test = pd.read_csv('D:/MGT Dataset/test.csv')
test = pd.read_csv('test.csv')

 # Check Dataset

In [None]:
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [None]:
df_train.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [None]:
df_train.shape

(1460, 81)

In [None]:
df_train.isnull().sum().sort_values(ascending=False).head(20)

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageCond        81
GarageType        81
GarageFinish      81
GarageQual        81
BsmtFinType2      38
BsmtExposure      38
BsmtQual          37
BsmtCond          37
BsmtFinType1      37
MasVnrArea         8
MasVnrType         8
Electrical         1
Id                 0
dtype: int64

In [None]:
object_features = []
for column in df_train.columns:
    if df_train[column].dtype == 'object':
        object_features.append(column)
        
print(f'Object features: {object_features} \n\nNumber of object features: {len(object_features)}')

Object features: ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'] 

Number of object features: 43


# EDA

<p style = "font-family:Georgia;font-size:14px; color:#000000 ; text-align: left;" >Check the Sale Price distribution:</p>

In [None]:
# Create figure
fig = px.histogram(x = df_train['SalePrice'],
                   template='simple_white',
                   color_discrete_sequence = ['#1192AA'])



# Set Title and x/y axis labels
fig.update_layout(
    xaxis_title="Sale Price",
    yaxis_title="Frequency",
    showlegend = False,
    font = dict(
            size = 14
            ),    
    title={
        'text': "Sale Price Distribution",
        'y':0.95,
        'x':0.5
        }
    )

# Display
fig.show() # for Kaggle version

We can clearly see that distrubution is skewed right skewed. Let's check skewness and kurtosis of this distribution:

In [None]:
print(f"Skewness: {df_train['SalePrice'].skew()}")
print(f"Kurtosis: {df_train['SalePrice'].kurt()}")

Skewness: 1.8828757597682129
Kurtosis: 6.536281860064529


Let's create a corrplot of all numeric features:

In [None]:
# Select numeric features
numeric_dtypes = ['int64', 'float64']
numeric = []
for i in df_train.columns:
    if df_train[i].dtype in numeric_dtypes:
        numeric.append(i)

# Create figure
fig = px.imshow(df_train.loc[:, numeric].corr(), template='simple_white')

# Display
fig.show()

Distribution of Numeric Values

In [None]:
fig = px.histogram(df_train[numeric], template='simple_white')
fig.update_layout(title='Distribution of Numeric Values')

fig.show()

Let's see how SalePrice relates to some of the features in the dataset:

In [None]:
# Select features
features = ['OverallQual', 'GrLivArea', 'GarageArea', 'GarageCars', 'TotalBsmtSF']

for feature in features:
    # Create figure 
    fig = px.scatter(df_train, feature, 'SalePrice', 
                     trendline="ols", trendline_scope="overall", trendline_color_override="red",
                     template = 'simple_white',
                     color_discrete_sequence = ['#1192AA'])
    
    # Set Title and x/y axis labels
    fig.update_layout(
        xaxis_title="Value",
        yaxis_title="Frequency",
        showlegend = False,
        font = dict(
                size = 14
                ),    
        title={
            'text': feature,
            'y':0.95,
            'x':0.5
            }
        )
    
    # Display
    fig.show()

# Data Cleaning

In [None]:
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


<p style = "font-family:Georgia;font-size:14px; color:#000000 ;background-color:  ; text-align: left;" >Normalize SalePrice:</p>

In [None]:
# Log transform SalePrice 
df_train['SalePrice'] = np.log1p(df_train['SalePrice'])

In [None]:
# Create figure
fig = px.histogram(x = df_train['SalePrice'],
                   template='simple_white',
                   color_discrete_sequence = ['#1192AA'])



# Set Title and x/y axis labels
fig.update_layout(
    xaxis_title="SalePrice",
    yaxis_title="Frequency",
    showlegend = False,
    font = dict(
            size = 14
            ),    
    title={
        'text': "Normalized Sale Price Distribution",
        'y':0.95,
        'x':0.5
        }
    )

# Display
fig.show() # for Kaggle version
#fig.show("svg") # for GitHub version

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Create function for cleaning dataset: </p>

In [None]:
def clean(X):
    
    # Replace some corrupted data
    X["Exterior2nd"] = X["Exterior2nd"].replace({"Brk Cmn": "BrkComm"})
    X["GarageYrBlt"] = X["GarageYrBlt"].where(X.GarageYrBlt <= 2010, X.YearBuilt)
    
    # Change data types in numerical features that should be categorical
    X['MSSubClass'] = X['MSSubClass'].apply(str)
    X['YrSold'] = X['YrSold'].astype(str)
    X['MoSold'] = X['MoSold'].astype(str)
    
    # Handle missing
    X['Functional'] = X['Functional'].fillna('Typ') 
    X['Electrical'] = X['Electrical'].fillna("SBrkr") 
    X['KitchenQual'] = X['KitchenQual'].fillna("TA") 
    X["PoolQC"] = X["PoolQC"].fillna("None")
    X['Exterior1st'] = X['Exterior1st'].fillna(X['Exterior1st'].mode()[0]) 
    X['Exterior2nd'] = X['Exterior2nd'].fillna(X['Exterior2nd'].mode()[0])
    X['SaleType'] = X['SaleType'].fillna(X['SaleType'].mode()[0])
    X["PoolQC"] = X["PoolQC"].fillna("None")
    
    # Replacing the missing values with 0, since no garage = no cars in garage
    for column in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
        X[column] = X[column].fillna(0)
        
    # Replacing the missing values with None
    for column in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
        X[column] = X[column].fillna('None')
        
    # NaN values for these categorical basement features, means there's no basement
    for column in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
        X[column] = X[column].fillna('None')
        
    # Replace left features missing values with None
    objects = []
    for i in X.columns:
        if X[i].dtype == object:
            objects.append(i)
    X.update(X[objects].fillna('None'))
        
    # And we do the same thing for numerical features, but this time with 0s
    numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    numeric = []
    for i in X.columns:
        if X[i].dtype in numeric_dtypes:
            numeric.append(i)
    X.update(X[numeric].fillna(0))

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Create function for normalazing numerical features: </p>

In [None]:
def log_transform(X):

    numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    numeric = []
    for i in X.columns:
        if X[i].dtype in numeric_dtypes:
            numeric.append(i)

    # Compute skewness
    skewed_features = X[numeric].apply(lambda x: skew(x)).sort_values(ascending=False)
    skewed_features = skewed_features[skewed_features > 0.5]
    skewed_features = skewed_features.index

    # Transform skewed features
    for i in skewed_features:
        X[i] = np.log1p(X[i])

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Use those functions on train dataset and check what we have: </p>

In [None]:
clean(df_train)
log_transform(df_train)
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,9.04204,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,1.791759,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,5.283204,Gd,TA,PConc,Gd,TA,No,GLQ,6.561031,Unf,0.0,5.01728,6.753438,GasA,Ex,Y,SBrkr,6.753438,6.751101,0.0,7.444833,0.693147,0.0,2,0.693147,3,0.693147,Gd,2.197225,Typ,0.0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0.0,4.127134,0.0,0.0,0.0,0.0,,,,0.0,2,2008,WD,Normal,12.247699
1,2,20,RL,80.0,9.169623,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,2.197225,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,6.886532,Unf,0.0,5.652489,7.141245,GasA,Ex,Y,SBrkr,7.141245,0.0,0.0,7.141245,0.0,0.693147,2,0.0,3,0.693147,TA,1.94591,Typ,0.693147,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,5.700444,0.0,0.0,0.0,0.0,0.0,,,,0.0,5,2007,WD,Normal,12.109016
2,3,60,RL,68.0,9.328212,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,1.791759,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,5.09375,Gd,TA,PConc,Gd,TA,Mn,GLQ,6.188264,Unf,0.0,6.075346,6.82546,GasA,Ex,Y,SBrkr,6.82546,6.765039,0.0,7.488294,0.693147,0.0,2,0.693147,3,0.693147,Gd,1.94591,Typ,0.693147,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0.0,3.7612,0.0,0.0,0.0,0.0,,,,0.0,9,2008,WD,Normal,12.317171
3,4,70,RL,60.0,9.164401,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,1.791759,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,5.379897,Unf,0.0,6.293419,6.629363,GasA,Gd,Y,SBrkr,6.869014,6.629363,0.0,7.448916,0.693147,0.0,1,0.0,3,0.693147,Gd,2.079442,Typ,0.693147,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0.0,3.583519,5.609472,0.0,0.0,0.0,,,,0.0,2,2006,WD,Abnorml,11.849405
4,5,60,RL,84.0,9.565284,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,1.791759,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,5.860786,Gd,TA,PConc,Gd,TA,Av,GLQ,6.486161,Unf,0.0,6.196444,7.044033,GasA,Ex,Y,SBrkr,7.044033,6.960348,0.0,7.695758,0.693147,0.0,2,0.693147,4,0.693147,Gd,2.302585,Typ,0.693147,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,5.26269,4.442651,0.0,0.0,0.0,0.0,,,,0.0,12,2008,WD,Normal,12.42922


<p style = "font-family:Georgia;font-size:14px; color:#000000 ; text-align: left;" >Check if we have any missing values: </p>

In [None]:
df_train.isnull().sum().sum()

0

# <h2 style = "font-family: Georgia;font-weight: bold; font-size: 30px; color: #1192AA; text-align:left">Feature Engineering</h2>

In [None]:
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,9.04204,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,1.791759,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,5.283204,Gd,TA,PConc,Gd,TA,No,GLQ,6.561031,Unf,0.0,5.01728,6.753438,GasA,Ex,Y,SBrkr,6.753438,6.751101,0.0,7.444833,0.693147,0.0,2,0.693147,3,0.693147,Gd,2.197225,Typ,0.0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0.0,4.127134,0.0,0.0,0.0,0.0,,,,0.0,2,2008,WD,Normal,12.247699
1,2,20,RL,80.0,9.169623,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,2.197225,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,6.886532,Unf,0.0,5.652489,7.141245,GasA,Ex,Y,SBrkr,7.141245,0.0,0.0,7.141245,0.0,0.693147,2,0.0,3,0.693147,TA,1.94591,Typ,0.693147,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,5.700444,0.0,0.0,0.0,0.0,0.0,,,,0.0,5,2007,WD,Normal,12.109016
2,3,60,RL,68.0,9.328212,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,1.791759,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,5.09375,Gd,TA,PConc,Gd,TA,Mn,GLQ,6.188264,Unf,0.0,6.075346,6.82546,GasA,Ex,Y,SBrkr,6.82546,6.765039,0.0,7.488294,0.693147,0.0,2,0.693147,3,0.693147,Gd,1.94591,Typ,0.693147,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0.0,3.7612,0.0,0.0,0.0,0.0,,,,0.0,9,2008,WD,Normal,12.317171
3,4,70,RL,60.0,9.164401,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,1.791759,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,5.379897,Unf,0.0,6.293419,6.629363,GasA,Gd,Y,SBrkr,6.869014,6.629363,0.0,7.448916,0.693147,0.0,1,0.0,3,0.693147,Gd,2.079442,Typ,0.693147,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0.0,3.583519,5.609472,0.0,0.0,0.0,,,,0.0,2,2006,WD,Abnorml,11.849405
4,5,60,RL,84.0,9.565284,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,1.791759,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,5.860786,Gd,TA,PConc,Gd,TA,Av,GLQ,6.486161,Unf,0.0,6.196444,7.044033,GasA,Ex,Y,SBrkr,7.044033,6.960348,0.0,7.695758,0.693147,0.0,2,0.693147,4,0.693147,Gd,2.302585,Typ,0.693147,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,5.26269,4.442651,0.0,0.0,0.0,0.0,,,,0.0,12,2008,WD,Normal,12.42922


For this competition I am going to use Linear models, which is Elastic Net regression. Linear models assume that there is a linear relationship between the input features and the output variable. If the true relationship is more complex than this, a linear model may not be able to capture all of the nuances of the data. Therefore, by using feature engineering, and manualy creating new features, linear models can often be improved to capture more complex patterns in the data.

Remove unnecessary attributes

In [None]:
def drop_uninformative(X):
    X.drop(['Id', 'Utilities', 'Street', 'PoolQC', 'MiscFeature', 'MiscVal', 'YearRemodAdd'], axis=1, inplace = True)

Add binary attributes such as HasPoolArea and HasGarageArea based on continuous numerical attributes such as PoolArea and GarageArea. This binary attribute indicates whether that particular amenity is present or not in the first place

In [None]:
def counts(X):
    X['HasWoodDeck'] = X['WoodDeckSF'].apply(lambda x: 1 if x > 0 else 0)
    X['HasOpenPorch'] = X['OpenPorchSF'].apply(lambda x: 1 if x > 0 else 0)
    X['HasEnclosedPorch'] = X['EnclosedPorch'].apply(lambda x: 1 if x > 0 else 0)
    X['Has3SsnPorch'] = X['3SsnPorch'].apply(lambda x: 1 if x > 0 else 0)
    X['HasScreenPorch'] = X['ScreenPorch'].apply(lambda x: 1 if x > 0 else 0)
    X['HasPool'] = X['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
    X['Has2ndFloor'] = X['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
    X['HasGarage'] = X['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
    X['HasBsmt'] = X['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
    X['HasFireplace'] = X['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

Customise the attributes for based on domain knowledge

In [None]:
def math_transform(X):
    X["SqFtPerRoom"] = X["GrLivArea"] / (X["TotRmsAbvGrd"] +
                                                       X["FullBath"] +
                                                       X["HalfBath"] +
                                                       X["KitchenAbvGr"])
    X['TotalSqrFootage'] = (X['BsmtFinSF1'] + X['BsmtFinSF2'] + X['1stFlrSF'] + X['2ndFlrSF'])
    X['TotalBathrooms'] = (X['FullBath'] + (0.5 * X['HalfBath']) + X['BsmtFullBath'] + (0.5 * X['BsmtHalfBath']))
    X['TotalPorchSF'] = (X['OpenPorchSF'] + X['3SsnPorch'] + X['EnclosedPorch'] + X['ScreenPorch'] + X['WoodDeckSF'])
    X['TotalHomeQuality'] = (X['OverallQual'] + X['OverallCond'])
    X = X.drop(['GrLivArea', 'TotRmsAbvGrd', 'FullBath', 'HalfBath', 'KitchenAbvGr', 'BsmtFinSF1', 'BsmtFinSF2', '1stFlrSF', '2ndFlrSF', 'BsmtFullBath', 'BsmtHalfBath', 'OpenPorchSF', '3SsnPorch', 'EnclosedPorch', 'ScreenPorch', 'WoodDeckSF', 'OverallQual', 'OverallCond'], axis=1)

Encode categorical features

In [None]:
def encode_features(X):
    return pd.get_dummies(X).reset_index(drop=True)

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Apply all functions to the train dataset:</p>


In [None]:
drop_uninformative(df_train)
counts(df_train)
math_transform(df_train)
df_train = encode_features(df_train)

# Check dataset
df_train.head()

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,SalePrice,HasWoodDeck,HasOpenPorch,HasEnclosedPorch,Has3SsnPorch,HasScreenPorch,HasPool,Has2ndFloor,HasGarage,HasBsmt,HasFireplace,SqFtPerRoom,TotalSqrFootage,TotalBathrooms,TotalPorchSF,TotalHomeQuality,MSSubClass_120,MSSubClass_160,MSSubClass_180,MSSubClass_190,MSSubClass_20,MSSubClass_30,MSSubClass_40,MSSubClass_45,MSSubClass_50,MSSubClass_60,MSSubClass_70,MSSubClass_75,MSSubClass_80,MSSubClass_85,MSSubClass_90,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Alley_Grvl,Alley_None,Alley_Pave,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,LotConfig_Corner,LotConfig_CulDSac,LotConfig_FR2,LotConfig_FR3,LotConfig_Inside,LandSlope_Gtl,LandSlope_Mod,LandSlope_Sev,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,Condition1_RRNe,Condition1_RRNn,Condition2_Artery,Condition2_Feedr,Condition2_Norm,Condition2_PosA,Condition2_PosN,Condition2_RRAe,Condition2_RRAn,Condition2_RRNn,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,RoofMatl_ClyTile,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,Exterior1st_AsbShng,Exterior1st_AsphShn,Exterior1st_BrkComm,Exterior1st_BrkFace,Exterior1st_CBlock,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing,Exterior2nd_AsbShng,Exterior2nd_AsphShn,Exterior2nd_BrkComm,Exterior2nd_BrkFace,Exterior2nd_CBlock,Exterior2nd_CmentBd,Exterior2nd_HdBoard,Exterior2nd_ImStucc,Exterior2nd_MetalSd,Exterior2nd_Other,Exterior2nd_Plywood,Exterior2nd_Stone,Exterior2nd_Stucco,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_Wd Shng,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,ExterQual_Ex,ExterQual_Fa,ExterQual_Gd,ExterQual_TA,ExterCond_Ex,ExterCond_Fa,ExterCond_Gd,ExterCond_Po,ExterCond_TA,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,BsmtQual_Ex,BsmtQual_Fa,BsmtQual_Gd,BsmtQual_None,BsmtQual_TA,BsmtCond_Fa,BsmtCond_Gd,BsmtCond_None,BsmtCond_Po,BsmtCond_TA,BsmtExposure_Av,BsmtExposure_Gd,BsmtExposure_Mn,BsmtExposure_No,BsmtExposure_None,BsmtFinType1_ALQ,BsmtFinType1_BLQ,BsmtFinType1_GLQ,BsmtFinType1_LwQ,BsmtFinType1_None,BsmtFinType1_Rec,BsmtFinType1_Unf,BsmtFinType2_ALQ,BsmtFinType2_BLQ,BsmtFinType2_GLQ,BsmtFinType2_LwQ,BsmtFinType2_None,BsmtFinType2_Rec,BsmtFinType2_Unf,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,HeatingQC_Ex,HeatingQC_Fa,HeatingQC_Gd,HeatingQC_Po,HeatingQC_TA,CentralAir_N,CentralAir_Y,Electrical_FuseA,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,KitchenQual_Ex,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,Functional_Maj1,Functional_Maj2,Functional_Min1,Functional_Min2,Functional_Mod,Functional_Sev,Functional_Typ,FireplaceQu_Ex,FireplaceQu_Fa,FireplaceQu_Gd,FireplaceQu_None,FireplaceQu_Po,FireplaceQu_TA,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_None,GarageFinish_Fin,GarageFinish_None,GarageFinish_RFn,GarageFinish_Unf,GarageQual_Ex,GarageQual_Fa,GarageQual_Gd,GarageQual_None,GarageQual_Po,GarageQual_TA,GarageCond_Ex,GarageCond_Fa,GarageCond_Gd,GarageCond_None,GarageCond_Po,GarageCond_TA,PavedDrive_N,PavedDrive_P,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None,MoSold_1,MoSold_10,MoSold_11,MoSold_12,MoSold_2,MoSold_3,MoSold_4,MoSold_5,MoSold_6,MoSold_7,MoSold_8,MoSold_9,YrSold_2006,YrSold_2007,YrSold_2008,YrSold_2009,YrSold_2010,SaleType_COD,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,65.0,9.04204,7,1.791759,2003,5.283204,6.561031,0.0,5.01728,6.753438,6.753438,6.751101,0.0,7.444833,0.693147,0.0,2,0.693147,3,0.693147,2.197225,0.0,2003.0,2,548,0.0,4.127134,0.0,0.0,0.0,0.0,12.247699,0,1,0,0,0,0,1,1,1,0,1.333359,20.06557,3.039721,4.127134,8.791759,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
1,80.0,9.169623,6,2.197225,1976,0.0,6.886532,0.0,5.652489,7.141245,7.141245,0.0,0.0,7.141245,0.0,0.693147,2,0.0,3,0.693147,1.94591,0.693147,1976.0,2,460,5.700444,0.0,0.0,0.0,0.0,0.0,12.109016,1,0,0,0,0,0,0,1,1,1,1.539374,14.027777,2.346574,5.700444,8.197225,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
2,68.0,9.328212,7,1.791759,2001,5.09375,6.188264,0.0,6.075346,6.82546,6.82546,6.765039,0.0,7.488294,0.693147,0.0,2,0.693147,3,0.693147,1.94591,0.693147,2001.0,2,608,0.0,3.7612,0.0,0.0,0.0,0.0,12.317171,0,1,0,0,0,0,1,1,1,1,1.404352,19.778763,3.039721,3.7612,8.791759,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
3,60.0,9.164401,7,1.791759,1915,0.0,5.379897,0.0,6.293419,6.629363,6.869014,6.629363,0.0,7.448916,0.693147,0.0,1,0.0,3,0.693147,2.079442,0.693147,1998.0,3,642,0.0,3.583519,5.609472,0.0,0.0,0.0,11.849405,0,1,1,0,0,0,1,1,1,1,1.974484,18.878275,1.693147,9.192991,8.791759,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
4,84.0,9.565284,8,1.791759,2000,5.860786,6.486161,0.0,6.196444,7.044033,7.044033,6.960348,0.0,7.695758,0.693147,0.0,2,0.693147,4,0.693147,2.302585,0.693147,2000.0,3,836,5.26269,4.442651,0.0,0.0,0.0,0.0,12.42922,1,1,0,0,0,0,1,1,1,1,1.352772,20.490541,3.039721,9.705341,9.791759,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0


# <h2 style = "font-family: Georgia;font-weight: bold; font-size: 30px; color: #1192AA; text-align:left">Data Preprocessing</h2>

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Create function that will apply all data preprocessing functions to the dataset:</p>

In [None]:
def data_preprocessing(X):
    
    # From Data Cleaning section
    clean(X)
    log_transform(X)
    
    # From Feature Engineering section
    drop_uninformative(X)
    counts(X)
    math_transform(X)
    X = encode_features(X)
    return X

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Merge train and test datasets to preprocess them, also set target y_train:</p>

In [None]:
# Save shapes 
ntrain = train.shape[0]
ntest = test.shape[0]

# Here we take log transformed values from df_train to set a train target
y = df_train.SalePrice.values 

# Create new dataset to preprocess data: 
df_new = pd.concat((train, test)).reset_index(drop=True)
df_new.drop(['SalePrice'], axis=1, inplace=True)

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Call data_preprocessing and check our data:</p>

In [None]:
all_data = data_preprocessing(df_new)
all_data.tail()

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,HasWoodDeck,HasOpenPorch,HasEnclosedPorch,Has3SsnPorch,HasScreenPorch,HasPool,Has2ndFloor,HasGarage,HasBsmt,HasFireplace,SqFtPerRoom,TotalSqrFootage,TotalBathrooms,TotalPorchSF,TotalHomeQuality,MSSubClass_120,MSSubClass_150,MSSubClass_160,MSSubClass_180,MSSubClass_190,MSSubClass_20,MSSubClass_30,MSSubClass_40,MSSubClass_45,MSSubClass_50,MSSubClass_60,MSSubClass_70,MSSubClass_75,MSSubClass_80,MSSubClass_85,MSSubClass_90,MSZoning_C (all),MSZoning_FV,MSZoning_None,MSZoning_RH,MSZoning_RL,MSZoning_RM,Alley_Grvl,Alley_None,Alley_Pave,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,LotConfig_Corner,LotConfig_CulDSac,LotConfig_FR2,LotConfig_FR3,LotConfig_Inside,LandSlope_Gtl,LandSlope_Mod,LandSlope_Sev,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,Condition1_Artery,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,Condition1_RRNe,Condition1_RRNn,Condition2_Artery,Condition2_Feedr,Condition2_Norm,Condition2_PosA,Condition2_PosN,Condition2_RRAe,Condition2_RRAn,Condition2_RRNn,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Fin,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,RoofMatl_ClyTile,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,Exterior1st_AsbShng,Exterior1st_AsphShn,Exterior1st_BrkComm,Exterior1st_BrkFace,Exterior1st_CBlock,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing,Exterior2nd_AsbShng,Exterior2nd_AsphShn,Exterior2nd_BrkComm,Exterior2nd_BrkFace,Exterior2nd_CBlock,Exterior2nd_CmentBd,Exterior2nd_HdBoard,Exterior2nd_ImStucc,Exterior2nd_MetalSd,Exterior2nd_Other,Exterior2nd_Plywood,Exterior2nd_Stone,Exterior2nd_Stucco,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_Wd Shng,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,ExterQual_Ex,ExterQual_Fa,ExterQual_Gd,ExterQual_TA,ExterCond_Ex,ExterCond_Fa,ExterCond_Gd,ExterCond_Po,ExterCond_TA,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,BsmtQual_Ex,BsmtQual_Fa,BsmtQual_Gd,BsmtQual_None,BsmtQual_TA,BsmtCond_Fa,BsmtCond_Gd,BsmtCond_None,BsmtCond_Po,BsmtCond_TA,BsmtExposure_Av,BsmtExposure_Gd,BsmtExposure_Mn,BsmtExposure_No,BsmtExposure_None,BsmtFinType1_ALQ,BsmtFinType1_BLQ,BsmtFinType1_GLQ,BsmtFinType1_LwQ,BsmtFinType1_None,BsmtFinType1_Rec,BsmtFinType1_Unf,BsmtFinType2_ALQ,BsmtFinType2_BLQ,BsmtFinType2_GLQ,BsmtFinType2_LwQ,BsmtFinType2_None,BsmtFinType2_Rec,BsmtFinType2_Unf,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,HeatingQC_Ex,HeatingQC_Fa,HeatingQC_Gd,HeatingQC_Po,HeatingQC_TA,CentralAir_N,CentralAir_Y,Electrical_FuseA,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,KitchenQual_Ex,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,Functional_Maj1,Functional_Maj2,Functional_Min1,Functional_Min2,Functional_Mod,Functional_Sev,Functional_Typ,FireplaceQu_Ex,FireplaceQu_Fa,FireplaceQu_Gd,FireplaceQu_None,FireplaceQu_Po,FireplaceQu_TA,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_None,GarageFinish_Fin,GarageFinish_None,GarageFinish_RFn,GarageFinish_Unf,GarageQual_Ex,GarageQual_Fa,GarageQual_Gd,GarageQual_None,GarageQual_Po,GarageQual_TA,GarageCond_Ex,GarageCond_Fa,GarageCond_Gd,GarageCond_None,GarageCond_Po,GarageCond_TA,PavedDrive_N,PavedDrive_P,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None,MoSold_1,MoSold_10,MoSold_11,MoSold_12,MoSold_2,MoSold_3,MoSold_4,MoSold_5,MoSold_6,MoSold_7,MoSold_8,MoSold_9,YrSold_2006,YrSold_2007,YrSold_2008,YrSold_2009,YrSold_2010,SaleType_COD,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
2914,21.0,7.568896,4,2.079442,1970,0.0,0.0,0.0,6.304449,6.304449,6.304449,6.304449,0.0,6.996681,0.0,0.0,1,0.693147,3,0.693147,1.791759,0.0,1970.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,1,0,1,0,1.674627,12.608898,1.346574,0.0,6.079442,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
2915,21.0,7.546974,4,1.791759,1970,0.0,5.533389,0.0,5.686975,6.304449,6.304449,6.304449,0.0,6.996681,0.0,0.0,1,0.693147,3,0.693147,1.94591,0.0,1970.0,1.0,286.0,0.0,3.218876,0.0,0.0,0.0,0.0,0,1,0,0,0,0,1,1,1,0,1.61504,18.142287,1.346574,3.218876,5.791759,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
2916,160.0,9.903538,5,2.079442,1960,0.0,7.110696,0.0,0.0,7.110696,7.110696,0.0,0.0,7.110696,0.693147,0.0,1,0.0,4,0.693147,2.079442,0.693147,1960.0,2.0,576.0,6.163315,0.0,0.0,0.0,0.0,0.0,1,0,0,0,0,0,0,1,1,1,1.884832,14.221392,1.693147,6.163315,7.079442,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
2917,62.0,9.253591,5,1.791759,1992,0.0,5.823046,0.0,6.356108,6.816736,6.878326,0.0,0.0,6.878326,0.0,0.693147,1,0.0,3,0.693147,1.94591,0.0,1992.0,0.0,0.0,4.394449,3.496508,0.0,0.0,0.0,0.0,1,1,0,0,0,0,0,0,1,0,1.89014,12.701372,1.346574,7.890957,6.791759,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
2918,74.0,9.172431,7,1.791759,1993,4.553877,6.632002,0.0,5.476464,6.904751,6.904751,6.912743,0.0,7.601402,0.0,0.0,2,0.693147,3,0.693147,2.302585,0.693147,1993.0,3.0,650.0,5.252273,3.89182,0.0,0.0,0.0,0.0,1,1,0,0,0,0,1,1,1,1,1.336186,20.449495,2.346574,9.144094,8.791759,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0


<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Split all preprocessed data to train and test. Check shapes (train and test must have the same number of columns, train and y_train must have the same number of rows):</p>

In [None]:
train = all_data[:ntrain]
test = all_data[ntrain:]
print(f"train shape: {train.shape}\ny_train shape: {y.shape}\ntest shape: {test.shape}")

train shape: (1460, 333)
y_train shape: (1460,)
test shape: (1459, 333)


# <h2 style = "font-family: Georgia; font-weight: bold; font-size: 30px; color: #1192AA; text-align:left">Model Building</h2>

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >
The main idea was to use Elastic Net regression, because I used one-hot encoding technics in feature engineering part, and those models work better with big amount of features, as they help to prevent overfitting and improve the model's generalization performance.
</p>
<p style = "font-family:Georgia;font-size:14px; color:#000000 ;background-color:  ; text-align: left;" >
However, it's important to note that Elastic Net regression may not always be the best-performing models for a given dataset. Therefore, I will also be testing other regression models such as XGBoost Regressor, Support Vector Regressor, and LGBMRegressor. These models are known to perform well on a wide range of datasets, and I'm interested in seeing how they compare to the Lasso and Ridge regressions on the House Prices dataset.
</p>
<p style = "font-family:Georgia;font-size:14px; color:#000000 ;background-color:  ; text-align: left;" >
Once I have tested all of the regression models, I will then blend them together to create a final model. Model blending, which is also known as ensemble modeling, is a technique that involves combining the predictions of multiple models to create a more accurate and robust model. I will use a weighted average approach to blend the models, where the weights are determined based on the individual models' performance on the validation set. This approach will help to reduce the impact of individual models that may not perform well on the dataset while emphasizing the strengths of the better-performing models.
</p>

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Setup cross validation folds:</p>

In [None]:
kf = KFold(n_splits=12, shuffle=True, random_state=11)

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Define error metrics:</p>

In [None]:
# Root Mean Squared Error
def rmse(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

# Cross Validation of the Root Mean Square Error
def cv_rmse(model, X=train):
    return np.sqrt(-cross_val_score(model, X, y,
                                    scoring="neg_mean_squared_error", cv=kf))

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Set up models:</p>

Elastic Net's hyperparameters are optimised using cross validation set uisng the ElasticNetCV function

SVR's hyperparameters are optimised using GridSearchCV



In [None]:
# Set up Parameters 
# define the range of alpha values to search
alpha_range = [0.001, 0.01, 0.1, 1, 10]

# define the range of l1_ratio values to search
l1_ratio_range = [0.1, 0.3, 0.5, 0.7, 0.9]

# define the Elastic Net regression model
elastic_net = make_pipeline(RobustScaler(),
                            ElasticNetCV(alphas=alpha_range, l1_ratio=l1_ratio_range, cv=kf))

svr_grid = {'C': [20, 22], 'epsilon':[0.008, 0.009], 'gamma': [0.001, 0.002, 0.0025]}

# Support Vector Regression
svr = make_pipeline(RobustScaler(),
                    GridSearchCV(SVR(),svr_grid, cv=kf))

# Light Gradient Boosting Regression
lightgbm = LGBMRegressor(n_estimators=6999,
                         learning_rate=0.01, 
                         num_leaves=6,
                         bagging_seed=8,
                         feature_fraction_seed=8,
                         objective='mse',
                         random_state=11,
                         )

# XGBoost Regression
xgboost = XGBRegressor(n_estimators=2000,
                       learning_rate=0.03,
                       max_depth=4,
                       subsample=0.72,
                       colsample_bytree=0.41,
                       random_state = 11)

<p style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;" >Check scores:</p>

In [None]:
score_lightgbm = cv_rmse(lightgbm)
print(f"lightgbm: {score_lightgbm.mean()}")

lightgbm: 0.132723000543209


In [None]:
score_xgboost = cv_rmse(xgboost)
print(f"xgboost: {score_xgboost.mean()}")

xgboost: 0.12575723320483315


In [None]:
score_elastic = cv_rmse(elastic_net)
print(f"elastic: {score_elastic.mean()}")

elastic: 0.12294385286108138


In [None]:
score_svr = cv_rmse(svr)
print(f"svr: {score_svr.mean()}")

svr: 0.1235645124649269


In [None]:
scores = {'lightgbm': score_lightgbm,
          'xgboost': score_xgboost,
          'elastic net': score_elastic,
          'svr': score_svr}

<div style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;">

In addition, I am going to train a meta model StackingCVRegressor optimizer using CatBoostRegressor.

Meta-learning is a type of machine learning that involves learning from other machine learning models. In the context of regression models, meta-learning can be used to combine the predictions of multiple base models to create a more accurate overall prediction.
</div>

A meta model is a higher-level model that uses the outputs of multiple individual models as inputs to make predictions. In the context of machine learning, a meta model is trained on the predictions of other models, which are themselves trained on the original dataset. The idea behind a meta model is to combine the strengths of multiple models and minimize their weaknesses, thus achieving better overall performance.

In the specific case you mentioned, "StackingCVRegressor" is a meta-modeling technique that involves training several individual models on the training data, using cross-validation to prevent overfitting. The predictions from these individual models are then used as features for a higher-level model, which is trained on the same data. The "CV" in "StackingCVRegressor" stands for "cross-validation," which is a technique for evaluating the performance of a model on a limited dataset by repeatedly dividing the dataset into training and testing sets. The "CatBoostRegressor" is one of the individual models being used as a base model to train the meta model.

In [None]:
meta_model = CatBoostRegressor(iterations = 6000,
                               learning_rate = 0.005,
                               depth = 4,
                               l2_leaf_reg = 1,
                               eval_metric = 'RMSE',
                               random_seed = 11,
                               logging_level = 'Silent')

stacking_model = StackingCVRegressor(regressors=(elastic_net, xgboost, lightgbm, svr),
                                      meta_regressor=meta_model,
                                      use_features_in_secondary=True)

<div style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;">
You can use this code to check cross validation score for stacking model, but it will take a lot of time to run it.
</div>
    
```python
score_stacking_model = cv_rmse(stacking_model)
scores['stacking'] = score_stacking_model
print(f"svr: {score_stacking_model.mean()}")
```

 <h2 style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;">Fit the models:</h2>

In [None]:
lightgbm_fit = lightgbm.fit(train, y)

xgboost_fit = xgboost.fit(train, y)

elasticNet_fit = elastic_net.fit(train, y)

svr_fit = svr.fit(train, y)

stacking_model_fit = stacking_model.fit(np.array(train), np.array(y))

# <h2 style = "font-family: Georgia; font-weight: bold; font-size: 30px; color: #1192AA; text-align:left">Blend and Predict</h2>

<h2 style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;">Define function:</h2>

This code defines a function called blend_predictions that takes a single argument X. The function returns a blended prediction generated by combining the predictions of five different machine learning models. The specific models used in this function are:

A LightGBM model (lightgbm_fit) trained on some data
An XGBoost model (xgboost_fit) trained on the same or similar data
An ElasticNet model (elasticNet_fit) trained on the same or similar data
A Support Vector Regression (SVR) model (svr_fit) trained on the same or similar data
A stacking model (stacking_model_fit) trained on the same or similar data
Each model generates a prediction for the input data X, and the function returns a weighted average of these predictions. The weights used for each model are:

0.1 for the LightGBM model
0.1 for the XGBoost model
0.4 for the ElasticNet model
0.15 for the SVR model
0.25 for the stacking model
By blending the predictions from multiple models, the function aims to create a more accurate and robust prediction than any single model could achieve on its own. The specific choice of models and weights used in this function would have been determined through experimentation and tuning on a training dataset.

In [None]:
def blend_predictions(X):
    return ((0.1 * lightgbm_fit.predict(X)) + \
            (0.1 * xgboost_fit.predict(X)) + \
            (0.4 * elasticNet_fit.predict(X)) + \
            (0.15 * svr_fit.predict(X)) + \
            (0.25 * stacking_model_fit.predict(np.array(X))))

<h2 style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;">Check blended score:</h2>

In [None]:
blended_score = rmse(y, blend_predictions(train))
scores['blended'] = blended_score
print(f"Blended score: {blended_score}")

Blended score: 0.06569235712718698


<h2 style = "font-family:Georgia;font-size:14px; color:#000000; text-align: left;">Plot all scores:</h2>

In [None]:
# Create figure
fig = px.line(x=list(scores.keys()), y=[score.mean() for score in scores.values()],
              markers=True, color_discrete_sequence = ['light blue'], template = 'simple_white')

# Set Title and x/y axis labels
fig.update_layout(
    xaxis_title="Model",
    yaxis_title="Root Mean Squared Error",
    showlegend = False,
    font = dict(
            size = 14
            ),    
    title={
        'text': "RMSE by Model",
        'y':0.95,
        'x':0.5
        }
    )

# Display
fig.show()