# Importing data + EDA

Importing necessary libaries and dataset for the section

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
import math

#Create a scrollable window so dataframe will not overflow the screen
#Taken from https://www.youtube.com/watch?v=NQQ3DRdXAXE&list=PL1CS4FDP5q9n5UM0qH_2XgsjH-8a6TK06&index=66
def scrollable_table(df, table_id, title):
    html = f'<h3>{title}</h3>'
    html += f'<div id="{table_id}" style="height:200px; overflow:auto;">'
    html += df.to_html()
    html += '</div>'
    return html

In [None]:
df_train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
html_train = scrollable_table(df_train, "training_df", "Train")
HTML(html_train)

In [None]:
print(f"Train dataset has {df_train.shape[1]} columns and {df_train.shape[0]} rows.")
print(f"Test dataset has {df_test.shape[1]} columns and {df_test.shape[0]} rows.")

In [None]:
print(f"There are {len(set(df_train.dtypes))} unique data type in the df which are {set(df_train.dtypes)}.")

In [None]:
numerical_features = df_train.select_dtypes(include=['int64', 'float64'])
categorical_features = df_train.select_dtypes(include=['O'])

Below is one way that you can make quick visualization with seaborn, in which you can roughly understand the general distribution of all features to make further and deeper analysis. I found it as a very useful step to do before any deeper analysis especially when you got a huge amount of features like this one.

I basically did regplot for numerical features instead of scatterplot for some additional information (correlation with saleprice) and boxplot for categorical features, which you can change based on your preferences.

In [None]:
#cluster same datatype together so it will be visually better
train_dtcluster = numerical_features.join(categorical_features) 

#The function take in 3 arguments, your dataset, the target column (SalePrice in this case)
#And the number of columns it takes which has a default value of 4
def plot_all(df, target_col, num_cols=4):
    num_rows = math.ceil(len(df.columns)/num_cols)
    fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(16, 4*num_rows))
    sns.set(font_scale=1.2, style='whitegrid')
    for i, col_name in enumerate(df.columns):
        if col_name != target_col.name:
            ax = axes[i // num_cols, i % num_cols]
            if df[col_name].dtype == "O":
                sns.boxplot(x=df[col_name], y=target_col, ax=ax) #Categorical plot
            else:
                sns.regplot(x=df[col_name], y=target_col, ax=ax) #Numerical plot
            ax.set_xlabel(col_name, fontsize=14)
            ax.set_ylabel(target_col.name, fontsize=14)
            ax.tick_params(axis='both', which='major', labelsize=12)
    plt.tight_layout()
    plt.show()

plot_all(train_dtcluster.drop('Id',axis=1), df_train.SalePrice, num_cols=5)

The above plots effectively gave us some insights to the correlation between different features and the target sale price. For example, we can potentially drop some seemingly useless features like Utilities, PoolArea and MiscVal by eyes for better training efficiency and outcome. 

Also, based on the regplots, we can clearly identify some of the 'numerical_features' should be labelled as categorical features which is what we will be doing next.

In [None]:
df_train[['MSSubClass', 'OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 
          'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 
          'Fireplaces', 'GarageCars', 'MoSold', 'YrSold']]= df_train[['MSSubClass', 'OverallQual', 'OverallCond', 'BsmtFullBath', 'BsmtHalfBath', 
          'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 
          'Fireplaces', 'GarageCars', 'MoSold', 'YrSold']].astype(str)

In [None]:
sns.heatmap(df_train.corr())

In [None]:
sns.histplot(df_train.SalePrice, kde=True)
plt.title("Distribution of SalePrice")

The histplot suggest a positive skew of our target distribution which suggests that we could apply log transformation to our target variable. By applying log transformation we can reduce the skewness and make the distribution more normal which can significantly improve the performances of many ML algorithms that suggest a normal distribution. But remember to take exp on the final prediction to get the correct results at the end.

In [None]:
df_train['SalePrice'] = np.log(df_train.SalePrice)
y_train = df_train['SalePrice']

In [None]:
sns.histplot(df_train.SalePrice, kde=True)
plt.title("Normalized SalePrice Distribution")

In [None]:
nan_df = pd.DataFrame({"missing_count": df_train.isna().sum().values, 
                       "dtype": df_train.dtypes.values,
                       "Percentage of missing values": (df_train.isna().sum().values / df_train.shape[0]) * 100},
              index=df_train.columns)
html_nandf = scrollable_table(nan_df.sort_values(by="missing_count", ascending=False), "nan",
                             "Count of missing values in each features")
HTML(html_nandf)

As we can see, PoolQC, MiscFeature, Alley and Fence all have a high percentage of missing values which I will be dropping them in this case.

In [None]:
df_train = df_train.drop(["PoolQC", "MiscFeature", "Alley", "Fence"],axis=1)

For the remaining columns that have missing values, I will be using SimpleImputer from sklearn to impute mean for numerical features and constant 'missing' value for categorical variables which make more sense this case since missing values for categorical features may be a sign that that particular house does not have that particular feature, eg. the house does not have a garage.

One other observation we got from the plot_all function and just based on the feature descriptions themselves is that there are quite a number of similar features in the dataset which could most likely result in **multicollinearity** (high correlation between two or more predictor variables). This will cause huge problem especailly for algorithms such as linear regression. 

There are multiple appraoch in reducing the problem of multicollinearity, such as manually dropping features, L1L2 Regularization and PCA which is what I will be doing in this notebook as part of the pipeline.

# Data Preprocessing

Now we will preprocess our data, in this section I will be:

1. Feature Selection
2. Removing Outliners
3. Numerical to categorical transformation
4. Imputing
5. Normalizing and scaling numerical features
6. One Hot Encoding
7. Pass everything through a pipeline

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import skew

train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
train = train.drop('Id',axis=1)
train = train[train.GrLivArea < 4000] #Remove outliners
test_id = test.Id
test = test.drop('Id',axis=1)

In [None]:
#Transfer the columns to categorical data
transfer_col = ['MSSubClass', 'OverallQual', 'OverallCond', 'BsmtFullBath', 
                'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 
                'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 
                'MoSold', 'YrSold']
for i in transfer_col:
    train[i] = train[i].astype(str)

#remove outliners
train = train[train.GrLivArea < 4000] 

#Drop unnecessary features
X = train.drop(["SalePrice", "PoolQC", "MiscFeature", "Alley", "Fence", "Utilities"] ,axis=1)

#Log y
y = np.log(train.SalePrice)

#Imputing both categorical and numerical features
num_col = X.select_dtypes(include=['int64', 'float64']).columns
cat_col = X.select_dtypes(include=['object']).columns

imputer_num = SimpleImputer(strategy='mean')
imputer_cat = SimpleImputer(strategy='constant', fill_value='missing')

for col in X.columns:
    if X[col].dtype == 'O':
        X[col] = imputer_cat.fit_transform(X[col].values.reshape(-1,1))[:,0] 
    else:
        X[col] = imputer_num.fit_transform(X[col].values.reshape(-1,1))[:,0] 

#normalize features
skewness = X[num_col].apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
X[skewed_features] = np.log1p(X[skewed_features])

#StandScaler
X[num_col] = StandardScaler().fit_transform(X[num_col])

#One hot encoding
for col in cat_col:
    X[col] = X[col].apply(lambda x: str(x) + str(col))
    
dummies_train = pd.get_dummies(X[cat_col])
X = X.drop(cat_col, axis=1)
X = X.join(dummies_train)

X_preprocessed_train = X

In [None]:
X_preprocessed_train.head()

And we basically do the same for test set, except for one hot encoding where we would create a framework and use **.align** function to encode the dataset.

In [None]:
#Transfer the columns to categorical data
transfer_col = ['MSSubClass', 'OverallQual', 'OverallCond', 'BsmtFullBath', 
                'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 
                'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 
                'MoSold', 'YrSold']
for i in transfer_col:
    test[i] = test[i].astype(str)

X = test.drop(["PoolQC", "MiscFeature", "Alley", "Fence", "Utilities"] ,axis=1)

num_col = X.select_dtypes(include=['int64', 'float64']).columns
cat_col = X.select_dtypes(include=['object']).columns

imputer_num = SimpleImputer(strategy='mean')
imputer_cat = SimpleImputer(strategy='constant', fill_value='missing')

for col in X.columns:
    if X[col].dtype == 'O':
        X[col] = imputer_cat.fit_transform(X[col].values.reshape(-1,1))[:,0] 
    else:
        X[col] = imputer_num.fit_transform(X[col].values.reshape(-1,1))[:,0] 

skewness = X[num_col].apply(lambda x: skew(x))
skewness = skewness[abs(skewness) > 0.5]
skewed_features = skewness.index
X[skewed_features] = np.log1p(X[skewed_features])

X[num_col] = StandardScaler().fit_transform(X[num_col])

#One hot encode with align
for col in cat_col:
    X[col] = X[col].apply(lambda x: str(x) + str(col))
dummies_test = pd.get_dummies(X[cat_col])
X = X.drop(cat_col, axis=1)
final_train, final_test = dummies_train.align(dummies_test, join='left', axis=1)
X = X.join(final_test)
X = X.fillna(0)

X_preprocessed_test = X

In [None]:
X_preprocessed_test.head()

## Pipeline

Now we will be passing everything to a pipeline. What I am doing down here is creating estimators, you can image it like a function where you can pass a dataset through the function and return the output. 

(Note: All the above codes do not have any affect on what we will be doing starting from now, it'sjust for demonstration purposes)

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import skew

train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
train = train[train.GrLivArea < 4000] #Remove outliners
y = np.log(train.SalePrice)
train = train.drop(['Id','SalePrice'],axis=1)
test_id = test.Id
test = test.drop('Id',axis=1)

In [None]:
class FeatureDropper(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.drop(["PoolQC", "MiscFeature", "Alley", "Fence", "Utilities"], axis=1)

class categorical_transformation(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        transfer_col = ['MSSubClass', 'OverallQual', 'OverallCond', 'BsmtFullBath', 
                        'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 
                        'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 
                        'MoSold', 'YrSold']
        for i in transfer_col:
            X[i] = X[i].astype(str)
        return X

class Imputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        imputer_num = SimpleImputer(strategy='mean')
        imputer_cat = SimpleImputer(strategy='constant', fill_value='missing')
        for col in X.columns:
            if X[col].dtype == 'O':
                X[col] = imputer_cat.fit_transform(X[col].values.reshape(-1,1))[:,0] 
            else:
                X[col] = imputer_num.fit_transform(X[col].values.reshape(-1,1))[:,0] 
        return X

class normalize(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        num_col = X.select_dtypes(exclude="O").columns
        X[num_col] = StandardScaler().fit_transform(X[num_col])
        return X

class log_skewed(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        num_col = X.select_dtypes(exclude='O').columns
        skewness = X[num_col].apply(lambda x: skew(x))
        skewness = skewness[abs(skewness) > 0.5]
        skewed_features = skewness.index
        X[skewed_features] = np.log1p(X[skewed_features])
        return X

class FeatureEncode(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        
        #Create a framework based on train dataset
        cat_col = framework.select_dtypes(include='O').columns
        framework_dummies = pd.get_dummies(framework[cat_col])
        
        #One Hot encoding for passed in dataset
        for col in cat_col:
            X[col] = X[col].apply(lambda x: str(x) + str(col))
        X_dummies = pd.get_dummies(X[cat_col])
        X = X.drop(cat_col,axis=1)
        _, final = framework_dummies.align(X_dummies, join='left', axis=1)
        X = X.join(final)
        X = X.fillna(0)
        return X


class PCA_transformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        pca = PCA()
        pca = PCA(n_components=107) #retain 95% variance
        X_pca = pca.fit_transform(X)
        return X_pca

The pipeline_framework is used to create a framework prototype for one hot encoding since train set and test set have differ categorical values, if we don't standardize the two dataset and just pass it through One Hot Encoding directly, it will create different datasets with different columns which we will not be able to pass it through training our models.

In [None]:
pipeline = Pipeline([
        ('dropfeatures', FeatureDropper()),
        ('cat_trans', categorical_transformation()),
        ('impute', Imputer()),
        ('log_feature', log_skewed()),
        ('scaler', normalize()),
        ('Encoding', FeatureEncode()),
])
pipeline_framework = Pipeline([
        ('dropfeatures', FeatureDropper()),
        ('cat_trans', categorical_transformation()),
        ('impute', Imputer()),
])

framework = pipeline_framework.fit_transform(train)
cat_col = framework.select_dtypes(include='O')
for col in cat_col:
    framework[col] = framework[col].apply(lambda x: str(x) + str(col))
    
X_preprocessed_train = pipeline.fit_transform(train)
X_preprocessed_test = pipeline.fit_transform(test)

In [None]:
X_preprocessed_train.head()

In [None]:
X_preprocessed_test.head()

# Modelling

In [None]:
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X_preprocessed_train, y, 
                                                    test_size=0.2, random_state=0)

def rmse_cv(model):
    cv_score = cross_val_score(model, X_preprocessed_train, y, cv=5, 
                               scoring='neg_mean_squared_error')
    rmse = np.sqrt(-cv_score).mean()
    return rmse

First start off with models with no parameters at all, we will be doing hyperparameter tuning with GridSearchCV in next section

In [None]:
models_raw = {
    'Ridge': RidgeCV(),
    'Lasso': LassoCV(alphas=np.logspace(-4,4,9)),
    'ElasticNet': ElasticNetCV(),
    'XGBoost': XGBRegressor(),
    'SVR': SVR()
}


for name, model in models_raw.items():
    print(f"{name} RMSE: {rmse_cv(model)}")

Linear regressors appear to have the best performances at this point, but this may be largely due to the fact that other more complex models like XGBoost usually will give a much better performance after hyperparameter tuning.

# Hyperparameter Tuning

Noted that I comment the whole script out since it will take a very long time to train the models which I would include the results as comment at the bottom if you are interested.

In [None]:
param_grids = {
    'Ridge': {
        'fit_intercept': [True, False],
        'alphas': [[0.1, 1.0, 10.0], [0.01, 0.1, 1.0], [0.001, 0.01, 0.1]]
    },
    'Lasso': {
        'fit_intercept': [True, False],
    },
    'ElasticNet': {
        'fit_intercept': [True, False],
        'l1_ratio': [0.1, 0.5, 0.9],
        'alphas': [[0.1, 1.0, 10.0], [0.01, 0.1, 1.0], [0.001, 0.01, 0.1]]
    },
    'XGBoost': {
        'n_estimators': [500, 1000],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 8],
        'reg_alpha': [0.0, 0.2],
        'reg_lambda': [0.0, 0.2],
        'min_child_weight': [1, 3, 5],
    },
    'SVR': {
        'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'C': [0.1, 1, 10],
        'epsilon': [0.01, 0.1, 1],
        'gamma': ['scale', 'auto'],
    }
    
}

In [None]:
from sklearn.model_selection import GridSearchCV

grids = {}
cv = 3

for model_name, model in models_raw.items():
    
    grids[model_name] = GridSearchCV(estimator=model, 
                                     param_grid=param_grids[model_name], 
                                     cv=cv, 
                                     scoring='neg_mean_squared_error', 
                                     n_jobs=-1)
    grids[model_name].fit(X_preprocessed_train, y)
    best_params = grids[model_name].best_params_
    best_score = np.sqrt(-1 * grids[model_name].best_score_)
    
    print(f'Best parameters for {model_name}: {best_params}')
    print(f'Best RMSE for {model_name}: {best_score}\n')
    
"""
Best parameters for Ridge: {'alphas': [0.1, 1.0, 10.0], 'fit_intercept': True}
Best RMSE for Ridge: 0.11655349784666286

Best parameters for Lasso: {'fit_intercept': True}
Best RMSE for Lasso: 0.1171914969370188

Best parameters for ElasticNet: {'alphas': [0.001, 0.01, 0.1], 'fit_intercept': True, 'l1_ratio': 0.5}
Best RMSE for ElasticNet: 0.11489623589933039

Best parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 1000, 'reg_alpha': 0.0, 'reg_lambda': 0.2}
Best RMSE for XGBoost: 0.12194649951014798

Best parameters for SVR: {'C': 10, 'epsilon': 0.01, 'gamma': 'auto', 'kernel': 'rbf'}
Best RMSE for SVR: 0.11591388647499179"""

Based on these results, we will run a 2nd round of hyperparameter tuning to deal with edge cases, it's basically the same script with different param_grids.

In [None]:
param_grids = {
    'Ridge': {
        'fit_intercept': [True],
        'alphas': [[0.1, 1.0, 10.0]],
    },
    'Lasso': {
        'fit_intercept': [True],
        'max_iter': [500, 800, 1000],

    },
    'ElasticNet': {
        'fit_intercept': [True],
        'l1_ratio': [0.01, 0.05, 0.1],
        'alphas': [[0.001, 0.01, 0.1]],

    },
    'XGBoost': {
        'n_estimators': [1000],
        'learning_rate': [0.05, 0.1, 0.2],
        'max_depth': [3, 4, 5],
        'reg_alpha': [0.0],
        'reg_lambda': [0.1, 0.2, 0.5],
        'min_child_weight': [3],
    },
    'SVR': {
        'kernel': ['rbf'],
        'C': [8, 10, 15, 20],
        'epsilon': [0.01, 0.05],
        'gamma': ['auto']
    }
    
}

grids = {}
cv = 3

for model_name, model in models_raw.items():
    
    grids[model_name] = GridSearchCV(estimator=model, 
                                     param_grid=param_grids[model_name], 
                                     cv=cv, 
                                     scoring='neg_mean_squared_error', 
                                     n_jobs=-1)
    grids[model_name].fit(X_preprocessed_train, y)
    best_params = grids[model_name].best_params_
    best_score = np.sqrt(-1 * grids[model_name].best_score_)
    
    print(f'Best parameters for {model_name}: {best_params}')
    print(f'Best RMSE for {model_name}: {best_score}\n')
    
"""
Best parameters for Ridge: {'alphas': [0.1, 1.0, 10.0], 'fit_intercept': True}
Best RMSE for Ridge: 0.11655349784666286

Best parameters for Lasso: {'fit_intercept': True, 'max_iter': 500}
Best RMSE for Lasso: 0.1171914969370188

Best parameters for ElasticNet: {'alphas': [0.001, 0.01, 0.1], 'fit_intercept': True, 'l1_ratio': 0.1}
Best RMSE for ElasticNet: 0.11558577029260873

Best parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 1000, 'reg_alpha': 0.0, 'reg_lambda': 0.2}
Best RMSE for XGBoost: 0.12194649951014798

Best parameters for SVR: {'C': 8, 'epsilon': 0.01, 'gamma': 'auto', 'kernel': 'rbf'}
Best RMSE for SVR: 0.11511200323011732
"""

In [None]:
LinearRegression = LinearRegression()
Ridge = RidgeCV(alphas=[0.1, 1.0, 10.0], fit_intercept=True)
Lasso = LassoCV(alphas=np.logspace(-4,4,9), fit_intercept=True, max_iter=500)
ElasticNet = ElasticNetCV(alphas=[0.001, 0.01, 0.1], fit_intercept=True, l1_ratio=0.1)
XGBoost = XGBRegressor(learning_rate=0.1, max_depth=3, min_child_weight=3, n_estimators=1000, reg_lambda=0.2)
SVR = SVR(C=8, epsilon=0.01, gamma='auto', kernel='rbf')

In [None]:
base_models = {
    'Ridge': Ridge,
    'Lasso': Lasso,
    'ElasticNet': ElasticNet,
    'XGBoost': XGBoost,
    'SVR': SVR,
}

for model_name, model in base_models.items():
    model.fit(X_train, y_train)
    model_cvscore = np.sqrt(-cross_val_score(model, X_preprocessed_train, y, cv=5, scoring='neg_mean_squared_error')).mean()
    print(f"{model_name} score: {model_cvscore}")

# Model Ensembling

This section will be ensembling the above models together, using stacking and voting.

Ensembling is a very powerful technique in machine learning in which we use it to combine the performance of different models together to get a more robust outcome which usually give a better performance in general.

## Voting

In [None]:
from sklearn.ensemble import VotingRegressor

# Create a voting classifier
voting_model = VotingRegressor(estimators=[('L1', Ridge), 
                                          ('L2', Lasso),
                                          ('EN', ElasticNet),
                                          ('xgb', XGBoost),
                                          ('svr', SVR)])

voting_model.fit(X_preprocessed_train, y)
rmse_voting = np.sqrt(-cross_val_score(voting_model, X_preprocessed_train, y, cv=5, scoring='neg_mean_squared_error')).mean()
print(f"VotingRegressor RMSE: {rmse_voting}")
#RMSE: 0.10977104827444395

In [None]:
voting_model2 = VotingRegressor(estimators=[('L1', Ridge), 
                                          ('L2', Lasso),
                                          ('EN', ElasticNet),])

voting_model2.fit(X_preprocessed_train, y)
rmse_voting2 = np.sqrt(-cross_val_score(voting_model2, X_preprocessed_train, y, cv=5, scoring='neg_mean_squared_error')).mean()
print(f"VotingRegressor2 RMSE: {rmse_voting2}")
#RMSE: 0.11280943325732835

## Stacking

In [None]:
from sklearn.ensemble import StackingRegressor

base_regressors = [('L1', Ridge), 
                   ('L2', Lasso),
                   ('EN', ElasticNet),
                   ('xgb', XGBoost),
                   ('svr', SVR)]


ensemble = StackingRegressor(
    estimators=base_regressors,
    final_estimator=XGBoost
)
ensemble.fit(X_preprocessed_train,y)

scores = cross_val_score(
    ensemble, X_preprocessed_train, y, cv=5, scoring='neg_mean_squared_error'
)

print("Mean score:", np.sqrt(-scores).mean())
print("Standard deviation:", np.sqrt(-scores).std())

# Mean score: 0.13003926654506529
# Standard deviation: 0.006159681441211842

# Submission

Now we are going to make our submissions, I will be submitting all the baseline models with the three ensembled model to see what's the difference between them. Now, we will pass our test dataframe through our pipeline and make our predictions which you can see one of the reasons why we use pipeline is that we can easily reuse it in the future when we want to pass it through test set, or make modifications to the pipeline as well, making everything more convenient.

In [None]:
final_models = {
    'Ridge': Ridge,
    'Lasso': Lasso,
    'ElasticNet': ElasticNet,
    'XGBoost': XGBoost,
    'SVR': SVR,
    'Voting1': voting_model,
    'Voting2': voting_model2,
    'Stack': ensemble
}
for model_name, model in final_models.items():
    submission_df = pd.DataFrame({'Id': test_id, 'SalePrice': np.exp(model.predict(X_preprocessed_test))})
    submission_name = 'final_submission_' + model_name + '.csv'
    submission_df.to_csv(submission_name, index=False)

The score of each submission is as followed:

Voting1: 0.13708 

Voting2: 0.13798

Ridge: 0.13957

Lasso: 0.14043

SVR: 0.14772

ElasticNet: 0.14782

XGBoost: 0.14825

Stack: 0.15377

The overall result is fairly alright standing at 30% on leaderboard, you can furthur improve the model performances by training additional models and most importantly, adding useful features based on domain knowledge which is something I didn't do much in this notebook. 

If you are still here, feel free to give me some comments on how to improve the model or whether there are anything I did wrong in my notebook, your help and support is very much appreciated.