# **DIVE INTO CODE COURSE**
## **Sprint Machine Learning Ensemble Learning**
**Student Name**: Doan Anh Tien<br>
**Student ID**: 1852789<br>
**Email**: tien.doan.g0pr0@hcmut.edu.vn

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import math
import seaborn as sns
import random

In [None]:
! pip install -q kaggle

In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"atien228","key":"ddfcb28e1466be0be321d2cc7b11510f"}'}

In [5]:
! cp kaggle.json ~/.kaggle/

In [6]:
! chmod 600 ~/.kaggle/kaggle.json

In [7]:
! kaggle competitions download -c house-prices-advanced-regression-techniques

Downloading data_description.txt to /content
  0% 0.00/13.1k [00:00<?, ?B/s]
100% 13.1k/13.1k [00:00<00:00, 25.1MB/s]
Downloading test.csv to /content
  0% 0.00/441k [00:00<?, ?B/s]
100% 441k/441k [00:00<00:00, 61.4MB/s]
Downloading sample_submission.csv to /content
  0% 0.00/31.2k [00:00<?, ?B/s]
100% 31.2k/31.2k [00:00<00:00, 26.8MB/s]
Downloading train.csv to /content
  0% 0.00/450k [00:00<?, ?B/s]
100% 450k/450k [00:00<00:00, 29.8MB/s]


---

### **[Problem 1] Blending scratch mounting**

**1. First feature selection**

**Dataset Preparation**

In [8]:
df_train = pd.read_csv('/content/train.csv')
df_test = pd.read_csv('/content/test.csv')
print("Train dataset -- Rows: {}, Columns: {}".format(df_train.shape[0], df_train.shape[1]))
print("Test dataset -- Rows: {}, Columns: {}".format(df_test.shape[0], df_test.shape[1]))

Train dataset -- Rows: 1460, Columns: 81
Test dataset -- Rows: 1459, Columns: 80


In [39]:
target = df_train['SalePrice']

In [40]:
df_train_f = df_train[['GrLivArea', 'YearBuilt']]

In [41]:
from sklearn.model_selection import train_test_split
X_house_train, X_house_test, y_house_train, y_house_test = train_test_split(df_train_f, target, train_size=0.7, random_state=1)

In [42]:
print("Train dataset: {}".format(X_house_train.shape))
print("Test dataset: {}".format(X_house_test.shape))

Train dataset: (1021, 2)
Test dataset: (439, 2)


In [43]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_house_train = sc.fit_transform(X_house_train)
X_house_test = sc.fit_transform(X_house_test)

The target of the dataset is to predict the price of house based on provided features. At this point, using Regression technique is the most suitable approach. We will first implement the the standalone Regression model, using Linear Regression in order to compared with the ensemble model.

For each run, we will use different features for accuracy metrics comparision. The ensemble model works from the point where the all of the estimator will train the data 

In [22]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from numpy import hstack

**Standalone model**

In [45]:
def retrieve_models():
    models = {}
    models['Linear Regression'] = LinearRegression()
    models['KNeighbors'] = KNeighborsRegressor()
    models['Decision Tree'] = DecisionTreeRegressor()
    models['SVM'] = SVR()

    return models

In [46]:
standalone_models = retrieve_models()

In [47]:
standalone_mse = {}

for name, model in standalone_models.items():
    model.fit(X_house_train, y_house_train)
    y_pred = model.predict(X_house_test)
    mse = mean_squared_error(y_house_test, y_pred)
    standalone_mse[name] = mse
    print("{} MSE: {}".format(name, mse))

Linear Regression MSE: 2045601786.2250392
KNeighbors MSE: 1941208867.6051934
Decision Tree MSE: 3859759801.8213115
SVM MSE: 7377896914.681474


**Ensemble model**

We will create a class for ensemble learning. Specifically, I would like to implement the ensemble model using base form as a combination of these three models: Decision Tree Regressor, KNeighbors Regressor and Support Vector Regressor.

In [48]:
class BlendModel():
    def __init__(self, models, weights):
        self.models = models
        self.weights = weights
        
    def fit(self, X_train, y_train):
        self.models_copy = [x for x in self.models.values()]
        
        for model in self.models_copy:
            model.fit(X_train, y_train)
    
    def predict(self, X_test):
        y_pred = np.column_stack([
            model.predict(X_test) for model in self.models_copy
        ])
        return np.sum(y_pred*self.weights, axis=1)

**Training**

In [49]:
base_list = retrieve_models()

In [50]:
print("The base model used: \n{}".format(list(base_list.keys())))

The base model used: 
['Linear Regression', 'KNeighbors', 'Decision Tree', 'SVM']


We will assign the weights for Linear Regression (0.35), KNeighbors Regression (0.4), Decision Tree Regression (0.2) and SVR (0.05) based on their standalone performance with the MSE metrics.

In [51]:
blend_model = BlendModel(base_list, [0.35, 0.4, 0.2, 0.05])
blend_model.fit(X_house_train, y_house_train)
blend_pred = blend_model.predict(X_house_test)
blend_mse = mean_squared_error(y_house_test, blend_pred)

print("Blending MSE: {}".format(blend_mse))

Blending MSE: 1918774926.4734585


**Comparing the MSE ratio**

The ratio between blended predictions and standalone predictions, the less the ratio is, the higher accuracy the blend model performed compared to the corresponding standalone model.

In [52]:
for key, values in standalone_mse.items():
    print("Blended with {}: {:.2f}".format(key, blend_mse/values))

Blended with Linear Regression: 0.94
Blended with KNeighbors: 0.99
Blended with Decision Tree: 0.50
Blended with SVM: 0.26


As a result, the blended model performs well compared to the other single models, but only work slightly better than LR and KNeighbors.

**2. Second feature selection**

In [53]:
def blend_and_compare(model_list, X_train, y_train, X_test, y_test):

    print("------- STANDALONE MODELS -------\n")
    standalone_models = model_list
    standalone_mse = {}
  
    for name, model in standalone_models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        standalone_mse[name] = mse
        print("{} MSE: {}".format(name, mse))


    print("\n------- BLENDING MODELS -------\n")
    base_list = model_list
    print("The base model used: \n{}".format(list(base_list.keys())))
    blend_model = BlendModel(base_list, [0.35, 0.4, 0.2, 0.05])
    blend_model.fit(X_train, y_train)
    blend_pred = blend_model.predict(X_test)
    blend_mse = mean_squared_error(y_test, blend_pred)
    print("Blending MSE: {}".format(blend_mse))


    print("\n------- COMPARING MODELS -------\n")
    for key, values in standalone_mse.items():
      print("Blended with {}: {:.2f}".format(key, blend_mse/values))

In the second run, we will choose the linear feet of streeted connected to property (LotFrontage) and Lot size in square feet (LotArea) as features.

In [54]:
df_train_f = df_train[['LotFrontage', 'LotArea']].copy()

In [55]:
print("Total missing value BEFORE imputing: {}".format(df_train_f.isna().sum().sum()))

Total missing value BEFORE imputing: 259


As there will be some missing values in our features, we will use impute methods for preprocessing.

In [56]:
from sklearn.impute import SimpleImputer

def fill_missing_value(df):
    new_df = df
    imr = SimpleImputer(missing_values=np.NaN, strategy='median')
    for columns in new_df.columns[0:]:
        imr_all = imr.fit(new_df[[columns]])
        new_df[columns] = imr_all.transform(new_df[[columns]]).ravel()

    return new_df

In [57]:
df_train_f = fill_missing_value(df_train_f)
print("Total missing value AFTER imputing: {}".format(df_train_f.isna().sum().sum()))

Total missing value AFTER imputing: 0


In [58]:
df_train_f.head(10)

Unnamed: 0,LotFrontage,LotArea
0,65.0,8450.0
1,80.0,9600.0
2,68.0,11250.0
3,60.0,9550.0
4,84.0,14260.0
5,85.0,14115.0
6,75.0,10084.0
7,69.0,10382.0
8,51.0,6120.0
9,50.0,7420.0


**Tranining**

In [59]:
X_house_train, X_house_test, y_house_train, y_house_test = train_test_split(df_train_f, target, train_size=0.7, random_state=1)

In [60]:
base_list = retrieve_models()
blend_and_compare(base_list, X_house_train, y_house_train, X_house_test, y_house_test)

------- STANDALONE MODELS -------

Linear Regression MSE: 6045602962.295516
KNeighbors MSE: 6225423409.447107
Decision Tree MSE: 9283787773.034727
SVM MSE: 7400201187.189829

------- BLENDING MODELS -------

The base model used: 
['Linear Regression', 'KNeighbors', 'Decision Tree', 'SVM']
Blending MSE: 5723811083.468643

------- COMPARING MODELS -------

Blended with Linear Regression: 0.95
Blended with KNeighbors: 0.92
Blended with Decision Tree: 0.62
Blended with SVM: 0.77


Similar to the first run, the blended model performs well compared to the other single models, but in this case, the ratio between its performance with Decision and SVM has been increase. This also show that changing the features may affect the behavior of our single model as well as the blended model.

**3. Third feature selection**

In the second run, we will total square feet of basement area (TotalBsmtSF) and square feet of size of garage (GarageArea) as features.

In [61]:
df_train_f = df_train[['TotalBsmtSF', 'GarageArea']].copy()

In [62]:
print("Total missing value BEFORE imputing: {}".format(df_train_f.isna().sum().sum()))

Total missing value BEFORE imputing: 0


**Tranining**

In [63]:
X_house_train, X_house_test, y_house_train, y_house_test = train_test_split(df_train_f, target, train_size=0.7, random_state=1)

In [64]:
base_list = retrieve_models()
blend_and_compare(base_list, X_house_train, y_house_train, X_house_test, y_house_test)

------- STANDALONE MODELS -------

Linear Regression MSE: 3169303691.2070436
KNeighbors MSE: 2813904177.7951703
Decision Tree MSE: 4276863547.11997
SVM MSE: 7387955691.922731

------- BLENDING MODELS -------

The base model used: 
['Linear Regression', 'KNeighbors', 'Decision Tree', 'SVM']
Blending MSE: 2798181637.8131413

------- COMPARING MODELS -------

Blended with Linear Regression: 0.88
Blended with KNeighbors: 0.99
Blended with Decision Tree: 0.65
Blended with SVM: 0.38


Lastly, the blended model still performs well compared to the other single models, and it can be seen that KNeighbors has the best consistency in terms of performance throughout 3 testing with different features.

### **[Problem 2] Scratch mounting of bagging**

**Dataset Preparation**

In [93]:
df_train_f = df_train[['GrLivArea', 'YearBuilt']]

In [95]:
from sklearn.model_selection import train_test_split
X_house_train, X_house_test, y_house_train, y_house_test = train_test_split(df_train_f, target, train_size=0.7, random_state=1)

In [96]:
print("Train dataset: {}".format(X_house_train.shape))
print("Test dataset: {}".format(X_house_test.shape))

Train dataset: (1021, 2)
Test dataset: (439, 2)


In [97]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_house_train = sc.fit_transform(X_house_train)
X_house_test = sc.fit_transform(X_house_test)

**Standalone Decision Tree Regressor**

In [199]:
model_rt = DecisionTreeRegressor()

In [200]:
model_rt.fit(X_house_train, y_house_train)
y_pred = model_rt.predict(X_house_test)
mse = mean_squared_error(y_house_test, y_pred)
print("Decision Tree Regressor MSE: {}".format(mse))

Decision Tree Regressor MSE: 3636547999.3862314


**Bagging Decision Tree Regressor**

In [195]:
class Bagging():

    def __init__(self, bootstrap, max_depth=100, seed=None):
        self.b = bootstrap
        self.max_d = max_depth
        self.seed = seed
        self.submodel = []

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
        self.m = X_train.shape[0] # M samples
        self.n = X_train.shape[1] # N features
        
        np.random.seed(self.seed)

        # Bootstrap samples
        for b in range(self.b):
            subsamp = np.random.choice(np.arange(self.m), size=self.m, replace=True)
            X_train_b = X_train[subsamp]
            y_train_b = y_train.to_numpy()[subsamp]

            subtree = DecisionTreeRegressor()
            subtree.fit(X_train_b, y_train_b)
            self.submodel.append(subtree)

    def predict(self, X_test):

        pred_avg = np.empty((len(self.submodel), len(X_test)))
        for i, model in enumerate(self.submodel):
            pred_avg[i] = model.predict(X_test)

        return pred_avg.mean(0) 

**Training**

In [198]:
n_bootstrap = 8

bagger = Bagging(n_bootstrap, seed=228)
bagger.fit(X_house_train, y_house_train)
y_pred_bagger = bagger.predict(X_house_test)
mse_bagger = mean_squared_error(y_house_test, y_pred_bagger)
print("Decision Tree Regressor MSE Bagger: {}".format(mse_bagger))

Decision Tree Regressor MSE Bagger: 2250479608.951358


**Comparing the MSE ratio**

The ratio between blended predictions and standalone predictions, the less the ratio is, the higher accuracy the bagging model performed compared to the corresponding standalone model.

In [201]:
print("Single with Bagging ratio: {:.2f}".format(mse_bagger/mse))

Single with Bagging ratio: 0.62


We can see that the ratio is 0.62 which mean the bagging model works very well in terms of decreasing the MSE.

### **[Problem 3] Scratch mounting of stacking**

**Level 0 Layer**

In the first level, we will implement a stacked layer consiting of multiple models with its best hyperparameters. Since getting hyperparameters is not a glance of eye, using GridSearch in the first step might help us to ultilize the training performance.

And after using their best hypermeters, we can achieve such results providing insight about their performance, that is, the sign to choose the best model to be a final learner in this stacking architecture.

In [228]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score

In [27]:
def best_hyperparameters(X_train, X_test, y_train, y_test,
                         models, param_grid, cv=10, score='neg_mean_squared_error',
                         scoring_test=r2_score):
  
    grid = GridSearchCV(estimator=models,
                        param_grid=param_grid,
                        scoring=score,
                        n_jobs=-1,
                        cv=cv,
                        verbose=2)
    
    fit_model = grid.fit(X_train, y_train)
    best_model = fit_model.best_estimator_

    pred = fit_model.predict(X_test)

    score = scoring_test(y_test, pred)

    return [best_model, score, score]

In [282]:
models = [KNeighborsRegressor(), DecisionTreeRegressor(), SVR(), RandomForestRegressor(), XGBRegressor(), LGBMRegressor()]

grid_hyperparamters = [{'n_neighbors': [3,5,7,10],
                        'weights': ['uniform', 'distance'],
                        'leaf_size': [20, 30, 40],
                        'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
                        },
                       {'criterion': ['mse', 'friedman_mse', 'mae', 'poisson'],
                        'splitter': ['best', 'random'],
                        'max_depth': [5, 10, 15, 20]
                        },
                       {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
                        'epsilon': [0.01, 0.05, 0.1],
                        'tol': [1e-4, 1e-3, 1e-2]
                        },
                       {'max_depth':[3, 5, 10, 13], 
                        'n_estimators':[200, 400, 600],
                        'max_features':[2, 4, 6, 8]
                       },
                       {'n_estimators': [400, 700, 1000],
                        'colsample_bytree': [0.7, 0.8],
                        'max_depth': [15,20,25],
                        'reg_alpha': [1.1, 1.2, 1.3],
                        'reg_lambda': [1.1, 1.2, 1.3],
                        'subsample': [0.7, 0.8, 0.9]
                       },
                       {'n_estimators': [400, 700, 1000],
                        'learning_rate': [0.12],
                        'colsample_bytree': [0.7, 0.8],
                        'max_depth': [4],
                        'num_leaves': [10, 20],
                        'reg_alpha': [1.1, 1.2],
                        'reg_lambda': [1.1, 1.2],
                        'min_split_gain': [0.3, 0.4],
                        'subsample': [0.8, 0.9],
                        'subsample_freq': [10, 20]
                       }
]

In [230]:
model_score = []

for i, model in enumerate(models):
    param_grid = grid_hyperparamters[i]
    result = best_hyperparameters(X_house_train, X_house_test, y_house_train, y_house_test,
                                  model, param_grid, cv=5)
    
    model_score.append(result)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 420 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:    1.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Done 127 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:    0.8s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    7.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   11.7s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:   46.7s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:  1.2min finished


Fitting 5 folds for each of 486 candidates, totalling 2430 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    9.6s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:   44.1s
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 644 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 1009 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done 1454 tasks      | elapsed: 11.5min
[Parallel(n_jobs=-1)]: Done 1981 tasks      | elapsed: 16.1min
[Parallel(n_jobs=-1)]: Done 2430 out of 2430 | elapsed: 20.2min finished


Fitting 5 folds for each of 384 candidates, totalling 1920 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 158 tasks      | elapsed:    9.6s
[Parallel(n_jobs=-1)]: Done 361 tasks      | elapsed:   31.8s
[Parallel(n_jobs=-1)]: Done 644 tasks      | elapsed:   59.0s
[Parallel(n_jobs=-1)]: Done 1009 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 1454 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 1920 out of 1920 | elapsed:  3.3min finished


In [231]:
for result in model_score:
    print('Model: {0}, Score: {1}'.format(type(result[0]).__name__, result[2]))

Model: KNeighborsRegressor, Score: 0.745664126198454
Model: DecisionTreeRegressor, Score: 0.7287195514362808
Model: SVR, Score: -0.017919347470168878
Model: RandomForestRegressor, Score: 0.7364874549797271
Model: XGBRegressor, Score: 0.49332315189520126
Model: LGBMRegressor, Score: 0.7018765195892105


All the score did very well except for SVR, therefore, we will remove it from our Level 1 estimators list.

In the end, we will use **Linear Regression** as our final estimator in the Level 1 layer.

**Stacking Model ( + Level 1)**

In [233]:
from sklearn.model_selection import KFold 

In [207]:
class Stacking():

    def __init__(self, models, final_model, K):
        self.level0 = models
        self.level1 = final_model
        self.K = K
        self.M = len(models)

    
    def fit(self, X_train, y_train, X_test, y_test):

        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test

        train_data = None
        test_data = None


        for model in self.level0:

            predictions = None
            batch_size = int(len(self.X_train)/self.K)

            for fold in range(self.K):

                if fold == (self.K - 1):
                    test = self.X_train[(batch_size * fold):,:]
                    batch_start = batch_size * fold
                    batch_finish = self.X_train.shape[0]
                else:
                    test = self.X_train[(batch_size * fold): (batch_size * (fold + 1)),:]
                    batch_start = batch_size * fold
                    batch_finish = batch_size * (fold + 1)
                
                fold_X_test = self.X_train[batch_start:batch_finish,:]
                fold_X_train = self.X_train[[index for index in range(self.X_train.shape[0]) if index not in range(batch_start, batch_finish)],:]

                fold_y_test = self.y_train.to_numpy()[batch_start:batch_finish]
                fold_y_train = self.y_train.to_numpy()[[index for index in range(self.X_train.shape[0]) if index not in range(batch_start, batch_finish)]]

                # Fit current classifier
                model.fit(fold_X_train, fold_y_train)
                fold_y_pred = model.predict(fold_X_test)

                # Store predictions for each fold_x_test
                if isinstance(predictions, np.ndarray):
                    predictions = np.concatenate((predictions, fold_y_pred))
                else:
                    predictions = fold_y_pred


            test_pred_values = self.level0_proc(model)

            if isinstance(train_data, np.ndarray):
                train_data = np.vstack((train_data, predictions))
            else:
                train_data = predictions

            if isinstance(test_data, np.ndarray):
                test_data = np.vstack((test_data, test_pred_values))
            else:
                test_data = test_pred_values


        train_data = train_data.T
        test_data = test_data.T

        self.level1_proc(self.level1, train_data, test_data)


    def level0_proc(self, model):
        # Train the data on level 1 layer with M models
        model.fit(self.X_train, self.y_train)
        y_pred = model.predict(self.X_test)

        return y_pred

    def level1_proc(self, final_model, train_data, test_data):
        # Train the data on the last estimator
        self.final_model = final_model.fit(train_data, self.y_train)
        train_pred = self.final_model.predict(train_data)
        test_pred = self.final_model.predict(test_data)
        print("Train MSE: {}".format(mean_squared_error(train_pred, self.y_train)))
        print("Test MSE: {}".format(mean_squared_error(test_pred, self.y_test)))

In [286]:
models_copy = models

In [295]:
# Pop out SVR model since its score is very bad
models_copy.pop(2)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [299]:
stack_model = Stacking(models_copy, LinearRegression(), 3)

In [300]:
stack_model.fit(X_house_train, y_house_train, X_house_test, y_house_test)

Train MSE: 1731556654.8236802
Test MSE: 1805118265.6692512


**Standalone Linear Regression**

In [292]:
model_lr = LinearRegression()

In [293]:
model_lr.fit(X_house_train, y_house_train)
y_pred = model_lr.predict(X_house_test)
mse_lr = mean_squared_error(y_house_test, y_pred)
print("Linear Regression MSE: {}".format(mse_lr))

Linear Regression MSE: 2045601786.2250392


As we can see that our stack model have a better result in terms of MSE reduction.