# BAGGING

Bagging improves the performance of several basic algorithms by making it more robust.

We will look at Linear Regression and Decision Tree Regressor model as our base models for Bagging.

We will use a housing price dataset (regression problem) with 13 features and 506 data samples.

## IMPORTING PACKAGES

The important packages have been imported for you.

In [None]:
# IMPORTING IMPORTANT PACKAGES
# RUN THE CELL AS IT IS.
# DO NOT CHANGE THIS CELL!

import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
ans = [0]*7

## RANDOM NUMBER/LIST GENERATOR

We will be using this function to generate a list of random numbers to get consistent results which are important for evaluation. While doing your own projects and work you can use libraries like random and numpy.

A brief description of the function:

_random_generator(seed, low, high, size)_

**seed** = A different seed generates a new list of random number but for the same seed, the same random number is generated.

**low** = Lower limit of the range in which to generate random numbers. (**INCLUSIVE**)

**high** = Upper limit of the range in which to generate random numbers. (**EXCLUSIVE**)

**size** = Number of random numbers to generate. If size = 1, then one scalar number is returned. If size>1 then a list of random numbers is generated.

Similar for _unique_random_generator_. This function returns a list of unique random numbers unline _random_generator_

In [None]:
def random_generator(seed = 0, low = 0, high = None, size = None):
    s = seed
    a = 11
    b = 13

    if high is None:
        return ("Error. Upper Limit not found")
    if size is None:
        return ("Error. Size not found")
    if size == 1:
        return ((a*s+b)%high)
    random_list = []
    for i in range(size):
        random_list.append((a*s+b)%high)
        s = (a*s+b)
    return random_list

## LOADING DATASET
The dataset is of a regression problem.

The target label is housing price. There are 13 features all of which are numerical in nature.


In [None]:
# LOADING THE DATASET
# RUN THE CELL AS IT IS
# DO NOT CHANGE THIH CELL

# reg_dataset -> Regression Dataset : Boston Dataset with Housing price as the target and 13 Features related to the houses.
# There are no categorical variables.

def load_dataset():
    reg_x, reg_y = load_boston(return_X_y = True)
    reg_data = np.concatenate((reg_x, np.array(reg_y).reshape(-1, 1)), axis = 1)
    cols = ["feature"+str(i) for i in range(1, 14)]
    cols = cols + ["price(target)"]
    reg_dataset = pd.DataFrame(data = reg_data, columns = cols)

    return reg_dataset

reg_dataset = load_dataset()
print("REGRESSION DATASET : \n", reg_dataset)

## SPLITTING INTO TRAIN AND TEST SET

We will divide the dataset into train and test set in the ratio 8:2.
Sklearn's built-in function "train_test_split" is supposed to be used.

eg:- train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = k, random_state = integer)  where k is a floating point number between 0 and 1. The number signifies the fraction of dataset to be given to the test dataset.

**NOTE**: Use random_state = 40 for evaluation purposes

Confirm that your training and test set are correctly alloted by checking with the number of samples in each.

In [None]:
# DIVIDE THE DATASET INTO TRAINING AND TEST SET
# Use "train_test_split" from sklearn to split the dataset into 8:2 ratio.
# TRAINING SET SIZE : TEST SET SIZE = 8 : 2
# NOTE: USE "random_state = 40" WHILE SPLITTING. OTHERWISE EVALUATION MIGHT BE WRONG

# START YOUR CODE HERE

X = reg_dataset.iloc[:, [*range(len(reg_dataset.columns)-1)]]
y = reg_dataset.iloc[:, [-1]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 40)

# END YOUR CODE HERE

## QUESTIONS:
The first part of the assignment focuses on Linear Regression Models and the second part focuses on Decision Tree models.

Places where you need to write your code have been indicated. Some parameters have been fixed for evaluation purposes. Be careful not to change them.

#### **BAGGING WITH LINEAR REGRESSION MODELS**

### **QUESTION 1**: Fit a Linear Regression model on  randomly sampled (with replacement) training data and assign the mean squared error of the predictions on the test set to ans[0]. (1 mark).
**NOTE**: You can use the mean squared error from sklearn which has been imported above for you.

**NOTE**: While randomly sampling we need to select 404 training samples but not all of them need to be unique.

**HINT**: mse = mean_squared_error(y_true, y_pred) where the mean squared error gets stored in the variable mse.

**HINT**: df.iloc[[0, 2, 0, 4], 1] selects the 1st, 3rd, 1st and 5th data sample(row) from the df DataFrame.

In [None]:
# SOME PART OF THE CODE HAS BEEN WRITTEN FOR YOU. DO NOT CHANGE THEM OTHERWISE IT MIGHT BE WRONGLY EVALUATED.
# THE DATA SAMPLES HAVE BEEN SELECTED FOR YOU FOR EVALUATION PURPOSES.


linreg1 = LinearRegression()                 # 1st Linear Regression Model for you to use
row_index = random_generator(1, 0, 404, 404) # Row indexes that you need to fit your model on. Do not change it.

# Try to print row_index to understand what "randomly sampled with replacement means"

# START YOUR CODE HERE:

X_train_temp = X_train.iloc[row_index, :]
y_train_temp = y_train.iloc[row_index, :]

linreg1 = linreg1.fit(X_train_temp, y_train_temp)
y_pred1 = linreg1.predict(X_test)

mse_test = mean_squared_error(y_test, y_pred1)

# END YOUR CODE HERE


In [None]:
# SUBSTITUTE YOUR ANSWER IN PLACE OF None

ans[0] = mse_test     

### **QUESTION 2**: Fit a second linear Regression model on new randomly sampled (with replacement) training data and assign the mean squared error of the average of the predictions of the two linear regression models on the test set to ans[1]. (1 mark)
eg:- If for a particular data point, model1 predicts 20.0 and model2 predicts 30.0 then the final prediction should be 25.0.

**WARNING**: The question asks for the mean squared error of the predictions and not the predictions themselves.

In [None]:
# SOME PART OF THE CODE HAS BEEN WRITTEN FOR YOU. DO NOT CHANGE WHERE NOT INDICATED.
# THE DATA SAMPLES HAVE BEEN SELECTED FOR YOU

linreg2 = LinearRegression()                   # 2nd Linear Regression Model
row_index = random_generator(3, 0, 404, 404)   # Row indexes that you need to fit your model on. Do not change it.

# Use the previous linear regression model (linreg1) as your first model. You have already trained your linreg1 so 
# you do not need to do that again. The final predictions would be the average of the predictions of these two models.

# START YOUR CODE HERE:

X_train_temp = X_train.iloc[row_index, :]
y_train_temp = y_train.iloc[row_index, :]

linreg2 = linreg2.fit(X_train_temp, y_train_temp)
y_pred2 = linreg2.predict(X_test)

y_pred_final = (y_pred1 + y_pred2) / 2

mse_test = mean_squared_error(y_test, y_pred_final)

# END YOUR CODE HERE

In [None]:
# SUBSTITUTE YOUR ANSWER IN PLACE OF None

ans[1] = mse_test 

Did the combined predictions have a lower mean squared error compared to the individual mean squared errors of the two models?

### **QUESTION 3**: Fit a third linear Regression model on  new randomly sampled (with replacement) training data and assign the mean squared error of the average of the predictions  of the three linear regression models on the test set to ans[2].    (2 marks)
eg:- If for a particular data point, model1 predicts 20.0, model2 predicts 30.0 and model3 predicts 70.0 then the final prediction should be (20.0+30.0+70.0)/3 = 60.0.

**WARNING**: The question asks for the mean squared error of the predictions and not the predictions themselves.

In [None]:
# SOME PART OF THE CODE HAS BEEN WRITTEN FOR YOU. DO NOT CHANGE WHERE NOT INDICATED.
# THE DATA SAMPLES HAVE BEEN SELECTED FOR YOU

linreg3 = LinearRegression()                  # 3rd Linear Regression Model
row_index = random_generator(5, 0, 404, 404)      # Row indexes that you need to fit your model on. Do not change it.

# linreg1, linreg2 and linreg3 would be your 3 models. You have already fitted linreg1 and linreg2 so you do not need to 
# train them again.

# START YOUR CODE HERE:. 

X_train_temp = X_train.iloc[row_index, :]
y_train_temp = y_train.iloc[row_index, :]

linreg3 = linreg3.fit(X_train_temp, y_train_temp)
y_pred3 = linreg3.predict(X_test)

y_pred_final = (y_pred1 + y_pred2 + y_pred3) / 3

mse_test = mean_squared_error(y_test, y_pred_final)

# END YOUR CODE HERE

In [None]:
# SUBSTITUTE YOUR ANSWER IN PLACE OF None

ans[2] = mse_test             

Did the mean squared error of the averaged predictions (ensembled predictions) reduced further? Is it less than the individual mean squared errors of the models?

### **QUESTION 4**: Fit a fourth linear Regression model on new randomly sampled training data and assign the mean squared error of the average of the predictions of the four linear regression models on the test set to ans[3]. (1 mark)
**WARNING**; The question asks for the mean squared error of the predictions and not the predictions themselves.

In [None]:
# SOME PART OF THE CODE HAS BEEN WRITTEN FOR YOU. DO NOT CHANGE WHERE NOT INDICATED.
# THE DATA SAMPLES HAVE BEEN SELECTED FOR YOU

linreg4 = LinearRegression()                  # 3rd Linear Regression Model
row_index = random_generator(7, 0, 404, 404)    # Row indexes that you need to fit your model on. Do not change it.

# Use the previous 3 models(linreg1, linreg2, linreg3) and do not train them again.

# START YOUR CODE HERE:. 

X_train_temp = X_train.iloc[row_index, :]
y_train_temp = y_train.iloc[row_index, :]

linreg4 = linreg4.fit(X_train_temp, y_train_temp)
y_pred4 = linreg4.predict(X_test)

y_pred_final = (y_pred1 + y_pred2 + y_pred3 + y_pred4) / 4

mse_test = mean_squared_error(y_test, y_pred_final)

# END YOUR CODE HERE

In [None]:
# SUBSTITUTE YOUR ANSWER IN PLACE OF None

ans[3] = mse_test

Did the mean squared error reduce this time as well? If not, then why?

So the final error depends also on how good the individual models are. Let's look at how a large number of models perform together.

### **QUESTION 5**: Fit 50 linear Regression models on randomly sampled training data (new random sampling for each model) and assign the mean squared error of the average of the predictions of the 50 linear regression models on the test set to ans[4]. (2 marks)
eg:- If model1 predicts y1, model2 predicts y2 and so on till y50, then final prediction would be (y1+y2+...+y50)/50.

**WARNING**; The question asks for the mean squared error of the predictions and not the predictions themselves.

In [None]:
# SOME PART OF THE CODE HAS BEEN WRITTEN FOR YOU. DO NOT CHANGE WHERE NOT INDICATED.
# FEEL FREE TO USE MORE CELLS FOR YOUR CODE.
# THE DATA SAMPLES HAVE BEEN SELECTED FOR YOU
# HINT: You can add the predictions in a for loop.

n = 50                                           # 50 Linear Regression models 
train_preds = 0
test_preds = 0

test_mse_all = []

np.random.seed(10)                               # Used for consistent answers for evaluation purposes
subset_seed = random_generator(10, 0, 200, n)
for i in range(n):
    np.random.seed(subset_seed[i])                            # Used for consistent answers for evaluation purposes
    row_index = random_generator(subset_seed[i], 0, 404, 404) # Row indexes that you need to fit your model on. Do not change it.
    
  # START YOUR CODE HERE:

    linreg = LinearRegression()    

    X_train_temp = X_train.iloc[row_index, :]
    y_train_temp = y_train.iloc[row_index, :]

    linreg = linreg.fit(X_train_temp, y_train_temp)
    
    train_pred = linreg.predict(X_train)
    test_pred = linreg.predict(X_test)

    train_preds += train_pred
    test_preds += test_pred

    test_mse_all.append(mean_squared_error(y_test, test_preds / (i+1)))

train_preds /= n
test_preds /= n

mse_test = mean_squared_error(y_test, test_preds)

  # END YOUR CODE HERE

In [None]:
# SUBSTITUTE YOUR ANSWER IN PLACE OF None

ans[4] = mse_test

Is it lower than the previous mean squared errors? 
If yes, does it go down till 0? To check that try plotting the test mean squared error vs no of linear regression models used.

Some of the matplotlib code has been written for you.

#### PLOTTING TEST MSE VS NUM OF MODELS USED

In [None]:
# num_models : list of the number of models
# test_mse : list of the corresponding test mean squared error

num_models = [*range(1, n+1)]        # remember to replace it with your number of models used
test_mse = test_mse_all         # remember to replace it with your test mean square errors

plt.plot(num_models, test_mse, c = "b", label = "Test Mean Squared Error")
plt.xlabel("Number of Linear Regression Models")
plt.legend()
plt.ylabel("Mean Squared Error")
plt.title("Mean Squared Error vs Num of Linear Regression Models")
plt.gca().set_xlim(left = 0)
plt.show()

Can you justify why bagging helps?
If there is slight dip in the test mse curve (i.e it does not saturate at its lowest point) can you explain why is it?

**HINT**: You can also try plotting the mean sqaured error of the predictions on the _WHOLE_ training set to gather more insight.

#### **BAGGING WITH DECISION TREE MODELS**

### **QUESTION 6**: Fit a single DecisionTree Regression model on randomly sampled(with replacement) training data and assign the mean squared error of the predictions on the test set to ans[5]. (1 mark)

In [None]:
# SOME PART OF THE CODE HAS BEEN WRITTEN FOR YOU. DO NOT MAKE CHANGES WHERE NOT INDICATED.
# THE DATA SAMPLES HAVE BEEN SELECTED FOR YOU.

dt1 = DecisionTreeRegressor(max_depth = 3, random_state = 10)    # Decision tree that you have to use. Don't change parameters.
row_index = random_generator(8, 0, 404, 404)                     # Row indexes that you need to fit your model on. Do not change it.
 
# START YOUR CODE HERE:. 

X_train_temp = X_train.iloc[row_index, :]
y_train_temp = y_train.iloc[row_index, :]

dt1 = dt1.fit(X_train_temp, y_train_temp)
y_pred_dt1 = dt1.predict(X_test)

mse_test = mean_squared_error(y_test, y_pred_dt1)

# END YOUR CODE HERE

In [None]:
# SUBSTITUTE YOUR ANSWER IN PLACE OF None

ans[5] = mse_test

### **QUESTION 7**: Train and fit 50 Decision Trees on randomly sampled training data (new random sampling for each decision tree) and assign the mean squared error of the average of all predictions on the test set to ans[6]. (2 marks)
eg:- If model1 predicts y1, model2 predicts y2 and so on till y50, then final prediction would be (y1+y2+...+y50)/50.

**WARNING**: The question asks for the mean squared error and not the final prediction.

In [None]:
# SOME PART OF THE CODE HAS BEEN WRITTEN FOR YOU. DO NOT MAKE CHANGES WHERE NOT INDICATED.
# THE DATA SAMPLE HAS BEEN SELECTED FOR YOU.

n = 50
train_preds = 0
test_preds = 0

test_mse_all = []

np.random.seed(11)                                                  # Used for consistent answers for evaluation purposes.
subset_seed = random_generator(9, 0, 200, n)
for i in range(n):
    row_index = random_generator(subset_seed[i], 0, 404, 404)                     # Row indexes that you need to fit your model on. Do not change it.
    dt = DecisionTreeRegressor(max_depth = 3, random_state = 10)      # Decision Tree that you need to use.Don't change parameters.

  # START YOUR CODE HERE:  

    X_train_temp = X_train.iloc[row_index, :]
    y_train_temp = y_train.iloc[row_index, :]

    dt = dt.fit(X_train_temp, y_train_temp)
    
    train_pred = dt.predict(X_train)
    test_pred = dt.predict(X_test)

    train_preds += train_pred
    test_preds += test_pred

    test_mse_all.append(mean_squared_error(y_test, test_preds / (i+1)))

train_preds /= n
test_preds /= n

mse_test = mean_squared_error(y_test, test_preds)

  # END YOUR CODE HERE 

In [None]:
# SUBSTITUTE YOUR ANSWER IN PLACE OF None

ans[6] = mse_test

Try to plot the test mean squared error vs the num of decision trees used for that prediction. The graph should be similar to the Linear Regression graph. 

You can also try to plot the mean squared error of the predictions on the _WHOLE_ training dataset to gain more insight.

Some of the matplotlib code has been written for you.

#### PLOTTING THE TEST MSE VS NUM OF DECISION TREES

In [None]:
# num_trees : list of the number of trees
# test_mse : list of the corresponding test mean squared error

num_trees = [*range(1, n+1)]         # remember to replace it with your implemented code
test_mse = test_mse_all          # remember to replace it with your implemented code

plt.plot(num_trees, test_mse, c = "b", label = "Test Mean Squared Error")
plt.xlabel("Number of Linear Regression Models")
plt.legend()
plt.ylabel("Mean Squared Error")
plt.title("Mean Squared Error vs Num of Linear Regression Models")
plt.gca().set_xlim(left = 0)
plt.show()

Did the Decision Tree Perform better or the Linear Regression Model?

Was there a dip in the test mean squared error in the Decision Trees?

Keep the test mean square error in mind for the next assignment where we will tweak the decision tree model a bit to arrive at a very popular model called Random Forest model.

In [None]:
import json
ans = [str(item) for item in ans]

filename = "group15_sayush7755@gmail.com_Abhishek_Singh_Baaging"

# Eg if your name is Saurav Joshi and group id is 0, filename becomes
# filename = group0_Saurav_Joshi_Bagging

## Do not change anything below!!
- Make sure you have changed the above variable "filename" with the correct value. Do not change anything below!!

In [None]:

from importlib import import_module
import os
from pprint import pprint

findScore = import_module('findScore')
response = findScore.main(ans)
response['details'] = filename
with open(f'evaluation_{filename}.json', 'w') as outfile:
    json.dump(response, outfile)
pprint(response)