# Bagging Algorithm (Regression)

Data Source: [Auto MPG]("https://archive.ics.uci.edu/ml/datasets/auto+mpg")

**Data Attributes**

This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original".

"The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993).

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

In [1]:
# Importing necessary packages
import pandas as pd
import numpy as np
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt

In [2]:
# Read the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
cars = pd.read_table(url, delim_whitespace = True)
cars

Unnamed: 0,18.0,8,307.0,130.0,3504.,12.0,70,1,chevrolet chevelle malibu
0,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
1,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
2,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
3,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
4,15.0,8,429.0,198.0,4341.0,10.0,70,1,ford galaxie 500
...,...,...,...,...,...,...,...,...,...
392,27.0,4,140.0,86.0,2790.0,15.6,82,1,ford mustang gl
393,44.0,4,97.0,52.0,2130.0,24.6,82,2,vw pickup
394,32.0,4,135.0,84.0,2295.0,11.6,82,1,dodge rampage
395,28.0,4,120.0,79.0,2625.0,18.6,82,1,ford ranger


In [3]:
# Display the characteristics of dataset
print("Dimensions of the dataset are: ", cars.shape)
print("The names of variables present in dataset are: \n", cars.columns)

Dimensions of the dataset are:  (397, 9)
The names of variables present in dataset are: 
 Index(['18.0', '8', '307.0', '130.0', '3504.', '12.0', '70', '1',
       'chevrolet chevelle malibu'],
      dtype='object')


In [4]:
# Renaming the header as per data source
cars.columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year",
               "origin", "car_name"]
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
1,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
2,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
3,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
4,15.0,8,429.0,198.0,4341.0,10.0,70,1,ford galaxie 500


In [5]:
# Export to .csv
cars.to_csv("./autompg/autompg.csv", index = False)

In [6]:
# Load and read the .csv file
auto = pd.read_csv("./autompg/autompg.csv")
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
1,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
2,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
3,17.0,8,302.0,140,3449,10.5,70,1,ford torino
4,15.0,8,429.0,198,4341,10.0,70,1,ford galaxie 500


In [7]:
# Check the information of dataset
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 391 entries, 0 to 390
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           391 non-null    float64
 1   cylinders     391 non-null    int64  
 2   displacement  391 non-null    float64
 3   horsepower    391 non-null    int64  
 4   weight        391 non-null    int64  
 5   acceleration  391 non-null    float64
 6   model_year    391 non-null    int64  
 7   origin        391 non-null    int64  
 8   car_name      391 non-null    object 
dtypes: float64(3), int64(5), object(1)
memory usage: 27.6+ KB


In [8]:
# Check missing and not available values
print("The null values in the dataframe are: \n", cars.isnull().sum())
print("The not available values in the dataframe are: \n", cars.isna().sum())

The null values in the dataframe are: 
 mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64
The not available values in the dataframe are: 
 mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64


In [9]:
# Drop the origin and car_name as these features will not add any value to the model
auto = auto.drop(["model_year", "origin", "car_name"], axis = 1)
auto

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration
0,15.0,8,350.0,165,3693,11.5
1,18.0,8,318.0,150,3436,11.0
2,16.0,8,304.0,150,3433,12.0
3,17.0,8,302.0,140,3449,10.5
4,15.0,8,429.0,198,4341,10.0
...,...,...,...,...,...,...
386,27.0,4,140.0,86,2790,15.6
387,44.0,4,97.0,52,2130,24.6
388,32.0,4,135.0,84,2295,11.6
389,28.0,4,120.0,79,2625,18.6


In [10]:
# Using seed function to generate the same dataset
np.random.seed(3000)

In [11]:
# Train-Test Split
training, test = train_test_split(auto, test_size = 0.3)

x_trg = training.drop("mpg", axis = 1)
y_trg = training["mpg"]

x_test = test.drop("mpg", axis = 1)
y_test = test["mpg"]

### Model Building - Bagging Model

In [12]:
# Model building - Bagging model
auto_bag = BaggingRegressor(random_state = 0)

# Fit the model
auto_bag.fit(x_trg, y_trg)

BaggingRegressor(random_state=0)

In [13]:
# Display the accuracy of the bagging model
print("Accuracy of bagging model on Training set is: ", auto_bag.score(x_trg, y_trg))
print("Accuracy of bagging model on Test set is: ", auto_bag.score(x_test, y_test))

Accuracy of bagging model on Training set is:  0.9635909424663078
Accuracy of bagging model on Test set is:  0.6020232922813146


In [14]:
# Predictions on Test set
auto_pred = auto_bag.predict(x_test)
auto_pred

array([33.01, 35.9 , 35.56, 13.4 , 24.18, 35.54, 23.6 , 14.8 , 27.24,
       18.17, 33.42, 25.73, 28.06, 32.75, 31.78, 24.91, 24.95, 15.76,
       27.44, 29.13, 13.  , 18.98, 21.22, 19.67, 28.65, 19.23, 35.29,
       25.33, 15.55, 32.47, 26.64, 14.85, 26.72, 26.98, 14.25, 25.2 ,
       15.01, 24.24, 30.14, 14.1 , 15.95, 23.92, 14.55, 36.84, 24.3 ,
       26.65, 12.65, 19.63, 15.85, 26.3 , 12.5 , 13.9 , 25.7 , 33.35,
       20.  , 14.35, 30.4 , 35.52, 19.54, 12.  , 27.92, 25.35, 13.2 ,
       14.75, 30.73, 29.96, 19.25, 28.98, 17.25, 36.78, 25.49, 25.6 ,
       21.84, 18.8 , 12.4 , 22.28, 12.5 , 25.65, 26.7 , 23.14, 28.2 ,
       35.6 , 14.05, 20.8 , 14.28, 21.94, 15.53, 22.38, 29.97, 18.68,
       21.84, 24.05, 28.72, 37.07, 17.7 , 25.73, 32.75, 15.04, 20.26,
       20.95, 21.16, 32.11, 35.53, 27.34, 19.22, 16.91, 24.32, 21.31,
       29.44, 24.63, 19.07, 28.6 , 18.84, 34.71, 33.66, 12.9 , 14.55,
       12.9 ])

In [15]:
# Compute the RMSE of the bagging model
auto_rmse = sqrt(mean_squared_error(y_test, auto_pred))
print("The RMSE of the Bagging Model is: ", auto_rmse)

The RMSE of the Bagging Model is:  4.696891507210751


#### Creating a new Bagging Model with Best parameters

In [16]:
# Import the GridSearchCV package from sklearn
from sklearn.model_selection import GridSearchCV

In [17]:
# Create grid for best parameters
grid = {"n_estimators" : [10,20,30], "max_samples" : [0.5,0.8,1.0], "max_features" : [0.5,0.7,1.0]}

# Model building
auto_bag_grid = BaggingRegressor()

In [18]:
# Search for best parameters
auto_bag_CV = GridSearchCV(estimator = auto_bag_grid, param_grid = grid, cv = 5)

# Model fit
auto_bag_results = auto_bag_CV.fit(x_trg, y_trg)
print("Best Parameters are: \n", auto_bag_CV.best_params_)

Best Parameters are: 
 {'max_features': 0.5, 'max_samples': 0.8, 'n_estimators': 20}


In [19]:
# Model Building - New Bagging Model with best parameters
auto_bag_best = BaggingRegressor(n_estimators = auto_bag_results.best_params_["n_estimators"],
                                max_samples = auto_bag_results.best_params_["max_samples"],
                                max_features = auto_bag_results.best_params_["max_features"])

In [20]:
auto_bag_best

BaggingRegressor(max_features=0.5, max_samples=0.8, n_estimators=20)

#### New Model Evaluation

In [21]:
# Evaluating the performance of new bagging model
auto_bag_best.fit(x_trg, y_trg)
print("Accuracy of new bagging model on Training set is: ", auto_bag_best.score(x_trg, y_trg))
print("Accuracy of new bagging model on Test set is: ", auto_bag_best.score(x_test, y_test))

Accuracy of new bagging model on Training set is:  0.9370275323427278
Accuracy of new bagging model on Test set is:  0.6522038536477504


In [22]:
# Prediction on Test set
auto_pred_2 = auto_bag_best.predict(x_test)
auto_pred_2

array([32.43      , 33.815     , 31.7075    , 14.23125   , 26.04816667,
       33.60558333, 24.42166667, 16.24869048, 26.86845238, 16.665     ,
       31.50208333, 25.56178571, 28.66375   , 33.94      , 32.10833333,
       24.0212987 , 24.81666667, 14.56119048, 27.12301587, 27.81151587,
       13.845     , 19.1675    , 22.59388889, 19.7612987 , 29.42888889,
       18.59      , 34.44      , 26.21      , 15.43238095, 29.8875    ,
       26.5975    , 14.015     , 27.81583333, 27.16541667, 16.02369048,
       25.37678571, 15.22      , 24.523     , 28.56416667, 13.45555556,
       18.3925    , 24.55301587, 13.29      , 32.9875    , 25.23055556,
       25.425     , 12.865     , 20.90735931, 15.90952381, 28.2375    ,
       12.955     , 13.825     , 27.59333333, 31.73416667, 20.91261905,
       14.56035714, 27.72      , 33.961     , 18.57      , 13.14035714,
       33.18      , 28.71666667, 13.705     , 16.178125  , 27.95468254,
       29.41541667, 19.16166667, 28.06      , 16.74      , 33.31

In [23]:
# Compute the RMSE of new bagging model
auto_rmse_2 = sqrt(mean_squared_error(y_test, auto_pred_2))
print("The RMSE of new Bagging Model is: ", auto_rmse_2)

The RMSE of new Bagging Model is:  4.390804331089603


We can see here that RMSE of bagging model improved from `4.69` to `4.39`. So we can say that the efficiency of new model imporoved with **Grid Search** approach.