# Random Forest (Regression)

Data Source: [Energy Efficiency]("https://archive.ics.uci.edu/ml/datasets/energy+efficiency")

**Data Attributes**

The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses. 

Specifically:
- X1 Relative Compactness
- X2 Surface Area
- X3 Wall Area
- X4 Roof Area
- X5 Overall Height
- X6 Orientation
- X7 Glazing Area
- X8 Glazing Area Distribution
- y1 Heating Load
- y2 Cooling Load

In [1]:
# Importing the necessary packages
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load and read the dataset
energy = pd.read_csv("./energy/ENB2012_data.csv")
energy.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2,Unnamed: 10,Unnamed: 11
0,0.98,514.5,294.0,110.25,7.0,2.0,0.0,0.0,15.55,21.33,,
1,0.98,514.5,294.0,110.25,7.0,3.0,0.0,0.0,15.55,21.33,,
2,0.98,514.5,294.0,110.25,7.0,4.0,0.0,0.0,15.55,21.33,,
3,0.98,514.5,294.0,110.25,7.0,5.0,0.0,0.0,15.55,21.33,,
4,0.9,563.5,318.5,122.5,7.0,2.0,0.0,0.0,20.84,28.28,,


In [3]:
# Drop the last two columns from the dataset
energy.drop(["Unnamed: 10", "Unnamed: 11"], axis = 1, inplace = True)
energy.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2.0,0.0,0.0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3.0,0.0,0.0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4.0,0.0,0.0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5.0,0.0,0.0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2.0,0.0,0.0,20.84,28.28


In [4]:
# Display the characterictics of dataset
print("Dimenions of dataset are: ", energy.shape)
print("The variables present in dataset are: \n", energy.columns)

Dimenions of dataset are:  (1296, 10)
The variables present in dataset are: 
 Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'Y1', 'Y2'], dtype='object')


In [5]:
# Convert the datatype from object to floats
energy.apply(pd.to_numeric)

# For the sake of simplicity lets drop the missing values
energy = energy.dropna()
print(energy)

       X1     X2     X3      X4   X5   X6   X7   X8     Y1     Y2
0    0.98  514.5  294.0  110.25  7.0  2.0  0.0  0.0  15.55  21.33
1    0.98  514.5  294.0  110.25  7.0  3.0  0.0  0.0  15.55  21.33
2    0.98  514.5  294.0  110.25  7.0  4.0  0.0  0.0  15.55  21.33
3    0.98  514.5  294.0  110.25  7.0  5.0  0.0  0.0  15.55  21.33
4    0.90  563.5  318.5  122.50  7.0  2.0  0.0  0.0  20.84  28.28
..    ...    ...    ...     ...  ...  ...  ...  ...    ...    ...
763  0.64  784.0  343.0  220.50  3.5  5.0  0.4  5.0  17.88  21.40
764  0.62  808.5  367.5  220.50  3.5  2.0  0.4  5.0  16.54  16.88
765  0.62  808.5  367.5  220.50  3.5  3.0  0.4  5.0  16.44  17.11
766  0.62  808.5  367.5  220.50  3.5  4.0  0.4  5.0  16.48  16.61
767  0.62  808.5  367.5  220.50  3.5  5.0  0.4  5.0  16.64  16.03

[768 rows x 10 columns]


In [6]:
# Convert it into dataframe
energy = pd.DataFrame(energy)
energy.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2.0,0.0,0.0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3.0,0.0,0.0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4.0,0.0,0.0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5.0,0.0,0.0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2.0,0.0,0.0,20.84,28.28


In [7]:
# Check the information of dataframe
energy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 767
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      768 non-null    float64
 1   X2      768 non-null    float64
 2   X3      768 non-null    float64
 3   X4      768 non-null    float64
 4   X5      768 non-null    float64
 5   X6      768 non-null    float64
 6   X7      768 non-null    float64
 7   X8      768 non-null    float64
 8   Y1      768 non-null    float64
 9   Y2      768 non-null    float64
dtypes: float64(10)
memory usage: 66.0 KB


In [8]:
# Check the missing values
print("The null values in the dataset are: \n", energy.isnull().sum())
print("The not applicable values in the dataset are: \n", energy.isna().sum)

The null values in the dataset are: 
 X1    0
X2    0
X3    0
X4    0
X5    0
X6    0
X7    0
X8    0
Y1    0
Y2    0
dtype: int64
The not applicable values in the dataset are: 
 <bound method NDFrame._add_numeric_operations.<locals>.sum of         X1     X2     X3     X4     X5     X6     X7     X8     Y1     Y2
0    False  False  False  False  False  False  False  False  False  False
1    False  False  False  False  False  False  False  False  False  False
2    False  False  False  False  False  False  False  False  False  False
3    False  False  False  False  False  False  False  False  False  False
4    False  False  False  False  False  False  False  False  False  False
..     ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
763  False  False  False  False  False  False  False  False  False  False
764  False  False  False  False  False  False  False  False  False  False
765  False  False  False  False  False  False  False  False  False  False
766  False  False  

In [9]:
# Using random seed function to generate the same dataset
np.random.seed(3000)

In [10]:
# Train-Test Split of independent and dependent features
# Let's consider y2 as dependent feature
training, test = train_test_split(energy, test_size = 0.3)

x_trg = training.drop(["Y1", "Y2"], axis = 1)
y_trg = training["Y2"]

x_test = test.drop(["Y1", "Y2"], axis = 1)
y_test = test["Y2"]

### Creating a Random Forest model

In [11]:
# Model Building - Random Forest
energy_forest = RandomForestRegressor(random_state = 0)

# Fit the model
energy_forest.fit(x_trg, y_trg)

RandomForestRegressor(random_state=0)

In [12]:
# Compute the accuracy of the Random Forst model
print("The accuracy on training set is: ", energy_forest.score(x_trg, y_trg))
print("The accuracy on test set is: ", energy_forest.score(x_test, y_test))

The accuracy on training set is:  0.9954309205360165
The accuracy on test set is:  0.9672820046856093


In [13]:
# Prediction via Random Forest model
pred_forest = energy_forest.predict(x_test)

In [14]:
pred_forest

array([30.2544, 13.622 , 39.2688, 13.435 , 16.3947, 31.6078, 19.3502,
       29.7503, 15.9719, 44.4337, 39.9382, 16.3137, 16.1919, 15.9243,
       14.6884, 15.99  , 15.8719, 32.493 , 41.7103, 15.1397, 41.0268,
       41.7555, 34.5523, 33.2598, 27.1518, 13.6462, 35.3317, 36.5708,
       14.1771, 40.2632, 14.2888, 33.92  , 15.8751, 37.2368, 14.0938,
       15.2774, 11.8703, 19.255 , 16.0319, 41.1288, 17.1529, 15.0968,
       28.2135, 15.8532, 33.2757, 25.9456, 16.72  , 33.7575, 14.2385,
       14.1544, 22.0791, 13.8093, 29.9907, 41.2563, 15.9869, 39.3578,
       15.4253, 33.4892, 14.8942, 16.3771, 29.7864, 13.5871, 14.3185,
       15.5903, 23.9204, 39.4692, 30.463 , 38.9331, 41.3766, 18.0214,
       30.311 , 40.5258, 19.2813, 39.1713, 25.9859, 15.935 , 17.0418,
       16.2579, 18.1617, 20.302 , 26.8757, 18.0157, 33.1501, 14.0981,
       33.697 , 14.3777, 17.1166, 12.3531, 17.1473, 39.1947, 32.3387,
       16.3426, 44.2446, 14.6336, 10.9656, 14.2847, 33.8649, 14.3004,
       31.1141, 39.8

In [15]:
# Compute the RMSE for the model
rmse_forest = sqrt(mean_squared_error(y_test, pred_forest))
print("The RMSE value for Random Forest model is: ", rmse_forest)

The RMSE value for Random Forest model is:  1.8041210522132292


#### Creating a Bagging Model

In [16]:
# Model Building
energy_bag = BaggingRegressor(random_state = 0)

# Fit the model
energy_bag.fit(x_trg, y_trg)
print("Accuracy of Bagging model on training set is: ", energy_bag.score(x_trg, y_trg))
print("Accuracy of Bagging model on test set is: ", energy_bag.score(x_test, y_test))

Accuracy of Bagging model on training set is:  0.9941567945050306
Accuracy of Bagging model on test set is:  0.9632999806041446


In [17]:
# Prediction via Bagging model
energy_bag_pred = energy_bag.predict(x_test)

# Compute the RMSE of Baggind model
energy_bag_rmse = sqrt(mean_squared_error(y_test, energy_bag_pred))
print("The RMSE of Bagging model is: ", energy_bag_rmse)

The RMSE of Bagging model is:  1.910757083471008


#### Creating a Decision Tree Model

In [18]:
# Model Building
energy_bag_tree = DecisionTreeRegressor(random_state = 0)

# Fit the model
energy_bag_tree.fit(x_trg, y_trg)

print("Features of the Decision Tree model for energy dataset are: \n", energy_bag_tree.feature_importances_)

# Prediction via Decision Tree model
energy_bag_pred = energy_bag_tree.predict(x_test)

# Compute the RMSE of Decision Tree model
energy_bag_tree_rmse = sqrt(mean_squared_error(y_test, energy_bag_pred))
print("The RMSE of Decision Tree model is: ", energy_bag_tree_rmse)

Features of the Decision Tree model for energy dataset are: 
 [1.07801535e-02 2.01280741e-04 1.03957304e-01 6.00425845e-03
 8.06400378e-01 1.30738165e-02 4.29857984e-02 1.65970109e-02]
The RMSE of Decision Tree model is:  2.3587433370059854
