<h1> Energy Efficiency of Buildings </h1>
<br>
    <q>This notebook has been made for the project at<a href="https://app.patika.dev/moduller/dspg-projeleri/binalar%C4%B1n_enerji_verimliligi"> patika.dev </a>.

<h3> Dataset </h3>
<br>
The data used in the project was obtained from the <a href="https://www.kaggle.com/elikplim/eergy-efficiency-dataset"> Energy Efficiency Dataset</a>.
<br><br>
The dataset contains eight attributes (or features, denoted by X1…X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses.



In [177]:
#libraries import
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import GaussianNB
import math

<h2> Download and Read Data </h2>

In [7]:
#reading data
df = pd.read_csv('ENB2012_data.csv')
df

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.90,563.5,318.5,122.50,7.0,2,0.0,0,20.84,28.28
...,...,...,...,...,...,...,...,...,...,...
763,0.64,784.0,343.0,220.50,3.5,5,0.4,5,17.88,21.40
764,0.62,808.5,367.5,220.50,3.5,2,0.4,5,16.54,16.88
765,0.62,808.5,367.5,220.50,3.5,3,0.4,5,16.44,17.11
766,0.62,808.5,367.5,220.50,3.5,4,0.4,5,16.48,16.61


In [9]:
#detecting missing values
print(df.isnull().sum())

X1    0
X2    0
X3    0
X4    0
X5    0
X6    0
X7    0
X8    0
Y1    0
Y2    0
dtype: int64


We don't have any missing values.
<br>
<br>
We don't have any categorical variable.

In [22]:
#checking for outliers

#z-score
from scipy import stats as st
z = np.abs(st.zscore(df))
print(np.where(z > 3))

(array([], dtype=int64), array([], dtype=int64))


We don't have any outliers.

<h2> Split Data </h2>

In [26]:
#train-validation-test
#I will split the data into 3 as 60% train, 20% validation and 20% test.

train, validate, test = np.split(df.sample(frac=1, random_state=42),
                                [int(.6*len(df)), int(.8*len(df))])
print("train " , train)
print("validate " , validate)
print("test " , test)

train         X1     X2     X3      X4   X5  X6    X7  X8     Y1     Y2
668  0.62  808.5  367.5  220.50  3.5   2  0.40   3  16.47  16.90
324  0.66  759.5  318.5  220.50  3.5   2  0.25   1  13.17  16.39
624  0.98  514.5  294.0  110.25  7.0   2  0.40   3  32.82  32.78
690  0.79  637.0  343.0  147.00  7.0   4  0.40   4  41.32  46.23
473  0.64  784.0  343.0  220.50  3.5   3  0.25   4  16.69  19.76
..    ...    ...    ...     ...  ...  ..   ...  ..    ...    ...
190  0.62  808.5  367.5  220.50  3.5   4  0.10   3  12.71  14.14
115  0.79  637.0  343.0  147.00  7.0   5  0.10   2  36.03  42.86
732  0.82  612.5  318.5  147.00  7.0   2  0.40   5  30.00  29.93
467  0.69  735.0  294.0  220.50  3.5   5  0.25   4  12.86  16.13
94   0.62  808.5  367.5  220.50  3.5   4  0.10   1  12.93  14.33

[460 rows x 10 columns]
validate         X1     X2     X3      X4   X5  X6    X7  X8     Y1     Y2
180  0.66  759.5  318.5  220.50  3.5   2  0.10   3  11.59  13.46
301  0.82  612.5  318.5  147.00  7.0   3  0.25  

In [34]:
#now i split x columns and y columns
x_train = train.iloc[:,:-2]
y_train = train.iloc[:,8:]
x_valid = validate.iloc[:,:-2]
y_valid = validate.iloc[:,8:]
x_test = test.iloc[:,:-2]
y_test = test.iloc[:,8:]

<h2> print_score function </h2>

In [79]:
def rmse(x,y):
    return math.sqrt(((x-y)**2).mean())

In [102]:
def print_score(m):
    m.fit(x_train,y_train)
    
    print(f"R^2 of train set: {m.score(x_train, y_train)}")
    print(f"R^2 of validation set: {m.score(x_valid, y_valid)}")

<h2> Random Forest Regressor </h2>
<p> I will try to find the RandomForestRegressor model that gives the best r^2 score by changing the max_features and max_leaf_nodes parameters with the tree numbers given below. </p>

<h3> m10 = 10 tree </h3>

In [103]:
m = RandomForestRegressor(n_estimators=10, n_jobs=-1, max_features = .5, max_leaf_nodes=10)
m.fit(x_train, y_train)
m.score(x_train, y_train)
m.score(x_valid, y_valid)


0.9605775307517177

In [128]:
#max leaf and features=0.5
m1_1 = RandomForestRegressor(n_estimators=10, n_jobs=-1, max_features = .5)
#max leaf and max features
m1_2 = RandomForestRegressor(n_estimators=10, n_jobs=-1)
#25 leaf and max faeatures
m1_3 = RandomForestRegressor(n_estimators=10, n_jobs=-1,max_leaf_nodes=25)
#50 leaf and features=0.75
m1_4 = RandomForestRegressor(n_estimators=10, n_jobs=-1, max_leaf_nodes=50
                            , max_features = .75)
#50 leaf and features=0.25
m1_5 = RandomForestRegressor(n_estimators=10, n_jobs=-1, max_leaf_nodes=100
                            , max_features = .25)

In [105]:
print_score(m1_1)

R^2 of train set: 0.9880806518514571
R^2 of validation set: 0.9600077697798572


In [129]:
print_score(m1_1)
print("*********************")
print_score(m1_2)
print("*********************")
print_score(m1_3)
print("*********************")
print_score(m1_4)
print("*********************")
print_score(m1_5)

R^2 of train set: 0.996398188164646
R^2 of validation set: 0.9734927763944191
*********************
R^2 of train set: 0.9956796952779898
R^2 of validation set: 0.9741291103935839
*********************
R^2 of train set: 0.9818708496616224
R^2 of validation set: 0.9787318832100618
*********************
R^2 of train set: 0.9905824959870366
R^2 of validation set: 0.9766056776390311
*********************
R^2 of train set: 0.9930593220515449
R^2 of validation set: 0.97793353367846


Since the m1_3 model gives the best r^2 score, I will use this as the RandomForestRegressor model to compare with other n_estimators parameters.

In [126]:
m10 = RandomForestRegressor(n_estimators=10, n_jobs=-1, max_leaf_nodes=25)

<h3> m20 = 20 tree </h3>

In [130]:
#max leaf and features=0.5
m2_1 = RandomForestRegressor(n_estimators=20, n_jobs=-1, max_features = .5)
#max leaf and max features
m2_2 = RandomForestRegressor(n_estimators=20, n_jobs=-1)
#25 leaf and max faeatures
m2_3 = RandomForestRegressor(n_estimators=20, n_jobs=-1,max_leaf_nodes=25)
#50 leaf and features=0.75
m2_4 = RandomForestRegressor(n_estimators=20, n_jobs=-1, max_leaf_nodes=50
                            , max_features = .75)
#50 leaf and features=0.25
m2_5 = RandomForestRegressor(n_estimators=20, n_jobs=-1, max_leaf_nodes=100
                            , max_features = .25)

In [131]:
print_score(m2_1)
print("*********************")
print_score(m2_2)
print("*********************")
print_score(m2_3)
print("*********************")
print_score(m2_4)
print("*********************")
print_score(m2_5)

R^2 of train set: 0.9965795836582008
R^2 of validation set: 0.977201593279623
*********************
R^2 of train set: 0.9959077707703345
R^2 of validation set: 0.9739533301954467
*********************
R^2 of train set: 0.9829721635770632
R^2 of validation set: 0.9794495003227996
*********************
R^2 of train set: 0.990392157483637
R^2 of validation set: 0.9777429407651048
*********************
R^2 of train set: 0.9932938749960539
R^2 of validation set: 0.9758166603401394


Since the m2_3 model gives the best r^2 score, I will use this as the RandomForestRegressor model to compare with other n_estimators parameters.

In [132]:
m20 = RandomForestRegressor(n_estimators=20, n_jobs=-1,max_leaf_nodes=25)

<h3> m30 = 30 tree </h3>

In [133]:
#max leaf and features=0.5
m3_1 = RandomForestRegressor(n_estimators=30, n_jobs=-1, max_features = .5)
#max leaf and max features
m3_2 = RandomForestRegressor(n_estimators=30, n_jobs=-1)
#25 leaf and max faeatures
m3_3 = RandomForestRegressor(n_estimators=30, n_jobs=-1,max_leaf_nodes=25)
#50 leaf and features=0.75
m3_4 = RandomForestRegressor(n_estimators=30, n_jobs=-1, max_leaf_nodes=50
                            , max_features = .75)
#50 leaf and features=0.25
m3_5 = RandomForestRegressor(n_estimators=30, n_jobs=-1, max_leaf_nodes=100
                            , max_features = .25)

In [134]:
print_score(m3_1)
print("*********************")
print_score(m3_2)
print("*********************")
print_score(m3_3)
print("*********************")
print_score(m3_4)
print("*********************")
print_score(m3_5)

R^2 of train set: 0.9966151666520702
R^2 of validation set: 0.9774469542947077
*********************
R^2 of train set: 0.9966663650164855
R^2 of validation set: 0.9732543290498091
*********************
R^2 of train set: 0.9827817891602917
R^2 of validation set: 0.9794545534801269
*********************
R^2 of train set: 0.9905057111603782
R^2 of validation set: 0.9789023445378096
*********************
R^2 of train set: 0.9939571755429409
R^2 of validation set: 0.9778706237916464


Since the m3_3 model gives the best r^2 score, I will use this as the RandomForestRegressor model to compare with other n_estimators parameters.

In [140]:
m30 = RandomForestRegressor(n_estimators=30, n_jobs=-1,max_leaf_nodes=25)

<h3> m40 = 40 tree </h3>

In [137]:
#max leaf and features=0.5
m4_1 = RandomForestRegressor(n_estimators=40, n_jobs=-1, max_features = .5)
#max leaf and max features
m4_2 = RandomForestRegressor(n_estimators=40, n_jobs=-1)
#25 leaf and max faeatures
m4_3 = RandomForestRegressor(n_estimators=40, n_jobs=-1,max_leaf_nodes=25)
#50 leaf and features=0.75
m4_4 = RandomForestRegressor(n_estimators=40, n_jobs=-1, max_leaf_nodes=50
                            , max_features = .75)
#50 leaf and features=0.25
m4_5 = RandomForestRegressor(n_estimators=40, n_jobs=-1, max_leaf_nodes=100
                            , max_features = .25)

In [145]:
print_score(m4_1)
print("*********************")
print_score(m4_2)
print("*********************")
print_score(m4_3)
print("*********************")
print_score(m4_4)
print("*********************")
print_score(m4_5)

R^2 of train set: 0.9970101612169089
R^2 of validation set: 0.9779036791002965
*********************
R^2 of train set: 0.9967047363570607
R^2 of validation set: 0.9719624614480764
*********************
R^2 of train set: 0.9831239082823453
R^2 of validation set: 0.9794045298082368
*********************
R^2 of train set: 0.9911408883781501
R^2 of validation set: 0.9782221447823787
*********************
R^2 of train set: 0.9938664194397182
R^2 of validation set: 0.9780947045352669


Since the m4_3 model gives the best r^2 score, I will use this as the RandomForestRegressor model to compare with other n_estimators parameters.

In [139]:
m40 = RandomForestRegressor(n_estimators=40, n_jobs=-1,max_leaf_nodes=25)

<h3> m50 = 50 tree </h3>

In [141]:
#max leaf and features=0.5
m5_1 = RandomForestRegressor(n_estimators=50, n_jobs=-1, max_features = .5)
#max leaf and max features
m5_2 = RandomForestRegressor(n_estimators=50, n_jobs=-1)
#25 leaf and max faeatures
m5_3 = RandomForestRegressor(n_estimators=50, n_jobs=-1,max_leaf_nodes=25)
#50 leaf and features=0.75
m5_4 = RandomForestRegressor(n_estimators=50, n_jobs=-1, max_leaf_nodes=50
                            , max_features = .75)
#50 leaf and features=0.25
m5_5 = RandomForestRegressor(n_estimators=50, n_jobs=-1, max_leaf_nodes=100
                            , max_features = .25)

In [150]:
print_score(m5_1)
print("*********************")
print_score(m5_2)
print("*********************")
print_score(m5_3)
print("*********************")
print_score(m5_4)
print("*********************")
print_score(m5_5)

R^2 of train set: 0.996968217149897
R^2 of validation set: 0.9773810464088855
*********************
R^2 of train set: 0.9968864307077103
R^2 of validation set: 0.9737485567811439
*********************
R^2 of train set: 0.9830502035408093
R^2 of validation set: 0.9790537537605919
*********************
R^2 of train set: 0.9908092731640654
R^2 of validation set: 0.9772870611388831
*********************
R^2 of train set: 0.994263654475785
R^2 of validation set: 0.9784521127184547


Since the m5_3 model gives the best r^2 score, I will use this as the RandomForestRegressor model to compare with other n_estimators parameters.

In [147]:
m50 = RandomForestRegressor(n_estimators=50, n_jobs=-1,max_leaf_nodes=25)

Now I want to see how much the score changes as I increase the number of trees more.
I want to see the difference by doing 100 and 150.

In [152]:
#max leaf and features=0.5
m6_1 = RandomForestRegressor(n_estimators=150, n_jobs=-1, max_features = .5)
#max leaf and max features
m6_2 = RandomForestRegressor(n_estimators=150, n_jobs=-1)
#25 leaf and max faeatures
m6_3 = RandomForestRegressor(n_estimators=150, n_jobs=-1,max_leaf_nodes=25)
#50 leaf and features=0.75
m6_4 = RandomForestRegressor(n_estimators=150, n_jobs=-1, max_leaf_nodes=50
                            , max_features = .75)
#50 leaf and features=0.25
m6_5 = RandomForestRegressor(n_estimators=150, n_jobs=-1, max_leaf_nodes=100
                            , max_features = .25)

In [153]:
print_score(m6_1)
print("*********************")
print_score(m6_2)
print("*********************")
print_score(m6_3)
print("*********************")
print_score(m6_4)
print("*********************")
print_score(m6_5)

R^2 of train set: 0.9972570184041105
R^2 of validation set: 0.9781843825146721
*********************
R^2 of train set: 0.9968548125649741
R^2 of validation set: 0.9736957508432196
*********************
R^2 of train set: 0.9831335710086467
R^2 of validation set: 0.9791776635751221
*********************
R^2 of train set: 0.9908222725249729
R^2 of validation set: 0.9780670719063431
*********************
R^2 of train set: 0.9944581344006764
R^2 of validation set: 0.9789555512335587


In [154]:
#max leaf and features=0.5
m7_1 = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_features = .5)
#max leaf and max features
m7_2 = RandomForestRegressor(n_estimators=100, n_jobs=-1)
#25 leaf and max faeatures
m7_3 = RandomForestRegressor(n_estimators=100, n_jobs=-1,max_leaf_nodes=25)
#50 leaf and features=0.75
m7_4 = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_leaf_nodes=50
                            , max_features = .75)
#50 leaf and features=0.25
m7_5 = RandomForestRegressor(n_estimators=100, n_jobs=-1, max_leaf_nodes=100
                            , max_features = .25)

In [155]:
print_score(m6_1)
print("*********************")
print_score(m6_2)
print("*********************")
print_score(m6_3)
print("*********************")
print_score(m6_4)
print("*********************")
print_score(m6_5)

R^2 of train set: 0.9972594729648755
R^2 of validation set: 0.9777179336590003
*********************
R^2 of train set: 0.9970276576129611
R^2 of validation set: 0.9739625175487712
*********************
R^2 of train set: 0.9830146770698956
R^2 of validation set: 0.9794421370362979
*********************
R^2 of train set: 0.9907849117883933
R^2 of validation set: 0.9776211952697216
*********************
R^2 of train set: 0.9945044751484262
R^2 of validation set: 0.978751544376911


When I hit 100, I scored higher than both 150 and 50. Now I will try to find the number of trees trying to give the best score in between. I will only use this combination as it always scores better than the 3rd combination.

In [158]:
md1 = RandomForestRegressor(n_estimators=100, n_jobs=-1,max_leaf_nodes=25)
md2 = RandomForestRegressor(n_estimators=90, n_jobs=-1,max_leaf_nodes=25)
md3 = RandomForestRegressor(n_estimators=80, n_jobs=-1,max_leaf_nodes=25)
md4 = RandomForestRegressor(n_estimators=70, n_jobs=-1,max_leaf_nodes=25)
md5 = RandomForestRegressor(n_estimators=60, n_jobs=-1,max_leaf_nodes=25)
md6 = RandomForestRegressor(n_estimators=50, n_jobs=-1,max_leaf_nodes=25)
md7 = RandomForestRegressor(n_estimators=110, n_jobs=-1,max_leaf_nodes=25)
md8 = RandomForestRegressor(n_estimators=120, n_jobs=-1,max_leaf_nodes=25)
md9 = RandomForestRegressor(n_estimators=130, n_jobs=-1,max_leaf_nodes=25)
md10 = RandomForestRegressor(n_estimators=140, n_jobs=-1,max_leaf_nodes=25)
md11 = RandomForestRegressor(n_estimators=150, n_jobs=-1,max_leaf_nodes=25)

In [159]:
print_score(md1)
print("*********************")
print_score(md2)
print("*********************")
print_score(md3)
print("*********************")
print_score(md4)
print("*********************")
print_score(md5)
print("*********************")
print_score(md6)
print("*********************")
print_score(md7)
print("*********************")
print_score(md8)
print("*********************")
print_score(md9)
print("*********************")
print_score(md10)

R^2 of train set: 0.9827912587939376
R^2 of validation set: 0.9794077442809037
*********************
R^2 of train set: 0.9831573214731825
R^2 of validation set: 0.9794146984510304
*********************
R^2 of train set: 0.9829945580993059
R^2 of validation set: 0.979248110099213
*********************
R^2 of train set: 0.9828485837962101
R^2 of validation set: 0.9794819080811036
*********************
R^2 of train set: 0.9827466263646187
R^2 of validation set: 0.9790178137440475
*********************
R^2 of train set: 0.9831320408327603
R^2 of validation set: 0.9789915769946071
*********************
R^2 of train set: 0.9829965170575033
R^2 of validation set: 0.9794199902789479
*********************
R^2 of train set: 0.9831393682105383
R^2 of validation set: 0.9794300797355942
*********************
R^2 of train set: 0.9829989088848086
R^2 of validation set: 0.9793052925211027
*********************
R^2 of train set: 0.9830923317073713
R^2 of validation set: 0.9792570648366592


Model 4 performed better than the others.

In [160]:
m70 = RandomForestRegressor(n_estimators=70, n_jobs=-1,max_leaf_nodes=25)

<h3> Best RandomForestRegressor Model Selection </h3>

In [169]:
print_score(m10)
print("*********************")
print_score(m20)
print("*********************")
print_score(m30)
print("*********************")
print_score(m40)
print("*********************")
print_score(m50)
print("*********************")
print_score(m70)


R^2 of train set: 0.9824256656174482
R^2 of validation set: 0.9778517570620489
*********************
R^2 of train set: 0.9826153193834573
R^2 of validation set: 0.9787115066820639
*********************
R^2 of train set: 0.9829090320149236
R^2 of validation set: 0.979144110933814
*********************
R^2 of train set: 0.9829008335073481
R^2 of validation set: 0.9787572945377525
*********************
R^2 of train set: 0.9828700781492679
R^2 of validation set: 0.979132994555967
*********************
R^2 of train set: 0.9830878804311725
R^2 of validation set: 0.9793358161689694


The m50 and m70 models give very close scores to each other. Sometimes one gives a better score, sometimes the other.
That's why I want to compare both of them with the test data in the final stage.

In [170]:
rfr_model1 = m50
rfr_model2 = m70

<h2> Linear Regression </h2>

In [174]:
lr_model = LinearRegression()
lr_model.fit(x_train, y_train)
print_score(lr_model)

R^2 of train set: 0.9024829375666665
R^2 of validation set: 0.9032843577136344


<h2> Compare these 3 models </h2>

In [182]:
print("RandomForestRegressor with 50 n_estimators")
print()
print_score(rfr_model1)
print()
print("*********************")
print()
print("RandomForestRegressor with 70 n_estimators")
print()
print_score(rfr_model2)
print()
print("*********************")
print()
print("Linear Regression")
print()
print_score(lr_model)
print()

RandomForestRegressor with 50 n_estimators

R^2 of train set: 0.9830105146442695
R^2 of validation set: 0.9788176469804055

*********************

RandomForestRegressor with 70 n_estimators

R^2 of train set: 0.9828391996567383
R^2 of validation set: 0.9791808105921496

*********************

Linear Regression

R^2 of train set: 0.9024829375666665
R^2 of validation set: 0.9032843577136344



Now let's look at the test data:

In [183]:
def print_score2(m):
    m.fit(x_train,y_train)
    
    print(f"R^2 of train set: {m.score(x_train, y_train)}")
    print(f"R^2 of validation set: {m.score(x_valid, y_valid)}")
    print(f"R^2 of test set: {m.score(x_test, y_test)}")

In [192]:
print("RandomForestRegressor with 50 n_estimators")
print()
print_score2(rfr_model1)
print()
print("*********************")
print()
print("RandomForestRegressor with 70 n_estimators")
print()
print_score2(rfr_model2)
print()
print("*********************")
print()
print("Linear Regression")
print()
print_score2(lr_model)
print()

RandomForestRegressor with 50 n_estimators

R^2 of train set: 0.9825221850974113
R^2 of validation set: 0.9786415052921925
R^2 of test set: 0.9795738581796989

*********************

RandomForestRegressor with 70 n_estimators

R^2 of train set: 0.9828634388750188
R^2 of validation set: 0.9793689059714474
R^2 of test set: 0.9801043053275207

*********************

Linear Regression

R^2 of train set: 0.9024829375666665
R^2 of validation set: 0.9032843577136344
R^2 of test set: 0.8907347170836288



Again, the RFR model, which we have given as n_estimators 50 and 70, gives values close to each other. Linear Regression performed approximately 10% worse than RFR.

As a result, we have a model that predicts the result with 98% accuracy.