# Data Description

We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.


The data can be accessed through:
https://archive.ics.uci.edu/ml/datasets/Energy+efficiency#

### Attribute Information:

The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses. 

Specifically: 

X1:	Relative Compactness 

X2:	Surface Area 

X3:	Wall Area 

X4:	Roof Area 

X5:	Overall Height 

X6:	Orientation 

X7:	Glazing Area 

X8:	Glazing Area Distribution 

y1:	Heating Load 

y2:	Cooling Load

To be able to predict each response a linear regression will be built for each outcome in two ways. First, through train and test split data. Second, using cross validation.

### Imported Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

### Data Import

In [3]:
df = pd.read_excel('ENB2012_data.xlsx')
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


In [4]:
n_cols = ["relative_compactness", "surface_area", "wall_area", "roof_area", "overall_height", "orientation", "glazing_area", "glazing_area_distribution", "heating_load", "cooling_load"]
df.columns = n_cols
df.head()

Unnamed: 0,relative_compactness,surface_area,wall_area,roof_area,overall_height,orientation,glazing_area,glazing_area_distribution,heating_load,cooling_load
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


### EDA

In [5]:
df.shape

(768, 10)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
relative_compactness         768 non-null float64
surface_area                 768 non-null float64
wall_area                    768 non-null float64
roof_area                    768 non-null float64
overall_height               768 non-null float64
orientation                  768 non-null int64
glazing_area                 768 non-null float64
glazing_area_distribution    768 non-null int64
heating_load                 768 non-null float64
cooling_load                 768 non-null float64
dtypes: float64(8), int64(2)
memory usage: 60.1 KB


In [7]:
df.describe()

Unnamed: 0,relative_compactness,surface_area,wall_area,roof_area,overall_height,orientation,glazing_area,glazing_area_distribution,heating_load,cooling_load
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,0.764167,671.708333,318.5,176.604167,5.25,3.5,0.234375,2.8125,22.307195,24.58776
std,0.105777,88.086116,43.626481,45.16595,1.75114,1.118763,0.133221,1.55096,10.090204,9.513306
min,0.62,514.5,245.0,110.25,3.5,2.0,0.0,0.0,6.01,10.9
25%,0.6825,606.375,294.0,140.875,3.5,2.75,0.1,1.75,12.9925,15.62
50%,0.75,673.75,318.5,183.75,5.25,3.5,0.25,3.0,18.95,22.08
75%,0.83,741.125,343.0,220.5,7.0,4.25,0.4,4.0,31.6675,33.1325
max,0.98,808.5,416.5,220.5,7.0,5.0,0.4,5.0,43.1,48.03


In [8]:
_ = pd.scatter_matrix(df, figsize=[15,15])

  """Entry point for launching an IPython kernel.


### Data Setup

In [9]:
X = df.iloc[:, 0:8]
X.head()

Unnamed: 0,relative_compactness,surface_area,wall_area,roof_area,overall_height,orientation,glazing_area,glazing_area_distribution
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0


In [10]:
h_l = df['heating_load']
h_l.head()

0    15.55
1    15.55
2    15.55
3    15.55
4    20.84
Name: heating_load, dtype: float64

In [11]:
c_l = df['cooling_load']
c_l.head()

0    21.33
1    21.33
2    21.33
3    21.33
4    28.28
Name: cooling_load, dtype: float64

## Linear Regression - Train Test Split

### Model Setup

In [13]:
# Data set preparation
X_h_train, X_h_test, h_l_train, h_l_test = train_test_split(X, h_l, random_state=42)
X_c_train, X_c_test, c_l_train, c_l_test = train_test_split(X, c_l, random_state=42)

# Regression models
reg_h_l = LinearRegression()
reg_c_l = LinearRegression()

# Model fitting
reg_h_l.fit(X_h_train, h_l_train)
reg_c_l.fit(X_c_train, c_l_train)

# Model scores
model_h_l = reg_h_l.score(X_h_test, h_l_test)
model_c_l = reg_c_l.score(X_c_test, c_l_test)

print('R squared: {} for heating load model'. format(model_h_l))
print('R squared: {} for cooling load model'. format(model_c_l))

R squared: 0.9146441887306919 for heating load model
R squared: 0.8916910572971993 for cooling load model


## Linear Regression - Cross Validation

### Model Setup

In [14]:
# Regression models
reg_h_l_cv = LinearRegression()
reg_c_l_cv = LinearRegression()

# Model fitting
cv_h_l = cross_val_score(reg_h_l_cv, X, h_l, cv=5)
cv_c_l = cross_val_score(reg_c_l_cv, X, c_l, cv=5)

print('R squared for heating load model: {}'. format(cv_h_l))
print('R squared for cooling load model: {}'. format(cv_c_l))

R squared for heating load model: [0.79620933 0.89341416 0.91723321 0.9248329  0.91352298]
R squared for cooling load model: [0.82910137 0.86053501 0.89021091 0.8961209  0.89301643]
