<a href="https://colab.research.google.com/github/doyinsolamiolaoye/hamoye_internship/blob/master/hamoye_stage_two_quiz.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # Machine Learning: Regression - Predicting Energy Efficiency of Buildings




## Introduction

According to the United Nations Environmental Program (UNEP) Sustainable Buildings and Climate Initiative, construction trade contributes as much as 30% to all global greenhouse gas emissions and consumes up to 40% of all energy used worldwide. Climate change is currently having a powerful impact on how buildings are designed and constructed. In this Project, I will develop a multivariate multiple regression model to study this effect.

## Dataset
The dataset for this project is the [Appliances Energy Prediction data](https://archive.ics.uci.edu/ml/machine-learning-databases/00374/). The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. 

Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters).

**Below is a preview of a couple columns I'll work with from the dataset:**

- **Date**: time year-month-day hour:minute:second
- **Appliances**: energy use in Wh
- **Lights**: energy use of light fixtures in the house in Wh
- **T1**: Temperature in kitchen area, in Celsius
- **RH_1**: Humidity in kitchen area, in %
- **T2**: Temperature in living room area, in Celsius
- **RH_2**: Humidity in living room area, in %
- **T3**: Temperature in laundry room area
- **RH_3**: Humidity in laundry room area, in %
- **T4**: Temperature in office room, in Celsius
- **RH_4**: Humidity in office room, in %
- **T5**: Temperature in bathroom, in Celsius
- **RH_5**: Humidity in bathroom, in %
- **T6**: Temperature outside the building (north side), in Celsius
- **RH_6**: Humidity outside the building (north side), in %
- **T7**: Temperature in ironing room , in Celsius
- **RH_7**: Humidity in ironing room, in %
- **T8**: Temperature in teenager room 2, in Celsius
- **RH_8**: Humidity in teenager room 2, in %
- **T9**: Temperature in parents room, in Celsius
- **RH_9**: Humidity in parents room, in %
- **To**: Temperature outside (from Chievres weather station), in Celsius
- **Pressure** (from Chievres weather station), in mm Hg
- **RH_out**: Humidity outside (from Chievres weather station), in %
- **Wind speed** (from Chievres weather station), in m/s
- **Visibility** (from Chievres weather station), in km
- **Tdewpoint** (from Chievres weather station), Â °C
- **rv1**, Random variable 1, nondimensional
- **rv2**, Random variable 2, nondimensional

**NOTE** : 
To answer some questions, you will need to normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a  random state of 42 (for reproducibility). Run a multiple linear regression using the training set and evaluate your model on the test set.


## Opening and Exploring the Data

Lets start by opening the dataset and then continue exploring the datset

In [1]:
#read dataset from github into a dataframe
import numpy as np
import pandas as pd
url = 'https://github.com/doyinsolamiolaoye/hamoye_internship/blob/master/energydata_complete.csv?raw=true'
df = pd.read_csv(url)

In [2]:
#number of rows and columns of the dataset
df.shape

(19735, 29)

In [3]:
#column labels
df.columns

Index(['date', 'Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3',
       'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
       'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
       'Visibility', 'Tdewpoint', 'rv1', 'rv2'],
      dtype='object')

In [4]:
#show the first 5 rows of the dataset
df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,55.2,7.026667,84.256667,17.2,41.626667,18.2,48.9,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,55.2,6.833333,84.063333,17.2,41.56,18.2,48.863333,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,55.09,6.56,83.156667,17.2,41.433333,18.2,48.73,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,55.09,6.433333,83.423333,17.133333,41.29,18.1,48.59,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,55.09,6.366667,84.893333,17.2,41.23,18.1,48.59,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [5]:
df.tail()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
19730,2016-05-27 17:20:00,100,0,25.566667,46.56,25.89,42.025714,27.2,41.163333,24.7,45.59,23.2,52.4,24.796667,1.0,24.5,44.5,24.7,50.074,23.2,46.79,22.733333,755.2,55.666667,3.333333,23.666667,13.333333,43.096812,43.096812
19731,2016-05-27 17:30:00,90,0,25.5,46.5,25.754,42.08,27.133333,41.223333,24.7,45.59,23.23,52.326667,24.196667,1.0,24.557143,44.414286,24.7,49.79,23.2,46.79,22.6,755.2,56.0,3.5,24.5,13.3,49.28294,49.28294
19732,2016-05-27 17:40:00,270,10,25.5,46.596667,25.628571,42.768571,27.05,41.69,24.7,45.73,23.23,52.266667,23.626667,1.0,24.54,44.4,24.7,49.66,23.2,46.79,22.466667,755.2,56.333333,3.666667,25.333333,13.266667,29.199117,29.199117
19733,2016-05-27 17:50:00,420,10,25.5,46.99,25.414,43.036,26.89,41.29,24.7,45.79,23.2,52.2,22.433333,1.0,24.5,44.295714,24.6625,49.51875,23.2,46.8175,22.333333,755.2,56.666667,3.833333,26.166667,13.233333,6.322784,6.322784
19734,2016-05-27 18:00:00,430,10,25.5,46.6,25.264286,42.971429,26.823333,41.156667,24.7,45.963333,23.2,52.2,21.026667,1.0,24.5,44.054,24.736,49.736,23.2,46.845,22.2,755.2,57.0,4.0,27.0,13.2,34.118851,34.118851


In [6]:
df.describe()

Unnamed: 0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
count,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
mean,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,39.026904,19.592106,50.949283,7.910939,54.609083,20.267106,35.3882,22.029107,42.936165,19.485828,41.552401,7.411665,755.522602,79.750418,4.039752,38.330834,3.760707,24.988033,24.988033
std,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,4.341321,1.844623,9.022034,6.090347,31.149806,2.109993,5.114208,1.956162,5.224361,2.014712,4.151497,5.317409,7.399441,14.901088,2.451221,11.794719,4.194648,14.496634,14.496634
min,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,27.66,15.33,29.815,-6.065,1.0,15.39,23.2,16.306667,29.6,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322
25%,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,35.53,18.2775,45.4,3.626667,30.025,18.7,31.5,20.79,39.066667,18.0,38.5,3.666667,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889
50%,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,38.4,19.39,49.09,7.3,55.29,20.033333,34.863333,22.1,42.375,19.39,40.9,6.916667,756.1,83.666667,3.666667,40.0,3.433333,24.897653,24.897653
75%,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,42.156667,20.619643,53.663333,11.256,83.226667,21.6,39.0,23.39,46.536,20.6,44.338095,10.408333,760.933333,91.666667,5.5,40.0,6.566667,37.583769,37.583769
max,1080.0,70.0,26.26,63.36,29.856667,56.026667,29.236,50.163333,26.2,51.09,25.795,96.321667,28.29,99.9,26.0,51.4,27.23,58.78,24.5,53.326667,26.1,772.3,100.0,14.0,66.0,15.5,49.99653,49.99653


I will drop the columns that are not needed for this project:

In [7]:
df.drop(['lights','date'], axis=1,inplace=True)

## Solution to Quiz

In [8]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [9]:
#Firstly, we normalise our dataset to a common scale using the min max scaler
scaler = MinMaxScaler()
normalised_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
normalised_df.head()

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,0.046729,0.32735,0.566187,0.225345,0.684038,0.215188,0.746066,0.351351,0.764262,0.175506,0.381691,0.38107,0.841827,0.170594,0.653428,0.173329,0.661412,0.223032,0.67729,0.37299,0.097674,0.894737,0.5,0.953846,0.538462,0.265449,0.265449
1,0.046729,0.32735,0.541326,0.225345,0.68214,0.215188,0.748871,0.351351,0.782437,0.175506,0.381691,0.375443,0.839872,0.170594,0.651064,0.173329,0.660155,0.2265,0.678532,0.369239,0.1,0.894737,0.47619,0.894872,0.533937,0.372083,0.372083
2,0.037383,0.32735,0.530502,0.225345,0.679445,0.215188,0.755569,0.344745,0.778062,0.175506,0.380037,0.367487,0.830704,0.170594,0.646572,0.173329,0.655586,0.219563,0.676049,0.365488,0.102326,0.894737,0.452381,0.835897,0.529412,0.572848,0.572848
3,0.037383,0.32735,0.52408,0.225345,0.678414,0.215188,0.758685,0.341441,0.770949,0.175506,0.380037,0.3638,0.833401,0.16431,0.641489,0.164175,0.650788,0.219563,0.671909,0.361736,0.104651,0.894737,0.428571,0.776923,0.524887,0.908261,0.908261
4,0.046729,0.32735,0.531419,0.225345,0.676727,0.215188,0.758685,0.341441,0.762697,0.178691,0.380037,0.361859,0.848264,0.170594,0.639362,0.164175,0.650788,0.219563,0.671909,0.357985,0.106977,0.894737,0.404762,0.717949,0.520362,0.201611,0.201611


### Question 12 - From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius(x = T2) and the temperature outside the building (y = T6). What is the R^2 value in two D.P?

Simple linear regression

In [10]:
#Split the Data into predictors (X) and target variables (y)
X = normalised_df[['T2']]
y = normalised_df[['T6']]

# splitting data into training and testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

reg = LinearRegression()
reg.fit(X_train, y_train)

pred = reg.predict(X_test)

#calculate the r-squared value to 2dp
round(r2_score(y_test, pred), 2)

0.64

Multivariate linear regression

In [11]:
#Split the Data into predictors (X) and target variables (y)
X = normalised_df.drop(columns=['Appliances'])
y = normalised_df['Appliances']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

reg_multi = LinearRegression()
reg_multi.fit(X_train, y_train)

pred = reg_multi.predict(X_test)

### Question 13 - What is the Mean Absolute Error (in two decimal places)?

In [12]:
mae = mean_absolute_error(y_test, pred)
round(mae, 2)

0.05

### Question 14 - What is the Residual Sum of Squares (in two decimal places)?

In [13]:
rss = np.sum(np.square(y_test - pred))
round(rss, 2)

45.35

### Question 15 - What is the Root Mean Squares Error (in three decimal places)?

In [14]:
rmse = np.sqrt(mean_squared_error(y_test, pred))
round(rmse, 3)

0.088

### Question 16 - What is the Coefficient of Determination (in two decimal places)?


In [15]:
r2_score_2 = r2_score(y_test, pred)
round(r2_score_2, 2)

0.15

### Question 17 - Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?

In [16]:
#comparing the effects of regularisation
def get_weights_df(model, feat, col_name):
    #this function returns the weight of every feature
    weights = pd.Series(model.coef_, feat.columns).sort_values()
    weights_df = pd.DataFrame(weights).reset_index()
    weights_df.columns = ['Features', col_name]
    #weights_df[col_name] = weights_df[col_name].round(3)
    return weights_df

In [17]:
linear_model_weights = get_weights_df(reg_multi, X_train, 'Linear_Model_Weight')

linear_model_weights.sort_values('Linear_Model_Weight', ascending  = False)

Unnamed: 0,Features,Linear_Model_Weight
25,RH_1,0.553547
24,T3,0.290627
23,T6,0.236425
22,Tdewpoint,0.117758
21,T8,0.101995
20,RH_3,0.096048
19,RH_6,0.038049
18,Windspeed,0.029183
17,T4,0.028981
16,RH_4,0.026386


### Question 18 - Train a ridge regression model with an alpha value of 0.4. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?

In [18]:
# Define the Ridge model with an alpha value of 0.4
ridge_reg = Ridge(alpha = 0.4)
ridge_reg.fit(X_train, y_train)

# Make predictions
ridge_reg_pred = ridge_reg.predict(X_test)

# RMSE
ridge_rmse = np.sqrt(mean_squared_error(y_test, ridge_reg_pred))
round(ridge_rmse, 3)

0.088

Same value gotten with the linear regression model above: 0.088

### Question 19 - Train a lasso regression model with an alpha value of 0.001 and obtain the new feature weights with it. How many of the features have non-zero feature weights?

In [19]:
# Define the Lasso model and call its fit method
lasso_reg = Lasso(alpha=0.001)
lasso_reg.fit(X_train, y_train)

# Call the get_weights_df function and pass the appropriate parameters
lasso_model_weights = get_weights_df(lasso_reg, X_train, 'Lasso_Model_Weight')

np.count_nonzero(lasso_model_weights['Lasso_Model_Weight'])

4

### Question 20 - What is the new RMSE with the Lasso Regression (in 3 decimal places)?

In [20]:
lasso_reg_pred = lasso_reg.predict(X_test)

# RMSE
lasso_rmse = np.sqrt(mean_squared_error(y_test, lasso_reg_pred))
round(lasso_rmse, 3)

0.094

## Conclusion

It can be concluded from the analysis above that:
- There is no change in the Root Mean Square Error when both the linear Regressionn and ridge regression Models are used.
- The coefficient of determination of the multivariate linear regression is very low (0.15) and as such it can be concluded that the Model does not fit our data well; the accuracy of our best of fit line is low.