                                                      NNONA Ebuka John

                                                        Ebukajohnn


                            MACHINE LEARNING REGRESSION PREDICTING ENERGY EFFICIENCY OF BUILDINGS

The dataset for this quiz is the Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters).

https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv

Features
Date, time year-month-day hour:minute:second
Appliances, energy use in Wh
lights, energy use of light fixtures in the house in Wh
T1, Temperature in kitchen area, in Celsius
RH_1, Humidity in kitchen area, in %
T2, Temperature in living room area, in Celsius
RH_2, Humidity in living room area, in %
T3, Temperature in laundry room area
RH_3, Humidity in laundry room area, in %
T4, Temperature in office room, in Celsius
RH_4, Humidity in office room, in %
T5, Temperature in bathroom, in Celsius
RH_5, Humidity in bathroom, in %
T6, Temperature outside the building (north side), in Celsius
RH_6, Humidity outside the building (north side), in %
T7, Temperature in ironing room , in Celsius
RH_7, Humidity in ironing room, in %
T8, Temperature in teenager room 2, in Celsius
RH_8, Humidity in teenager room 2, in %
T9, Temperature in parents room, in Celsius
RH_9, Humidity in parents room, in %
To, Temperature outside (from Chievres weather station), in Celsius
Pressure (from Chievres weather station), in mm Hg
RH_out, Humidity outside (from Chievres weather station), in %
Wind speed (from Chievres weather station), in m/s
Visibility (from Chievres weather station), in km
Tdewpoint (from Chievres weather station), Â°C
rv1, Random variable 1, nondimensional
rv2, Random variable 2, nondimensional


In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.model_selection as Ms
import warnings 
warnings.filterwarnings("ignore", category=RuntimeWarning)
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error

In [46]:
# read in the data
datum=pd.read_csv('energydata_complete.csv')
datum.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


# Exploratory Data Analysis

In [28]:
datum.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

Checking for missing values in the dataset

In [29]:
datum.isnull().sum()

date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64

In [49]:
#Firstly, we normalise our dataset to a common scale using the min max scaler
scaler = MinMaxScaler()
#normalised_df = pd.DataFrame(scaler.fit_transform(df.drop(columns = "date")), columns=df.drop(columns = "date").columns)
normalised_datum = pd.DataFrame(scaler.fit_transform(datum.drop(columns="date")), columns=datum.drop(columns="date").columns)


# assign x and y and reshape x 
x = np.array(normalised_df.T2).reshape(-1, 1)
y = normalised_df['T6']


#Now, we split our dataset into the training and testing dataset. 
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3, random_state=1)

#x_train, x_test, y_train, y_test = train_test_split(features_datum, temperature_target,test_size=0.3, random_state=1)

linear_model = LinearRegression()

#fit the model to the training dataset
linear_model.fit(x_train, y_train)

#obtain predictions
predicted_values = linear_model.predict(x_test)

In [50]:
slr_score = r2_score(y_test, predicted_values)
round(slr_score, 2)

0.65

In [None]:
#13. Norm removing ["date","lights"], target variable ["Appliances"]. 70-30 train_test set split, random state 42

In [69]:
#Firstly, we normalise our dataset to a common scale using the min max scaler
scaler = MinMaxScaler()
normalised_datum = pd.DataFrame(scaler.fit_transform(datum.drop(columns = "date")), columns=datum.drop(columns = "date").columns)

features_datum = normalised_datum.drop(columns=['lights', "Appliances"])
mlr_target = normalised_datum['Appliances']


#Now, we split our dataset into the training and testing dataset. Recall that we had earlier segmented the features 
# and target variables.
x_train, x_test, y_train, y_test = train_test_split(features_datum, mlr_target,test_size=0.3, random_state=42)

mlr_model = LinearRegression()

#fit the model to the training dataset
mlr_model.fit(x_train, y_train)

#obtain predictions
mlr_predicted_values = mlr_model.predict(x_test)

In [71]:
mae = mean_absolute_error(y_test, mlr_predicted_values)
round(mae, 2)

0.05

In [None]:
# 14 Residual Sum of Squares, 2 dec. Place

In [72]:
rss = np.sum(np.square(y_test - mlr_predicted_values))
round(rss, 2)

45.35

In [None]:
# 15 Root Mean Squared Error

In [36]:
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test, mlr_predicted_values))
round(rmse, 3)

0.088

In [None]:
# 16 coeff of determination

In [37]:
mlr_score = r2_score(y_test, mlr_predicted_values)
round(mlr_score, 2)

0.15

In [None]:
#17. Obtain the feature weights from your linear model above.

In [38]:
#comparing the effects of regularisation
def get_weights_df(model, feat, col_name):
    #this function returns the weight of every feature
    weights = pd.Series(model.coef_, feat.columns).sort_values()
    weights_df = pd.DataFrame(weights).reset_index()
    weights_df.columns = ['Features', col_name]
    weights_df[col_name].round(3)
    return weights_df

linear_model_weights = get_weights_df(mlr_model, x_train, 'Linear_Model_Weight')

In [39]:
linear_model_weights

Unnamed: 0,Features,Linear_Model_Weight
0,RH_2,-0.456698
1,T_out,-0.32186
2,T2,-0.236178
3,T9,-0.189941
4,RH_8,-0.157595
5,RH_out,-0.077671
6,RH_7,-0.044614
7,RH_9,-0.0398
8,T5,-0.015657
9,T1,-0.003281


In [None]:
#18. Train a ridge regression model with an alpha value 0.4

In [40]:
from sklearn.linear_model import Ridge

In [56]:
ridge_reg = Ridge(alpha=0.4)
ridge_reg.fit(x_train, y_train)

Ridge(alpha=0.4)

In [57]:
pred_ridge = ridge_reg.predict(x_test)

In [58]:
rmse1 = np.sqrt(mean_squared_error(y_test, pred_ridge))
round(rmse1, 3)

0.088

In [None]:
#19. Train a lasso regression with alpha value 0.001

In [None]:
from sklearn.linear_model import Lasso

In [42]:
lasso_reg = Ridge(alpha=0.001)
lasso_reg.fit(x_train, y_train)

Ridge(alpha=0.001)

In [59]:
pred_lasso = lasso_reg.predict(x_test)

In [60]:
lasso_weights_df = get_weights_df(lasso_reg, x_train, 'Lasso_weight')
lasso_weights_df

Unnamed: 0,Features,Lasso_weight
0,RH_2,-0.45657
1,T_out,-0.321674
2,T2,-0.23608
3,T9,-0.189939
4,RH_8,-0.157594
5,RH_out,-0.077597
6,RH_7,-0.044617
7,RH_9,-0.039805
8,T5,-0.015669
9,T1,-0.003325


In [None]:
#20. New RMSE with the lasso regression in 3 dec place

In [65]:
rmse1 = np.sqrt(mean_squared_error(y_test, pred_lasso))
round(rmse1, 3)

0.088