## Introduction

### Dataset Description

The dataset for the remainder of this quiz is the Appliances Energy Prediction data. The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non predictive attributes (parameters). The attribute information can be seen below.

Attribute Information:

Date, time year-month-day hour:minute:second

Appliances, energy use in Wh

lights, energy use of light fixtures in the house in Wh

T1, Temperature in kitchen area, in Celsius

RH_1, Humidity in kitchen area, in %

T2, Temperature in living room area, in Celsius

RH_2, Humidity in living room area, in %

T3, Temperature in laundry room area

RH_3, Humidity in laundry room area, in %

T4, Temperature in office room, in Celsius

RH_4, Humidity in office room, in %

T5, Temperature in bathroom, in Celsius

RH_5, Humidity in bathroom, in %

T6, Temperature outside the building (north side), in Celsius

RH_6, Humidity outside the building (north side), in %

T7, Temperature in ironing room , in Celsius

RH_7, Humidity in ironing room, in %

T8, Temperature in teenager room 2, in Celsius

RH_8, Humidity in teenager room 2, in %

T9, Temperature in parents room, in Celsius

RH_9, Humidity in parents room, in %

To, Temperature outside (from Chievres weather station), in Celsius

Pressure (from Chievres weather station), in mm Hg

RH_out, Humidity outside (from Chievres weather station), in %

Wind speed (from Chievres weather station), in m/s

Visibility (from Chievres weather station), in km

Tdewpoint (from Chievres weather station), Â°C

rv1, Random variable 1, nondimensional

rv2, Random variable 2, nondimensional



## Importing relevant packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

## Data Assessing

In [2]:
df = pd.read_csv('energydata_complete.csv')

In [3]:
df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [4]:
#checking basic dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

In [5]:
#checking for missing rows in the dataset
df.isnull().sum()

date           0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
T6             0
RH_6           0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
rv1            0
rv2            0
dtype: int64

In [6]:
#checking for duplicated rows in the dataset
df.duplicated().sum()

0

## Data Cleaning

In [7]:
#Converting the date column from object to datetime format
df.date = pd.to_datetime(df.date)

## Machine Learning Problems

### Question 12:

In [8]:
#Creating a subset of the data for just the temperature in the living room(T2) and the temperature outside(T6)
df_subset = df[['T2','T6']]

In [9]:
#Making T2 the independent variable, and T6, the dependent variable
X = df_subset[['T2']]
y = df_subset['T6']

In [10]:
#Splitting the variables into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [11]:
#Creating a linear regression model
linear_model = LinearRegression()

In [12]:
#Fitting the training data 
linear_model.fit(X_train, y_train)

In [13]:
#Getting predictions from the testing data
y_pred = linear_model.predict(X_test)

In [14]:
#Calculating the r2_score
round(r2_score(y_test, y_pred),2)

0.64

### Question 13:

In [15]:
#dropping the 'date', and 'lights' column
df.drop(['date', 'lights'], axis=1, inplace=True)

In [16]:
#normalizing the dataset using the MinMaxScaler 
scaler = MinMaxScaler()
normalised_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

In [17]:
normalised_df.head()

Unnamed: 0,Appliances,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,0.046729,0.32735,0.566187,0.225345,0.684038,0.215188,0.746066,0.351351,0.764262,0.175506,...,0.223032,0.67729,0.37299,0.097674,0.894737,0.5,0.953846,0.538462,0.265449,0.265449
1,0.046729,0.32735,0.541326,0.225345,0.68214,0.215188,0.748871,0.351351,0.782437,0.175506,...,0.2265,0.678532,0.369239,0.1,0.894737,0.47619,0.894872,0.533937,0.372083,0.372083
2,0.037383,0.32735,0.530502,0.225345,0.679445,0.215188,0.755569,0.344745,0.778062,0.175506,...,0.219563,0.676049,0.365488,0.102326,0.894737,0.452381,0.835897,0.529412,0.572848,0.572848
3,0.037383,0.32735,0.52408,0.225345,0.678414,0.215188,0.758685,0.341441,0.770949,0.175506,...,0.219563,0.671909,0.361736,0.104651,0.894737,0.428571,0.776923,0.524887,0.908261,0.908261
4,0.046729,0.32735,0.531419,0.225345,0.676727,0.215188,0.758685,0.341441,0.762697,0.178691,...,0.219563,0.671909,0.357985,0.106977,0.894737,0.404762,0.717949,0.520362,0.201611,0.201611


In [18]:
#Making 'Appliances' the dependent column and all other columns the independent columns
X2 = normalised_df.drop(['Appliances'], axis=1)
y2 = normalised_df['Appliances']

In [19]:
#Splitting the variables into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X2, y2, test_size=0.3, random_state=42)

In [20]:
#Creating a linear regression model
linear_model2 = LinearRegression()

In [21]:
#Fitting the training data
linear_model2.fit(X_train, y_train)

In [22]:
#Getting predictions from the testing data
y_pred = linear_model2.predict(X_test) 

In [23]:
#Calculating the mean absolute error
mae = mean_absolute_error(y_test, y_pred) 
round(mae, 2)

0.05

### Question 14:

In [24]:
#Calculating the residual sum of squares(rss)
rss = np.sum(np.square(y_test - y_pred)) 
round(rss, 2)

45.35

### Question 15:

In [25]:
#Calculating the Root Mean Square Error (RMSE)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
round(rmse, 3)

0.088

### Question 16:

In [26]:
#Calculating the Coefficient of Determination (R2 Score)
round(r2_score(y_test, y_pred), 2)

0.15

### Question 17:

In [27]:
#this function returns the weight of every feature
def get_weights_df(model, feat, col_name): 
    weights = pd.Series(model.coef_, feat.columns).sort_values() 
    weights_df = pd.DataFrame(weights).reset_index() 
    weights_df.columns = ['Features', col_name] 
    weights_df[col_name].round(3) 
    return weights_df

In [28]:
#Obtaining which features have the lowest and highest weights
linear_model_weights = get_weights_df(linear_model2, X_train, 'Linear_Model_Weight')
linear_model_weights

Unnamed: 0,Features,Linear_Model_Weight
0,RH_2,-0.456698
1,T_out,-0.32186
2,T2,-0.236178
3,T9,-0.189941
4,RH_8,-0.157595
5,RH_out,-0.077671
6,RH_7,-0.044614
7,RH_9,-0.0398
8,T5,-0.015657
9,T1,-0.003281


### Question 18:

In [29]:
#Creating a ridge regression model
ridge_reg = Ridge(alpha=0.4)
#Fitting the training data
ridge_reg.fit(X_train, y_train)

In [30]:
ridge_pred = ridge_reg.predict(X_test)

In [31]:
#Calculating the r2_score
round(r2_score(y_test, ridge_pred), 2)

0.15

### Question 19:

In [32]:
#Creating a lasso regression model
lasso_reg = Lasso(alpha=0.001)
#Fitting the training data
lasso_reg.fit(X_train, y_train)

In [33]:
#Obtaining how many features have non-zero feature weights
lasso_weights_df = get_weights_df(lasso_reg, X_train, 'Lasso_weight')
lasso_weights_df

Unnamed: 0,Features,Lasso_weight
0,RH_out,-0.049557
1,RH_8,-0.00011
2,T1,0.0
3,Tdewpoint,0.0
4,Visibility,0.0
5,Press_mm_hg,-0.0
6,T_out,0.0
7,RH_9,-0.0
8,T9,-0.0
9,T8,0.0


### Question 20:

In [34]:
#Getting predictions from the testing data using the lasso model
lasso_pred = lasso_reg.predict(X_test)

In [35]:
#Calculating the root mean square error(rsme)
lasso_rmse = np.sqrt(mean_squared_error(y_test, lasso_pred))
round(lasso_rmse, 3)

0.094