# Faryar Memon
##### ID: 14770bef0541f000 <br>
-----
#### STAGE B: Machine Learning: Regression - Predicting Energy Efficiency of Buildings 
----- 
<br>Dataset ([Appliances Energy Prediction](https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction))

    Dataset Description:
    
        - Date, time year-month-day hour:minute:second
        - Appliances, energy use in Wh
        - lights, energy use of light fixtures in the house in Wh
        - T1, Temperature in kitchen area, in Celsius
        - RH_1, Humidity in kitchen area, in %
        - T2, Temperature in living room area, in Celsius
        - RH_2, Humidity in living room area, in %
        - T3, Temperature in laundry room area
        - RH_3, Humidity in laundry room area, in %
        - T4, Temperature in office room, in Celsius
        - RH_4, Humidity in office room, in %
        - T5, Temperature in bathroom, in Celsius
        - RH_5, Humidity in bathroom, in %
        - T6, Temperature outside the building (north side), in Celsius
        - RH_6, Humidity outside the building (north side), in %
        - T7, Temperature in ironing room , in Celsius
        - RH_7, Humidity in ironing room, in %
        - T8, Temperature in teenager room 2, in Celsius
        - RH_8, Humidity in teenager room 2, in %
        - T9, Temperature in parents room, in Celsius
        - RH_9, Humidity in parents room, in %
        - To, Temperature outside (from Chievres weather station), in Celsius
        - Pressure (from Chievres weather station), in mm Hg
        - RH_out, Humidity outside (from Chievres weather station), in %
        - Wind speed (from Chievres weather station), in m/s
        - Visibility (from Chievres weather station), in km
        - Tdewpoint (from Chievres weather station), Â°C
        - rv1, Random variable 1, nondimensional
        - rv2, Random variable 2, nondimensional
<br>





In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
%matplotlib inline
plt.style.use('dark_background')

In [2]:
path = 'energydata_complete.csv'
df = pd.read_csv(path)

In [3]:
df.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


### Understanding the dataset; data types, basic stats of each columns and finding missing values & duplicated rows.



In [4]:
def DB_Info(df):
    """
    Returns detailed information about the dataset such as it's data types, 
    basics statistics, number of missing values in each columns & duplicated rows
    """
    print(df.info())
    print('-'*20)
    print(df.describe(include='all'))
    print('-'*20)
    d = df.isnull().sum()
    # prints the columns with null values with a total count of null values it contains
    if d[d>0].any():
        print(d[d>0])
    else:
        print('There are no null values')
    print('-'*20)
    # prints the duplicated rows
    if df.duplicated().any():
        print(df[df.duplicated()])
    else:
        print('There are no duplicated rows')
    
DB_Info(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

As stated in the output, the dataset consists of 19735 entries (rows) & 29 columns (variables).
<br>There are no missing values and duplicated rows.

In [5]:
# Checking for categorial columns
for col in df.columns:
    a = df[col].unique()
    if len(a)<15:
        print(f'{col} has {len(a)} unique values ->> {a}', end = '\n\n')

lights has 8 unique values ->> [30 40 50 70 60 10 20  0]



## Question 12

From the dataset, fit a linear model on the relationship between the temperature in the living room in Celsius (x = T2) and the temperature outside the building (y = T6). What is the R^2 value in two d.p.?

In [6]:
features = df[['T2']] #X = T2 
target = df[['T6']] #Y = T6

# Splitting the features & targe variables into train and test data set
x_train, x_test, y_train, y_test = train_test_split(features, target,
                                                    test_size=0.3)
# fitting the linear regression model
linear_model1 = LinearRegression()
linear_model1.fit(x_train, y_train)
p1 = linear_model1.predict(x_test)

In [7]:
r2_1 = round(r2_score(y_test, p1), 2)
print(f'R Squared of the linear model is {r2_1}.')

R Squared of the linear model is 0.63.


### Answer:
The R^2 value in two d.p. is 0.64.

## Question 13
Normalize the dataset using the MinMaxScaler after removing the following columns: [“date”, “lights”]. The target variable is “Appliances”. Use a 70-30 train-test set split with a random state of 42 (for reproducibility). Run a multiple linear regression using the training set and evaluate your model on the test set. Answer the following questions:

What is the Mean Absolute Error (in two decimal places)?

In [8]:
# dropping the columns 'date' & 'lights' as per instruction
new_df = df.drop(columns=['date', 'lights'])
scalar = MinMaxScaler()
# normalizing the new data set
normalized_new_df = pd.DataFrame(scalar.fit_transform(new_df), columns=new_df.columns)

In [9]:
# Creating features and target
features2 = normalized_new_df.drop(columns=['Appliances'])
target2 = normalized_new_df['Appliances']

x2_train, x2_test, y2_train, y2_test = train_test_split(features2, target2,
                                                    test_size=0.3, random_state=42)
linear_model2 = LinearRegression(normalize=True)
linear_model2.fit(x2_train, y2_train)
p2 = linear_model2.predict(x2_test)

In [10]:
mae2 = round(mean_absolute_error(y2_test, p2),2)
print(f'Mean Absolute Error (MAE) of this multiple linear regression model is {mae2}.')

Mean Absolute Error (MAE) of this multiple linear regression model is 0.05.


### Answer:
Mean Absolute Error (MAE) of this multiple linear regression model (in 2 decimal) is 0.05.

## Question 14
What is the Residual Sum of Squares (in two decimal places)?

In [11]:
rss2 = round(np.sum(np.square(y2_test-p2)), 2)
print(f'Residual Sum of Squares (RSS) of this multiple linear regression model is {rss2}.')

Residual Sum of Squares (RSS) of this multiple linear regression model is 45.35.


### Answer:
Residual Sum of Squares (RSS) of this multiple linear regression model (in 2 decimal) is 45.35.

## Question 15
What is the Root Mean Squared Error (in three decimal places)?

In [12]:
rmse2 = round(np.sqrt(mean_squared_error(y2_test, p2)), 3)
print(f'Root Mean Squared Error (RMSE) of this multiple linear regression model is {rmse2}.')

Root Mean Squared Error (RMSE) of this multiple linear regression model is 0.088.


### Answer:
Root Mean Squared Error (RMSE) of this multiple linear regression model (in 3 decimal)is 0.088.

## Question 16
What is the Coefficient of Determination (in two decimal places)?

In [13]:
r2_2 = round(r2_score(y2_test, p2), 2)
print(f'Coefficient of Determination (R Squared) of this multiple linear regression model is {r2_2}.')

Coefficient of Determination (R Squared) of this multiple linear regression model is 0.15.


### Answer:
Coefficient of Determination (R Squared) of this multiple linear regression model (in 2 decimal) is 0.15.

## Question 17
Obtain the feature weights from your linear model above. Which features have the lowest and highest weights respectively?

In [14]:
linear_weights2 = pd.Series(linear_model2.coef_, x2_train.columns).sort_values()
linear_weights2.columns = ['Features', 'Linear Regression Weights']
linear_weights2

RH_2          -0.456698
T_out         -0.321860
T2            -0.236178
T9            -0.189941
RH_8          -0.157595
RH_out        -0.077671
RH_7          -0.044614
RH_9          -0.039800
T5            -0.015657
T1            -0.003281
rv2            0.000770
rv1            0.000770
Press_mm_hg    0.006839
T7             0.010319
Visibility     0.012307
RH_5           0.016006
RH_4           0.026386
T4             0.028981
Windspeed      0.029183
RH_6           0.038049
RH_3           0.096048
T8             0.101995
Tdewpoint      0.117758
T6             0.236425
T3             0.290627
RH_1           0.553547
dtype: float64

In [15]:
print(f'Highest weighted feature is {linear_weights2.idxmax()} with weight {round(linear_weights2.max(), 6)}.')

Highest weighted feature is RH_1 with weight 0.553547.


In [16]:
print(f'Lowest weighted feature is {linear_weights2.idxmin()} with weight {round(linear_weights2.min(),6)}.')

Lowest weighted feature is RH_2 with weight -0.456698.


### Answer:
Highest weighted feature is <strong>RH_1</strong> with weight 0.553547.
<br>Lowest weighted feature is <strong>RH_2</strong> with weight -0.456698.

## Question 18
Train a ridge regression model with an alpha value of 0.4. Is there any change to the root mean squared error (RMSE) when evaluated on the test set?

In [17]:
ridge_reg = Ridge(alpha=0.4)
ridge_reg.fit(x2_train, y2_train)
p3 = ridge_reg.predict(x2_test)

In [18]:
rmse3 = round(np.sqrt(mean_squared_error(y2_test, p3)), 3)
print(f'Root Mean Squared Error (RMSE) of this Ridge Regression model is {rmse3}.')

Root Mean Squared Error (RMSE) of this Ridge Regression model is 0.088.


### Answer:
Root Mean Squared Error (RMSE) of this Ridge Regression model is 0.088
Since the RMSE of Linear Regression mode was also 0.088, there is NO CHANGE in RMSE.

## Question 19
Train a lasso regression model with an alpha value of 0.001 and obtain the new feature weights with it. How many of the features have non-zero feature weights?

In [19]:
lasso_reg = Lasso(alpha=0.001)
lasso_reg.fit(x2_train,y2_train)
p4 = lasso_reg.predict(x2_test)

In [20]:
lasso_weights = pd.Series(lasso_reg.coef_, x2_train.columns).sort_values()
lasso_weights.columns = ['Features', 'Lasso Regression Weights']
lasso_weights

RH_out        -0.049557
RH_8          -0.000110
T1             0.000000
Tdewpoint      0.000000
Visibility     0.000000
Press_mm_hg   -0.000000
T_out          0.000000
RH_9          -0.000000
T9            -0.000000
T8             0.000000
RH_7          -0.000000
rv1           -0.000000
T7            -0.000000
T6             0.000000
RH_5           0.000000
T5            -0.000000
RH_4           0.000000
T4            -0.000000
RH_3           0.000000
T3             0.000000
RH_2          -0.000000
T2             0.000000
RH_6          -0.000000
rv2           -0.000000
Windspeed      0.002912
RH_1           0.017880
dtype: float64

In [21]:
non_zero_wcount = lasso_weights[lasso_weights!=0.0].count()
print(f'There are {non_zero_wcount} features with non-zero feature weights in this Lasso Regression model.')

There are 4 features with non-zero feature weights in this Lasso Regression model.


### Answer:
There are *4* features with non-zero feature weights.

## Question 20
What is the new RMSE with the lasso regression? (Answer should be in three (3) decimal places)

In [22]:
rmse4 = round(np.sqrt(mean_squared_error(y2_test, p4)), 3)
print(f'Root Mean Squared Error (RMSE) of this Lasso Regression model is {rmse4}.')

Root Mean Squared Error (RMSE) of this Lasso Regression model is 0.094.


### Answer:
Root Mean Squared Error (RMSE) of this Lasso Regression model is 0.094 (in 3 decimal places).