<a href="https://colab.research.google.com/github/Umesh1307/Appliances-Energy-Prediction-Regression-Analysis-Almabetter-Capstone-Project-No.-2/blob/main/Appliances_Energy_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement:

---



## The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters. Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis (rp5.ru) and merged together with the experimental data sets using the date and time column. Two random variables have been included in the data set for testing the regression models and to filter out non-predictive attributes(parameters).

---



### 1. date: year-month-day hour:minute:second
### 2. T1: Temperature in kitchen area, in Celsius
### 3. RH_1: Humidity in kitchen area, in %
### 4. T2: Temperature in living room area, in Celsius
### 5. RH_2: Humidity in living room area, in %
### 6. T3: Temperature in laundry room area
### 7. RH_3: Humidity in laundry room area, in %
### 8. T4: Temperature in office room, in Celsius
### 9. RH_4: Humidity in office room, in %
### 10. T5: Temperature in bathroom, in Celsius
### 11. vRH_5: Humidity in bathroom, in %
### 12. T6: Temperature outside the building (north side), in Celsius
### 13. RH_6: Humidity outside the building (north side), in %
### 14. T7: Temperature in ironing room, in Celsius
### 15. RH_7: Humidity in ironing room, in %
### 16. T8: Temperature in teenager room 2, in Celsius
### 17. RH_8: Humidity in teenager room 2, in %
### 18. T9: Temperature in parents’ room, in Celsius
### 19. RH_9: Humidity in parents’ room, in %
### 20. T_out: Temperature outside (from Chievres weather station), in Celsius
### 21. Pressure: (from Chievres weather station), in mm Hg
### 22. RH_out: Humidity outside (from Chievres weather station), in %
### 23. Wind speed: (from Chievres weather station), in m/s
### 24. Visibility: (from Chievres weather station), in km
### 25. T_dewpoint: (from Chievres weather station), Â°C
### 26. rv1: Random variable 1, non-dimensional
### 27. rv2: Random variable 2, non-dimensional
### 28. Lights: energy use of light fixtures in the house in Wh
### 29. Appliances: energy use in Wh (Target Variable)

## Where indicated, hourly data (then interpolated) from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis,rp5.ru. Permission was obtained from Reliable Prognosis for the distribution of the 4.5 months of weather data.

---





##😇 Before Delving deep straight into the coding part, let's understand the problem statement together 😇

---


### Energy is the ability to do work. Scientists define energy as the ability to do work. Modern civilization is possible because people have learned how to change energy from one form to another and then use it to do work. People use energy to walk and bicycle, to move cars along roads and boats through water, to cook food on stoves, to make ice in freezers, to light our homes and offices, to manufacture products, and to send astronauts into space.

### There are many different forms of energy, including:

### Heat

### Light

### Motion

### Electrical

### Chemical

### Gravitational

## 😇 curious to know about energy more refere this [link text](https://www.eia.gov/energyexplained/what-is-energy/)

---




## Definition of Attributes:

---
* ## **Relative Humidity:**



### **Relative Humidity is the amount of water vapour present in air expressed in the percentage of the amount needed for saturation at the same temperature. It the amount of water vapour is in the air. It is actually the ratio of the amount of water vapour the air can hold that the temperature.**



* ## **Dew Point Temperature:**


### **The dew point temperature of compressed air is the temperature at which water begins to condense out of the air into a liquid form.**



* ## **Temperature**:

### **It denotes the degree of hotness and coldness of a body.**



* ## **Wind Speed**:

### **It the speed at which wind flows, it plays vital role for cooling of building, condensation of air.**



* ## **Visibility**:

### **The quality or state of being visible. The degree of clearness (as of the atmosphere or ocean) specifically the greatest distance through the atmosphere toward the horizon at which prominent objects can be identified with the naked eye capability of being readily noticed.**


---





# Objective of Project:
---
### The increasing trend in energy consumption is becoming cause of concern for the entire world, as the energy consumption is increasing year after year so is the carbon and greenhouse gas emission, the majority portion of the electricity generated is consumed by industrial sector but a considerable amount is also consumed by residential sector. It is important to study the energy consuming behaviour in the residential sector and predict the energy consumption by home appliances as it consume maximum amount of energy in the residence. This project focuses on predicting the energy consumption of home appliances based on humidity and temperature.

---

# What we can do?

---
### Energy prediction of appliances requires identifying and predicting individual appliance energy consumption when combined in a closed chain environment. This experiment aims to provide insight into reducing energy consumption by identifying trends and appliances involved.
### Power prediction has been a major concern in power system for effective energy utilization to reduce demand.

# Tentative Roadmap to Follow:

---

* ### Loading the dataset.

* ### cleaning and transforming of features (Null value treatment, Data type consistency check).
* ### Descriptive statistical analysis.
* ### Skewness and outlier (anomalies) detection analysis.
* ### Feature engineering (standardizing, normalizing, multicolinearity assumption check, linearity between independent and dependent variable check).
* ### Exploratory data analysis(understanding the patteren and behaviour of data. EDA involves generating summary statistics for numerical data in the dataset and creating various graphical representations to understand the data better).
* ### Understanding the feature importance (PCA can be handy for feature selection, lasso regression can be a another option).
* ### Model Selection.
* ### Model Training.
* ### Model Evaluation.
* ### Conclusion.


---






# ***STEP 1: LOADING THE DATASET***

---



In [None]:
# Let's get started with very first step loading the wapon's(libraries):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.decomposition import PCA, LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.neural_network import MLPRegressor
import xgboost as xgb
from sklearn import neighbors
from sklearn.svm import SVR
import time
from math import sqrt
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from tensorflow.keras import Sequential, layers, Input
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Mounting the drive:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Creating the directorial path for the data set:
dir_path="/content/drive/MyDrive/Almabetter Project/Capstone - Projects/Module 4 Supervised ML Regression/Appliances Energy Prediction"

In [None]:
# Loading the dataset:
energy_df=pd.read_csv(dir_path+"/data_application_energy.csv")

In [None]:
# Checking the head of the dataset, traditional way yet useful
energy_df.head()

In [None]:
# Let's use the colab data table feature to visualize explictly! This feature was new one for me :)
from google.colab.data_table import DataTable
DataTable(energy_df)

# A few interesting features of the data table display:😇

---



* ### Clicking the Filter button in the upper right allows you to search for terms or values in any particular column.
* ### Clicking on any column title lets you sort the results according to that column's value.
* ### The table displays only a subset of the data at a time. You can navigate through pages of data using the controls on the lower right.

---



In [None]:
# Checking the tail of the dataset:)
energy_df.tail(3)


In [None]:
# Checking the shape of the dataset:)
print(f"The Shape of dataset is {energy_df.shape} There are {energy_df.shape[0]} rows and {energy_df.shape[1]} columns")

In [None]:
# Let's have a look at the data type of the features.
energy_df.info()



---


# Observations:

---


* ### **Number of entries** : **19735**
* ### **Number of features : 27 ( 2 Random Variables excluded )**

* ### **Target Variable : Appliances**

* ### **There are no categorical feature in the data. There are two 'int' features which might be categorical. We will check it later in the notebook**.

* ### **There are no information about the features rv1, rv2 and what it denotes. We will keep it or discard it based on its relationship with the target variable.**

* ### **There are in total 29 variables out of which 28 are independent variables and 1 is dependent which is our target variable.**

---



In [None]:
# Rechecking for the null values if any
energy_df.isnull().sum()



---


## ***DESCRIPTIVE STATISTICAL ANALYSIS***

---
### Here we will be using pandas describe method to have an intution about the basic behaviour of data, furthermore we will use pandas profilling to have a more understanding of the data.


In [None]:
# Now let's use pandas describe method
energy_df.describe()

## **It is puzzling the mind, making unable to draw inference by looking at whole dataset description. What we can do is we can create two other dataframes containing all the temperatures in one dataframe from dataset and, relative humidity in other.**##

---





In [None]:
# Creating a dictionary of all the tempreatures from the dataset naming it as temp_dict
temp_dict = {
    'T1' : 'temp_kitchen', 'T2' : 'temp_living', 'T3' : 'temp_laundry', 
    'T4' : 'temp_office', 'T5' : 'temp_bath', 'T6' : 'temp_outside',
    'T7' : 'temp_iron', 'T8' : 'temp_teen', 'T9' : 'temp_parents', 'T_out' : 'temp_station'
}

In [None]:
# Renaming the attributes
energy_df = energy_df.rename(columns=temp_dict)

In [None]:
# Creating a dictionary of all the relative humidity attributes
humid_dict = {
    'RH_1' : 'humid_kitchen', 'RH_2' : 'humid_living', 'RH_3' : 'humid_laundry', 
    'RH_4' : 'humid_office', 'RH_5' : 'humid_bath', 'RH_6' : 'humid_outside',
    'RH_7' : 'humid_iron', 'RH_8' : 'humid_teen', 'RH_9' : 'humid_parents', 'RH_out' : 'humid_station'
}

In [None]:
# Renaming the attributes
energy_df = energy_df.rename(columns=humid_dict)

In [None]:
# Let's have look at the description of temp_dic:)
from google.colab import data_table                           # Importing the datatable extension of google colab.
data_table._DEFAULT_FORMATTERS[float] = lambda x: f"{x:.2f}"  # Formatting the float values of the attributes.
energy_df[temp_dict.values()].describe()                      # Using the describe method of pandas. 



---


## **OBSERVATIONS :**

---



* ### **Average outside temperature over a period of 4.5 months is around 7.5 degrees. It ranges from -6 to 28 degrees.**
* ### **While average temperature inside the building has been around 20 degrees for all the rooms. It ranges from 14 - 30 degrees.**
* ### **Which implies, Warming appliances have been used to keep the insides of the building warm.There must be some sort of direct correlation between temperature and consumption of energy inside house.**
---



In [None]:
# Let's draw some conclusions from humidity dict:
from google.colab import data_table                             # Importing the datatable extension of google colab.
data_table._DEFAULT_FORMATTERS[float] = lambda x: f"{x:.2f}"    # Formatting the float values of the attributes.
energy_df[humid_dict.values()].describe()                       # Using the describe method of pandas.




---


## **Observations:**

---


* ### **Average humidity outside the building has been higher than the average humidity inside.**
* ### **Average humidity at the weather station is significantly higher compared to outside humidity near the building.**

* ### **Average humidity in the bathroom is significantly higher compared to other rooms due to obvious reasons.**

* ### **Kids and parent room show a comparatively higher average humidity as well signifying the fact that, inhabitants of this building spend most of their time in these buildings.**

---



In [None]:
# Seggregate the columns based on its category.
temp_cols =['temp_kitchen', 'temp_living', 'temp_laundry', 
    'temp_office', 'temp_bath', 'temp_outside',
    'temp_iron', 'temp_teen',  'temp_parents', 'temp_station']

humid_cols =['humid_kitchen','humid_living',  'humid_laundry', 
    'humid_office', 'humid_bath',  'humid_outside',
    'humid_iron', 'humid_teen', 'humid_parents', 'humid_station']

weather_cols =["Tdewpoint","Press_mm_hg","Windspeed","Visibility"]

light_cols = ["lights"]

random_cols = ["rv1", "rv2"]

date_time_cols = ['month', 'weekday', 'hour', 'week']

target = ["Appliances"]

In [None]:
numeric_features=energy_df[temp_cols+ humid_cols+ weather_cols+ light_cols+ random_cols+ target]

In [None]:
for col in numeric_features[1:-1]:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = energy_df[col]
    label = energy_df['Appliances']
    correlation = feature.corr(label)
    plt.scatter(x=feature, y=label)
    plt.xlabel(col)
    plt.ylabel('Appliances')
    ax.set_title('Appliances vs ' + col + '- correlation: ' + str(correlation))
    z = np.polyfit(energy_df[col], energy_df['Appliances'], 1)
    y_hat = np.poly1d(z)(energy_df[col])

    plt.plot(energy_df[col], y_hat, "r--", lw=1)

plt.show()

In [None]:
# Let's check the distribution of the data using histogram
energy_df[temp_cols+ humid_cols+ weather_cols+ light_cols+ random_cols+ target].hist(bins=50, figsize=(20,25))
plt.show()

### Using histogram we are able to understand the distribution of data, however there is a useful plot called normal probability plot which is used to check the normality or the normal distribution of the data. below we will use probability plot, and see how distribution deviates from the theoretical values of the distribution.

---



In [None]:
# Let's use normal probability plot for checking the normal distribution of data
for col in energy_df[temp_cols+ humid_cols+ weather_cols+ light_cols+ random_cols+ target].columns:
    stats.probplot(energy_df[col], dist='norm', plot=plt, fit=True)
    plt.title(col)
    plt.show()


---


## **Observations**:

---



## **Features Near to Normal Distribution:**



### pressure, humid_iron, humid_teen, humid_parents , temp_station, humid_living,humid_laundry,humid_office,humid_bath, humid_kitchen, temp_outside, temp_iron, temp_teen, temp_kitchen, temp_living.







## **skewed features:**

### Appliances, visibility, windspeed, humid_station, t dewpoint, temp_parents, temp_laundry, temp_office, temp_bath.




# **Randomly Distributed Features :**

 ### rv1, rv2, humid_outside


---



### Note: Our target variable is skewed. We will apply some transofrmation on it to bring it closer to normal distribution.some transformations that can be done to make the feature normal are:

1. Log
2. Exponential
3. square root
4. box-cox
5. reciprocal

---




In [None]:
target_var_original = energy_df[['Appliances']].copy()
# normality check
def normality(data,feature):
    plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    sns.kdeplot(data[feature])
    plt.subplot(1,2,2)
    stats.probplot(data[feature],plot=plt)
    plt.show()

In [None]:
# Checking the normality of the target variable using kde(kernal density estimation line) and normal probability graph
normality(target_var_original,'Appliances')

In [None]:
# Performing the log transformation on target variable
target_var_original['log_transform']=np.log(target_var_original['Appliances'])
normality(target_var_original,'log_transform')
# Performing the reciprocal transformation on target variable
target_var_original['reciprocal_transform']=1/target_var_original.Appliances
normality(target_var_original,'reciprocal_transform')
# Performing the square root transformation on target variable
target_var_original['sqroot_transform']= np.sqrt(target_var_original.Appliances)
normality(target_var_original,'sqroot_transform')
# Performing the boxcox transformation
target_var_original['boxcox_transform'], parameters= stats.boxcox(target_var_original['Appliances'])
normality(target_var_original,'boxcox_transform')



---


## **Observation:**

---


### We observe that none of the transformations are making our target variable perfectly normal, but still log trasnformation is giving better results than others. So we will be applying log transformation on the target variable.

---



In [None]:
# Applying the log transformation on target variable
energy_df = energy_df.copy()
energy_df['Log_Appliances'] = np.log(energy_df['Appliances'])

In [None]:
energy_df

In [None]:
# Appliance usage distribution before and after log transformation.
fig, ax = plt.subplots(1,2, figsize=(16,6))
sns.histplot(x='Appliances', data=energy_df, binwidth=20, ax=ax[0],kde=True,color='blue')
sns.histplot(x='Log_Appliances', data=energy_df, binwidth=0.09, ax=ax[1],kde=True,color='blue')
ax[0].set_title('Appliance Usage Distribution Before Log Transformation')
ax[1].set_title('Appliance Usage Distribution After Log Transformation')
plt.show()

In [None]:
# Checking the distribution of the lights feature
energy_df.lights.hist(bins=8)
plt.title("Distribution of Lights feature")
plt.show()

In [None]:
# Checking the value count of the lights features
energy_df.lights.value_counts()

### Since most of the value in lights column is 0, it wont be playing much role in our model. Hence we drop the lights feature from our dataframe.

---


In [None]:
# Dropping the column lights
energy_df = energy_df.drop('lights', axis=1)

In [None]:
# Converting date feature from object type to datetime feature and extracting the month, weekday, hour
energy_df['weekday'] = ((pd.to_datetime(energy_df['date']).dt.dayofweek)// 5 == 0).astype(int)
energy_df['hour'] = pd.to_datetime(energy_df['date']).dt.hour
energy_df['month'] = pd.to_datetime(energy_df['date']).dt.month

In [None]:
# Plotting the Appliances energy consumption per hours of the day
fig, ax = plt.subplots(1,1,figsize=(18,10))
energy_df.groupby('hour').agg({'Appliances' : 'mean'}).plot.bar(ax=ax)
plt.title("Hours of Day vs Appliances energy consumption")
ax.set_xlabel("Hours of the day")
ax.set_ylabel('Appliance energy (Wh)')
plt.show()



---


## **Observations**

---


### Above figure is a representation of average energy consumption of appliances at different time of the day over a period of 4.5 months. We observe two peak hours. One at 11 am in the morning and other at 6 PM in the evening. While the peak at 11 am is shallow and low, peak at 6 PM is comparatively higher and sharper.

### We observe that over the sleeping hours (10 PM - 6 AM) the energy consumption of appliances is around 50 Wh. After about 6 AM, energy consumption starts to rise gradually up until 11 AM (probably due to morning chores). And then gradually decreases to around 100 Wh at about 3 PM. After which the energy consumption drastically shoots up up until 6 PM in the evening (probably due to requirement lights in rooms). However energy consumption of appliances reverts back to 50 Wh, as night approaches and people in the house go to bed at around 10 PM.

---



In [None]:
# Comparing appliances energy consumption on weekday and weekends
fig, ax = plt.subplots(1,2,figsize=(18,9))
week_df = energy_df.groupby(['weekday','hour']).agg({'Appliances':'mean'}).reset_index(0)
week_df[week_df.weekday==0].Appliances.plot.bar(ax=ax[0], label='weekends')
week_df[week_df.weekday==1].Appliances.plot.bar(ax=ax[1], label='weekdays')
ax[0].legend(loc='best')
ax[1].legend(loc='best')
ax[0].set_ylabel('Appliance Energy (Wh)')
ax[1].set_ylabel('Appliance Energy (Wh)')
plt.show()



---


## Observation:

---


### We observe that the energy consumption of appliances during the office hours (8 AM - 4 PM) is higher in weekends compared to the weekdays. Also, average overall consumption is higher in weekends is pretty high.

### Lets look at how temperature and humidity levels vary inside different rooms !

---



In [None]:
# Plotting the appliances energy consumption per hour per month
fig, ax = plt.subplots(1,1,figsize=(25,5))
energy_df.groupby(['month','hour']).agg({'Appliances' : 'mean'}).plot.bar(ax=ax)
ax.set_ylabel('Appliance enrergy (Wh)')
plt.title("Appliances Energy Consumption per Hours per Month")
plt.show()

### A trend of high consumption hours for each month seems to be similar to the over all trend.

---



---



In [None]:
# Let's check the 
fig, ax = plt.subplots(1,2,figsize=(15,5))
week_df = energy_df.groupby(['weekday','hour']).agg({'Appliances':'mean'}).reset_index(0)
week_df[week_df.weekday==0].Appliances.plot.bar(ax=ax[0], label='weekends')
week_df[week_df.weekday==1].Appliances.plot.bar(ax=ax[1], label='weekdays')
ax[0].legend(loc='best')
ax[1].legend(loc='best')
ax[0].set_ylabel('Appliance Energy (Wh)')
ax[1].set_ylabel('Appliance Energy (Wh)')
plt.show()



---


## **Observation**:

---


### We observe that the energy consumption of appliances during the office hours (8 AM - 4 PM) is higher in weekends compared to the weekdays. Also, average overall consumption is higher in weekends is pretty high.

### Lets look at how temperature and humidity levels vary inside different rooms !

---



In [None]:
# Checking the temperature level for each kind of rooms on an hourly basis
fig, axes = plt.subplots(2,5,figsize=(40,16))
for i, temp in enumerate(temp_dict.values()):
  energy_df.groupby('hour').agg({temp : 'mean'}).plot.bar(ax=axes[i//5, i%5])
  axes[i//5, i%5].legend(loc='best')
  axes[i//5, i%5].set_title(temp)



---


## Observations:

---


* ### The average temperature inside each of the rooms has been almost constant over the day. 
* ### However the average temperature outside the building and near the station changes over the course of the day. 
* ### The average night time temperature is around 6 degree C, while average day time temperature varies over hours and peaks to 12 degree C at about 2-3 PM in the afternoon.

---



In [None]:
# Checking the trend of temperature over the course of 4.5 months
fig, axes = plt.subplots(2,5,figsize=(40,16))
for i, temp in enumerate(temp_dict.values()):
  energy_df.groupby('month').agg({temp : 'mean'}).plot.bar(ax=axes[i//5, i%5])
  axes[i//5, i%5].legend(loc='best')
  axes[i//5, i%5].set_title(temp)



---


## **Observations:**

---


* ### We observe a significant increasing trend of daytime outside temperatures over the course of 5 months starting from an avg of 4 degree celsius in 1st month to an average of 15 degree celsius in month 5. 

* ### The outside temperatures seem to have an impact over temperature inside too, although the variance of temperatures inside the building is low, since the temperature inside is controlled. Although the increase temperature seem to have no impact on the appliance consumtion patterns.

---



In [None]:
# Observing the humidity level of each hour a day
fig, axes = plt.subplots(2,5,figsize=(40,16))
for i, humid in enumerate(humid_dict.values()):
  energy_df.groupby('hour').agg({humid : 'mean'}).plot.bar(ax=axes[i//5, i%5])
  axes[i//5, i%5].legend(loc='best')
  axes[i//5, i%5].set_title(humid)

In [None]:
# Observing the humidity level over the course of 4.5 months
fig, axes = plt.subplots(2,5,figsize=(40,16))
for i, humid in enumerate(humid_dict.values()):
  energy_df.groupby('month').agg({humid : 'mean'}).plot.bar(ax=axes[i//5, i%5])
  axes[i//5, i%5].legend(loc='best')
  axes[i//5, i%5].set_title(humid)

* ### Although the humidity outside building tend to decrease over months, the humidity inside rooms seem to be unaffected. The humidity levels outside seem to be negatively correlated to with the temperature levels outside. Lets check !

In [None]:
energy_df[['temp_outside', 'humid_outside']].corr()

* ### Indeed there is a strong negative correlation between temperature and humidity levels outside. As temperature increases, moisture levels in the air decreases. We also observe that during the day time when the temperatures are high, humidity levels are low.

In [None]:
# Checking the correlation between features and target variable
cols = list(temp_dict.values())
cols.extend(list(humid_dict.values()))
cols.extend(['Appliances'])
fig, ax = plt.subplots(1,1,figsize=(15,10))
sns.heatmap(energy_df[cols].corr(),ax=ax, annot=True)
plt.show()



---


# Observations :

---



* ### From the correlation graph we clearly observe that the features related to temperature and features related to humidity have positive correlation within themselves whereas have a a very little to no correlation with each other.

* ### Humidity outside have a strong negative correlation with temperature levels as already discussed.


* ### Apart from that we observe that a couple features such as humidity at station, temperature outside the building and temperature in the living room have a comparatively high absolute correlation (above 0.12) with Appliances energy consumption.

---



In [None]:
# let us plot the variation of energy consumption with these variables
fig, axes = plt.subplots(4,5,figsize = (50,40))
for i, col in enumerate(cols[:-1]):
  ax = axes[i//5, i%5]
  ax.scatter(energy_df[col], energy_df['Appliances'])
  ax.set_xlabel(col)
  ax.set_ylabel('Appliances')

## **Lets look at the dependence of appliance energy consumption on newly created variables!**

In [None]:
# Checking the correlation between target variable and newly created variables.
fig, ax = plt.subplots(1,1,figsize=(10,8))
sns.heatmap(energy_df[['month', 'weekday', 'hour', 'Appliances']].corr(), annot=True, ax=ax)
plt.show()



---


**Observations**

---


* ### As we have observed earlier as well, there seenms to be no correlation between month and the observed energy use i.e. the enegy consumption pretty much remains similar over all months. 

* ### Similarly there is no direct effect of weekdays on appliance energy consumption.

* ### Although there is a correlation of 0.22 between hour and appliances.

---



In [None]:
# Writing a simple function to create a new feature called session
def create_session(x):
  if x <= 6 or x >= 22:
    return 1
  elif x>6 and x <=15:
    return 2
  else:
    return 3

In [None]:
# lets create a new column based on our observations
energy_df['session'] = energy_df['hour'].apply(lambda x : create_session(x))

In [None]:
# let's plot a correlation plot
fig,ax = plt.subplots(1,1,figsize=(10,8))
sns.heatmap(energy_df[['session', 'Appliances']].corr(), ax = ax, annot=True)
plt.show()

In [None]:
# Plotting the appliances energy usages cross session
fig, ax = plt.subplots(1,1,figsize=(10,8))
sns.boxplot(x='session',y='Appliances',data=energy_df, ax = ax)
plt.title("Appliances energy consumption across different session")
plt.show()

### We were now able to increase the correlation to 0.34 by making creating this new row. We see a clear distinction of power consumtion in different sessions.

## Lets look at features related to weather as well.

In [None]:
# Again rechecking the correlation between features and target variable
fig,ax = plt.subplots(1,1,figsize=(20,15))
sns.heatmap(energy_df[weather_cols + cols].corr(), ax = ax, annot=True)
plt.show()



---


#**Observations**

---


### Tdewpoint shows a high correlation with most of the tempearture and humidity level features than any other weather parameters. Pressure, windspeed and visibiltiy show little to no correlation. We might need to include only these feaatures in our final model

---



## Lets now deep dive into reducing the temperature and humidity parameters through some feature engineering and come up with features that explain maximum variability.

In [None]:
# Finding the correlation between target variable and temperature features
temp_cols = list(set(list(temp_dict.values())) - {'temp_outside', 'temp_station'})
energy_df['mean_temp'] = energy_df[temp_cols].mean(axis=1)
energy_df[['mean_temp', 'Appliances']].corr()

# **Observations**

---


### Since most of the temperature variables inside the room show little to know correlation with target variable, lets try to find components that could explain maximum variance, which might improve the correlation with target variable as well.

---



### Before doing PCA, I need to split the data into train and test, and fit PCA on train set

In [None]:
# Creating a train test split for PCA
train_energy_df, test_energy_df = train_test_split(energy_df, test_size=0.2, random_state=1)

In [None]:
# Fitting PCA 
pca = PCA()
pca.fit(train_energy_df[temp_cols])
temp_pca = pca.transform(energy_df[temp_cols])
variance = pca.explained_variance_ratio_*100
fig, ax = plt.subplots(1,1,figsize=(10,8))
ax.bar(range(len(variance)), variance)
ax.plot(range(len(variance)), np.cumsum(variance),'r.-',linewidth=2, label='Cummulative variance %')
ax.set_xlabel('Principal components')
ax.set_ylabel('Explained variance %')
plt.legend(loc='best')
plt.show()

In [None]:
# Checking the variance
variance

### First two components seem to explain more than 91 % of variance in data.

In [None]:
# Adding new features of PCA in dataframe
for i in range(temp_pca.shape[1]):
    energy_df[f'temp_pca{i+1}'] = temp_pca[:,i]

In [None]:

# Observing the corrleation with PCA feature with target variable
fig,ax = plt.subplots(1,1,figsize=(10,8))
sns.heatmap(energy_df[['temp_pca1', 'temp_pca2', 'temp_pca3', 'temp_pca4','temp_pca5', 'temp_pca6', 'temp_pca7', 'temp_pca8', 'Appliances']].corr(), ax=ax,annot=True)
plt.show() 

In [None]:
#Lets look at components of humid_pca4
dict(zip(temp_cols, pca.components_[7,:]))

In [None]:
humid_cols = list(set(list(humid_dict.values())) - {'humid_outside', 'humid_station'})
energy_df['mean_humid'] = energy_df[humid_cols].mean(axis=1)
energy_df[['mean_humid', 'Appliances']].corr()

In [None]:
# Performing pca on Humid features
pca = PCA()
pca.fit(train_energy_df[humid_cols])
humid_pca = pca.transform(energy_df[humid_cols])
variance = pca.explained_variance_ratio_*100
fig, ax = plt.subplots(1,1,figsize=(10,8))
ax.bar(range(len(variance)), variance)
ax.plot(range(len(variance)), np.cumsum(variance),'r.-',linewidth=2, label='Cummulative variance %')
ax.set_xlabel('Principal components')
ax.set_ylabel('Explained variance %')
plt.legend(loc='best')
plt.show()

In [None]:
# Adding Humid pca features to the dataframe
for i in range(humid_pca.shape[1]):
  energy_df[f'humid_pca{i+1}'] = humid_pca[:,i]

In [None]:
variance

In [None]:
# Observing the correlation with newly created humid pca with target variable
fig,ax = plt.subplots(1,1,figsize=(10,8))
sns.heatmap(energy_df[['humid_pca1', 'humid_pca2', 'humid_pca3', 'humid_pca4', 'humid_pca6', 'humid_pca7', 'humid_pca8', 'Log_Appliances']].corr(), ax = ax, annot=True)
plt.show()

In [None]:
plt.scatter(energy_df['humid_pca4'], energy_df['Appliances'])
plt.figure(figsize=(10,8))
plt.show()

In [None]:
#Lets look at components of humid_pca4
dict(zip(humid_cols, pca.components_[3,:]))

In [None]:
# Checking the correlation between tempeature difference and target variable
energy_df['diff_temp'] = energy_df['temp_outside'] - energy_df['mean_temp']
energy_df[['diff_temp', 'Appliances']].corr()

In [None]:
# Rechecking the log appliances
energy_df['log_Appliances'] = np.log(energy_df['Appliances'])
sns.histplot(energy_df.log_Appliances,bins=50,kde=True)
plt.show()


# Modeling With PCA Features

---



In [None]:
# Finalizing features
final_features = ['temp_pca1', 'temp_pca2', 'humid_pca1', 'humid_pca2', 'temp_outside', 'humid_outside', 'weekday', 'session', 'Windspeed', 'Press_mm_hg', 'log_Appliances']

In [None]:
# Checking the correalation with finalised features
fig,ax = plt.subplots(1,1,figsize=(10,8))
sns.heatmap(energy_df[final_features].corr(), ax = ax, annot=True)
plt.show()

In [None]:
# Train test split
final_train_df, final_test_df = train_test_split(energy_df[final_features], test_size = 0.2, random_state = 1)


In [None]:
energy_df.head()

In [None]:
# Creating X_train, X_test, Y_train, Y_test.
X_train, y_train = final_train_df.drop(['log_Appliances'], axis=1), final_train_df['log_Appliances']
X_test, y_test = final_test_df.drop(['log_Appliances'], axis=1), final_test_df['log_Appliances']

In [None]:
# Standardization of features
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)



---


# The ML regressor models that we use are :

---



### Lasso Regressor

### Ridge Regressor

### KNeighbors Regressor

### Support Vector Regressor

### Random Forest Regressor

### Extra Tree Regressor

### Gradient Boosting Regressor

### XGB Regressor

### MLP Regressor

In [None]:
models=[
           ['Lasso: ', Lasso()],
           ['Ridge: ', Ridge()],
           ['KNeighborsRegressor: ',  neighbors.KNeighborsRegressor()],
           ['SVR:' , SVR(kernel='rbf')],
           ['RandomForest ',RandomForestRegressor()],
           ['ExtraTreeRegressor :',ExtraTreesRegressor()],
           ['GradientBoostingRegressor: ', GradientBoostingRegressor()] ,
           ['XGBRegressor: ', xgb.XGBRegressor()] ,
           ['MLPRegressor: ', MLPRegressor(  activation='relu', solver='adam',learning_rate='adaptive',max_iter=1000,learning_rate_init=0.01,alpha=0.01)]
         ]

In [None]:
# Evaluation of Models
model_data = []
for name,curr_model in models :
    curr_model_data = {}
    curr_model.random_state = 78
    curr_model_data["Name"] = name
    start = time.time()
    curr_model.fit(X_train,y_train)
    end = time.time()
    curr_model_data["Train_Time"] = end - start
    curr_model_data["Train_R2_Score"] = r2_score(y_train,curr_model.predict(X_train))
    curr_model_data["Test_R2_Score"] = r2_score(y_test,curr_model.predict(X_test))
    curr_model_data["Test_RMSE_Score"] = np.sqrt(mean_squared_error(y_test,curr_model.predict(X_test)))
    model_data.append(curr_model_data)

In [None]:
# Storing the results of each model into a dataframe
results_df = df = pd.DataFrame(model_data)

In [None]:
results_df


In [None]:
# Evaluation of each model
results_df.plot.bar(x="Name", y=['Test_R2_Score' , 'Train_R2_Score' , 'Test_RMSE_Score'], title = 'Results' , width = .6, figsize= (20,8))
plt.show()



---


# OBSERVATIONS :

---



* ### Extra Tree Regressor performs the best so far with a R2 score of 0.59 and RMSE of 0.65.
* ### Lasso regression is the worst performing model so far.

---



# Hyper-parameter Tuning

In [None]:
# Hyper-parameter Tuning
from sklearn.model_selection import GridSearchCV
param_grid = [{
              'max_depth': [80, 150, 200,250],
              'n_estimators' : [100,150,200,250],
              'max_features': ["auto", "sqrt", "log2"]
            }]
reg = ExtraTreesRegressor(random_state=40)
grid_search = GridSearchCV(estimator = reg, param_grid = param_grid, cv = 5, n_jobs = -1 , scoring='r2' , verbose=2)
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

In [None]:
grid_search.best_estimator_.score(X_train,y_train)

In [None]:
grid_search.best_estimator_.score(X_test,y_test)

In [None]:
np.sqrt(mean_squared_error(y_test, grid_search.best_estimator_.predict(X_test)))

In [None]:
X_test = pd.DataFrame(X_test, columns=final_features[:-1])
y_test = pd.Series(y_test, name = 'Appliances')

In [None]:
session_errors = []
for i in range(1,4):
  session_errors.append(np.sqrt(mean_squared_error(y_test[X_test.session*sc_X.scale_[-3] + sc_X.mean_[-3] == i]*sc_y.scale_ + sc_y.mean_, 
                                                   grid_search.best_estimator_.predict(X_test[X_test.session*sc_X.scale_[-3] + sc_X.mean_[-3] == i])*sc_y.scale_ + sc_y.mean_)))

In [None]:
plt.bar(x=['10 PM - 6 AM', '6 AM - 3 PM', '3 PM - 10 PM'], height=session_errors, width = 0.5)
plt.axhline(np.sqrt(mean_squared_error(y_test*sc_y.scale_ + sc_y.mean_, 
                                       grid_search.best_estimator_.predict(X_test)*sc_y.scale_ + sc_y.mean_)), 
            color='red', label='average RMSE')
plt.xlabel('Appliances Energy (Wh)')
plt.ylabel('RMSE')
plt.legend(loc='best')
plt.show()



---


# Observations:

---


### When we look at the root mean squared errors made in prediction of energy consumption in appliances at different time of the day, we observe that errors made are quite less than the average RMSE of entire test set. Which is quite intuitive since we had little to no variance in energy consumption in those hours. However the errors are above average for other two time frames, where we had seen a quite a variance in energy levels.

---



In [None]:
fig, axes = plt.subplots(1,1,figsize=(20,5))
plt.plot(range(len(y_test[:200])), y_test[:200]*sc_y.scale_ + sc_y.mean_, label='actual')
plt.plot(range(len(y_test[:200])), grid_search.best_estimator_.predict(X_test.iloc[:200,:])*sc_y.scale_ + sc_y.mean_, label='predicted')
plt.legend(loc='best')
plt.ylabel('Appliance energy consumption (Wh)')
plt.xlabel('Samples')
plt.show()

In [None]:
feature_indices = np.argsort(grid_search.best_estimator_.feature_importances_)
importances = grid_search.best_estimator_.feature_importances_
indices = np.argsort(importances)[::-1]
names = [final_train_df.columns[i] for i in indices]

plt.figure(figsize=(10,6))
plt.title("Feature Importance")
plt.bar(range(X_train.shape[1]), importances[indices])
plt.xticks(range(X_train.shape[1]), names, rotation=90)
plt.show()

# Modeling Without PCA features:
### Including all temperature and humidity features and engineered feature 'session' in our features set.

In [None]:
final_features = ['temp_laundry','temp_bath', 'temp_kitchen', 'temp_parents', 'temp_office', 'temp_living', 'temp_teen', 'temp_iron','humid_kitchen',
 'humid_office', 'humid_bath', 'humid_living', 'humid_parents', 'humid_laundry', 'humid_teen', 'humid_iron',
  'temp_outside', 'humid_outside', 'temp_station', 'humid_station', 'weekday', 'session', 'Windspeed', 'Press_mm_hg', 'Appliances']

In [None]:
final_train_df, final_test_df = train_test_split(energy_df[final_features], test_size = 0.2, random_state = 1)

In [None]:
X_train, y_train = final_train_df.drop('Appliances', axis=1), final_train_df['Appliances']
X_test, y_test = final_test_df.drop('Appliances', axis=1), final_test_df['Appliances']

In [None]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [None]:
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.values.reshape([-1,1])).flatten()
y_test = sc_y.transform(y_test.values.reshape([-1,1])).flatten()

In [None]:
models = [
           ['Lasso: ', Lasso()],
           ['Ridge: ', Ridge()],
           ['KNeighborsRegressor: ',  neighbors.KNeighborsRegressor()],
           ['SVR:' , SVR(kernel='rbf')],
           ['RandomForest ',RandomForestRegressor()],
           ['ExtraTreeRegressor :',ExtraTreesRegressor()],
           ['GradientBoostingRegressor: ', GradientBoostingRegressor()],
           ['XGBRegressor: ', xgb.XGBRegressor()],
           ['MLPRegressor: ', MLPRegressor(  activation='relu', solver='adam',learning_rate='adaptive',max_iter=1000,learning_rate_init=0.01,alpha=0.01)]
         ]

In [None]:
model_data = []
for name,curr_model in models :
    curr_model_data = {}
    curr_model.random_state = 78
    curr_model_data["Name"] = name
    start = time.time()
    curr_model.fit(X_train,y_train)
    end = time.time()
    curr_model_data["Train_Time"] = end - start
    curr_model_data["Train_R2_Score"] = r2_score(y_train,curr_model.predict(X_train))
    curr_model_data["Test_R2_Score"] = r2_score(y_test,curr_model.predict(X_test))
    curr_model_data["Test_RMSE_Score"] = np.sqrt(mean_squared_error(y_test,curr_model.predict(X_test)))
    model_data.append(curr_model_data)

In [None]:
results_df = pd.DataFrame(model_data)


In [None]:
results_df


In [None]:
results_df.plot.bar(x="Name", y=['Test_R2_Score' , 'Train_R2_Score' , 'Test_RMSE_Score'], title = 'Results' , width = .6, figsize= (20,8))

## Hyper-parameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [{
              'max_depth': [80, 150, 200,250],
              'n_estimators' : [100,150,200,250],
              'max_features': ["auto", "sqrt", "log2"]
            }]
reg = ExtraTreesRegressor(random_state=40)
grid_search = GridSearchCV(estimator = reg, param_grid = param_grid, cv = 5, n_jobs = -1 , scoring='r2' , verbose=2)
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

In [None]:
grid_search.best_estimator_.score(X_train,y_train)

In [None]:
grid_search.best_estimator_.score(X_test,y_test)

In [None]:
np.sqrt(mean_squared_error(y_test, grid_search.best_estimator_.predict(X_test)))