<a href="https://colab.research.google.com/github/anasjy/appliance-energy-prediction/blob/main/Appliance_energy_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Regression**
# **All about data**
Data-driven prediction of energy use of appliances   

The data set is at 10 min for about 4.5 months. The house temperature and humidity conditions
were monitored with a ZigBee wireless sensor network. Each wireless node transmitted the
temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for
10 minutes periods. The energy data was logged every 10 minutes with m-bus energy meters.
Weather from the nearest airport weather station (Chievres Airport, Belgium) was downloaded
from a public data set from Reliable Prognosis (rp5.ru) and merged together with the
experimental data sets using the date and time column. Two random variables have been
included in the data set for testing the regression models and to filter out non-predictive attributes
(parameters).

# **Attribute Information**
Date: time (yr:mon:day:hr:min:sec) Appliances: energy use in Wh lights: Energy use of light fixtures in the house(Wh)

T1: Temperature in kitchen area(C), RH_1: Humidity in kitchen area(%)   

T2: Temperature in living room area(C), RH_2: Humidity in living room(%)  

T3: Temperature in laundry room, RH_3: Humidity in laundry room area(%)   

T4: Temperature in office room(C), RH_4: Humidity in office room(%)   

T5: Temperature in bathroom in (C), RH_5: Humidity in bathroom(%)           

T6: Temperature outside the building (north side) in (C), RH_6: Humidity outside the building (northside)%    

T7: Temperature in ironing room in (C), RH_7: Humidity in ironing room in (%)  

T8:Temperature in teenager room 2 in (C), RH_8: Humidity in teenager room 2 in (%)  

T9: Temperature in parents room in (C), RH_9: Humidity in parents room in %

To: Temperature outside (from Chievres weather station) in (C) Pressure (from Chievres weather station): in mm Hg RH_out: Humidity outside (from Chievres weather station) in (%),

Wind speed (from Chievres weather station), in m/s  
Visibility (from Chievres weather station), in km   
Tdewpoint (from Chievres weather station), Â°C   
rv1, Random variable 1, nondimensional   
rv2, Random variable 2, nondimensional   

Where indicated, hourly data (then interpolated) from the nearest airport weather station (Chievres Airport, Belgium) was downloaded from a public data set from Reliable Prognosis,rp5.ru. Permission was obtained from Reliable Prognosis for the distribution of the 4.5 months of weather data.



# **Import Packages**
# First Import necessary packages and import the dataset

In [2]:
import numpy as np
import pandas as pd
from numpy import math

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
sns.set(color_codes=True)

  import pandas.util.testing as tm


# **Exploratory Data Analysis**

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
slr = pd.read_csv('/content/drive/MyDrive/Capstone machine learning regression/data_application_energy.csv')

In [None]:
slr

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.890000,47.596667,19.200000,44.790000,19.790000,44.730000,19.000000,45.566667,17.166667,55.200000,7.026667,84.256667,17.200000,41.626667,18.2000,48.900000,17.033333,45.5300,6.600000,733.5,92.000000,7.000000,63.000000,5.300000,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.890000,46.693333,19.200000,44.722500,19.790000,44.790000,19.000000,45.992500,17.166667,55.200000,6.833333,84.063333,17.200000,41.560000,18.2000,48.863333,17.066667,45.5600,6.483333,733.6,92.000000,6.666667,59.166667,5.200000,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.890000,46.300000,19.200000,44.626667,19.790000,44.933333,18.926667,45.890000,17.166667,55.090000,6.560000,83.156667,17.200000,41.433333,18.2000,48.730000,17.000000,45.5000,6.366667,733.7,92.000000,6.333333,55.333333,5.100000,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.890000,46.066667,19.200000,44.590000,19.790000,45.000000,18.890000,45.723333,17.166667,55.090000,6.433333,83.423333,17.133333,41.290000,18.1000,48.590000,17.000000,45.4000,6.250000,733.8,92.000000,6.000000,51.500000,5.000000,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.890000,46.333333,19.200000,44.530000,19.790000,45.000000,18.890000,45.530000,17.200000,55.090000,6.366667,84.893333,17.200000,41.230000,18.1000,48.590000,17.000000,45.4000,6.133333,733.9,92.000000,5.666667,47.666667,4.900000,10.084097,10.084097
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19730,2016-05-27 17:20:00,100,0,25.566667,46.560000,25.890000,42.025714,27.200000,41.163333,24.700000,45.590000,23.200000,52.400000,24.796667,1.000000,24.500000,44.500000,24.7000,50.074000,23.200000,46.7900,22.733333,755.2,55.666667,3.333333,23.666667,13.333333,43.096812,43.096812
19731,2016-05-27 17:30:00,90,0,25.500000,46.500000,25.754000,42.080000,27.133333,41.223333,24.700000,45.590000,23.230000,52.326667,24.196667,1.000000,24.557143,44.414286,24.7000,49.790000,23.200000,46.7900,22.600000,755.2,56.000000,3.500000,24.500000,13.300000,49.282940,49.282940
19732,2016-05-27 17:40:00,270,10,25.500000,46.596667,25.628571,42.768571,27.050000,41.690000,24.700000,45.730000,23.230000,52.266667,23.626667,1.000000,24.540000,44.400000,24.7000,49.660000,23.200000,46.7900,22.466667,755.2,56.333333,3.666667,25.333333,13.266667,29.199117,29.199117
19733,2016-05-27 17:50:00,420,10,25.500000,46.990000,25.414000,43.036000,26.890000,41.290000,24.700000,45.790000,23.200000,52.200000,22.433333,1.000000,24.500000,44.295714,24.6625,49.518750,23.200000,46.8175,22.333333,755.2,56.666667,3.833333,26.166667,13.233333,6.322784,6.322784


In [None]:
slr.shape

(19735, 29)

In [None]:
slr.columns

Index(['date', 'Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3',
       'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8',
       'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed',
       'Visibility', 'Tdewpoint', 'rv1', 'rv2'],
      dtype='object')

In [None]:
slr.head()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,55.2,7.026667,84.256667,17.2,41.626667,18.2,48.9,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,55.2,6.833333,84.063333,17.2,41.56,18.2,48.863333,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,55.09,6.56,83.156667,17.2,41.433333,18.2,48.73,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,55.09,6.433333,83.423333,17.133333,41.29,18.1,48.59,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,55.09,6.366667,84.893333,17.2,41.23,18.1,48.59,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [None]:
slr.tail()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
19730,2016-05-27 17:20:00,100,0,25.566667,46.56,25.89,42.025714,27.2,41.163333,24.7,45.59,23.2,52.4,24.796667,1.0,24.5,44.5,24.7,50.074,23.2,46.79,22.733333,755.2,55.666667,3.333333,23.666667,13.333333,43.096812,43.096812
19731,2016-05-27 17:30:00,90,0,25.5,46.5,25.754,42.08,27.133333,41.223333,24.7,45.59,23.23,52.326667,24.196667,1.0,24.557143,44.414286,24.7,49.79,23.2,46.79,22.6,755.2,56.0,3.5,24.5,13.3,49.28294,49.28294
19732,2016-05-27 17:40:00,270,10,25.5,46.596667,25.628571,42.768571,27.05,41.69,24.7,45.73,23.23,52.266667,23.626667,1.0,24.54,44.4,24.7,49.66,23.2,46.79,22.466667,755.2,56.333333,3.666667,25.333333,13.266667,29.199117,29.199117
19733,2016-05-27 17:50:00,420,10,25.5,46.99,25.414,43.036,26.89,41.29,24.7,45.79,23.2,52.2,22.433333,1.0,24.5,44.295714,24.6625,49.51875,23.2,46.8175,22.333333,755.2,56.666667,3.833333,26.166667,13.233333,6.322784,6.322784
19734,2016-05-27 18:00:00,430,10,25.5,46.6,25.264286,42.971429,26.823333,41.156667,24.7,45.963333,23.2,52.2,21.026667,1.0,24.5,44.054,24.736,49.736,23.2,46.845,22.2,755.2,57.0,4.0,27.0,13.2,34.118851,34.118851


In [None]:
slr.describe()

Unnamed: 0,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
count,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0,19735.0
mean,97.694958,3.801875,21.686571,40.259739,20.341219,40.42042,22.267611,39.2425,20.855335,39.026904,19.592106,50.949283,7.910939,54.609083,20.267106,35.3882,22.029107,42.936165,19.485828,41.552401,7.411665,755.522602,79.750418,4.039752,38.330834,3.760707,24.988033,24.988033
std,102.524891,7.935988,1.606066,3.979299,2.192974,4.069813,2.006111,3.254576,2.042884,4.341321,1.844623,9.022034,6.090347,31.149806,2.109993,5.114208,1.956162,5.224361,2.014712,4.151497,5.317409,7.399441,14.901088,2.451221,11.794719,4.194648,14.496634,14.496634
min,10.0,0.0,16.79,27.023333,16.1,20.463333,17.2,28.766667,15.1,27.66,15.33,29.815,-6.065,1.0,15.39,23.2,16.306667,29.6,14.89,29.166667,-5.0,729.3,24.0,0.0,1.0,-6.6,0.005322,0.005322
25%,50.0,0.0,20.76,37.333333,18.79,37.9,20.79,36.9,19.53,35.53,18.2775,45.4,3.626667,30.025,18.7,31.5,20.79,39.066667,18.0,38.5,3.666667,750.933333,70.333333,2.0,29.0,0.9,12.497889,12.497889
50%,60.0,0.0,21.6,39.656667,20.0,40.5,22.1,38.53,20.666667,38.4,19.39,49.09,7.3,55.29,20.033333,34.863333,22.1,42.375,19.39,40.9,6.916667,756.1,83.666667,3.666667,40.0,3.433333,24.897653,24.897653
75%,100.0,0.0,22.6,43.066667,21.5,43.26,23.29,41.76,22.1,42.156667,20.619643,53.663333,11.256,83.226667,21.6,39.0,23.39,46.536,20.6,44.338095,10.408333,760.933333,91.666667,5.5,40.0,6.566667,37.583769,37.583769
max,1080.0,70.0,26.26,63.36,29.856667,56.026667,29.236,50.163333,26.2,51.09,25.795,96.321667,28.29,99.9,26.0,51.4,27.23,58.78,24.5,53.326667,26.1,772.3,100.0,14.0,66.0,15.5,49.99653,49.99653


In [None]:
slr.dtypes

date            object
Appliances       int64
lights           int64
T1             float64
RH_1           float64
T2             float64
RH_2           float64
T3             float64
RH_3           float64
T4             float64
RH_4           float64
T5             float64
RH_5           float64
T6             float64
RH_6           float64
T7             float64
RH_7           float64
T8             float64
RH_8           float64
T9             float64
RH_9           float64
T_out          float64
Press_mm_hg    float64
RH_out         float64
Windspeed      float64
Visibility     float64
Tdewpoint      float64
rv1            float64
rv2            float64
dtype: object

In [None]:
slr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19735 entries, 0 to 19734
Data columns (total 29 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         19735 non-null  object 
 1   Appliances   19735 non-null  int64  
 2   lights       19735 non-null  int64  
 3   T1           19735 non-null  float64
 4   RH_1         19735 non-null  float64
 5   T2           19735 non-null  float64
 6   RH_2         19735 non-null  float64
 7   T3           19735 non-null  float64
 8   RH_3         19735 non-null  float64
 9   T4           19735 non-null  float64
 10  RH_4         19735 non-null  float64
 11  T5           19735 non-null  float64
 12  RH_5         19735 non-null  float64
 13  T6           19735 non-null  float64
 14  RH_6         19735 non-null  float64
 15  T7           19735 non-null  float64
 16  RH_7         19735 non-null  float64
 17  T8           19735 non-null  float64
 18  RH_8         19735 non-null  float64
 19  T9  

In [None]:
#Totoal number of duplicate rows present in this data.

duplicate_rows = slr[slr.duplicated()]
print("Number of duplicate rows:",duplicate_rows.shape)



Number of duplicate rows: (0, 29)


# In this dataset there are no duplicate rows

# **Data Preparation and Cleaning**

In [None]:
slr["Appliances"].isnull().sum()

0

In [None]:
#null vales
slr.isnull().sum().sort_values(ascending=False)

rv2            0
T6             0
Appliances     0
lights         0
T1             0
RH_1           0
T2             0
RH_2           0
T3             0
RH_3           0
T4             0
RH_4           0
T5             0
RH_5           0
RH_6           0
rv1            0
T7             0
RH_7           0
T8             0
RH_8           0
T9             0
RH_9           0
T_out          0
Press_mm_hg    0
RH_out         0
Windspeed      0
Visibility     0
Tdewpoint      0
date           0
dtype: int64

# As you can see there are no null values.

In [8]:
slr= slr.drop(['date'], axis=1)

# Date dropping reason: As we are not trying to analyze the problem as Time Series rather regress on "Appliance" column.

# **Data Visualization**

Let's move on the extract information about data and also dealing with it.

In [9]:
# 80% of the data is use for the training of the models and the rest is used for testing
train, test = train_test_split(slr,test_size=0.20,random_state=40)

In [None]:
train.describe()

In [13]:
## Divide the columns based on type for clear column management 

col_temp = ["T1","T2","T3","T4","T5","T6","T7","T8","T9"]

col_hum = ["RH_1","RH_2","RH_3","RH_4","RH_5","RH_6","RH_7","RH_8","RH_9"]

col_weather = ["T_out", "Tdewpoint","RH_out","Press_mm_hg",
                "Windspeed","Visibility"] 
col_light = ["lights"]
col_randoms = ["rv1", "rv2"]

col_target = ["Appliances"]

In [14]:
# Seperate dependent and independent variables 
feature_vars = train[col_temp + col_hum + col_weather + col_light + col_randoms ]
target_vars = train[col_target]

In [None]:
feature_vars.describe()

In [None]:
# Check the distribution of values in lights column
feature_vars.lights.value_counts()

In [None]:
target_vars.describe()

# **Observations**
Temperature columns - Temperature inside the house varies between 14.89 Deg & 29.85 Deg , temperatire outside (T6) varies between -6.06 Deg to 28.29 Deg . The reason for this variation is sensors are kept outside the house

Humidiy columns - Humidity inside house varies is between 20.60% to 63.36% with exception of RH_5 (Bathroom) and RH_6 (Outside house) which varies between 29.82% to 96.32% and 1% to 99.9% respectively.

Appliances - 75% of Appliance consumption is less than 100 Wh . With the maximum consumption of 1080 Wh , there will be outliers in this column and there are small number of cases where consumption is very high

Lights column - Intially I believed lights column will be able to give useful information . With 11438 0 (zero) enteries in 14801 rows , this column will not add any value to the model . I believed light consumption along with humidity level in a room will give idea about human presence in the room and hence its impact on Appliance consumption. Hence for now , I will dropping this column

In [12]:
# Due to lot of zero enteries this column is of not much use and will be ignored in rest of the model
l1 = slr.drop(['lights'], axis=1 , inplace= True) ;

In [None]:
feature_vars.head(2)

# **Data Visualization**

In [None]:
# Histogram of all the features to understand the distribution
feature_vars.hist(bins = 20 , figsize= (12,16),color='green') ;

### Focussed displots for RH_6 , RH_out , rv1 , rv2 , Visibility , Windspeed due to irregular distribution

### Using Plotly helps us visualize data better as it allows us to interact with the plot like zoom in to the distribution and hover around to locate the values corresponding to axis value

In [None]:
# focussed displots for RH_6 , RH_out , Visibility , Windspeed due to irregular distribution
f, ax = plt.subplots(2,3,figsize=(12,8))
vis1 = sns.distplot(feature_vars["RH_6"],bins=10, ax= ax[0][0])
vis2 = sns.distplot(feature_vars["RH_out"],bins=10, ax=ax[0][1])
vis3 = sns.distplot(feature_vars["Visibility"],bins=10, ax=ax[1][0])
vis4 = sns.distplot(feature_vars["Windspeed"],bins=10, ax=ax[1][1])
vis5 = sns.distplot(feature_vars["rv1"],bins=10, ax=ax[0][2])
vis6 = sns.distplot(feature_vars["rv2"],bins=10, ax=ax[1][2])

In [None]:
# Distribution of values in Applainces column
target_vars.hist(color='green',bins=10)

In [None]:
sns.set(rc={'figure.figsize':(12,6)})
plt.xlabel('Appliance consumption in Wh')
plt.ylabel('Frequency')
sns.distplot(target_vars ,color="r",bins=10)

In [None]:
# Check the distribution of values in Appliances column
slr['Appliances'].value_counts().head(15)

# **Observations**
Temperature - All the columns follow normal distribution except T9     
Humidity - All columns follow normal distribution except RH_6 and RH_out ,     primarly because these sensors are outside the house     
Appliance - This column is postively skewed , most the values are around mean   100 Wh . There are outliers in this column     
Visibilty - This column is negatively skewed        
Windspeed - This column is postively skewed      

In [None]:
#Appliance column range with consumption less than 200 Wh
print('Percentage of the appliance consumption is less than 200 Wh')
print(((target_vars[target_vars <= 200].count()) / (len(target_vars)))*100 )

Percentage of the appliance consumption is less than 200 Wh
Appliances    90.315429
dtype: float64


# **Correlation Plots**

In [None]:
# Correlatrion of all about data

sns.set(rc={'figure.figsize':(15,12)})
slr = slr.corr().round(2)
sns.heatmap(data=slr, annot=True,cmap="vlag")

In [None]:
# Use the weather , temperature , applainces and random column to see the correlation

train_corr = train[col_temp + col_hum + col_weather +col_target+col_randoms]
corr = train_corr.corr()
# Mask the repeated values
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
  
f, ax = plt.subplots(figsize=(16, 14))
#Generate Heat Map, allow annotations and place floats in map
sns.heatmap(corr, annot=True, fmt=".2f" , mask=mask,)
    #Apply xticks
plt.xticks(range(len(corr.columns)), corr.columns);
    #Apply yticks
plt.yticks(range(len(corr.columns)), corr.columns)
    #show plot
plt.show()

In [None]:
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

# Function to get top correlations 

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(train_corr, 40))

In [None]:
# Correlation of all about dataset
slr.iloc[:,1:].corr()

# **Observations based on correlation plot**

1.Temperature - All the temperature variables from T1-T9 and T_out have positive correlation with the target Appliances . For the indoortemperatures, the correlations are high as expected, since the ventilation is driven by the HRV unit and minimizes air tempera-ture differences between rooms. Four columns have a high degree of correlation with T9 - T3,T5,T7,T8 also T6 & T_Out has high correlation (both temperatures from outside) . Hence T6 & T9 can be removed from training set as information provided by them can be provided by other fields.

2.Weather attributes - Visibility, Tdewpoint, Press_mm_hg have low correlation values

3.Humidity - There are no significantly high correlation cases (> 0.9) for humidity sensors.

4.Random variables have no role to play


# **Data Preprocessing**


# **Splitting the data into training and testing dataset**

In [21]:
# Due to lot of zero enteries this column is of not much use and will be ignored in rest of the model
_ = feature_vars.drop(['lights'], axis=1 , inplace= True) ;

In [22]:
#Split training dataset into independent and dependent varibales
train_X = train[feature_vars.columns]
train_y = train[target_vars.columns]

In [32]:
#Split testing dataset into independent and dependent varibales
test_X = test[feature_vars.columns]
test_y = test[target_vars.columns]

In [None]:
# Due to conlusion made above below columns are removed
train_X.drop(["rv1","rv2","Visibility","T6","T9"],axis=1 , inplace=True)

In [34]:
# Due to conlusion made above below columns are removed
test_X.drop(["rv1","rv2","Visibility","T6","T9"], axis=1, inplace=True)

In [25]:
train_X.columns

Index(['T1', 'T2', 'T3', 'T4', 'T5', 'T7', 'T8', 'RH_1', 'RH_2', 'RH_3',
       'RH_4', 'RH_5', 'RH_6', 'RH_7', 'RH_8', 'RH_9', 'T_out', 'Tdewpoint',
       'RH_out', 'Press_mm_hg', 'Windspeed'],
      dtype='object')

In [35]:
test_X.columns

Index(['T1', 'T2', 'T3', 'T4', 'T5', 'T7', 'T8', 'RH_1', 'RH_2', 'RH_3',
       'RH_4', 'RH_5', 'RH_6', 'RH_7', 'RH_8', 'RH_9', 'T_out', 'Tdewpoint',
       'RH_out', 'Press_mm_hg', 'Windspeed'],
      dtype='object')

In [39]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()

# Create test and training set by including Appliances column

train = slr[list(train_X.columns.values) + col_target ]

test = slr[list(test_X.columns.values) + col_target ]

# Create dummy test and training set to hold scaled values

sc_train = pd.DataFrame(columns=train.columns , index=train.index)

sc_train[sc_train.columns] = sc.fit_transform(train)

sc_test= pd.DataFrame(columns=test.columns , index=test.index)

sc_test[sc_test.columns] = sc.fit_transform(test)

In [40]:
sc_train.head()

Unnamed: 0,T1,T2,T3,T4,T5,T7,T8,RH_1,RH_2,RH_3,RH_4,RH_5,RH_6,RH_7,RH_8,RH_9,T_out,Tdewpoint,RH_out,Press_mm_hg,Windspeed,Appliances
0,-1.118645,-0.520411,-1.235063,-0.908217,-1.314903,-1.453646,-1.957509,1.843821,1.073683,1.68613,1.506438,0.47116,0.951798,1.219861,1.141572,0.958136,-0.152647,0.366975,0.82208,-2.976328,1.207694,-0.367676
1,-1.118645,-0.520411,-1.235063,-0.908217,-1.314903,-1.453646,-1.957509,1.616807,1.057097,1.704566,1.604528,0.47116,0.945592,1.206825,1.134554,0.965363,-0.174588,0.343135,0.82208,-2.962813,1.071703,-0.367676
2,-1.118645,-0.520411,-1.235063,-0.944115,-1.314903,-1.453646,-1.957509,1.517959,1.03355,1.748608,1.580918,0.458968,0.916484,1.182057,1.109032,0.95091,-0.196529,0.319294,0.82208,-2.949298,0.935713,-0.465215
3,-1.118645,-0.520411,-1.235063,-0.962063,-1.314903,-1.485243,-2.008631,1.459321,1.02454,1.769092,1.542526,0.458968,0.925045,1.15403,1.082233,0.926821,-0.21847,0.295454,0.82208,-2.935783,0.799723,-0.465215
4,-1.118645,-0.520411,-1.235063,-0.962063,-1.296832,-1.453646,-2.008631,1.526336,1.009797,1.769092,1.497991,0.458968,0.972238,1.142298,1.082233,0.926821,-0.240411,0.271613,0.82208,-2.922268,0.663733,-0.367676


In [41]:
sc_test.head()

Unnamed: 0,T1,T2,T3,T4,T5,T7,T8,RH_1,RH_2,RH_3,RH_4,RH_5,RH_6,RH_7,RH_8,RH_9,T_out,Tdewpoint,RH_out,Press_mm_hg,Windspeed,Appliances
0,-1.118645,-0.520411,-1.235063,-0.908217,-1.314903,-1.453646,-1.957509,1.843821,1.073683,1.68613,1.506438,0.47116,0.951798,1.219861,1.141572,0.958136,-0.152647,0.366975,0.82208,-2.976328,1.207694,-0.367676
1,-1.118645,-0.520411,-1.235063,-0.908217,-1.314903,-1.453646,-1.957509,1.616807,1.057097,1.704566,1.604528,0.47116,0.945592,1.206825,1.134554,0.965363,-0.174588,0.343135,0.82208,-2.962813,1.071703,-0.367676
2,-1.118645,-0.520411,-1.235063,-0.944115,-1.314903,-1.453646,-1.957509,1.517959,1.03355,1.748608,1.580918,0.458968,0.916484,1.182057,1.109032,0.95091,-0.196529,0.319294,0.82208,-2.949298,0.935713,-0.465215
3,-1.118645,-0.520411,-1.235063,-0.962063,-1.314903,-1.485243,-2.008631,1.459321,1.02454,1.769092,1.542526,0.458968,0.925045,1.15403,1.082233,0.926821,-0.21847,0.295454,0.82208,-2.935783,0.799723,-0.465215
4,-1.118645,-0.520411,-1.235063,-0.962063,-1.296832,-1.453646,-2.008631,1.526336,1.009797,1.769092,1.497991,0.458968,0.972238,1.142298,1.082233,0.926821,-0.240411,0.271613,0.82208,-2.922268,0.663733,-0.367676


In [42]:
# Remove Appliances column from traininig set

train_X =  sc_train.drop(['Appliances'] , axis=1)
train_y = sc_train['Appliances']

test_X =  sc_test.drop(['Appliances'] , axis=1)
test_y = sc_test['Appliances']

In [43]:
train_X.head()

Unnamed: 0,T1,T2,T3,T4,T5,T7,T8,RH_1,RH_2,RH_3,RH_4,RH_5,RH_6,RH_7,RH_8,RH_9,T_out,Tdewpoint,RH_out,Press_mm_hg,Windspeed
0,-1.118645,-0.520411,-1.235063,-0.908217,-1.314903,-1.453646,-1.957509,1.843821,1.073683,1.68613,1.506438,0.47116,0.951798,1.219861,1.141572,0.958136,-0.152647,0.366975,0.82208,-2.976328,1.207694
1,-1.118645,-0.520411,-1.235063,-0.908217,-1.314903,-1.453646,-1.957509,1.616807,1.057097,1.704566,1.604528,0.47116,0.945592,1.206825,1.134554,0.965363,-0.174588,0.343135,0.82208,-2.962813,1.071703
2,-1.118645,-0.520411,-1.235063,-0.944115,-1.314903,-1.453646,-1.957509,1.517959,1.03355,1.748608,1.580918,0.458968,0.916484,1.182057,1.109032,0.95091,-0.196529,0.319294,0.82208,-2.949298,0.935713
3,-1.118645,-0.520411,-1.235063,-0.962063,-1.314903,-1.485243,-2.008631,1.459321,1.02454,1.769092,1.542526,0.458968,0.925045,1.15403,1.082233,0.926821,-0.21847,0.295454,0.82208,-2.935783,0.799723
4,-1.118645,-0.520411,-1.235063,-0.962063,-1.296832,-1.453646,-2.008631,1.526336,1.009797,1.769092,1.497991,0.458968,0.972238,1.142298,1.082233,0.926821,-0.240411,0.271613,0.82208,-2.922268,0.663733


In [44]:
train_y.head()

0   -0.367676
1   -0.367676
2   -0.465215
3   -0.465215
4   -0.367676
Name: Appliances, dtype: float64

# We will be looking at following Algorithms

# Improved Linear regression models

# 1.Ridge regression

# 2.Lasso regression