# How Do Weather Conditions Impact Electricity Consumption?

In this notebook, we use a dataset (previously used in an Analytics Vidhya hackathon that is now closed) to predict electricity consumption in the fictitious country of Electrovania, on an hourly basis based upon factors related to the weather.

## Import Packages

In [2]:
# Exploratory Data Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Exploratory Data Analysis

### Reading in the Data

Training and testing data were initially split into different CSV files as part of Analytics Vidhya's hackathon. For this analysis however, I will be rejoining them and randomly splitting them using Scikit-learn, in order to introduce more "randomness" into the experiments.

In [7]:
# two separate DataFrames
df1 = pd.read_csv('Electricity_Consumption/train.csv')
df2 = pd.read_csv('Electricity_Consumption/test.csv')

In [8]:
df1.head()

Unnamed: 0,ID,datetime,temperature,var1,pressure,windspeed,var2,electricity_consumption
0,0,2013-07-01 00:00:00,-11.4,-17.1,1003.0,571.91,A,216.0
1,1,2013-07-01 01:00:00,-12.1,-19.3,996.0,575.04,A,210.0
2,2,2013-07-01 02:00:00,-12.9,-20.0,1000.0,578.435,A,225.0
3,3,2013-07-01 03:00:00,-11.4,-17.1,995.0,582.58,A,216.0
4,4,2013-07-01 04:00:00,-11.4,-19.3,1005.0,586.6,A,222.0


In [11]:
df2.tail()

Unnamed: 0,ID,datetime,temperature,var1,pressure,windspeed,var2
8563,35059,2017-06-30 19:00:00,-5.7,-18.6,998.0,233.595,A
8564,35060,2017-06-30 20:00:00,-5.7,-17.1,995.0,238.78,A
8565,35061,2017-06-30 21:00:00,-7.1,-19.3,1004.0,244.325,A
8566,35062,2017-06-30 22:00:00,-6.4,-19.3,1008.0,247.47,A
8567,35063,2017-06-30 23:00:00,-5.0,-16.4,1001.0,250.6,A


### Combining DataFrames into One

From the above, we can see that: 

- both DataFrames have exactly the same features
- and that the samples are unique
- HOWEVER, the testing DataFrame is lacks a column for the 'electricity_consumption' - our target variable!

Therefore, we want to merely append one of the DataFrames to the other - this looks best-suited for an "outer" merge in pandas!

In [13]:
# Merge the DataFrames
df = pd.merge(df1, df2, how='outer')
# see if the head matches the head of df1
df.head()

Unnamed: 0,ID,datetime,temperature,var1,pressure,windspeed,var2,electricity_consumption
0,0,2013-07-01 00:00:00,-11.4,-17.1,1003.0,571.91,A,216.0
1,1,2013-07-01 01:00:00,-12.1,-19.3,996.0,575.04,A,210.0
2,2,2013-07-01 02:00:00,-12.9,-20.0,1000.0,578.435,A,225.0
3,3,2013-07-01 03:00:00,-11.4,-17.1,995.0,582.58,A,216.0
4,4,2013-07-01 04:00:00,-11.4,-19.3,1005.0,586.6,A,222.0


In [12]:
# AND, check to see if the bottom of the new df == bottom of df2!
df.tail()

Unnamed: 0,ID,datetime,temperature,var1,pressure,windspeed,var2,electricity_consumption
35059,35059,2017-06-30 19:00:00,-5.7,-18.6,998.0,233.595,A,
35060,35060,2017-06-30 20:00:00,-5.7,-17.1,995.0,238.78,A,
35061,35061,2017-06-30 21:00:00,-7.1,-19.3,1004.0,244.325,A,
35062,35062,2017-06-30 22:00:00,-6.4,-19.3,1008.0,247.47,A,
35063,35063,2017-06-30 23:00:00,-5.0,-16.4,1001.0,250.6,A,


## Removing Columns

The 'ID', 'var1', and 'var2' came with no description so they don't necessarily carry any useful information. Therefore I will be dropping them.

In [15]:
df = df.drop(columns=['ID', 'var1', 'var2'])