<a href="https://colab.research.google.com/github/brepowell/ML-Contest-Series/blob/main/MLSeriesSupervisedLearningTemplate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# To run:
Use an IDE that will allow you to open a Jupyter Notebook.

For example, use Anaconda Navigator to open Visual Studio Code.

You may need to select a kernel to run the program.

# STEP 1: DATA GATHERING / FEATURE EXPLORATION

## Import Libraries

In [137]:
# This step can be done in any cell of the notebook. It does not have to be at the top.
import pandas as pd


## Data Information

From
https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households

"Created 9 years ago, updated 2 years ago
Energy consumption readings for a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014.

Readings were taken at half hourly intervals. The customers in the trial were recruited as a balanced sample representative of the Greater London population.

The dataset contains energy consumption, in kWh (per half hour), unique household identifier, date and time. The CSV file is around 10GB when unzipped and contains around 167million rows.

Within the data set are two groups of customers. The first is a sub-group, of approximately 1100 customers, who were subjected to Dynamic Time of Use (dToU) energy prices throughout the 2013 calendar year period. The tariff prices were given a day ahead via the Smart Meter IHD (In Home Display) or text message to mobile phone. Customers were issued High (67.20p/kWh), Low (3.99p/kWh) or normal (11.76p/kWh) price signals and the times of day these applied. The dates/times and the price signal schedule is availaible as part of this dataset. All non-Time of Use customers were on a flat rate tariff of 14.228pence/kWh.

The signals given were designed to be representative of the types of signal that may be used in the future to manage both high renewable generation (supply following) operation and also test the potential to use high price signals to reduce stress on local distribution grids during periods of stress.

The remaining sample of approximately 4500 customers energy consumption readings were not subject to the dToU tariff."


## Load Data -- Daily

In [138]:
# Use Pandas to load the data into a dataframe
path = "data/archive/"
startYear = 2011
endYear = 2014

londonData = pd.read_csv(path + 'daily_dataset.csv')
londonData.shape


(3510433, 9)

## Load Data -- Half Hourly

In [139]:
# tstp = Time Series Time Period (day)
# This shows readings every half hour (LCLid + tstp + energy(kWh/hh) = 3 columns)
path = "data/archive/halfhourly_dataset/halfhourly_dataset/"
block1 = pd.read_csv(path + 'block_0.csv')
block1.shape


(1222670, 3)

## Load Data -- Half Hourly 2nd set

In [140]:
# This shows readings every half hour (LCLid + day + 48 half hour columns = 50 columns) per day, per ID
# It is the same data as the halfhourly_dataset, except formatted differently and with more precision.
path = "data/archive/hhblock_dataset/hhblock_dataset/"
halfblock0 = pd.read_csv(path + 'block_0.csv')
halfblock0.shape

(25286, 50)

## Check the Min/Max/Avg -- Daily

In [141]:
londonData.head()

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
0,MAC000131,2011-12-15,0.485,0.432045,0.868,22,0.239146,9.505,0.072
1,MAC000131,2011-12-16,0.1415,0.296167,1.116,48,0.281471,14.216,0.031
2,MAC000131,2011-12-17,0.1015,0.189812,0.685,48,0.188405,9.111,0.064
3,MAC000131,2011-12-18,0.114,0.218979,0.676,48,0.202919,10.511,0.065
4,MAC000131,2011-12-19,0.191,0.325979,0.788,48,0.259205,15.647,0.066


In [142]:
londonData.tail() 

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
3510428,MAC004977,2014-02-24,0.095,0.118458,0.58,48,0.093814,5.686,0.052
3510429,MAC004977,2014-02-25,0.0675,0.084208,0.176,48,0.037107,4.042,0.046
3510430,MAC004977,2014-02-26,0.108,0.1205,0.282,48,0.069332,5.784,0.046
3510431,MAC004977,2014-02-27,0.072,0.114062,0.431,48,0.094482,5.475,0.047
3510432,MAC004977,2014-02-28,0.097,0.097,0.097,1,,0.097,0.097


In [143]:
londonData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3510433 entries, 0 to 3510432
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   LCLid          object 
 1   day            object 
 2   energy_median  float64
 3   energy_mean    float64
 4   energy_max     float64
 5   energy_count   int64  
 6   energy_std     float64
 7   energy_sum     float64
 8   energy_min     float64
dtypes: float64(6), int64(1), object(2)
memory usage: 241.0+ MB


In [144]:
# Zero is the minimum for all
londonData.describe()

Unnamed: 0,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
count,3510403.0,3510403.0,3510403.0,3510433.0,3499102.0,3510403.0,3510403.0
mean,0.1587395,0.2117305,0.834521,47.80364,0.1726673,10.12414,0.05962578
std,0.1701865,0.190846,0.6683156,2.810982,0.1531208,9.128793,0.08701312
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.067,0.09808333,0.346,48.0,0.06911626,4.682,0.02
50%,0.1145,0.1632917,0.688,48.0,0.132791,7.815,0.039
75%,0.191,0.2624583,1.128,48.0,0.2293124,12.569,0.071
max,6.9705,6.92825,10.761,48.0,4.024569,332.556,6.524


In [145]:
# There are 5566 distinct LCLids
customerIDs = londonData.groupby("LCLid").count()
customerIDs.shape


(5566, 8)

In [146]:
customerIDs.head()

Unnamed: 0_level_0,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
LCLid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
MAC000002,505,505,505,505,505,503,505,505
MAC000003,740,740,740,740,740,739,740,740
MAC000004,662,662,662,662,662,660,662,662
MAC000005,638,638,638,638,638,636,638,638
MAC000006,761,761,761,761,761,760,761,761


In [147]:
# Looking at min, it appears that there are LCLids that only have line of information!
customerIDs.describe()["day"]

count    5566.000000
mean      630.692239
std       111.945099
min         1.000000
25%       600.000000
50%       651.000000
75%       684.000000
max       829.000000
Name: day, dtype: float64

## Power Outage Days

In [148]:
# There are possibly 15,138 power outages, where max energy was 0 for the day
zeroPower = londonData[londonData['energy_max'] == 0]
zeroPower.shape

(15138, 9)

In [149]:
# I wonder what energy_count is????
zeroPower.head()

Unnamed: 0,LCLid,day,energy_median,energy_mean,energy_max,energy_count,energy_std,energy_sum,energy_min
12271,MAC004560,2013-04-02,0.0,0.0,0.0,48,0.0,0.0,0.0
12272,MAC004560,2013-04-03,0.0,0.0,0.0,48,0.0,0.0,0.0
12273,MAC004560,2013-04-04,0.0,0.0,0.0,48,0.0,0.0,0.0
12274,MAC004560,2013-04-05,0.0,0.0,0.0,1,,0.0,0.0
41810,MAC001340,2013-11-15,0.0,0.0,0.0,48,0.0,0.0,0.0


In [150]:
zeroPower.to_csv("PowerOutages.csv")

In [151]:
# There are possibly 819 days where the power was out for at least one location
zeroPowerDays = zeroPower.groupby("day").count()
zeroPowerDays.shape

(819, 8)

In [152]:
# There are possibly 281 customers affected by a power outage
zeroPowerIDs = zeroPower.groupby("LCLid").count()
zeroPowerIDs.shape

(281, 8)

## Customer Information

In [153]:
# Use Pandas to load the data into a dataframe
path = "data/archive/"
house = pd.read_csv(path + 'informations_households.csv')

In [154]:
# Looking at that house that was dead for a long time
house[house["LCLid"] == "MAC001340"]

Unnamed: 0,LCLid,stdorToU,Acorn,Acorn_grouped,file
4394,MAC001340,Std,ACORN-N,Adversity,block_87


In [155]:
# Use Pandas to load the data into a dataframe
#path = "data/archive/"
#acorn = pd.read_csv(path + 'acorn_details.csv', encoding="utf-16")

## Check the Min/Max/Avg -- 1/2 Hourly

In [156]:
block1.head()

Unnamed: 0,LCLid,tstp,energy(kWh/hh)
0,MAC000002,2012-10-12 00:30:00.0000000,0
1,MAC000002,2012-10-12 01:00:00.0000000,0
2,MAC000002,2012-10-12 01:30:00.0000000,0
3,MAC000002,2012-10-12 02:00:00.0000000,0
4,MAC000002,2012-10-12 02:30:00.0000000,0


In [157]:
block1.tail()

Unnamed: 0,LCLid,tstp,energy(kWh/hh)
1222665,MAC005492,2014-02-27 22:00:00.0000000,0.182
1222666,MAC005492,2014-02-27 22:30:00.0000000,0.122
1222667,MAC005492,2014-02-27 23:00:00.0000000,0.14
1222668,MAC005492,2014-02-27 23:30:00.0000000,0.192
1222669,MAC005492,2014-02-28 00:00:00.0000000,0.088


In [158]:
block1.describe()

Unnamed: 0,LCLid,tstp,energy(kWh/hh)
count,1222670,1222670,1222670.0
unique,50,39292,5022.0
top,MAC000246,2012-12-16 00:00:00.0000000,0.013
freq,39245,50,6238.0


In [159]:
# The halfhourly block files only show around 50 locations, due to the granularity of half hour
hourlycustomerIDs = block1.groupby("LCLid").count()
hourlycustomerIDs.shape

(50, 2)

## Check for Imbalanced Data

# STEP 2: FEATURE ENGINEERING / DATA CLEANING / PRE-PROCESSING TECHNIQUES

## Label your target variable

In [160]:
# Ex) Change "yes" or "no" to 1 or 0 so the model understands the label

## Fix the data imbalance and other problems from STEP 1

## Seperate features (x) from labels (y)

## Feature Reduction

## Normalize / Scale the Data

## Investigate Variance or Feature Importance

# STEP 3: MODEL TRAINING & BUILDING


## Split into Train and Test data

## Use a Model

## Perform a Hyperparameter Search

## Save the best model

# STEP 4: EVALUATE THE MODEL

## Look at Metrics - Ex) Precision, Recall, F1 score

## Plot a Confusion Matrix

# STEP 5: DEPLOY MODEL