# Abstract

**Objective:** To do a roundtrip analysis on a file: reading, transforming, writing.

**Method:** Load a given file and


# Use Kaggle to Download the Data

This installs a local copy of the `kaggle` command-line app to your PC, and uses it to search for and download a smart-meter dataset. You will need a few things set up in advance.

You will need a `kaggle.json` file. The steps are:

1. Create a Kaggle account via email
2. Open your profile, and go the **Account** settings: `https://www.kaggle.com/$USERNAME_HERE/account`
3. Click on the button **Create New API Token**. This will download a `kaggle.json` file to local PC.

Next you need to put it on Google Drive

1. Make sure you have a Google Drive account
2. Create a folder `Colab Data`
3. Upload the `kaggle.json` file to that `Colab Data` folder on Google driver.

You now have what you need to execute this notebook

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
!mkdir -p ~/.kaggle

In [7]:
!cp "/content/drive/My Drive/Colab Data/kaggle.json" ~/.kaggle

In [8]:
!chmod 600 ~/.kaggle/kaggle.json

In [9]:
!ls ~/.kaggle

kaggle.json


In [10]:
!pip install kaggle



In [11]:
!kaggle datasets list --search energy

ref                                                             title                                             size  lastUpdated          downloadCount  
--------------------------------------------------------------  -----------------------------------------------  -----  -------------------  -------------  
robikscube/hourly-energy-consumption                            Hourly Energy Consumption                         11MB  2018-08-30 14:17:03          19541  
unitednations/international-energy-statistics                   International Energy Statistics                    7MB  2017-11-16 00:06:06           5814  
loveall/appliances-energy-prediction                            Appliances Energy Prediction                       2MB  2017-09-16 10:43:26           2875  
lucabasa/dutch-energy                                           Energy consumption of the Netherlands            139MB  2020-06-21 18:51:28           6819  
nicholasjhana/energy-consumption-generation-prices-and-wea

In [12]:
# !kaggle datasets download jeanmidev/smart-meters-in-london --path /content/drive/My\ Drive/Colab\ Data/

# Unpack and Verify the Smart-Meter Dataset

In this cell we unzip the data, list the files, and ensure they're what we expect to see given the description of the dataset at [the dataset's page on Kaggle](https://www.kaggle.com/jeanmidev/smart-meters-in-london)

In [13]:
# !mkdir /content/drive/My\ Drive/Colab\ Data/smart-meters-in-london

In [14]:
# !unzip -o -q /content/drive/My\ Drive/Colab\ Data/smart-meters-in-london.zip \
# -d /content/drive/My\ Drive/Colab\ Data/smart-meters-in-london \

In [15]:
!ls /content/drive/My\ Drive/Colab\ Data/smart-meters-in-london/halfhourly_dataset/

halfhourly_dataset


In [16]:
import os
import pathlib
import pandas as pd
from IPython.display import display, Markdown, Image

In [17]:
DATASET_PATH=pathlib.Path('/content/drive/My Drive/Colab Data/smart-meters-in-london')
ELEC_READINGS_ONE_ROW_PER_DAY=DATASET_PATH / 'hhblock_dataset' / 'hhblock_dataset'  # one row per day
ELEC_READINGS_ONE_ROW_PER_READING=DATASET_PATH / 'halfhourly_dataset' / 'halfhourly_dataset'  # one row per timestamp

In [18]:
path1_block = pd.read_csv(ELEC_READINGS_ONE_ROW_PER_DAY / 'block_0.csv')

In [19]:
path2_block = pd.read_csv(ELEC_READINGS_ONE_ROW_PER_READING / 'block_0.csv')

In [20]:
path1_block.head()

Unnamed: 0,LCLid,day,hh_0,hh_1,hh_2,hh_3,hh_4,hh_5,hh_6,hh_7,hh_8,hh_9,hh_10,hh_11,hh_12,hh_13,hh_14,hh_15,hh_16,hh_17,hh_18,hh_19,hh_20,hh_21,hh_22,hh_23,hh_24,hh_25,hh_26,hh_27,hh_28,hh_29,hh_30,hh_31,hh_32,hh_33,hh_34,hh_35,hh_36,hh_37,hh_38,hh_39,hh_40,hh_41,hh_42,hh_43,hh_44,hh_45,hh_46,hh_47
0,MAC000002,2012-10-13,0.263,0.269,0.275,0.256,0.211,0.136,0.161,0.119,0.167,0.109,0.168,0.107,0.166,0.117,0.157,0.126,0.146,0.106,0.135,0.191,0.915,0.933,0.122,0.138,0.076,0.133,0.076,0.133,0.085,0.263,0.134,0.235,0.124,0.184,0.23,0.176,0.388,0.26,0.918,0.278,0.267,0.239,0.23,0.233,0.235,0.188,0.259,0.25
1,MAC000002,2012-10-14,0.262,0.166,0.226,0.088,0.126,0.082,0.123,0.083,0.12,0.079,0.121,0.075,0.124,0.073,0.125,0.07,0.13,0.108,0.196,0.346,0.524,0.076,0.129,0.667,0.23,0.22,0.163,0.091,0.17,0.11,0.11,0.121,0.099,0.157,0.093,0.371,0.386,1.085,1.075,0.956,0.821,0.745,0.712,0.511,0.231,0.21,0.278,0.159
2,MAC000002,2012-10-15,0.192,0.097,0.141,0.083,0.132,0.07,0.13,0.074,0.124,0.078,0.118,0.082,0.112,0.087,0.106,0.14,0.12,1.075,0.146,0.123,0.082,0.127,0.077,0.551,0.149,0.129,0.075,0.13,0.075,0.129,0.075,0.128,0.166,0.194,0.695,0.26,0.227,0.255,1.164,0.249,0.225,0.258,0.26,0.334,0.299,0.236,0.241,0.237
3,MAC000002,2012-10-16,0.237,0.237,0.193,0.118,0.098,0.107,0.094,0.109,0.091,0.105,0.091,0.104,0.092,0.103,0.093,0.101,0.144,0.1,0.408,0.102,0.1,0.116,0.354,0.146,0.19,0.991,0.31,0.121,0.113,0.094,0.119,0.087,0.13,0.238,0.204,0.284,0.447,0.266,0.966,0.172,0.192,0.228,0.203,0.211,0.188,0.213,0.157,0.202
4,MAC000002,2012-10-17,0.157,0.211,0.155,0.169,0.101,0.117,0.084,0.118,0.08,0.119,0.075,0.123,0.071,0.126,0.067,0.124,0.118,0.132,0.358,0.628,0.784,0.681,0.749,0.593,0.502,0.115,0.113,0.092,0.124,0.084,0.125,0.078,0.136,0.227,0.207,0.141,0.258,0.217,0.223,0.075,0.23,0.208,0.265,0.377,0.327,0.277,0.288,0.256


In [21]:
path2_block.iloc[46:,:].head(10)

Unnamed: 0,LCLid,tstp,energy(kWh/hh)
46,MAC000002,2012-10-13 00:00:00.0000000,0.263
47,MAC000002,2012-10-13 00:30:00.0000000,0.269
48,MAC000002,2012-10-13 01:00:00.0000000,0.275
49,MAC000002,2012-10-13 01:30:00.0000000,0.256
50,MAC000002,2012-10-13 02:00:00.0000000,0.211
51,MAC000002,2012-10-13 02:30:00.0000000,0.136
52,MAC000002,2012-10-13 03:00:00.0000000,0.161
53,MAC000002,2012-10-13 03:30:00.0000000,0.119
54,MAC000002,2012-10-13 04:00:00.0000000,0.167
55,MAC000002,2012-10-13 04:30:00.0000000,0.109


So really what we want here is to read in the hhblock version, and then process that. There are multiple households per file.

In [27]:
blocks = []
for reading_file in ELEC_READINGS_ONE_ROW_PER_DAY.iterdir():
  if (reading_file.name.startswith('block_')) and (reading_file.name.endswith('.csv')):
    print(f"Reading in {reading_file}")
    blocks += [ pd.read_csv(reading_file)]

In [28]:
readings = pd.concat(blocks, sort=False)

In [31]:
del blocks

# Experiment 1: An Autoencoder for Daily Energy Consumption

In this experiment we stack all days from all households together -- ignoring the household-specific clusterieng -- and just fit a variationall auto-encoder to the daily data to try to find a low-rank representation of a household-day

In order to make this easier to model in Gaussian terms, we should use a log transform of the data, but we'll skip this for the time being

In [35]:
HouseId = 'LCLid'
Day = 'day'

In [39]:
readings.set_index([HouseId, Day], inplace=True)
readings.columns = list(range(48))

In [40]:
readings.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47
LCLid,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
MAC000002,2012-10-13,0.263,0.269,0.275,0.256,0.211,0.136,0.161,0.119,0.167,0.109,0.168,0.107,0.166,0.117,0.157,0.126,0.146,0.106,0.135,0.191,0.915,0.933,0.122,0.138,0.076,0.133,0.076,0.133,0.085,0.263,0.134,0.235,0.124,0.184,0.23,0.176,0.388,0.26,0.918,0.278,0.267,0.239,0.23,0.233,0.235,0.188,0.259,0.25
MAC000002,2012-10-14,0.262,0.166,0.226,0.088,0.126,0.082,0.123,0.083,0.12,0.079,0.121,0.075,0.124,0.073,0.125,0.07,0.13,0.108,0.196,0.346,0.524,0.076,0.129,0.667,0.23,0.22,0.163,0.091,0.17,0.11,0.11,0.121,0.099,0.157,0.093,0.371,0.386,1.085,1.075,0.956,0.821,0.745,0.712,0.511,0.231,0.21,0.278,0.159
MAC000002,2012-10-15,0.192,0.097,0.141,0.083,0.132,0.07,0.13,0.074,0.124,0.078,0.118,0.082,0.112,0.087,0.106,0.14,0.12,1.075,0.146,0.123,0.082,0.127,0.077,0.551,0.149,0.129,0.075,0.13,0.075,0.129,0.075,0.128,0.166,0.194,0.695,0.26,0.227,0.255,1.164,0.249,0.225,0.258,0.26,0.334,0.299,0.236,0.241,0.237
MAC000002,2012-10-16,0.237,0.237,0.193,0.118,0.098,0.107,0.094,0.109,0.091,0.105,0.091,0.104,0.092,0.103,0.093,0.101,0.144,0.1,0.408,0.102,0.1,0.116,0.354,0.146,0.19,0.991,0.31,0.121,0.113,0.094,0.119,0.087,0.13,0.238,0.204,0.284,0.447,0.266,0.966,0.172,0.192,0.228,0.203,0.211,0.188,0.213,0.157,0.202
MAC000002,2012-10-17,0.157,0.211,0.155,0.169,0.101,0.117,0.084,0.118,0.08,0.119,0.075,0.123,0.071,0.126,0.067,0.124,0.118,0.132,0.358,0.628,0.784,0.681,0.749,0.593,0.502,0.115,0.113,0.092,0.124,0.084,0.125,0.078,0.136,0.227,0.207,0.141,0.258,0.217,0.223,0.075,0.23,0.208,0.265,0.377,0.327,0.277,0.288,0.256


## Following a Convolutional Variational Auto-Encoder 

We follow this tutorial at first for the variational auto-encoder, operating on digits. Subsequently we will adapt it to work on time-series: https://keras.io/examples/generative/vae/

In [41]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers


Note that this tutorial follows the new class-based Keras API

The first layer is the "Sampling Layer", i.e. the target low-rank representation

In [43]:
class Sampling(layers.Layer):
  """
  Uses a Gaussian over the latent space z, parameterised by z_meaan and 
  z_log_var.
  """

  def call(self, inputs):
    z_mean, z_log_var = inputs  # What?!
    batch = tf.shape(z_mean)[0]
    dim = tf.shape(z_mean)[1]  # What is tf.shape?
    epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
    return z_mean + tf.exp(0.5 * z_log_var) * epsilon