In [1]:
import numpy as np
import pandas as pd

import keras
from keras.preprocessing.sequence import TimeseriesGenerator

import matplotlib.pyplot as plt

Using TensorFlow backend.


In [2]:
train = pd.read_csv("gs://123test_bucket/train.csv")

#### Exploring TimeseriesGenerator using dummy data

In [3]:
data = np.array([[i] for i in range(50)])
targets = np.array([[i] for i in range(50)])

In [4]:
data_gen = TimeseriesGenerator(data, targets,
                               length=10, sampling_rate=2,
                               batch_size=2)

The parameters for the TimeseriesGenerator are:
1. **data**: the timeseries that will be created into a 1-D array
2. **targets**: the same time series that is passed to data, used for generating y-values
3. **length**: essentially a "look-back" parameter that specifies how far it should look in the time series to generate the target
4. **sampling_rate**: depends the frequency of sampling, based on which the values in-between will be skipped
5. **batch_size**: essentially, think of it as the the number of rows that the generator outputs

Further reading: https://keras.io/preprocessing/sequence/

In [5]:
assert len(data_gen) == 20
print(len(data_gen))

20


In [6]:
batch_0 = data_gen[0]
x, y = batch_0

In [7]:
print(y)

[[10]
 [11]]


In [8]:
print(x)

[[[0]
  [2]
  [4]
  [6]
  [8]]

 [[1]
  [3]
  [5]
  [7]
  [9]]]


As seen from the code above, TimeseriesGenerator, in this case, generated 2 values of y, since we specified the batch_size to be 2. Associated with the 2 y values, are 2 arrays in x. There are 5 values in each array. This is because even though we specified the length to be 10, we specified the sampling_rate to be 2.This result in 10/2 = 5 results.

In [9]:
assert np.array_equal(x,
                      np.array([[[0], [2], [4], [6], [8]],
                                [[1], [3], [5], [7], [9]]]))
assert np.array_equal(y,
                      np.array([[10], [11]]))

#### Applying TimeseriesGenerator to the ASHRAE training data

Once difference between the example above, and our situation is that we have multiple timeseries, for each building and each meter in the building. So, we will be required to modify the code a little bit.

Below, we check how many meters exist in the dataset.

In [10]:
len(train[['building_id', 'meter']].drop_duplicates())

2380

Taking a subset of the data.

In [11]:
train_sub = train[train.building_id.isin(np.arange(0,10))]
train_sub.describe()

Unnamed: 0,building_id,meter,meter_reading
count,102372.0,102372.0,102372.0
mean,4.996816,0.141953,456.649096
std,2.95176,0.349004,983.878076
min,0.0,0.0,0.0
25%,2.0,0.0,0.0
50%,5.0,0.0,80.2691
75%,7.0,0.0,425.576
max,9.0,1.0,8442.07


The code below has been taken from this stackoverflow answer with some modifications:
https://stackoverflow.com/questions/55116638/use-keras-timeseriesgenerator-function-to-generate-squence-group-by-some-id/55118459#55118459

The modification is basically that once we subset the data for building ID, it is then subset for meter type also.

Further reading about modifying keras generator classes can be found below:
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

In [13]:
# https://stackoverflow.com/questions/55116638/use-keras-timeseriesgenerator-function-to-generate-squence-group-by-some-id/55118459#55118459
# https://keras.io/preprocessing/sequence/
class DataGenerator(keras.utils.Sequence):
    def __init__(self, dt, length = 168, batch_size = 10):
        self.tgs = list()
        for i in range(dt['building_id'].min(),dt['building_id'].max()+1):
            sub = dt.loc[dt['building_id'] == i, ['meter', 'meter_reading']]
            for meter in sub['meter'].unique():
                # subsetting sub for meter type
                adf = sub.loc[sub['meter'] == meter, 'meter_reading']
                self.tgs.append(TimeseriesGenerator(adf.values,adf.values,length,batch_size=batch_size))
        self.len = sum([len(tg) for tg in self.tgs])
        self.idx_i = list()
        self.idx_j = list()

        for i, tg in enumerate(self.tgs):
            self.idx_i.extend(list(range(len(tg))))
            self.idx_j.extend([i]*len(tg))    
        #print ( self.idx_i,  self.idx_j)

    def __len__(self):
        return self.len

    def __getitem__(self, index):
        return self.tgs[self.idx_j[index]][self.idx_i[index]]



Based on a manual check it was found that there were 12 unique meters in the train_sub dataset.

In [14]:
assert len(train_sub[['building_id', 'meter']].drop_duplicates()) == 12
df_category = len(train_sub[['building_id', 'meter']].drop_duplicates())
df_category

12

In our case we want to use 24*7 timesteps, representing 7 days and 24 hours. We can experiment with the batch size but using 3 here for a short example.

In [18]:
# Test
length = 24*7
batch_size = 3
g = DataGenerator(train_sub,length, batch_size = batch_size)

In [19]:
df_len = len(train_sub)
df_len 

102372

In [21]:
len(g)

33453

In [22]:
assert round((df_len - len(g)*batch_size + (length*df_category))/df_len) == 0

Notice above, the length of the train_sub dataframe is 102,372. train_sub was the input to the DataGenerator. However, we end up length of 33,453 from the generator.
The assert statement above checks if that makes sense. Essentially, if we multiple the length of g with it's batch size, we get all the rows except those where there is insufficient data to form an array of size 24*7. This will be the case when towrds the end of the time series for each meter, hence multiplying the length by number of meters.

In [175]:
x, y = g[-1000] # For example

In [177]:
y

array([928.628, 970.838, 928.628])

In [176]:
x

array([[1013.05 ,  970.838,  970.838,  970.838,  928.628, 1224.1  ,
         970.838, 1519.57 , 1899.47 , 1688.41 , 1519.57 , 1519.57 ,
        1435.15 , 1561.78 , 1308.52 , 1181.89 , 1055.26 , 1139.68 ,
         759.786,  886.418, 1224.1  , 1097.47 , 1139.68 , 1013.05 ,
         759.786,  970.838,  928.628,  886.418,  928.628, 1139.68 ,
        1308.52 , 1350.73 , 1435.15 , 1519.57 , 1435.15 , 1392.94 ,
        1392.94 , 1477.36 , 1477.36 , 1435.15 , 1139.68 , 1097.47 ,
         844.207, 1013.05 ,  759.786, 1055.26 , 1013.05 ,  970.838,
         844.207,  886.418,  717.576,  886.418, 1139.68 , 1181.89 ,
        1139.68 , 1097.47 , 1350.73 , 1392.94 , 1350.73 ,  717.576,
        1519.57 , 1392.94 , 1266.31 , 1435.15 , 1308.52 , 1139.68 ,
         844.207, 1097.47 , 1055.26 ,  970.838, 1055.26 ,  970.838,
         928.628,  801.997, 1055.26 ,  844.207, 1013.05 , 1139.68 ,
        1055.26 , 1055.26 , 1055.26 , 1477.36 , 1435.15 , 1308.52 ,
        1224.1  , 1308.52 , 1308.52 , 1477.36 , 

Similar to our first example, we end up with 3 y values. And x contains 3 arrays of length 24*7.