<a href="https://colab.research.google.com/github/chandan2294/Time-Series-with-Deep-Learning-/blob/master/Preparing_Features_and_Labels_for_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
try:
  %tensorflow_version 2.x
except Exception:
  pass

In [2]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
print(tf.__version__)

2.2.0-rc3


Create a simple dataset containing few instances (range from 0 to 20)

In [4]:
dataset = tf.data.Dataset.range(20)
for val in dataset:
  print(val.numpy())

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19


Now, we'll window the data into chucks of five items shifting by one each time.

In [9]:
dataset = tf.data.Dataset.range(20)
dataset = dataset.window(5, shift = 1) #Window the data into 5 chucks and shift by 1
for window_dataset in dataset:
  for val in window_dataset:
    print(val.numpy(), end = " ")
  print()

0 1 2 3 4 
1 2 3 4 5 
2 3 4 5 6 
3 4 5 6 7 
4 5 6 7 8 
5 6 7 8 9 
6 7 8 9 10 
7 8 9 10 11 
8 9 10 11 12 
9 10 11 12 13 
10 11 12 13 14 
11 12 13 14 15 
12 13 14 15 16 
13 14 15 16 17 
14 15 16 17 18 
15 16 17 18 19 
16 17 18 19 
17 18 19 
18 19 
19 


We see that this gives us the output of first five items, and then second five items, and then the third and so on. At the end of the dataset, when there isn't enough data to give us five items, you'll see shorter lines.

To just get chunks of five records, we'll set *drop_reminder = True*

In [10]:
dataset = tf.data.Dataset.range(20)
dataset = dataset.window(5, shift = 1, drop_remainder=True)
dataset = dataset.flat_map(lambda window:window.batch(5))
for window in dataset:
  print(window.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]
[5 6 7 8 9]
[ 6  7  8  9 10]
[ 7  8  9 10 11]
[ 8  9 10 11 12]
[ 9 10 11 12 13]
[10 11 12 13 14]
[11 12 13 14 15]
[12 13 14 15 16]
[13 14 15 16 17]
[14 15 16 17 18]
[15 16 17 18 19]


Now we split it into x's and y's or features and labels. We'll take the last coulmn as the label, and we'll split using a lambda. We'll split the data into column - 1, which is all the coulmns except the last one, and minus one which is the last column only. 

In [11]:
dataset = tf.data.Dataset.range(20)
dataset = dataset.window(5, shift = 1, drop_remainder = True)
dataset = dataset.flat_map(lambda window: window.batch(5))
dataset = dataset.map(lambda window: (window[:-1], window[-1:]))
for x, y in dataset:
  print(x.numpy(), y.numpy())

[0 1 2 3] [4]
[1 2 3 4] [5]
[2 3 4 5] [6]
[3 4 5 6] [7]
[4 5 6 7] [8]
[5 6 7 8] [9]
[6 7 8 9] [10]
[ 7  8  9 10] [11]
[ 8  9 10 11] [12]
[ 9 10 11 12] [13]
[10 11 12 13] [14]
[11 12 13 14] [15]
[12 13 14 15] [16]
[13 14 15 16] [17]
[14 15 16 17] [18]
[15 16 17 18] [19]


Next is to shuffle the data. This is acheived with the shuffle method. This helps us rearrange the data such that it doesn't accidentally introduce a sequence bias. 

Sequence Bias: Sequence bias is when the order of things can impact the selection of things. For example, if I were to ask you your favorite TV show, and listed "Game of Thrones", "Killing Eve", "Travellers" and "Doctor Who" in that order, you're probably more likely to select 'Game of Thrones' as you are familiar with it, and it's the first thing you see. Even if it is equal to the other TV shows. So, when training data in a dataset, we don't want the sequence to impact the training in a similar way, so it's good to shuffle them up.

In [12]:
dataset = tf.data.Dataset.range(20)
dataset = dataset.window(5, shift = 1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(5))
dataset = dataset.map(lambda window: (window[:-1], window[-1:]))
dataset = dataset.shuffle(buffer_size=20)
for x, y in dataset:
  print(x.numpy(), y.numpy())

[ 7  8  9 10] [11]
[13 14 15 16] [17]
[0 1 2 3] [4]
[12 13 14 15] [16]
[ 8  9 10 11] [12]
[15 16 17 18] [19]
[2 3 4 5] [6]
[4 5 6 7] [8]
[11 12 13 14] [15]
[3 4 5 6] [7]
[6 7 8 9] [10]
[14 15 16 17] [18]
[10 11 12 13] [14]
[ 9 10 11 12] [13]
[1 2 3 4] [5]
[5 6 7 8] [9]


Finally comes batching. By setting a batch size of three, our data gets batched into 3 x's and 3 y's at a time. 

In [14]:
dataset = tf.data.Dataset.range(20)
dataset = dataset.window(5, shift = 1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(5))
dataset = dataset.map(lambda window: (window[:-1], window[-1:]))
dataset = dataset.shuffle(buffer_size=20)
dataset = dataset.batch(3).prefetch(1)
for x, y in dataset:
  print("X = ", x.numpy())
  print('y = ', y.numpy())

X =  [[ 2  3  4  5]
 [ 0  1  2  3]
 [14 15 16 17]]
y =  [[ 6]
 [ 4]
 [18]]
X =  [[6 7 8 9]
 [1 2 3 4]
 [3 4 5 6]]
y =  [[10]
 [ 5]
 [ 7]]
X =  [[ 5  6  7  8]
 [13 14 15 16]
 [ 9 10 11 12]]
y =  [[ 9]
 [17]
 [13]]
X =  [[12 13 14 15]
 [ 7  8  9 10]
 [ 4  5  6  7]]
y =  [[16]
 [11]
 [ 8]]
X =  [[15 16 17 18]
 [ 8  9 10 11]
 [10 11 12 13]]
y =  [[19]
 [12]
 [14]]
X =  [[11 12 13 14]]
y =  [[15]]
