### Objectives of sprint 1:

- Scaling numeric data and how to transform categorical data.
- Padding and Truncating input sequences with varied lengths.
- Transforming input sequences into a supervised learning problem.

### Preparing Numeric Data

Normalize Series Data

In [2]:
from pandas import Series
from sklearn.preprocessing import MinMaxScaler

# define contrived series
data = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0]
series = Series(data)
print(series)

# prepare data for normalization
values = series.values
values = values.reshape((len(values), 1))

# train the normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(values)
print('\nMin: %f, Max: %f' % (scaler.data_min_, scaler.data_max_))

# normalize the dataset and print
normalized = scaler.transform(values)
print('\n', normalized)

# inverse transform and print
inversed = scaler.inverse_transform(normalized)
print('\n', inversed)

0     10.0
1     20.0
2     30.0
3     40.0
4     50.0
5     60.0
6     70.0
7     80.0
8     90.0
9    100.0
dtype: float64

Min: 10.000000, Max: 100.000000

 [[0.        ]
 [0.11111111]
 [0.22222222]
 [0.33333333]
 [0.44444444]
 [0.55555556]
 [0.66666667]
 [0.77777778]
 [0.88888889]
 [1.        ]]

 [[ 10.]
 [ 20.]
 [ 30.]
 [ 40.]
 [ 50.]
 [ 60.]
 [ 70.]
 [ 80.]
 [ 90.]
 [100.]]


Standardize Series Data

In [4]:
from sklearn.preprocessing import StandardScaler
from math import sqrt

# define contrived series
data = [1.0, 5.5, 9.0, 2.6, 8.8, 3.0, 4.1, 7.9, 6.3]
series = Series(data)
print('\n', series)

# prepare data for normalization
values = series.values
values = values.reshape((len(values), 1))

# train the normalization
scaler = StandardScaler()
scaler = scaler.fit(values)
print('\nMean: %f, StandardDeviation: %f' % (scaler.mean_, sqrt(scaler.var_)))

# normalize the dataset and print
standardized = scaler.transform(values)
print('\n', standardized)

# inverse transform and print
inversed = scaler.inverse_transform(standardized)
print('\n', inversed)


 0    1.0
1    5.5
2    9.0
3    2.6
4    8.8
5    3.0
6    4.1
7    7.9
8    6.3
dtype: float64

Mean: 5.355556, StandardDeviation: 2.712568

 [[-1.60569456]
 [ 0.05325007]
 [ 1.34354035]
 [-1.01584758]
 [ 1.26980948]
 [-0.86838584]
 [-0.46286604]
 [ 0.93802055]
 [ 0.34817357]]

 [[1. ]
 [5.5]
 [9. ]
 [2.6]
 [8.8]
 [3. ]
 [4.1]
 [7.9]
 [6.3]]


### Preparing Categorical Data

Convert Categorical Data to Numerical Data

This involves two steps:
1. Integer Encoding.
2. One Hot Encoding.

In [6]:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm']
values = array(data)
print('\n', values)

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print('\n', integer_encoded)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print('\n', onehot_encoded)

# invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print('\n', inverted)


 ['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm']

 [0 0 2 0 1 1 2 0 2]

 [[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]]

 ['cold']


### Prepare Sequences with Varied Lengths

Sequence Padding

In [12]:
from keras.preprocessing.sequence import pad_sequences

# define sequences
sequences = [[1, 2, 3, 4],
             [1, 2, 3],
             [1]]

# pre-sequence padding
padded = pad_sequences(sequences)
print(padded)

# post-sequence padding
padded = pad_sequences(sequences, padding = 'post')
print('\n', padded)

[[1 2 3 4]
 [0 1 2 3]
 [0 0 0 1]]

 [[1 2 3 4]
 [1 2 3 0]
 [1 0 0 0]]


Sequence Truncation

In [16]:
# pre-sequence truncation
truncated = pad_sequences(sequences, maxlen = 2)
print(truncated)

# post-sequence truncation
truncated = pad_sequences(sequences, maxlen = 2, truncating = 'post')
print('\n', truncated)

[[3 4]
 [2 3]
 [0 1]]

 [[1 2]
 [1 2]
 [0 1]]


Transform time series data into a supervised learning problem

In [27]:
from pandas import DataFrame

# define the sequence
df = DataFrame()
df['t'] = [x for x in range(10)]
print(df)

   t
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9


In [28]:
# shift forward
df['t-1'] = df['t'].shift(1)
print(df)

# shift backward
df['t+1'] = df['t'].shift(-1)
print('\n', df)

   t  t-1
0  0  NaN
1  1  0.0
2  2  1.0
3  3  2.0
4  4  3.0
5  5  4.0
6  6  5.0
7  7  6.0
8  8  7.0
9  9  8.0

    t  t-1  t+1
0  0  NaN  1.0
1  1  0.0  2.0
2  2  1.0  3.0
3  3  2.0  4.0
4  4  3.0  5.0
5  5  4.0  6.0
6  6  5.0  7.0
7  7  6.0  8.0
8  8  7.0  9.0
9  9  8.0  NaN


### Objectives of sprint 2:

- Defining an LSTM model, including how to reshape your data for the required 3D input.
- Fitting and evaluating LSTM model and use it to make predictions on new data.
- Taking fine-grained control over the internal state in the model and when it is reset.

### Define the Model

In [51]:
from keras import *

model = Sequential()

model.add(layers.LSTM(2))  # LSTM hidden layer with 2 memory cells
model.add(layers.Dense(1))  # Dense output layer with 1 neuron

Another way of defining the model

In [52]:
layer = [layers.LSTM(2), layers.Dense(1)]
model = Sequential(layer)

The first hidden layer in the network must define the number of inputs to expect, e.g. the shape of the input layer. Input must be three-dimensional, comprised of samples, time steps and features in that order.

- Samples. These are the rows in your data. One sample may be one sequence.
- Time steps. These are the past observations for a feature, such as lag variables.
- Features. These are columns in your data.

In [48]:
import numpy as np

data = np.array([[1, 2],
                 [2, 3],
                 [3, 4],
                 [4, 5],
                 [5, 6],
                 [6, 7],
                 [7, 8]])

out = [3, 4, 5, 6, 7, 8, 9]

data = data.reshape((data.shape[0], data.shape[1], 1))

print('Number of samples:', data.shape[0])
print('Number of time steps:', data.shape[1])
print('Number of features per time step:', data.shape[2])

Number of samples: 7
Number of time steps: 2
Number of features per time step: 1


The **input shape** argument that expects a tuple containing the number of
**time steps** and the **number of features**.

In [54]:
model = Sequential()
model.add(layers.LSTM(5, input_shape = (2,1)))
model.add(layers.Dense(1))
model.add(layers.Activation('sigmoid'))

compiling the model

In [57]:
model.compile(optimizer = 'sgd', loss= 'mse', metrics=['accuracy'])