# Preparing datasets for LSTM model

The input to every LSTM layer must be three-dimensional.

The three dimensions of this input are:

- <font color=blue> __Samples__:</font>    One sequence is one sample. A batch is comprised of one or more samples.
- <font color=blue> __Time Steps:__</font> One time step is one point of observation in the sample.
- <font color=blue> __Features:__</font>   One feature is one observation at a time step.

This means that the input layer expects a 3D array of data when fitting the model and when making predictions, even if specific dimensions of the array contain a single value, e.g. one sample or one feature.

When defining the input layer of your LSTM network, the network assumes you have 1 or more samples and requires that you specify the number of time steps and the number of features. You can do this by specifying a tuple to the “input_shape” argument.

For example, the model below defines an input layer that expects 1 or more samples, 50 time steps, and 2 features

In [None]:
model = Sequential()
model.add(LSTM(32, input_shape=(50, 2)))
model.add(Dense(1))

Now that we know how to define an LSTM input layer and the expectations of 3D inputs, let’s look at some examples of how we can prepare our data for the LSTM.

<font color=red> __Notice__:</font>

- LSTMs expect 3D input, and it can be challenging.
- LSTMs don’t like sequences of more than 200-400 time steps, so the data will need to be split into samples.

## Example of LSTM With Single Input Sample

Consider the case where you have one sequence of multiple time steps and one feature.

For example, this could be a sequence of 10 values:

0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0

We can define this sequence of numbers as a NumPy array.

In [18]:
from numpy import array
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])

We can then use the reshape() function on the NumPy array to reshape this one-dimensional array into a three-dimensional array with 1 sample, 10 time steps, and 1 feature at each time step.

The reshape() function when called on an array takes one argument which is a tuple defining the new shape of the array. We cannot pass in any tuple of numbers; the reshape must evenly reorganize the data in the array.

In [19]:
data = data.reshape((1, 10, 1))

Once reshaped, we can print the new shape of the array.

In [20]:
print(data.shape)

(1, 10, 1)


Putting all of this together, the complete example is listed below.

In [21]:
from numpy import array
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
data = data.reshape((1, 10, 1))
print(data.shape)

(1, 10, 1)


This data is now ready to be used as input (X) to the LSTM with an input_shape of (10, 1).

In [None]:
model = Sequential()
model.add(LSTM(32, input_shape=(10, 1)))
model.add(Dense(1))

## Example of LSTM with Multiple Input Features

Consider the case where you have multiple parallel series as input for your model.

For example, this could be two parallel series of 10 values:

<font color=red> series 1:</font> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0

<font color=red> series 2:</font> 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1

We can define these data as a matrix of 2 columns with 10 rows:

In [23]:
from numpy import array
data = array([
	[0.1, 1.0],
	[0.2, 0.9],
	[0.3, 0.8],
	[0.4, 0.7],
	[0.5, 0.6],
	[0.6, 0.5],
	[0.7, 0.4],
	[0.8, 0.3],
	[0.9, 0.2],
	[1.0, 0.1]])

This data can be framed as 1 sample with 10 time steps and 2 features.

It can be reshaped as a 3D array as follows:

In [24]:
data = data.reshape(1, 10, 2)

Putting all of this together, the complete example is listed below and running the example prints the new 3D shape of the single sample.

In [27]:
from numpy import array
data = array([
	[0.1, 1.0],
	[0.2, 0.9],
	[0.3, 0.8],
	[0.4, 0.7],
	[0.5, 0.6],
	[0.6, 0.5],
	[0.7, 0.4],
	[0.8, 0.3],
	[0.9, 0.2],
	[1.0, 0.1]])
data = data.reshape(1, 10, 2)
print(data.shape)

(1, 10, 2)


This data is now ready to be used as input (X) to the LSTM with an input_shape of (10, 2).

In [None]:
model = Sequential()
model.add(LSTM(32, input_shape=(10, 2)))
model.add(Dense(1))

## Tips for LSTM Input

This section lists some tips to help you when preparing your input data for LSTMs.

- The LSTM input layer must be 3D.
- The meaning of the 3 input dimensions are: samples, time steps, and features.
- The LSTM input layer is defined by the input_shape argument on the first hidden layer.
- The input_shape argument takes a tuple of two values that define the number of time steps and features.
- The number of samples is assumed to be 1 or more.
- The reshape() function on NumPy arrays can be used to reshape your 1D or 2D data to be 3D.
- The reshape() function takes a tuple as an argument that defines the new shape

# Techniques to Handle Very Long Sequences with LSTMs

Long Short-Term Memory or LSTM recurrent neural networks are capable of learning and remembering over long sequences of inputs.

LSTMs work very well if your problem has one output for every input, like time series forecasting or text translation. But LSTMs can be challenging to use when you have very long input sequences and only one or a handful of outputs.

This is often called sequence labeling, or sequence classification.

Some examples include:

- Classification of sentiment in documents containing thousands of words (natural language processing).
- Classification of an EEG trace of thousands of time steps (medicine).
- Classification of coding or non-coding genes for sequences of thousands of DNA base pairs (bioinformatics).

These so-called sequence classification tasks require special handling when using recurrent neural networks, like LSTMs.

## 1. Use Sequences As-Is

The starting point is to use the long sequence data as-is without change.

This may result in the problem of very long training times.

More troubling, attempting to back-propagate across very long input sequences may result in vanishing gradients, and in turn, an unlearnable model.

A reasonable limit of 250-500 time steps is often used in practice with large LSTM models.

## 2. Truncate Sequences

A common technique for handling very long sequences is to simply truncate them.

This can be done by selectively removing time steps from the beginning or the end of input sequences.

This will allow you to force the sequences to a manageable length at the cost of losing data.

The risk of truncating input sequences is that data that is valuable to the model in order to make accurate predictions is being lost.

## 3. Summarize Sequences

In some problem domains, it may be possible to summarize the input sequences.

For example, in the case where input sequences are words, it may be possible to remove all words from input sequences that are above a specified word frequency (e.g. “and”, “the”, etc.).

This could be framed as only keep the observations where their ranked frequency in the entire training dataset is above some fixed value.

Summarization may result in both focusing the problem on the most salient parts of the input sequences and sufficiently reducing the length of input sequences.

## 4. Random Sampling

A less systematic approach may be to summarize a sequence using random sampling.

Random time steps may be selected and removed from the sequence in order to reduce them to a specific length.

Alternately, random contiguous subsequences may be selected to construct a new sampled sequence over the desired length, care to handle overlap or non-overlap as required by the domain.

This approach may be suitable in cases where there is no obvious way to systematically reduce the sequence length.

This approach may also be used as a type of data augmentation scheme in order to create many possible different input sequences from each input sequence. Such methods can improve the robustness of models when available training data is limited.

## 5. Use Truncated Backpropagation Through Time

Rather than updating the model based on the entire sequence, the gradient can be estimated from a subset of the last time steps.

This is called Truncated Backpropagation Through Time, or TBPTT for short. It can dramatically speed up the learning process of recurrent neural networks like LSTMs on long sequences.

This would allow all sequences to be provided as input and execute the forward pass, but only the last tens or hundreds of time steps would be used to estimate the gradients and used in weight updates.

Some modern implementations of LSTMs permit you to specify the number of time steps to use for updates, separate for the time steps used as input sequences.

## 6. Use an Encoder-Decoder Architecture

You can use an autoencoder to learn a new representation length for long sequences, then a decoder network to interpret the encoded representation into the desired output.

This may involve an unsupervised autoencoder as a pre-processing pass on sequences, or the more recent encoder-decoder LSTM style networks used for natural language translation.

Again, there may still be difficulties in learning from very long sequences, but the more sophisticated architecture may offer additional leverage or skill, especially if combined with one or more of the techniques above.

## 1. Load the Data

I assume you know how to load the data as a Pandas Series or DataFrame.

Here, we will mock loading by defining a new dataset in memory with 5,000 time steps.

In [8]:
from numpy import array

# load...
data = list()
n = 5000
for i in range(n):
	data.append([i+1, (i+1)*10])
data = array(data)
print(data[:5, :])
print(data.shape)

[[ 1 10]
 [ 2 20]
 [ 3 30]
 [ 4 40]
 [ 5 50]]
(5000, 2)


Running this piece both prints the first 5 rows of data and the shape of the loaded data.

## 2. Drop Time

If your time series data is uniform over time and there is no missing values, we can drop the time column.

If not, you may want to look at imputing the missing values, resampling the data to a new time scale, or developing a model that can handle missing values. See posts like:

Here, we just drop the first column:

In [9]:
# drop time
data = data[:, 1]
print(data.shape)

(5000,)


Now we have an array of 5,000 values.

## 3. Split Into Samples

LSTMs need to process samples where each sample is a single time series.

In this case, 5,000 time steps is too long; LSTMs work better with 200-to-400 time steps based on some papers I’ve read. Therefore, we need to split the 5,000 time steps into multiple shorter sub-sequences.

Here, we will split the 5,000 time steps into 25 sub-sequences of 200 time steps each. Rather than using NumPy or Python tricks, we will do this the old fashioned way so you can see what is going on.

In [10]:
# split into samples (e.g. 5000/200 = 25)
samples = list()
length = 200
# step over the 5,000 in jumps of 200
for i in range(0,n,length):
	# grab from i to i + 200
	sample = data[i:i+length]
	samples.append(sample)
print(len(samples))

25


We now have 25 sub sequences of 200 time steps each.

In [11]:
print(len(samples))

25


## 4. Reshape Subsequences

The LSTM needs data with the format of [samples, time steps and features]. Here, we have 25 samples, 200 time steps per sample, and 1 feature.

First, we need to convert our list of arrays into a 2D NumPy array of 25 x 200.

In [12]:
# convert list of arrays into 2d array
data = array(samples)
print(data.shape)

(25, 200)


Next, we can use the reshape() function to add one additional dimension for our single feature.

In [13]:
# reshape into [samples, timesteps, features]
# expect [25, 200, 1]
data = data.reshape((len(samples), length, 1))
print(data.shape)

(25, 200, 1)


## References:

1. How to Reshape Input Data for Long Short-Term Memory Networks in Keras ([Link](https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/))
1. Techniques to Handle Very Long Sequences with LSTMs ([Link](https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/))

## Further Reading:

1. How to Convert a Time Series to a Supervised Learning Problem in Python ([Link](https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/))
2. Time Series Forecasting as Supervised Learning ([Link](https://machinelearningmastery.com/time-series-forecasting-supervised-learning/))
3. How to Load and Explore Time Series Data in Python ([Link](https://machinelearningmastery.com/load-explore-time-series-data-python/))
4. How to Load and Explore Time Series Data in Python ([Link](https://machinelearningmastery.com/load-explore-time-series-data-python/))
5. How To Load Machine Learning Data in Python ([Link](https://machinelearningmastery.com/load-machine-learning-data-python/))
6. How to Develop LSTM Models for Time Series Forecasting ([Link](https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/))
