Adapted from chapter 9 of https://d2l.ai/

In [25]:
!rm ./ftse*
!wget https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/ftse100.png 

We often need statistical tools to analyse sequence model data

for example look at this graph:

![FTSE 100 index over about 30 years.](https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/ftse100.png )



Let us denote the prices by $x_t$, i.e., at *time step* $t \in \mathbb{Z}^+$ we observe price $x_t$.
Note that for sequences in this text,
$t$ will typically be discrete and vary over integers or its subset.
Suppose that
a trader who wants to do well in the stock market on day $t$ predicts $x_t$ via

$$x_t \sim P(x_t \mid x_{t-1}, \ldots, x_1).$$



In order to do this, we could have trained a linear model however, the number of input to the model varies depending on previous values encountered. This makes it intractable.Much about autoregressive model revolves around doing it efficiently. this is done by

1. Not haveing arbitary long input, instead focus on someprevious tau values. Now this makes regression possible. They literally regress on themselves thus autoregressive model.

2. second strategy is to keep tabs on some past observations through summary summary $h_t$ of the past observations, and at the same time update $h_t$ in addition to the prediction $\hat{x}_t$.
This leads to models that estimate $x_t$ with $\hat{x}_t = P(x_t \mid h_{t})$ and moreover updates of the form  $h_t = g(h_{t-1}, x_{t-1})$. Since $h_t$ is never observed, these models are also called *latent autoregressive models*.

![A latent autoregressive model.](https://raw.githubusercontent.com/d2l-ai/d2l-en/a49cf28aaaf533595707bca63ec1e567f65c16ca/img/sequence-model.svg)

Both cases raise the obvious question of how to generate training data. One typically uses historical observations to predict the next observation given the ones up to right now. Obviously we do not expect time to stand still. However, a common assumption is that while the specific values of $x_t$ might change, at least the dynamics of the sequence itself will not. This is reasonable, since novel dynamics are just that, novel and thus not predictable using data that we have so far. Statisticians call dynamics that do not change *stationary*.
Regardless of what we do, we will thus get an estimate of the entire sequence via

$$P(x_1, \ldots, x_T) = \prod_{t=1}^T P(x_t \mid x_{t-1}, \ldots, x_1).$$

Note that the above considerations still hold if we deal with discrete objects, such as words, rather than continuous numbers. The only difference is that in such a situation we need to use a classifier rather than a regression model to estimate $P(x_t \mid  x_{t-1}, \ldots, x_1)$.

### Generating training data

we will use cos function with some added noise for 1000 steps. `T=1000`

In [26]:
import torch

In [27]:
T = 1000
time = torch.arange(1,T+1)
len(time)

In [28]:
x = torch.cos(time * 0.01) + torch.normal(0,0.2,(T,))
x.shape

In [29]:
import matplotlib.pyplot as plt

plt.plot(time, x, label='train data')
plt.legend()
plt.show()

# this does look like noisy cos graph!

### Turning train data into features and labels

In order to do so we will use the next term of first 4 term as the label. `tau = 4`.

In [30]:
tau  = 4

features = torch.zeros(T-tau, tau)
labels = torch.zeros(T-tau)
for i in range(0, T-tau):
    features[i,: ] = x[i:i+tau] 
    labels[i] = x[i+tau]

features[:5], labels[:5]
    

In [31]:
labels = labels.reshape((-1,1))

### Creating train and test data

We will use `torch.utils.data.TensorDataset` to load the features and labels directly.

In [32]:
batch_size = 4
n_train = 600

train_iter = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(*(features[:n_train], labels[:n_train])), batch_size=batch_size, shuffle=True)
test_iter = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(*(features[n_train:], labels[n_train:])), batch_size=batch_size, shuffle=True)

The reason we do is that we need to turn such a sequence into features and labels that our model can train on. Based on the embedding dimension  τ  we map the data into pairs  yt=xt  and  xt=[xt−τ,…,xt−1] . The astute reader might have noticed that this gives us  τ  fewer data examples, since we do not have sufficient history for the first  τ  of them. A simple fix, in particular if the sequence is long, is to discard those few terms. Alternatively we could pad the sequence with zeros. Here we only use the first 600 feature-label pairs for training.

### We are ready to train our model.

inorder to train our model

1. define a neural network
2. define loss optimizer
3. define learning rate
4. we need to initialise all parameters
5. implement training loop

for initialisation we will do xavier initialisation. And create a simple sequential model.

In [33]:
from torch import nn

def init_net(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)

def get_net():
    net = nn.Sequential(nn.Linear(4,32), nn.ReLU(), nn.Linear(32,1))
    net.apply(init_net)
    return net

    
device = torch.device('gpu' if torch.cuda.is_available() else 'cpu')

In [34]:
net = get_net()
pred_x = net(features)
pred_x.shape

Here is how it looks before training

In [35]:
plt.plot(time, x, label="actual")
plt.plot(time[4:], pred_x.detach().numpy(), label="predicted")
plt.legend()
plt.title("before training")
plt.grid()
plt.show()

Standard training loop of pytorch.https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

In [36]:
def train(net, train_iter, test_iter, num_epochs, device, lr=0.01):
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    loss = nn.MSELoss()
    
    net = net.to(device)
    
    for epoch in range(num_epochs):
        overall_loss = 0
        total_numer = 0
        for X, y in train_iter:
            X = X.to(device)
            y = y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            total_numer += X.shape[0]
            overall_loss += l
        
        epoch_loss = overall_loss/total_numer
        print(f"for epoch {epoch} the loss: {epoch_loss}")

In [37]:
net = get_net()
train(net, train_iter, test_iter, 10, device, 0.01)

## Prediction

lets put neural net model to some use.

### One step ahead prediction

Since the training loss is small, we would expect our model to work well. Let us see what this means in practice. The first thing to check is how well the model is able to predict what happens just in the next time step, namely the one-step-ahead prediction.

In [38]:
pred_x = net(features)
pred_x.shape

In [39]:
plt.plot(time, x, label="actual")
plt.plot(time[4:], pred_x.detach().numpy(), label="predicted")
plt.legend()
plt.title("after training")
plt.grid()
plt.show()

The one-step-ahead predictions look nice, just as we expected. Even beyond 604 (n_train + tau) observations the predictions still look trustworthy. However, there is just one little problem to this: if we observe sequence data only until time step 604, we cannot hope to receive the inputs for all the future one-step-ahead predictions. Instead, we need to work our way forward one step at a time:

`x^605=f(x601,x602,x603,x604),
 x^606=f(x602,x603,x604,x^605),
 x^607=f(x603,x604,x^605,x^606),
 x^608=f(x604,x^605,x^606,x^607),
 x^609=f(x^605,x^606,x^607,x^608)...`

### k step ahead prediction

Generally, for an observed sequence up to  xt , its predicted output  x^t+k  at time step  t+k  is called the  k -step-ahead prediction. Since we have observed up to  x604 , its  k -step-ahead prediction is  x^604+k . In other words, we will have to use our own predictions to make multistep-ahead predictions.

In [43]:
multistep_preds = torch.zeros(T)
multistep_preds[:n_train+ tau] = x[:n_train+tau]

for i in range(n_train+tau , T):
    # predicting simply based on past 4 predictions
    multistep_preds[i] = net(multistep_preds[i-tau:i].reshape((-1,1)).squeeze(1))

plt.plot(time, x, label="actual")
plt.plot(time[4:], pred_x.detach().numpy(), label = "one step ahead")
plt.plot(time, multistep_preds.detach().numpy(), label = "multi step ahead")
plt.legend()
plt.grid()
plt.show()

The green line is our multi step prediction, so what just happened? why did it fail beyond training data?

### Explanation for the green line

As the above example shows, this is a spectacular failure. The predictions decay to a constant pretty quickly after a few prediction steps. Why did the algorithm work so poorly? This is ultimately due to the fact that the errors build up. Let us say that after step 1 we have some error  ϵ1=ϵ¯ . Now the input for step 2 is perturbed by  ϵ1 , hence we suffer some error in the order of  ϵ2=ϵ¯+cϵ1  for some constant  c , and so on. The error can diverge rather rapidly from the true observations. This is a common phenomenon. For instance, weather forecasts for the next 24 hours tend to be pretty accurate but beyond that the accuracy declines rapidly. We will discuss methods for improving this throughout this chapter and beyond.

Let us take a closer look at the difficulties in  k -step-ahead predictions by computing predictions on the entire sequence for  k=1,4,16,64 .


In [44]:
max_steps = 64

ahead_features = torch.zeros((T - tau - max_steps + 1, tau + max_steps))

for i in range(tau):
    ahead_features[:,i] = x[i:i + T - tau - max_steps + 1]

for i in range(tau, tau + max_steps):
    ahead_features[:, i] = net(ahead_features[:, i - tau:i]).reshape(-1)

steps = (1, 4, 16, 64)

for i in steps:
    plt.plot(time[tau + i -1: T - max_steps + i], ahead_features[:, (tau + i - 1)].detach().numpy(), label=f"{i} pred steps" )
plt.legend()
plt.grid()
plt.show()
    

This clearly illustrates how the quality of the prediction changes as we try to predict further into the future. While the 4-step-ahead predictions still look good, anything beyond that is almost useless.

### Summary

There is quite a difference in difficulty between interpolation and extrapolation. Consequently, if you have a sequence, always respect the temporal order of the data when training, i.e., never train on future data.

Sequence models require specialized statistical tools for estimation. Two popular choices are autoregressive models and latent-variable autoregressive models.

For causal models (e.g., time going forward), estimating the forward direction is typically a lot easier than the reverse direction.

For an observed sequence up to time step  t , its predicted output at time step  t+k  is the  k -step-ahead prediction. As we predict further in time by increasing  k , the errors accumulate and the quality of the prediction degrades, often dramatically.

### Things to ponder

Improve the model in the experiment of this section.

    -Incorporate more than the past 4 observations? How many do you really need?
    (Tried. in a limited way. close to one seems better.)

    - How many past observations would you need if there was no noise? Hint: you can write  sin  and  cos  as a differential equation.
    (I tried with sin and cos.Its the same story. `x = torch.cos(time * 0.01)`)

    - Can you incorporate older observations while keeping the total number of features constant? Does this improve accuracy? Why?
    (Dont get the question)

    Change the neural network architecture and evaluate the performance.
    (tried. same)

An investor wants to find a good security to buy. He looks at past returns to decide which one is likely to do well. What could possibly go wrong with this strategy?
(Only things constant in life is change)

Does causality also apply to text? To which extent?
(The words do have some causality I believe, you can expect h to follow W at the start of a sentence with a relatively high degree of confidence. But its very topical)

Give an example for when a latent autoregressive model might be needed to capture the dynamic of the data.
(In stock market!)

### multi step for all steps till 4

In [46]:
# lets try more than 4 observations

steps = range(1, 64)

for i in steps:
    plt.plot(time[tau + i -1: T - max_steps + i], ahead_features[:, (tau + i - 1)].detach().numpy(), label=f"{i} pred steps" )
#plt.legend()
plt.grid()
plt.show()


### Lets try and eliminate noise altogether and find out if it makes a difference to multi step prediction.

In [56]:
# if x had no noise

x = torch.cos(time * 0.01)
x.shape

In [57]:
plt.plot(time, x, label='train data')
plt.legend()
plt.show()


In [58]:
au  = 4

features = torch.zeros(T-tau, tau)
labels = torch.zeros(T-tau)
for i in range(0, T-tau):
    features[i,: ] = x[i:i+tau] 
    labels[i] = x[i+tau]

features[:5], labels[:5]

In [59]:
labels = labels.reshape((-1,1))

In [60]:
batch_size = 4
n_train = 600

train_iter = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(*(features[:n_train], labels[:n_train])), batch_size=batch_size, shuffle=True)
test_iter = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(*(features[n_train:], labels[n_train:])), batch_size=batch_size, shuffle=True)

In [61]:
net = get_net()
train(net, train_iter, test_iter, 10, device, 0.01)

In [62]:
pred_x = net(features)

In [63]:
plt.plot(time, x, label="actual")
plt.plot(time[4:], pred_x.detach().numpy(), label="predicted")
plt.legend()
plt.title("after training")
plt.grid()
plt.show()

In [64]:
multistep_preds = torch.zeros(T)
multistep_preds[:n_train+ tau] = x[:n_train+tau]

for i in range(n_train+tau , T):
    # predicting simply based on past 4 predictions
    multistep_preds[i] = net(multistep_preds[i-tau:i].reshape((-1,1)).squeeze(1))

plt.plot(time, x, label="actual")
plt.plot(time[4:], pred_x.detach().numpy(), label = "one step ahead")
plt.plot(time, multistep_preds.detach().numpy(), label = "multi step ahead")
plt.legend()
plt.grid()
plt.show()

Conclusion: So even with no noise its the same story.