# Advanced layer types: Recurrence

## Questions

- Why do we need layers specifically designed for sequential data?
- What are Recurrent Neural Networks (RNNs) and LSTMs?
- How does an LSTM "remember” important information over time?
- What are alternatives like attention?

## Objectives

- Understand the structure and motivation behind RNN and LSTM layers
- Relate LSTM concepts to earlier architectures (dense, CNN)
- Explore a simple forecasting example using LSTM


## Revisiting sunshine hours

Yesterday, we predicted today's sunshine hours (in Basel) using weather variables from just yesterday — a one-to-one mapping. Each input was a single day's data. Let's rebuild that model quickly to remind ourselves of the test set performance. 

In [None]:
import pandas as pd
data = pd.read_csv("https://zenodo.org/record/5071376/files/weather_prediction_dataset_light.csv?download=1")

In [None]:
import pandas as pd

filename_data = "data/weather_prediction_dataset_light.csv"
data = pd.read_csv(filename_data)
data.head()

### Select Basel features only
To speed up our computation and simplify this model, we'll just focus on predictors from Basel.

In [None]:
# Use only Basel-specific predictors
basel_columns = [col for col in data.columns if col.startswith('BASEL_')]

nr_rows = 9*365 # using 9 years to match our example on day 2

# Drop DATE and MONTH, keep Basel predictors
X_data = data.loc[:nr_rows, basel_columns]
y_data = data.loc[1:(nr_rows + 1), "BASEL_sunshine"]
print(X_data.shape)
X_data.head()

In [None]:
y_data.head() # show's next day of sunshine

### Temporal train/test split
This time, we'll apply a more appropriate split for a temporal dataset. We'll turn off shuffling so that the test set consists only of later time points than those in the training set. This setup allows us to evaluate how well the model can predict future values using only past information — without accidentally training on data from the future. This is important, because, the future often doesn't resemble the past, and we want to see how well our model can handle this.

In [None]:
from sklearn.model_selection import train_test_split
test_set_size = 0.2
X_train, X_holdout, y_train, y_holdout = train_test_split(X_data, y_data, test_size=test_set_size, random_state=0, shuffle=False)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=0, shuffle=False)
print(X_train.shape)
print(X_test.shape)
print(X_val.shape)


## Compute baseline performance: tomorrow's sunshine is equal to today's
Since we're using a new train/test split, it's a good idea to re-compute the baseline model.

In [None]:
y_baseline_prediction = X_test['BASEL_sunshine']
from sklearn.metrics import root_mean_squared_error
rmse_baseline = root_mean_squared_error(y_test, y_baseline_prediction)
print('Baseline:', rmse_baseline)


## Recompute dense model performance

### Set random seed
Set seeds to control for random weight initalization

In [None]:
from tensorflow import keras
from numpy.random import seed
seed(42)
keras.utils.set_random_seed(42)

In [None]:
from tensorflow import keras

def create_dense_nn(input_shape):
    # Input layer
    inputs = keras.Input(shape=input_shape, name='input')

    # Dense layers
    layers_dense = keras.layers.Dense(100, 'relu')(inputs)
    layers_dense = keras.layers.Dense(50, 'relu')(layers_dense)

    # Output layer
    outputs = keras.layers.Dense(1)(layers_dense)

    return keras.Model(inputs=inputs, outputs=outputs, name="dense_weather_prediction_model")



In [None]:
model_dense = create_dense_nn(input_shape=(X_train.shape[1],))
model_dense.summary()

With just 9 input features, this fully connected model has 6,101 parameters.

**Challenge**: What is the data:params ratio for this model?

**Challenge**: Is this an "underparameterized" or "overparameterized" model? How do you think it will perform?

In [None]:
def compile_model(model):
    model.compile(optimizer='adam',
                  loss='mse',
                  metrics=[keras.metrics.RootMeanSquaredError()])


In [None]:
compile_model(model_dense)

Fit with early stopping, as we did previously

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

earlystopper = EarlyStopping(
    monitor='val_loss',
    patience=10
    )

history_dense = model_dense.fit(X_train, y_train,
                    batch_size = 32,
                    epochs = 200,
                    validation_data=(X_val, y_val),
                    callbacks=[earlystopper])

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def plot_history(history, metrics):
    """
    Plot the training history

    Args:
        history (keras History object that is returned by model.fit())
        metrics (str, list): Metric or a list of metrics to plot
    """
    history_df = pd.DataFrame.from_dict(history.history)
    sns.lineplot(data=history_df[metrics])
    plt.xlabel("epochs")
    plt.ylabel("metric")


In [None]:
plot_history(history_dense, ['root_mean_squared_error', 'val_root_mean_squared_error'])

In [None]:
model_dense.evaluate(X_test, y_test)

If you recall, our baseline model had an RMSE of Baseline: 4.27. Our fully connect neural network does slightly better than baseline (3.59). While it's encouraging to see an improvement to the baseline, there is much more we can do to improve this result. 

Recall that here, we are only looking at the current day's weather (across cities) to determine sunshine hours in Basel the next day. But what if sunshine patterns depend on the past week, past month, or past year? 

### When is a single lag not enough?

In many real-world tasks, yesterday's data alone isn't sufficient — patterns unfold over time:

- Rainy streaks often last several days
- Cold fronts move gradually, not all at once
- In other domains: heartbeats, language, gestures, and biological sequences like proteins rely on order and context across multiple steps

#### Include mutliple lags manually?
A natural next step is to **include multiple lags manually** as input features. 

For example: add sunshine hours from the last 30 days as separate columns. 



In [None]:
# Create lagged features for all Basel-specific predictors
data_lagged = data.copy()
basel_columns = [col for col in data.columns if col.startswith('BASEL_')]

# Add lags for each Basel predictor (store in dictionary first)
window_size = 30
lagged_features = {}

for col in basel_columns:
    for lag in range(1, window_size + 1):
        lagged_features[f'{col}_lag{lag}'] = data[col].shift(lag)

# Concatenate all lagged features at once
data_lagged = pd.concat([data_lagged, pd.DataFrame(lagged_features)], axis=1)

# Drop rows with NaNs caused by lagging
data_lagged = data_lagged.dropna().reset_index(drop=True)

# Define X and y using only lagged Basel features
all_lagged_cols = [col for col in data_lagged.columns if col.startswith('BASEL_') and 'lag' in col]
X_data_lagged = data_lagged.loc[:nr_rows, all_lagged_cols]
y_data_lagged = data_lagged.loc[:nr_rows, 'BASEL_sunshine']  # target is same as before

# Train/validation/test split (preserving time order)
from sklearn.model_selection import train_test_split

X_train_lagged, X_holdout_lagged, y_train_lagged, y_holdout_lagged = train_test_split(
    X_data_lagged, y_data_lagged, test_size=test_set_size, random_state=0, shuffle=False)

X_val_lagged, X_test_lagged, y_val_lagged, y_test_lagged = train_test_split(
    X_holdout_lagged, y_holdout_lagged, test_size=0.5, random_state=0, shuffle=False)


In [None]:
X_train_lagged.head()

In [None]:
print(X_train_lagged.shape)
print(X_test_lagged.shape)
print(X_val_lagged.shape)


**Note**: Number of predictors grows drastically! How do you think this will impact model performance? 

Let's try it and see.

In [None]:
# Create new dense model with additional input features
model_dense_lagged = create_dense_nn(input_shape=(X_train_lagged.shape[1],))

# view model summary
model_dense_lagged.summary()


**Note**: With 270 input features, the total number of weights increases dramaticaly as well. 32,201 (30 lags) >> 6,101 (1 lag)

Compile and train the model next.

In [None]:
# compile model 
compile_model(model_dense_lagged)
# train model and store results
history_dense_lagged = model_dense_lagged.fit(X_train_lagged, y_train_lagged,
                    batch_size = 32,
                    epochs = 200,
                    validation_data=(X_val_lagged, y_val_lagged),
                    callbacks=[earlystopper])


In [None]:
plot_history(history_dense_lagged, ['root_mean_squared_error', 'val_root_mean_squared_error'])

In [None]:
model_dense_lagged.evaluate(X_test_lagged, y_test_lagged)


### Discuss: What might be happening here?
We see that the explosion in input parameters has made the problem more challenging to model. We get a test set error of 3.96 compared to 3.59 in our previous 1-lag dense model.

### Why manual lagging isn't enough

Previously, we gave our models access to past information by manually adding lagged versions of each predictor. While this works in principle, it introduces several problems:

- **No awareness of time**: The model treats each lag (e.g., x_{t-1}, x_{t-30}) as a separate feature, with no sense that lag 1 is more recent than lag 30.
- **Feature explosion**: Including many lags across many predictors causes the input size to grow rapidly — in our case, 30 lags × 9 features = 270 inputs.
- **Manual choices**: You have to decide how many lags to include and which features to use. These decisions are often arbitrary and may not generalize.
- **Separate weights for each lag**: A dense model learns a different set of weights for each timestep, which limits its ability to recognize repeating or shifting patterns.

This approach flattens time into a wide input vector, forcing the model to memorize temporal structure instead of modeling it directly.

## Introducing recurrence

Recurrent neural networks (RNNs) address these limitations by processing one timestep at a time while maintaining a **hidden state** that evolves as the sequence progresses.

At each timestep `t`, the RNN:
- Receives the input `x_t`
- Uses the hidden state from the previous timestep, `h_{t-1}`
- Computes a new hidden state `h_t` using a shared transformation

This process is called **recurrence** — the same computation (with shared parameters) is applied at every timestep. The hidden state acts as a running summary of what the model has seen so far.

This is illustrated below:  
![Unrolled RNN](rnn_unfolded.png)

- Left: A single RNN cell, showing how it uses `x_t` and `h_{t-1}` to produce `o`
- Right: The same cell **unrolled through time**, with the hidden state passed from one step to the next

By the end of the sequence, the final hidden state reflects the model's accumulated understanding of everything it's seen.

### Why RNNs don't need lagged features

Rather than handing the model 30 manually lagged inputs, we give it the full sequence and let it learn what to remember over time. The hidden state captures temporal context dynamically — often retaining short-term memory well, but struggling with long-range dependencies due to how gradients behave during training (more on that later).

RNNs resemble a *first-order Markov process*: each hidden state `h_t` depends only on `h_{t-1}` and `x_t`. But unlike fixed-state Markov models, RNNs learn a continuous, task-specific representation of state.

Key benefits:
- **Fewer parameters**: the same weights are reused across time
- **Built-in temporal structure**: the model processes sequences step by step
- **Better generalization**: no need to manually engineer lag features

### How can one set of weights model an entire sequence?

A common point of confusion: if the RNN uses the same weights at each timestep, how can it handle complex, time-dependent patterns?

The answer is that the model doesn't try to learn the entire sequence at once. Instead, it learns **how to update a memory (the hidden state)** at each step based on the current input. The recurrence isn't solving the whole task at every point in time — it's gradually evolving a compact summary of the sequence.

Each `h_t` is shaped by both `x_t` and everything that came before it, as encoded in `h_{t-1}`. The final hidden state reflects this cumulative process.

## Building a recurrent model

Recurrent layers expect 3D input: one sequence per sample. Instead of flattening the last 30 days into a single row, we reshape our data into sequences of timesteps, where each timestep contains a feature vector.

At each step, the RNN:
- Receives the current input `x_t`
- Uses the previous hidden state `h_{t-1}`
- Computes a new hidden state `h_t`

This `h_t` can be passed forward to the next timestep or used to make a prediction — typically just once at the final step in many-to-one setups.

### Preparing the inputs and outputs

To train a recurrent model, we’ll transform our dataset so that each training sample is a short sequence of consecutive days (e.g., 30 timesteps) with the same set of predictors.

Recurrent layers expect input tensors shaped:


In [None]:
X_data = data.loc[:nr_rows].drop(columns=["DATE", "MONTH"]) # nr_rows gives us 9 years of data again
y_data = data.loc[window_size:(nr_rows + window_size)]["BASEL_sunshine"] # predict starting with first day after window_size
X_data.shape

Keep only BASEL predictors, as we did before.

In [None]:
# Keep only BASEL predictors 
basel_columns = [col for col in data.columns if col.startswith('BASEL_')]

# Drop NaNs and reset index to keep it tidy
data_basel = data[basel_columns].dropna().reset_index(drop=True)
data_basel.shape

#### How are the input sequences and targets constructed?

Each input sample is a sequence of 30 consecutive days of predictor values.  
We slide this 30-day window over the dataset to create many such sequences.


In [None]:
import numpy as np

# Build sequences
n_seq = len(data_basel) - window_size

X_seq = np.stack([
    data_basel[basel_columns].iloc[i:i+window_size].values
    for i in range(n_seq)
])
print("X_seq shape:", X_seq.shape)  # (samples, 30, features)


This creates a 3D array of shape (n_seq, window_size, features). Each slice `X_seq[i]` contains predictors for days i through i+window_size


For each input sequence, we want the model to predict the sunshine on the day after the last timestep (i.e., day i+window_size). We align the target values accordingly:

In [None]:
y_seq = data_basel['BASEL_sunshine'].iloc[window_size:].values # index starts at 0, so this gives us the 31st day as first data point in y_seq

print("y_seq shape:", y_seq.shape)

Temporal train/test split (without shuffling)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_holdout, y_train, y_holdout = train_test_split(X_seq, y_seq, test_size=test_set_size, random_state=0, shuffle=False)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=0, shuffle=False)

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

### Modeling sequences with a recurrent neural network

Now that our input data is structured as sequences (30 timesteps × 9 features), we can use a recurrent neural network to learn from patterns over time.

Instead of manually adding lagged features, we feed the model the full sequence and let it learn which parts of the past are useful. This keeps the input compact while enabling the model to capture temporal dependencies internally.

Let's define our first `SimpleRNN` model.

In [None]:
def create_simple_rnn(input_shape, rnn_units=16):
    # Input layer for sequences of shape (timesteps, features)
    inputs = keras.Input(shape=input_shape, name='input_sequence')

    # Simple RNN layer with ReLU activation
    x = keras.layers.SimpleRNN(rnn_units, activation='relu')(inputs)

    # Output layer for regression
    outputs = keras.layers.Dense(1)(x)

    return keras.Model(inputs=inputs, outputs=outputs, name="simple_rnn_weather_model")


In [None]:
model_rnn = create_simple_rnn(input_shape=X_train.shape[1:], rnn_units=16)
model_rnn.summary()



The SimpleRNN layer has three components contributing to its parameter count of 416

- **Input weights (`U`)**:  
  One weight per input feature × hidden unit  
  → `9 input features × 16 units = 144`

- **Recurrent weights (`V`)**:  
  One weight per hidden unit × hidden unit  
  → `16 units × 16 units = 256`

  These weights connect the hidden state at time `t-1` to the hidden state at time `t`, enabling the model to "remember" and update internal state across timesteps.

- **Biases (`b`)**:  
  One bias term per hidden unit  
  → `16`

**Total for RNN**:  


In [None]:
144 + 256 + 16

These weights are **shared across all 30 timesteps** — the same weights are applied as the model "unrolls" across time. This is one of the reasons RNNs can model long sequences without requiring separate weights for each lag like a dense model would.


In comparison, our dense model containing 30 lags with all predictors had 32,201 weights!

### What does the Dense layer do here?
After the RNN layer processes the full sequence, it returns the **final hidden state** — not the full sequence of outputs, just the one at the last timestep. This is typical in many-to-one sequence models (e.g., predicting tomorrow’s value from a 30-day window). If you want to output an entire sequence rather than just the next time-step, you can set: `return_sequences=True` in the RNN layer. We'll practice this in a few minutes.

In the many-to-one setup we've used here, the RNN outputs a single vector of size 16 (one value per hidden unit). That vector is then passed to a `Dense` layer, which acts as the output layer. The Dense layer:

- Takes all 16 values from the final hidden state
- Learns one weight per unit, to map them to a single prediction (sunshine hours)
- Adds a bias term

So the 17 parameters in the Dense layer are:
- **16 weights**: One from each hidden unit to the output
- **1 bias**: A constant shift to the prediction

**Note:** If you’ve seen diagrams like the one we showed above and below, you may notice an output weight matrix `W` that maps the hidden state `h_t` to an output `o_t` at each timestep. In our case, we aren’t producing an output at every step. Instead, we apply a single Dense layer *after* the final timestep, which effectively takes the place of `W` — but only once, not repeatedly. 

![Unrolled RNN](rnn_unfolded.png)


### Compile and fit the model

In [None]:
compile_model(model_rnn)

In [None]:
history_rnn = model_rnn.fit(
    X_train, y_train,
    batch_size=32,
    epochs=200,
    validation_data=(X_val, y_val),
    callbacks=[earlystopper]
)

In [None]:
plot_history(history_rnn, ['root_mean_squared_error', 'val_root_mean_squared_error'])


In [None]:
model_rnn.evaluate(X_test, y_test)

We see an improvement with the RNN model! Our test error is now 3.49 compared to the dense net's error of 3.96 (30 lags) and 3.59 (1 lag). Let's try a slighlty more complicated RNN. This time, we'll use two recurrent layers.

#### RMSE's summarized
* **RNN_30**: 3.49
* **Dense_1**: 3.59
* **Dense_30**: 3.96
* **Baseline**: 4.55


### Stacked RNN

Let's try a stacked RNN next to give our model great capacity to learn.

* The first SimpleRNN layer reads the input sequence and **returns a sequence of hidden states** — one for each timestep. It will have an output shape of: (batch_size, timesteps, units)
This preserves temporal information across all steps.

* The second SimpleRNN layer then treats that sequence as its input, processing it step by step and finally returning a single hidden state — the one from the final timestep.
Output shape: (batch_size, units)

In [None]:
from tensorflow import keras

def create_stacked_rnn_model(input_shape, rnn_units=16):
    inputs = keras.Input(shape=input_shape, name="input_sequence")

    x = keras.layers.SimpleRNN(rnn_units, return_sequences=True)(inputs) 
    x = keras.layers.SimpleRNN(rnn_units)(x) 
    outputs = keras.layers.Dense(1)(x)

    return keras.Model(inputs, outputs, name="stacked_rnn_weather_model")



### What happens when we stack recurrent layers?

Stacking multiple RNN layers (e.g., using `return_sequences=True` in the first layer and feeding it into a second RNN) increases the **capacity** of the model — it allows the network to learn more complex or hierarchical temporal patterns.

This is somewhat analogous to stacking convolutional layers: early layers learn local features, later layers learn broader context. In a stacked RNN:
- The **first layer** processes the raw input sequence and emits a hidden state for each timestep.
- The **second layer** sees that sequence of hidden states and learns higher-level patterns across time — for example, shifts in trends or repeated structures based on short-term signals.


But there's a key limitation: stacking RNN layers doesn't fix their tendency to forget. Vanilla RNNs (like `SimpleRNN`) struggle to retain information over long sequences due to a problem called the **vanishing gradient problem**.

During training, RNNs rely on backpropagation through time to update their weights. But as the number of timesteps increases, the gradient (the signal used to update weights) becomes smaller and smaller — until it's effectively zero. This means early inputs (within each input window) have little influence on model predictions, even if they were important.

#### What counts as "long"?

There’s no universal threshold, but **even moderately long sequences** — sometimes as short as 20–30 steps — can expose this limitation in practice. The tipping point depends on several factors:

- The activation function (tanh exacerbates vanishing more than ReLU)
- How weights are initialized
- The complexity and variability of the sequence
- The specifics of optimization and training dynamics

In our case, with 30-day weather windows, SimpleRNN may already be at the edge of what it can retain reliably. As sequence length increases, the risk of forgetting early inputs becomes more severe unless mitigated.

So while stacking adds modeling power, it doesn’t address this memory bottleneck.

If your task requires learning long-range dependencies — for example, subtle signals from 30 days ago influencing tomorrow’s prediction — SimpleRNNs, even when stacked, are unlikely to perform well.

In those cases, consider using:
- **LSTM layers**: These use gating mechanisms to preserve and control memory over time
- **Transformer-based models**: These use attention to model the entire sequence at once, avoiding recurrence altogether

Stacked RNNs can be useful for capturing **short- to mid-range patterns**, but they’re not a solution for true long-term memory.


In [None]:
model_rnn_stacked = create_stacked_rnn_model(input_shape=X_train.shape[1:], rnn_units=16)
model_rnn_stacked.summary()
compile_model(model_rnn_stacked)
history_rnn_stacked = model_rnn_stacked.fit(
    X_train, y_train,
    batch_size=64,
    epochs=200,
    validation_data=(X_val, y_val),
    callbacks=[earlystopper]
)

In [None]:
plot_history(history_rnn_stacked, ['root_mean_squared_error', 'val_root_mean_squared_error'])


In [None]:
model_rnn_stacked.evaluate(X_test, y_test)

#### RMSE's summarized
* **RNN_stacked_30**: 3.42
* **RNN_30**: 3.49
* **Dense_1**: 3.59
* **Dense_30**: 3.96
* **Baseline**: 4.55


Further improvement! Yay recurrence! 

While we should certainly take a victory lap at this point, it may still be unsatistying that we're missing the mark by more than 3 hours, on average.

## The problem with vanilla RNNs

Basic RNNs can capture short-term dependencies, but they struggle to retain information across long sequences — a limitation known as the vanishing gradient problem. Imagine whispering a message down a long chain of people like the game of telephone. As the message travels further, it degrades. This is similar to how gradient information gets lost in deep RNNs.

This is why we need specialized architectures like **LSTM** that help preserve memory over long sequences.

### LSTM to the rescue

LSTM (Long Short-Term Memory) layers help address the vanishing gradient problem by adding a memory component: the cell state.

In [None]:
#           ┌────────────┐
# x_t ───►  │  LSTM cell │ ───►   h_t
#           └────────────┘
#             ▲       ▲
#         h_{t-1}   c_{t-1} (memory)

At each timestep `t`, the LSTM takes:
- the input `x_t`
- the previous hidden state `h_{t-1}`
- the previous cell state `c_{t-1}`

The cell state acts as long-term memory, while the hidden state provides a short-term summary. The LSTM uses **gates with learned weights** to decide how much information to erase, update, or reveal at each step:

- **Forget gate**: Learns which parts of the cell state to erase
- **Input gate**: Learns which new information to add to memory
- **Output gate**: Learns what information to send to the next hidden state

These gates allow the model to maintain and control a persistent internal state across many timesteps, helping it overcome the vanishing gradient problem and track longer-term dependencies more effectively.


## Train LSTM model

In [None]:
def create_lstm_model(input_shape, lstm_units=16):
    inputs = keras.Input(shape=input_shape, name="input_sequence")

    # Stacked LSTM layers to compared to stacked RNN
    x = keras.layers.LSTM(lstm_units, return_sequences=True)(inputs) 
    x = keras.layers.LSTM(lstm_units)(x) 

    # Output layer
    outputs = keras.layers.Dense(1)(x)

    return keras.Model(inputs, outputs, name="lstm_weather_model")


In [None]:
model_lstm = create_lstm_model(input_shape=X_train.shape[1:], lstm_units=7) # we'll choose a number of units so that the total param count is comparable to the stacked RNN (961)
model_lstm.summary()


The LSTM layers each have four sets of parameters — one for each internal gate (input, forget, cell, output). Each gate has its own set of input weights, recurrent weights, and a bias term. That’s why the total parameter count for each LSTM layer scales with a factor of 4.

These parameters are shared across all timesteps in the input sequence, so the number of parameters depends on the input/output dimensions — not the sequence length.

### Breakdown of parameters

#### First LSTM layer (`lstm`)
- **Input shape**: (batch_size, 30 timesteps, 9 features)
- **Output shape**: (batch_size, 30 timesteps, 7 units)
- **Parameters**: 476

Breakdown:
- Input weights: `9 inputs × 7 units × 4 gates = 252`
- Recurrent weights: `7 units × 7 units × 4 gates = 196`
- Biases: `7 units × 4 gates = 28`
- **Total**: 252 + 196 + 28 = **476**

#### Second LSTM layer (`lstm_1`)
- **Input shape**: (batch_size, 30 timesteps, 7 features from previous LSTM)
- **Output shape**: (batch_size, 7 units)
- **Parameters**: 420

Breakdown:
- Input weights: `7 inputs × 7 units × 4 gates = 196`
- Recurrent weights: `7 units × 7 units × 4 gates = 196`
- Biases: `7 units × 4 gates = 28`
- **Total**: 196 + 196 + 28 = **420**

#### Final Dense layer
- Input shape: 7 (from final LSTM output)
- Output shape: 1
- **Parameters**: 7 weights + 1 bias = **8**

**Total model parameters**:  
476 (first LSTM) + 420 (second LSTM) + 8 (Dense) = **904**


In [None]:
compile_model(model_lstm)
history_lstm = model_lstm.fit(
    X_train, y_train,
    batch_size=32,
    epochs=200,
    validation_data=(X_val, y_val),
    callbacks=[earlystopper]
)

In [None]:
plot_history(history_lstm, ['root_mean_squared_error', 'val_root_mean_squared_error'])


In [None]:
model_lstm.evaluate(X_test, y_test)

#### RMSE's summarized
* **LSTM_stacked_30**: 3.48
* **RNN_stacked_30**: 3.42
* **RNN_30**: 3.49
* **Dense_1**: 3.59
* **Dense_30**: 3.96
* **Baseline**: 4.55


### Discuss
Why might the LSTM be doing worse in this situation? What do you think will happen if we increase the sequence length?

Let's try it!

In [None]:
import numpy as np
window_size_larger = 90

# Build sequences
n_seq = len(data_basel) - window_size_larger

X_seq = np.stack([
    data_basel[basel_columns].iloc[i:i+window_size_larger].values
    for i in range(n_seq)
])
print("X_seq shape:", X_seq.shape)  # (samples, 30, features)

y_seq = data_basel['BASEL_sunshine'].iloc[window_size_larger:].values

print("y_seq shape:", y_seq.shape)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_holdout, y_train, y_holdout = train_test_split(X_seq, y_seq, test_size=test_set_size, random_state=0, shuffle=False)
X_val, X_test, y_val, y_test = train_test_split(X_holdout, y_holdout, test_size=0.5, random_state=0, shuffle=False)

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

In [None]:
model_lstm = create_lstm_model(input_shape=X_train.shape[1:], lstm_units=7)
model_lstm.summary()
compile_model(model_lstm)
history_lstm = model_lstm.fit(
    X_train, y_train,
    batch_size=32,
    epochs=200,
    validation_data=(X_val, y_val),
    callbacks=[earlystopper]
)

**Note**: You may notice that training takes a while. RNNs and LSTMs are notoriously very slow to train because they processs data one input at a time.

In [None]:
plot_history(history_lstm, ['root_mean_squared_error', 'val_root_mean_squared_error'])


In [None]:
model_lstm.evaluate(X_test, y_test)

#### RMSE's summarized
* **LSTM_stacked_90**: 3.43
* **LSTM_stacked_30**: 3.47
* **RNN_stacked_30**: 3.42
* **RNN_30**: 3.49
* **Dense_1**: 3.59
* **Dense_30**: 3.96

### Discuss result


### Exercise: Adding dropout

Dropout is a common regularization technique that helps prevent overfitting by randomly "dropping" units during training. However, **how you apply dropout depends on the type of layer**.

- For **dense layers**, you add a separate `Dropout(...)` layer in between the dense layers.
- For **RNNs and LSTMs**, you don't use a separate layer. Instead, you pass dropout settings directly as arguments to the recurrent layer itself. This ensures that dropout is applied correctly over time without disrupting the sequence structure.

In this exercise, you'll experiment with adding dropout to the models we built earlier. You may want to create a copy of this notebook for this next experiment so you don't lose track of our current results.

#### 1. Add dropout of 0.1 to the RNN and LSTM models (both 30 and 90 window sizes)

Use the built-in dropout arguments:
```python
dropout = 0.1 # add this at top of notebook in case you want to experiment with other values
# For SimpleRNN
x = keras.layers.SimpleRNN(rnn_units, activation='relu', dropout=dropout)(inputs)

# For LSTM
x = keras.layers.LSTM(lstm_units, dropout=dropout)(x)

```

Run each model with and without dropout, and compare their performance.
Does dropout help reduce overfitting? How does it affect validation and test RMSE?

#### 2. Add dropout to the dense model
For dense layers, add Dropout as a separate layer after the first activation:

```python
x = keras.layers.Dense(100, activation='relu')(inputs)
x = keras.layers.Dropout(dropout)(x)
```

### Discuss: Validation vs test 
If we are still comparing models at this stage, should we be using the test set to compare performance?

### Bonus exercises as homework

1. Use keras' tuner to find a decent set of hyperparameters for each model tested (RNN, LSTM and Dense). Experiment with different number of units, activation functoins, and dropout levels. Feel free to experiement with others if you wish, but be mindful of combinatorial explosion. 
2. Adjust window_size to 3 (near the top of this notebook), and restart the notebook kernel / run all cells. Which model performs the best? Why do you think this might be?
3. Adjust window_size to 360 (near the top of this notebook), and restart the notebook kernel / run all cells. Which model performs the best? Why do you think this might be?
4. Adjust the patience parameter to 50, and max epochs to 10000. This should help us find the "second descent" in the training curve, if one is possible to observe with this data. How do the training curves look? How long did training take? Were you able to find a better model?
5. Try to fit a CNN to the data. Does it do any better than the LSTM?


## Keypoints

- RNNs and LSTMs allow neural networks to process data step-by-step
- LSTMs retain long-term context using gated memory
- Sequence models are widely used in time series, language, and biology

### Wrapping up: What is attention?

So far, we’ve seen how RNNs and LSTMs process sequences step by step and pass a hidden state forward in time as a kind of memory. But ultimately, they compress everything they’ve seen into a single vector — often just the final hidden state — and use that to make a prediction.

That's a limitation when important information appears earlier in the sequence. Even with gates, LSTMs can struggle to retain distant context.

Attention changes this by asking a different question: Instead of trying to remember everything as you go, why not look back across all the memories you built — and decide which ones matter most right now?

You can think of attention as a spotlight the model moves across the sequence of past hidden states. It assigns weights to each timestep based on how relevant it thinks that step is for the current task. Then it combines those weighted memories into a new context vector, which it uses to make a prediction.

This allows the model to:
- Access the full sequence without compressing it into a single state
- Focus on different parts of the sequence depending on the context
- Better handle long-range dependencies

Attention was originally introduced alongside RNNs and LSTMs — for tasks like translation, where aligning parts of an input with an output matters. But later, researchers asked: what if we removed recurrence altogether and just used attention? That led to the Transformer architecture.

We'll come back to Transformers in a future lesson. For now, just keep this in mind:

LSTMs decide what to remember over time.  
Attention decides where to look across the stored memories.
