## Columbia University
### ECBM E4040 Neural Networks and Deep Learning. Fall 2025.

## **Task 2: RNN Application - Time Series Forecasting** (25%)

# Forecasting Task

The purpose of this workbook is to teach you how to use neural networks to do time series forecasting. Time series is a collection of discrete data points indexed over time. Being able to predict what the next data point will be in a squence is a valuable toolset which has implications for climate science, finance, healthscience, economics, earthquake prediction, and many more fields.

For our assignment, we are going to be examing data from Microsoft Stock's value over a 7 year period. This task is going to focus on data processing & splitting, model selection, training and forecasting on time series data.

## Import Packages
First we need to install Pandas. Pandas is a very popular data science package for Python. It is especially great at creating DataFrames to store discrete data. You can read more about Pandas at: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html


In [None]:
# Run this cell to install Pandas
!pip install -q pandas

In [None]:
# Import modules
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

%load_ext autoreload
%autoreload 2

## Load Data

In [None]:
# Load the data from the CSV and display the Pandas DataFrame Head (first 5 rows of the DataFrame)
df = pd.read_csv('stock_data/Microsoft_Stock.csv')
df['Date'] = pd.to_datetime(df['Date'])
open_prices = df['Open'].values
df.head()

Now, lets plot the data to visualize what the open price looks like over time.
For this assignment we are only going to focus on the open price.

In [None]:
# Plot 'Open' vs 'Date'
plt.figure(figsize=(10, 6))  # Set the figure size
plt.plot(df['Date'], open_prices, label='Open Price')

# Add labels and Title
plt.xlabel('Date')
plt.ylabel('Open Price')
plt.title('MSFT Open Prices 2015-2021')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

plt.grid(True)
plt.legend()
plt.show()

## Create Lookback Dataset (5%)
To ensure the dataset has the right "lookback window" for making predictions, we need to feed the model a sequence of past data points leading up to the target point. For example, if we want to predict the value at time $t=6$, we should use data points from $t=0$ to $t=5$ as the input $X$ for the model. The target value, $y$, for this input sequence would then be the data point at $t=6$. Importantly, the lookback window (in this case, 6 data points) should always exclude the actual point we’re trying to predict. This ensures that the model is only using historical data to make its predictions, rather than information from the future.

To implement this, we’ll create a function that generates a lookback dataset by passing a sliding window over the data. The length of this window, or "lookback," determines how many data points are included in each sequence. Here, we’ll set the lookback to a default of 10, though this can be adjusted later when fine-tuning the model. For now, start with a lookback of 10 as a baseline.

Before generating the lookback dataset, we also need to normalize the data (`open_prices`) to standardize the range of input values. To do this, perform a standard normalization by subtracting the mean of `open_prices` from each data point and then dividing by the standard deviation. This normalized data is then ready to be passed to the lookback function, which will create the dataset for training the model.

<center>
<img src="./img/lookback.png" width="500" class="center"/>
</center>

<font color="red"><strong>TODO:</strong></font> Normalize the data following the description above.

In [None]:
###################################################
# TODO: Set open_prices_normalized following the  #
#       description above.                        #
#                                                 #
###################################################

open_prices_normalized = None

###################################################
# ENDTODO #
###################################################

<font color="red"><strong>TODO:</strong></font> Complete the `create_dataset` function following the description above.

In [None]:
def create_dataset(data, lookback = 10):
    X, y = [], []
    ###################################################
    # TODO: Create the lookback dataset following the #
    #       description above.                        #
    #                                                 #
    ###################################################



    ###################################################
    # ENDTODO #
    ###################################################

    return np.array(X), np.array(y)

# Define the lookback period (You can change this later)
lookback = 10

# Prepare the dataset with lookback
X, y = create_dataset(open_prices_normalized, lookback)


# If this was done correctly, the number of samples for X and y should be equal,
# X should also have the lookback number as the second dimension.
print("X_open_prices_lookback", X.shape)
print("y_open_prices", y.shape)

## Split Data into Training, Validation, and Test (5%)

By this point in your data science career, you have become familiar with good practice in splitting data
between training, validation, and testing. To properly set up the dataset for time series forecasting, we’ll split the data again into training, validation, and testing sets. However unlike other types of data we have worked with thus far in the class, time series data requires that we maintain the order of observations, so we can't use random selection to split the data. Instead, we’ll divide it sequentially: the first 70% of the data will be used for training, the next 15% for validation, and the final 15% for testing.

You will first have to calculate the number of entries for each group: training, validation, and testing. Then you can split the `X` data, `y` data, and the timestamps in similar fashions.

<font color="red"><strong>TODO:</strong></font> Split the dataset into training, validation, and testing following the description above.

In [None]:
###################################################
# TODO: Split the dataset and fill in the empty   #
#       variables following the description above #
#                                                 #
###################################################

# Calculate the index sizes for the splits
train_size = None
val_size = None
test_size = None

# Split the data into train, validation, and test sets
X_train = None
y_train = None

X_val = None
y_val = None

X_test = None
y_test = None

# Split timestamps into train, validation, and test
lookback_timestamps = df['Date'][lookback:];

train_timestamps = None
val_timestamps = None
test_timestamps = None


###################################################
# ENDTODO #
###################################################



# Reshape X_train, X_val, and X_test from 2D (samples, lookback) to 3D shape (samples, lookback, features=1)
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_val = X_val.reshape((X_val.shape[0], X_val.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))


# Print the sizes of each dataset to confirm
print("Train", X_train.shape, y_train.shape, train_timestamps.shape);
print("Validation", X_val.shape, y_val.shape, val_timestamps.shape);
print("Test", X_test.shape, y_test.shape, test_timestamps.shape);

## Visualize the split dataset

In [None]:
# Plot the Normalized Data Splits
plt.figure(figsize=(10, 6))  # Set the figure size
plt.plot(train_timestamps, y_train, label='Open Price - Train')
plt.plot(val_timestamps, y_val, label='Open Price - Validation')
plt.plot(test_timestamps, y_test, label='Open Price - Test')


# Adding labels and title
plt.xlabel('Date')
plt.ylabel('Open Price')
plt.title('Normalized MSFT Open Prices 2015-2021')
plt.grid(True)
plt.legend()
plt.show()

<span style="color:red">__TODO:__</span> Why do you think we split the data sequentially instead of randomly? Answer in 1-3 sentences.

<span style="color:red">__Answer:__</span>

## Building the Neural Network (5%)
For this part of the task, you’ll create a neural network model using TensorFlow’s Sequential API. The goal is to design a model for time series forecasting by experimenting with various layer types, layer counts, and unit configurations.

Begin by constructing a basic Sequential model with an InputLayer that matches the shape of your lookback data (the number of time steps and feature dimension). Start with at least one recurrent layer, like LSTM, GRU, or a simple RNN, and follow it with a Dense layer to produce the final output. Experiment with different combinations and configurations, such as adding additional Dense layers or stacking multiple RNN/LSTM/GRU layers with different numbers of units. Once the model structure is defined, consider applying optional techniques we learned in class: dropout, batch normalization, early stopping to monitor validation performance, etc.

Remember, the purpose of this neural network is to take in a lookback window and produce a prediction for a single point, `y_predict`, which is then compared to the actual label, `y`. This is a regression problem, and you need to think about the output layer when designing your model.

<center>
<img src="./img/prediction.png" width="500" class="center"/>
</center>

<span style="color:red">__TODO:__</span> Build an RNN-based model for time series forecasting using TensorFlow and Keras following the instructions above.

In [None]:
###################################################
# TODO: Declare your TensorFlow model             #
#                                                 #
###################################################

# Declare your TF Keras model
ts_model = None


###################################################
# ENDTODO #
###################################################

<span style="color:red">__TODO:__</span> Briefly describe the decisions you made regarding the design of your neural network and why you made those decisions. Please answer in a few short sentences.

<span style="color:red">__Answer:__</span>

## Compile and Train your model (6%)

Compile your model by setting an optimizer, a learning rate, and an appropriate loss function for this regression problem. Experiment with different learning rates to find the one that best suits your model’s performance, and select a suitable optimizer and loss function. Remember, this is a regression task and think about how that would affect your loss function.

<span style="color:red">__TODO:__</span> Choose a learning rate, an optimizer, and a loss function, then compile your model.

In [None]:
###################################################
# TODO: Select hyperparameters for compiling your #
#       model                                     #
#                                                 #
###################################################

# Select hyperparameters
LEARNING_RATE = None
OPTIMIZER = None
LOSS = None


###################################################
# ENDTODO #
###################################################


ts_model.compile(
    optimizer=OPTIMIZER,
    loss=LOSS
)

Select the number of epochs and the batch size, include additional items like early stopping call backs if you desire and then train your model. Remember to use your validation data! To read more about the `fit()` function which you will need to train your Tensorflow Keras model, check out the TF guide: https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit

We want to get the training history from the fit function so that we can graph the training progress. This will provide insights into your model’s learning progress and any potential overfitting or underfitting.

<span style="color:red">__TODO:__</span> Choose training parameters and train your model following the description above.

In [None]:
###################################################
# TODO: Select parameters for your training       #
#       and train your model following the        #
#       instructions above.                       #
#                                                 #
###################################################

# Hyperparameters
EPOCHS = None
BATCH_SIZE = None

# Train your model
history = None


###################################################
# ENDTODO #
###################################################

<span style="color:red">__TODO:__</span> Run the following cell to plot your loss.

In [None]:
# Extract the loss values for training and validation
train_loss = history.history['loss']  # Training loss
val_loss = history.history['val_loss']  # Validation loss

# Plot the training and validation loss over epochs
plt.figure(figsize=(10, 6))
plt.plot(train_loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')

# Add labels and title
plt.title('Training and Validation Loss Over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

# Show the plot
plt.grid(True)
plt.show()

<span style="color:red">__TODO:__</span> Does your training graph indicate signs of overfitting? Answer in 1-2 sentences.

<span style="color:red">__Answer:__</span>

## Evalute the Model's Performance (4%)
Use your model to make predictions for the training, validation, and test data sets.
Now remember the data is all normalized so the output of this model will also be normalized.
We will provided the code to graph your normalized predictions against the normalized labels.
However, if you really wanted to calculate Microsofts stocks remember de-normalize your data!

<span style="color:red">__TODO:__</span> Use the TF Keras `predict()` function to get the output. https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict. Then see how well the model performed on the test dataset.

In [None]:
###################################################
# TODO: Generate predictions for each of the      #
#       dataset splits                            #
#                                                 #
###################################################

# Generate predictions for training, validation, and test data
y_train_pred_norm = None
y_val_pred_norm = None
y_test_pred_norm = None


###################################################
# ENDTODO #
###################################################

In [None]:
# Create MSE loss function
mse_loss = tf.keras.losses.MeanSquaredError()

# Calculate mean squared error for each dataset
train_loss = mse_loss(y_train, y_train_pred_norm).numpy()
val_loss = mse_loss(y_val, y_val_pred_norm).numpy()
test_loss = mse_loss(y_test, y_test_pred_norm).numpy()

print("Training Loss:", train_loss)
print("Validation Loss:", val_loss)
print("Test Loss:", test_loss)


# Plotting the predictions and actual values
plt.figure(figsize=(12, 8))

# Plot for training data
plt.plot(train_timestamps, y_train, label='Training Actuals',
         color='blue', linewidth = 4)
plt.plot(train_timestamps, y_train_pred_norm, label='Training Predictions',
         color='lightblue', linestyle='--', linewidth = 3)

# Plot for validation data
plt.plot(val_timestamps, y_val, label='Validation Actuals',
         color='green', linewidth = 4)
plt.plot(val_timestamps, y_val_pred_norm, label='Validation Predictions',
         color='lightgreen', linestyle='--', linewidth = 3)

# Plot for test data
plt.plot(test_timestamps, y_test, label='Test Actuals',
         color='red', linewidth = 4)
plt.plot(test_timestamps, y_test_pred_norm, label='Test Predictions',
         color='orange', linestyle='--', linewidth = 3)

# Add labels, legend, and grid
plt.xlabel('Date')
plt.ylabel('Open Price')
plt.title('Predictions vs Actual Prices for Training, Validation, and Test Sets')
plt.legend()
plt.grid(True)

# Rotate x-axis labels for readability
plt.xticks(rotation=45)

# Show the plot
plt.show()


<span style="color:red">__TODO:__</span> How well did your model perform on the validation and test data? Answer in 2-3 sentences.

<span style="color:red">__Answer:__</span>


<span style="color:red">__TODO:__</span> Discussion: Why might your RNN predictions lag behind sudden changes?

When you compare your RNN’s predictions with the ground truth,
you may notice that the forecast reacts *slowly* to sudden changes.
Explain briefly why this lag occurs based on how the RNN updates its hidden state,
and suggest one way (architectural or training) to make it respond faster.

No more that 5 sentences.

<span style="color:red">__Answer:__</span>