## Stock Price Forecasting with RNNs

GOAL: use previous day stock prices to forecast foreward stock prices for 60 days into the future

<img src="https://responsive.fxempire.com/v7/_fxempire_/2021/04/Stock-Market-2-2.jpg?func=cover&q=70&width=500" alt="Stock Price Demo" width="300" height=150/>

Reference: FX Empire

#### Libraries

In [None]:
import pandas as pd
import numpy as np

#to plot within notebook
import matplotlib.pyplot as plt

#importing required libraries
from sklearn.preprocessing import MinMaxScaler #for normalizing "skewed" data

from keras.models import Sequential # library module for "layering" in python code for each NN layer
from keras.layers import Dense, Dropout, LSTM # LSTM (specialised RNN) is the only different module we need from a CNN

In [None]:
# config - display plots inline
%matplotlib inline

#setting figure size
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 20,10 # figure size in inches

##### Data Import & EDA

to add a data file to Colab use the folder on the LHS, go up a level to the main directory and choose the "Content" folder. Drag and drop the data file into that folder. Right click on the file uploaded and select "copy path"

In [None]:
# Exercise - see if you can import the stock price data using the instruction above



In [None]:
df.shape

In [None]:
df.info()

#### Data Wrangling / Formatting

Python reads "Date" column as a string (object). We need to convert this to a pandas datetime format before we set up our RNN. But we also need to keep the raw format as plotting dates requires "Date" to be in its original string format

In [None]:
#setting index as date values
df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d') # Y is 4 digit year, y would be 2 digit year
df.index = df['Date'] # set Date column as an index

# sort the data so that Date is in ascending order
data = df.sort_index(ascending=True, axis=0)

# isolate Date and Close columns
# NB this is initially empty
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date', 'Close'])

# copying each rows in our source data into the two column version of our data (called new_data)
for i in range(0,len(data)):
    new_data['Date'][i] = data['Date'][i]
    new_data['Close'][i] = data['Close'][i]

new_data.head()

In [None]:
# aside 
data.head()

#### Data Wrangling - Data Partitioning

In [None]:
new_data.index = new_data.Date # set the Date column of "new_data" as the index
new_data.head()

In [None]:
# drop the extra Date column
new_data.drop("Date", axis = 1, inplace = True)
new_data.head() # check

In [None]:
# create a training and test set using an 80/20 split

dataset = new_data.values # copying the values from the new_data dataframe

# we set training data as the first 80% of rows, and test data as the remaining 20%
train = dataset[0:int(len(new_data)*0.8-1),:] # from rows 0 to 987 (exclusive)
test = dataset[int(len(new_data)*0.8-1):,:] # from row 987 to the end (row 1234)

In [None]:
# post checks
len(train)

987 is 80% of the original data size (1235 rows)

In [None]:
len(test)


In [None]:
# Exercise - scale the data so that the stock prices are between 0 and 1
# use MinMaxScaler and apply to "dataset"


#### Prepare the data for Recurrent Neural Network

for RNNs (or LSTMs) we use previous time steps as additional "features"

In [None]:
# add the last 60 time buckets as fetures to our training set
x_train, y_train = [], [] # initialisation as empty lists

# loop thru the training set (starting from the 60th time step)
for i in range(60, len(train)):
  x_train.append(scaled_data[i-60:i,0]) # features are the last 60 time steps (multi-column)
  y_train.append(scaled_data[i,0]) # target is just the current time step (single-column)

# convert to numpy array format
x_train, y_train = np.array(x_train), np.array(y_train)

# create a 3D tensor for the RNN network
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))

In [None]:
x_train.shape # check tensor dimensions

#### Fit LSTM Model

In [None]:
# Exercise - try on your own before viewing solution below

# a) create / define an LSTM network (STEP 1)
# b) compile (STEP 2) the model using adam optimization with MSE loss
# c) fit the data to the model and train (STEP 3)

In [None]:
# a) create / define an LSTM network (STEP 1)
model = Sequential() # TF sequential model

model.add(LSTM(50, activation='relu', return_sequences=True, input_shape=(x_train.shape[1], 1))) # first hidden layer BASELINE
#model.add(LSTM(50, activation='tanh', return_sequences=True, input_shape=(x_train.shape[1], 1))) # b) rerun the model with a different activation function (tanh)
model.add(LSTM(50)) # input shape taken from previous hidden layer
model.add(Dense(1)) # single hidden layer of LSTM units

In [None]:
# b) compile (STEP 2) the model using adam optimization with MSE loss
model.compile(optimizer='adam', loss='mse')

In [None]:
# c) fit the data to the model and train (STEP 3)

batch_size = 1 # baseline
#batch_size = 10 # a) rerun the model with a different batch size and compare RMS error

model.fit(x_train, 
          y_train, 
          epochs=1, 
          batch_size=batch_size, 
          verbose=1) # verbose = 1 gives us a bit more info when the training process runs

#### Model Evaluation / Predictions

In [None]:
# wrangle the test data in the same way as the training data

# get data from 60 time steps b4 test set starts
inputs = new_data[len(new_data) - len(test) - 60:].values # everything in our new_data 

# dataframe from row number (1235 - 248 - 60 = 927) to the end (i.e. row 1234) - so 248+60 rows = 308
inputs = inputs.reshape(-1,1) # -1 means use the number of rows in the source variable (input)
inputs = scaler.transform(inputs) # we need to scale again as model was trained on scaled data

# we start from row 60 of the TEST set
x_test = []
for i in range(60, inputs.shape[0]):
    x_test.append(inputs[i-60:i,0]) # gather up the last 60 time steps for each of the test set timesteps
x_test = np.array(x_test) # convert the actual test data to a numpy array

x_test = np.reshape(x_test, (x_test.shape[0],x_test.shape[1], 1)) # convert to 3D numpy array for predicting
pricePred = model.predict(x_test) # predict our stock price for next time step

#### POST PROCESS - we need to "unscale" the test set in order to plot the data below
# wrangle Data back to original format (unscaled) for plotting
# (reverse process)
pricePred = scaler.inverse_transform(pricePred) # reverse scale back into original units

In [None]:
# have a look at model performance 
# RMS = sqaure root of the average square error b/ w actual (test) and forecasted stock prices (pricePred)

rms = np.sqrt(np.mean(np.power(test-pricePred,2)))
rms

above value should be compared with our average stock price...

In [None]:
# average (actual) stock price 
new_data["Close"].mean()

our RMS error (residual) is about 12/169 = 7%

Visualise our Forecast

we need to do some further wrangling to get our data into the correct format to plot on a time series matplot chart

In [None]:
# We plot the actual stock prices against our predictions
# and we colour the training set, test set and our predictions differently

plt.figure(figsize=(10, 7)) # inches

# plot the actual stock price time series chart against our predictions

# re-defining the training set, our test set and our predictions for the chart we are about to show
train_new = new_data[:int(len(new_data)*0.8-1)] # up to row number 80% of total rows in new_data
test_new = new_data[int(len(new_data)*0.8-1):] # from row number 80% of total rows to the end
test_new["Predictions"] = pricePred # copy our predicted stock prices

plt.rcParams.update({'font.size': 15})

plt.plot(train_new["Close"]) # plot the training set data
plt.plot(test_new[["Close","Predictions"]]) # plot the test set stock prices and our predictions
# NB in line above we "combine" actual test stock prices and predictions 
# ON THE SAME LINE of the chart

plt.title("TATA Stock Price - RNN forecasting comparison")

plt.legend(["train","test","prediction"])

In [None]:
# Exercise - see if you can

# a) make the plot bigger
# b) decrease the axes font
# c) change the legend to show train (80%), test (20%)

In [None]:
# We plot the actual stock prices against our predictions
# and we colour the training set, test set and our predictions differently

plt.figure(figsize=(12, 8)) # a) make the plot bigger

# plot the actual stock price time series chart against our predictions

# re-defining the training set, our test set and our predictions for the chart we are about to show
train_new = new_data[:int(len(new_data)*0.8-1)] # up to row number 80% of total rows in new_data
test_new = new_data[int(len(new_data)*0.8-1):] # from row number 80% of total rows to the end
test_new["Predictions"] = pricePred # copy our predicted stock prices

plt.rcParams.update({'font.size': 10}) # b) decrease the axes font

plt.plot(train_new["Close"]) # plot the training set data
plt.plot(test_new[["Close","Predictions"]]) # plot the test set stock prices and our predictions
# NB in line above we "combine" actual test stock prices and predictions 
# ON THE SAME LINE of the chart

plt.title("TATA Stock Price - RNN forecasting comparison")

plt.legend(["train 80%","test 20%","prediction"])

In [None]:
# Exercise (scenario runs)
# a) rerun the model with a different batch size and compare RMS error
# b) rerun the model with a different activation function (tanh)