# Predicting the Stock Market with AI

By:
* Chenghao LYU
* Daniel STULBERG HUF
* Kerryan MACÉ

## 1. Introduction

In this applied course, you will learn a glimpse of how financial markets and AI interact with each other in order to make future price predictions according to the market behaviour.

In this particular notebook, we will teach you how to predict the closing stock price of any company listed in the US stock exchange for the next 10 days in the future. To help us in this task, we will use the LSTM recurrent network model.

<br><strong>Acknowledgment:</strong> This file was adapted from a notebook created by the author of the YouTube channel <a href="https://www.youtube.com/@ComputerSciencecompsci112358">Computer Science</a>, and the reference video is available <a href="https://youtu.be/QIUxPv5PJOY">here</a>.

<strong>Disclaimer:</strong> This applied course has purely didactic purposes and does not serve by any means of promoting a specific stock nor giving any specific investment advice. And always remeber: <strong>past performance is no guarantee of future results</strong>.

<br>First, let's install _yfinance_, an open-source Python library used for downloading market data from Yahoo! Finance's API without any cost.

In [None]:
!pip install "yfinance"

Next, we will import all the libraries required to run this notebook.

In [None]:
import datetime as dt
import yfinance as yf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM

## 2. Data Import and Preprocessing

In this section, we will obtain and then prepare the data for using it in our model.



We start by choosing an arbitrary company listed in the US stock exchange (META in this case). Then, we set up an interval of time to download the company's price data.

In [None]:
company = "META"

start_date = dt.datetime(2017, 1, 1)
end_date = dt.datetime.now()
stock_data = yf.download(company, start=start_date, end=end_date)

We can print the first lines of the generated DataFrame to check which information was obtained in this download.

In [None]:
print(stock_data.head())

As we can see, the data obtained was a time series containing the opening price, highest price, lowest price, closing price, adjusted closing price and volume of the META stocks within the specified time interval.

Our model will only use the closing price as attribute to predict future values, so we can filter our DataFrame. 

In [None]:
close_price = stock_data[["Close"]]

We can now visualize the closing price history of the company.

In [None]:
plt.figure(figsize=(12, 6))
plt.title(f"{company} Stock prices", fontsize=18)
plt.xlabel("Date", fontsize=18)
plt.ylabel("Closing price USD", fontsize=18)
plt.plot(close_price)
plt.grid()
plt.show()

The next step will be splitting the data between training set (80% of the data) and test set (20% of the data). But before that, let's check the number of lines that exists in our database.

In [None]:
qt_lines = len(close_price)
qt_train = round(.8 * qt_lines)
qt_test = qt_lines - qt_train

print(f"Quantity of whole set: {qt_lines}")
print(f"Quantity of training set: {qt_train}")
print(f"Quantity of test set: {qt_test}")

We will also normalize the data between 0 and 1 before splitting the sets. This step is very important because we do not want that the neural network understands that a higher price has a higher weight than a smaller price (and vice versa). We will use the ```fit_transform``` method to compute the minimum and maximum values of the data, which is then scaled to fit in between these two values.

In [None]:
# Data normalization
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(close_price)

When splitting the sets, we want to maintain the order of the time series because the next values are intrinsically related to previous ones in a LSTM application.

In [None]:
# Splitting the sets
train_set = scaled_data[:qt_train]
test_set = scaled_data[qt_train: qt_train+qt_test]

The function written below formats the train and test sets the way the network expects to receive the data. You do not have to understand deeply how it works, just keep in mind that for each value _v_ in the validation set, there will be _p_ past values that the model will use to predict _v_ (_p_ is the number of days the model will look back in time to make the predictions), and also that the sets will be converted to _numpy arrays_ for them to be used in the LSTM model.

In [None]:
def prepare_data(dataset, prediction_length=1):

  x_data, y_data = [], []

  for i in range(prediction_length, len(dataset)):
    x_data.append(dataset[i-prediction_length:i, 0])    # independent training variables
    y_data.append(dataset[i, 0])                        # dependent target variables

  return np.array(x_data), np.array(y_data)

In this case, we will choose our model to look back <strong>60</strong> days in time for making the prediction for the following day.

Complete the following lines of code that will call the function above.

In [None]:
# TO DO
prediction_length = ####
X_train, Y_train = ###
X_test, Y_test = ###

<details>
  <summary> Solution </summary>
  <pre>
prediction_length = 60
X_train, Y_train = prepare_data(train_set, prediction_length)
X_test, Y_test = prepare_data(test_set, prediction_length)</pre>
</details>

In [None]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

As you can see from above, the ```X_train``` and ```X_test``` datasets are 2-dimensional until now. However, a LSTM network expects the inputs to be 3-dimensional in the form of:

1) Number of samples 

2) Number of time steps

3) Number of features

So for this last step, you will have to reshape the data to fit in the input of the model. Complete the parameters of the ```reshape``` method (whose documentation can be found <a href="https://numpy.org/doc/stable/reference/generated/numpy.reshape.html">here</a>).

In [None]:
# TO DO
X_train = X_train.reshape(#samples, #time steps, #features)
X_test = X_test.reshape(#samples, #time steps, #features)

<details>
  <summary> Hint 1 </summary>
  <pre>
For the number of samples, just check the size of the first dimension of the datasets.</pre>
</details>
<details>
  <summary> Hint 2 </summary>
  <pre>
For the number of time steps, just pick the number of days you previously set the model to look back at.</pre>
</details>
<details>
  <summary> Hint 3 </summary>
  <pre>
For the number of features, remember the number of columns you reduced the DataFrame to in the beginning of the notebook.</pre>
</details>
<details>
  <summary> Solution </summary>
  <pre>
X_train = X_train.reshape(X_train.shape[0], prediction_length, 1)
X_test = X_test.reshape(X_test.shape[0], prediction_length, 1)</pre>
</details>

## 3. LSTM Model

In this section, we will build, train and validate a LSTM recurrent neural network. You have already watched the video that explains how the LSTM model works, so we are not going to go into too many details here. Also, for our luck, TensorFlow will do almost all the work for us.

First, we will group a linear stack of layers into a TensorFlow model using the ```Sequential``` function. 

In [None]:
model = Sequential()

Then, we will add two consecutive LSTM layers with 50 neurons each. The first LSTM layer will have ```return_sequences``` equals <strong>True</strong>, because we want the output sequence to be inserted in the next block of the network, so that the model obtains memory from the data. Also, in the first layer we must specify the input shape of the data, which is a Tuple containing the number of days to look back and the number of days to predict in the future.

The second and last LSTM layer does not need to have memory.

In [None]:
model.add(LSTM(units=50, return_sequences=True, input_shape=(prediction_length, 1)))

# TO DO 
model.add(#second LSTM layer)

<details>
  <summary> Solution </summary>
  <pre>
model.add(LSTM(units=50, return_sequences=False))</pre>
</details>

Finally, we add two fully connected layers of dimensions 25 and 1, respectively, to the network. This last linear layer will be the final output of the model. 

In order to add the layers, take a look at the documentation for the TensorFlow Dense layer <a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense">here</a>.

In [None]:
# TO DO
model.add(#First fc layer)
model.add(#Second fc layer)

<details>
  <summary> Solution </summary>
  <pre>
model.add(Dense(units=25))
model.add(Dense(units=1))</pre>
</details>

We will compile our model using the <strong>adam</strong> optimizer and we define the loss function to be the <strong>mean squared error</strong>.

In [None]:
model.compile(optimizer="adam", loss="mse")
model.summary()

After that, we are going to train our model with 25 epochs and a batch size of 32. It is also important to say that ```validation_data``` is simply the data on which to evaluate the loss of the model at the end of each epoch, and therefore <strong>the model will not be trained on this data</strong> (more details about the ```model.fit``` function can be found <a href="https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit">here</a>).

This step should take no longer than 3 minutes.

In [None]:
# Training the model
validation = model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=25, batch_size=32)

We can visualize the training and validation losses (mean squared error) to check how our model behaved.

In [None]:
# Visualizing the loss of the model
plt.plot(validation.history["loss"], label="Training loss")
plt.plot(validation.history["val_loss"], label="Validation loss")
plt.legend()
plt.show()

Now that the training has ended, we make predictions using the test set.

In [None]:
# Making the prediction
predicted_prices = model.predict(X_test)

We have to remind that the data predicted by the model is normalized, so it is necessary to convert the values back to suitable prices with the help of the ```inverse_transform``` function.

In [None]:
predicted_prices = scaler.inverse_transform(predicted_prices)

After that, we can add the generated data to the original DataFrame containing the closing prices. 

In [None]:
close_price["Prediction"] = np.nan
close_price["Prediction"][-len(predicted_prices):] = [ item for elem in predicted_prices.tolist() for item in elem]

We can now visualize the price curves in the training, test, and prediction sets.

In [None]:
plt.figure(figsize=(12, 6))
plt.title(f"{company} Stock prices", fontsize=18)
plt.xlabel("Date", fontsize=18)
plt.ylabel("Price", fontsize=18)
plt.plot(close_price["Close"].iloc[:len(close_price)-len(predicted_prices)])
plt.plot(close_price["Close"].iloc[len(close_price)-len(predicted_prices):])
plt.plot(close_price["Prediction"])
plt.legend(["Training", "Validation", "Prediction"], loc="upper left")
plt.grid()
plt.show()

## 4. Future Predictions

In this last section, we are going to make a 10-day future prediction of the stock prices. This block of code was adapted from <a href="https://youtu.be/CvfAx3_nGME">this video</a>, and once again, you do not need to understand deeply how it works. What is important to know is that for each of the 10 days of predictions, the input set (of length 60) keeps being shifted in each iteration such that its first value is removed, and the value to be added at the end is the last value predicted by the model in the previous iteration. This process stays in a loop until the number of future days is reached. 

In [None]:
# Getting the last days to be used for the prediction
input_length = len(test_set) - prediction_length

# Converting the input set to an array
input_set = test_set[input_length:]
input_set = np.array(input_set).reshape(1, -1)

# Converting the array to a list
output_set = list(input_set)[0].tolist()

# Loop for predicting the following 10 days
pred_output = []
i = 0
future_days = 10 

while i < future_days:

  if len(output_set) > prediction_length:
    input_set = np.array(output_set[1:])
    input_set = input_set.reshape((1, prediction_length, 1))
    pred = model.predict(input_set, verbose=0)
    print(f"Day {i}. Predicted value -> {pred}")
    output_set.extend(pred[0].tolist())
    output_set = output_set[1:]

  else:
    input_set = input_set.reshape((1, prediction_length, 1))
    pred = model.predict(input_set, verbose=0)
    output_set.extend(pred[0].tolist())
    
  pred_output.extend(pred.tolist())
  i+=1

# Transforming the ouput to corresponding price
future_pred = scaler.inverse_transform(pred_output)
future_pred = np.array(future_pred).reshape(1, -1)[0].tolist()

# Creating DataFrame with predicted dates
df_future = pd.DataFrame({"Future":future_pred}, index = pd.date_range(close_price.tail(1).index[0] + dt.timedelta(days=1), periods=future_days, freq='b'))

We can now visualize the price curves in the test, prediction, and future sets.

In [None]:
plt.figure(figsize=(12, 6))
plt.title(f"{company} Stock prices", fontsize=18)
plt.xlabel("Date", fontsize=18)
plt.ylabel("Price", fontsize=18)
plt.plot(close_price["Close"].iloc[len(close_price)-len(predicted_prices):])
plt.plot(close_price["Prediction"])
plt.plot(df_future)
plt.legend(["Validation", "Prediction", "Future"], loc="upper left")
plt.grid()
plt.show()

Try to check by yourself the perfomance of the model. Is the model predicting well? Is the predicting curve accurate, or too steep, or too delayed?

If you want to, you can change the company in the beggining of the notebook, train the model again and then check its perfomance. For example, try it with AMZN, AAPL or GOOGL.

## 5. MCQ

<b>1. What statement(s) is(are) true about LSTM (Long short-term memory) networks?</b>

<div>
  A. <input type="checkbox">
  <label>LSTM is a kind of recurrent neural network.</label>
</div>

<div>
  B. <input type="checkbox">
  <label>The input vector of LSTM is expected to be 2-dimensional.</label>
</div>

<div>
  C. <input type="checkbox">
  <label>The gradient backpropagation can be regulated to avoid gradient vanishing and exploding in order to keep long or short-term memory.</label>
</div>

<div>
  D. <input type="checkbox">
  <label>LSTM is normally augmented by recurrent gates called "forget gates" which allows selective forgetting.</label>
</div>

<br>
<details>
  <summary> Answer </summary>
ACD
</details>

<b>2. What statement(s) is (are) true about the advantages of LSTM?</b>

<div>
  A. <input type="checkbox">
  <label>It is more suitable to time-series sensitive issues.</label>
</div>

<div>
  B. <input type="checkbox">
  <label>It needs less training time in general cases comparing to CNN (Convolutional Neural Network).</label>
</div>

<div>
  C. <input type="checkbox">
  <label>It is capable of dealing with sequences in any length.</label>
</div>

<br>
<details>
  <summary> Answer </summary>
A
</details>

<b>3. What do the dimensions of the LSTM input vector consist of?</b>

<div>
  A. <input type="checkbox">
  <label>Number of samples.</label>
</div>

<div>
  B. <input type="checkbox">
  <label>Number of time steps.</label>
</div>

<div>
  C. <input type="checkbox">
  <label>Number of nodes.</label>
</div>

<div>
  D. <input type="checkbox">
  <label>Number of features.</label>
</div>

<br>
<details>
  <summary> Answer </summary>
ABD
</details>

<b>4. What statement(s) is (are) true about the application in this notebook?</b>

<div>
  A. <input type="checkbox">
  <label>The reshape method is used to change the time steps of data.</label>
</div>

<div>
  B. <input type="checkbox">
  <label>The loss function is defined to be the mean squared error.</label>
</div>

<div>
  C. <input type="checkbox">
  <label>The input set is getting bigger alongside the prediction to make the results get better.</label>
</div>

<div>
  D. <input type="checkbox">
  <label>The predicted values could be used as input data in future iterations.</label>
</div>

<br>
<details>
  <summary> Answer </summary>
BD
</details>

<b>5. What can you conclude about the efficiency of the final results obtained in this notebook?</b>

<div>
  A. <input type="checkbox">
  <label>Using the closing price as the unique input feature of the model is enough to predict the stock market with low error rate.</label>
</div>

<div>
  B. <input type="checkbox">
  <label>Results are still reasonably poor. We can only capture upward and downward trends, but not solid variations of future stock prices.</label>
</div>

<br>
<details>
  <summary> Answer </summary>
B
</details>