This guide is based on notes from this TensorFlow 2.0 course and is organized as follows

- Building a Deep Q-Learning Trading Network
- Stock Market Data Preprocessing
- Training our Deep Q-Learning Trading Agent
- Summary: Deep Reinforcement Learning for Trading with TensorFlow 2.0

# 1. Building a Deep Q-Learning Trading Network

In [1]:
# Importing the libraries
import math
import random
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas_datareader as data_reader
import yfinance as yf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras.optimizers import Adam

from tqdm import tqdm_notebook, tqdm
from collections import deque

## Defining our Deep Q-Learning Trader

Now we need to define the algorithm itself with the AI_Trader class, below are a few important points:

- In trading we have an action space of 3: Buy, Sell, and Sit
- We set the experience replay memory to `deque` with 2000 elements inside it
- We create an empty list with `inventory` which contains the stocks we've already bought
- We need to set a `gamma` parameter to `0.95`, which helps to maximize the current reward over the long-term
- The epsilon parameter is used to determine whether we should use a random action or to use the model for the action. We start by setting it to 1.0 so that it takes random actions in the beginning when the model is not trained.
- Over time we want to decrease the random actions and instead we can mostly use the trained model, so we set epsilon_final to 0.01
- We're then set the speed of decreasing epsilon in the `epsilon_decay` parameter

In [2]:
class AI_Trader():
  
  def __init__(self, state_size, action_space=3, model_name="AITrader"):
    
    self.state_size = state_size
    self.action_space = action_space
    self.memory = deque(maxlen=2000)
    self.inventory = []
    self.model_name = model_name
    
    self.gammsa = 0.95
    self.epsilon = 1.0
    self.epsilon_final = 0.01
    self.epsilon_decay = 0.995

## Defining the Neural Network

Next we need to start defining our neural network.

The first step to define our neural network is to define a function called `model_builder` which doesn't take any arguments, just the keyword `self`.

We then define the model with `tf.keras.models.Sequential()`.

To define with model's states, which are the previous `n` days and stock prices of the days.

A state is just a vector of numbers and we can use a fully connected network, or a dense network.

Next, we add the first dense layer with `tf.keras.layers.Dense()` and specify the number of neurons in the layer to 32 and set the activation to `relu`. We also need to define the input shape in the first layer with `input_dim=self.state_size`

We're going to use 3 hidden layers in this network, so we add 2 more and change the architecture of to 64 neurons in the second and 128 for the last layer.

We then need to define the output layer and compile the network.

To define the output layer we need to set the number of neurons to the number of actions we can take, 3 in this case. We're also going to change the activation function to `relu` because we're using mean-squared error for the loss:

In [3]:
def model_builder(self):
      
      model = Sequential()
      
      model.add(Dense(units=32, activation='relu', input_dim=self.state_size))
      model.add(Dense(units=64, activation='relu'))
      model.add(Dense(units=128, activation='relu'))
      model.add(Dense(units=self.action_space, activation='linear'))

Finally, we need to compile the model. Since this is a regression task we can't use accuracy as our loss, so we use `mse`. We then use the `Adam` optimizer and set the learning rate to 0.001 and return the model:

In [4]:
# model.compile(loss='mse', optimizer=tf.keras.optimizer.Adam(lr=0.001))
# return model

## Building a Trading Function

Now that we've defined the neural network, we need to build a function to trade that takes the state as input and returns an action to perform in that state.

To do this we're going to create a function called `trade` that takes in one argument: `state`.

For each state, we need to determine if we should use a randomly generated action or the neural network.

To do this, we use the `random` library, and if it is less than our `epsilon` we return a random action with `random.randrange()` and pass in `self.action_space`.

If the number is greater than `epsilon` we use our model to choose the action. To do this, we define actions equal to `self.model.predict` and pass in the state as the argument.

We then return a single number with np.argmax to return only the action with the highest probability.

To summarize:

- The function takes as input the shape and generates a random number
- If the number is less than or equal to epsilon it will generate a random action (this will always be the case in the beginning)
- If it is greater than epsilon it will use the model to perform a prediction on the input state and return the action that has the highest probability

In [5]:
def trade(self, state):
      if random.random() <= self.epsilon:
          return random.randrange(self.action_space)
      
      actions = self.model.predict(actions[0])

## Training the Model

Now that we've implemented the `trade` function let's build a custom training function.

This function will take a batch of saved data and train the model on that, below is a step-by-step process to do this:

- We define this function `batch_trade` and it will take `batch_size` as an argument
- We select data from the experience replay memory by first setting `batch` to an empty list
- We then iterate through the memory with a for loop
- Since we're dealing with time series data we need to sample from the end of the memory instead of randomly sampling from it
- Now that we have a batch of data we need to iterate through each batch—`state`, `reward`, `next_state`, and `done`—and train the model with this
- If the agent is not in a terminal state we calculate the discounted total reward as the current `reward`
- Next we define the `target` variable which is also predicted by the model
- Next we fit the model with `self.model.fit()`
- At the end of this function we want to decrease the epsilon parameter so that we slowly stop performing random actions

In [6]:
def batch_train(self, batch_size):

    batch = []
    for i in range(len(self.memory) - batch_size + 1, len(self.memory)):
      batch.append(self.memory[i])

    for state, action, reward, next_state, done in batch:
      reward = reward
      if not done:
        reward = reward + self.gamma * np.amax(self.model.predict(next_state)[0])

      target = self.model.predict(state)
      target[0][action] = reward

      self.model.fit(state, target, epochs=1, verbose=0)

    if self.epsilon > self.epsilon_final:
      self.epsilon *= self.epsilon_decay

# 2. Stock Market Data Preprocessing

Now that we've built our `AI_Trader` class we now need to create a few helper functions that will be used in the learning process.

In particular, we need to define the following 3 functions:

1. sigmoid - sigmoid is an activation function, generally used at the end of a network for binary classification as it scales a number to a range from 0 to 1. This will be used to normalize stock price data.

In [7]:
def sigmoid(x):
  return 1 / (1 + math.exp(-x))

2. stocks_price_format - this is a formatting function to print out the prices of the stocks we bought or sold.

In [8]:
def stocks_price_format(n):
  if n < 0:
    return "- $ {0:2f}".format(abs(n))
  else:
    return "$ {0:2f}".format(abs(n))

3. dataset_loader - this function connects with a data source and pulls the stock data from it, in this case we're loading data from Yahoo Finance:

In [9]:
def dataset_loader(stock_name):
  # dataset = data_reader.DataReader(stock_name, start="2010-01-01", end="2020-01-01", data_source="yahoo")
  # dataset = data_reader.DataReader(stock_name, data_source="yahoo")
  # start_date = str(dataset.index[0]).split()[0]
  # end_date = str(dataset.index[-1]).split()[0]
  dataset = yf.download(stock_name)
  close = dataset['Close']
  return close

Below we can take a look at the AAPL dataset. With this information we are going to build states for our network.

In [10]:
stock_name = "AAPL"
data = dataset_loader(stock_name)
data.head()

[*********************100%%**********************]  1 of 1 completed


  df.index += _pd.TimedeltaIndex(dst_error_hours, 'h')


Date
1980-12-12    0.128348
1980-12-15    0.121652
1980-12-16    0.112723
1980-12-17    0.115513
1980-12-18    0.118862
Name: Close, dtype: float64

## State Creator

Now that we have our dataset_loader function we need to create a function that takes this data and generates states from it.

Let's first look at how we can translate the problem of stock market trading to a reinforcement learning environment.

- Each point on a stock graph is just a floating number that represents a stock price at a given time.
- Our task is to predict what is going to happen in the next period, and as mentioned there are 3 possible actions: buy, sell, or sit.
This is regression problem - let's say we have a `window_size = 5` so we use 5 states to predict our target, which is a continuous number.

Instead of predicting real numbers for our target we instead want to predict one of our 3 actions.

Next we're going change our input states to be differences in stock prices, which will represent price changes over time.

To implement this in Python we're going to create a function `state_creator` which takes 3 arguments: `data`, `timestep`, and `window_size`:

- We first need to calculate the starting_id
- When the starting_id is positive we create a state and if it is negative we append the info until we get to the window_size
- Next we define an empty list called state and iterate through the window_data list.
- As we append the state we need to normalize the price data with the sigmoid function
- To complete the function we return a NumPy array of the state

In [11]:
def state_creator(data, timestep, window_size):
  starting_id = timestep - window_size + 1
  
  if starting_id >= 0:
    window_data = data[starting_id:timestep+1]
  else:
    window_data = - starting_id * [data[0]] + list(data[0:timestep+1])
  
  state = []
  for i in range(window_size - 1):
    state.append(sigmoid(window_data[i+1] - window_data[i]))
  
  return np.array([state])

## Loading a Dataset

Now that we have our `state_creator` function we can load our dataset.

First we need to define a new variable called `stock_name`, and for this example we'll use `AAPL`.

Then we define a variable called `data` with our `dataset_loader` function:

In [12]:
stock_name = "AAPL"
data = dataset_loader(stock_name)
data.head()

[*********************100%%**********************]  1 of 1 completed


  df.index += _pd.TimedeltaIndex(dst_error_hours, 'h')


Date
1980-12-12    0.128348
1980-12-15    0.121652
1980-12-16    0.112723
1980-12-17    0.115513
1980-12-18    0.118862
Name: Close, dtype: float64

# 3. Training the Q-Learning Trading Agent

Before we proceed to training our model, let's define a few hyperparameters, including:

In [13]:
window_size = 10
episodes = 1000

batch_size = 32
data_samples = len(data) - 1

Now it's time to define our trading agent, and let's take a look at a summary of the model:

In [32]:
class AI_Trader():
    def __init__(self, state_size, action_space=3, model_name="AITrader"):
      self.state_size = state_size
      self.action_space = action_space
      self.memory = deque(maxlen=2000)
      self.inventory = []
      self.model_name = model_name
      
      self.gamma = 0.95
      self.epsilon = 1.0
      self.epsilon_final = 0.01
      self.epsilon_decay = 0.995
      self.model = self.model_builder()
        
    def model_builder(self):
        model = Sequential()
      
        model.add(Dense(units=32, activation='relu', input_dim=self.state_size))
        model.add(Dense(units=64, activation='relu'))
        model.add(Dense(units=128, activation='relu'))
        model.add(Dense(units=self.action_space, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(learning_rate=0.001))
        
        return model
    
  
    def trade(self, state):
        if random.random() <= self.epsilon:
            return random.randrange(self.action_space)
      
        actions = self.model.predict(state, verbose=0)
        return np.argmax(actions[0])
    
    def batch_train(self, batch_size):
        batch = []
        for i in range(len(self.memory) - batch_size + 1, len(self.memory)):
            batch.append(self.memory[i])
            
        for state, action, reward, next_state, done in batch:
            reward = reward
            if not done:
                reward = reward + self.gamma * np.amax(self.model.predict(next_state, verbose=0)[0])
            target = self.model.predict(state, verbose=0)
            target[0][action] = reward
            
            self.model.fit(state, target, epochs=1, verbose=0)
        if self.epsilon > self.epsilon_final:
            self.epsilon *= self.epsilon_decay

Now it's time to define our trading agent, and let's take a look at a summary of the model:

In [33]:
trader = AI_Trader(window_size)
model = trader.model
model.summary()



Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_28 (Dense)            (None, 32)                352       
                                                                 
 dense_29 (Dense)            (None, 64)                2112      
                                                                 
 dense_30 (Dense)            (None, 128)               8320      
                                                                 
 dense_31 (Dense)            (None, 3)                 387       
                                                                 
Total params: 11171 (43.64 KB)
Trainable params: 11171 (43.64 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Defining a Training Loop

Now we need to train our model, which we're going to do with a `for` loop that will iterate through all of the `episodes`.

- Next we want to print out the current episode
- We then need to define our initial state with `state_creator`
- Then we define 2 variables so that we can keep track of `total_profit` and we set our inventory to 0 at the beginning of an episode with `trader.inventory = []`
- Next we define our timestep (1 timestep is 1 day) with a for loop, which represents how many samples we have. To do this we need to define our `action`, `next_state`, and `reward`.
- Then we want to update our `inventory` based on the given `action`
- Based on the actions we can calculate our `reward` and update the `total_profit`
- We then need to check if this is the last sample in our dataset
- Next we need to append all of the data to our trader's experience replay buffer with `trader.memory.append()`
- We then change the state to the `next_state` so we can iterate through the whole `episode`
- Finally, we want to print out the `total_profit` if `done = True` and add print statements to when we buy or sell and how what the profit is
There are two more things to do before starting the training process:

- We need to check if we have more information in our `memory` than our `batch_size`. If that is true we call `trader.batch_train` and pass in the `batch_size` argument
- We're then going to check if the number of episodes is divisible by 10, and if that is the case we're going to save the model with `trader.model.save()` in an H5 file

In [None]:
stock_name = "AAPL"
data = dataset_loader(stock_name)

for episode in range(1, episodes + 1):
  
  print("Episode: {}/{}".format(episode, episodes))
  
  state = state_creator(data, 0, window_size + 1)
  
  total_profit = 0
  trader.inventory = []
  
  for t in tqdm(range(data_samples)):
    
    action = trader.trade(state)
    
    next_state = state_creator(data, t+1, window_size + 1)
    reward = 0
    
    if action == 1: #Buying
      trader.inventory.append(data[t])
      print("AI Trader bought: ", stocks_price_format(data[t]))
      
    elif action == 2 and len(trader.inventory) > 0: #Selling
      buy_price = trader.inventory.pop(0)
      
      reward = max(data[t] - buy_price, 0)
      total_profit += data[t] - buy_price
      print("AI Trader sold: ", stocks_price_format(data[t]), " Profit: " + stocks_price_format(data[t] - buy_price) )
      
    if t == data_samples - 1:
      done = True
    else:
      done = False
      
    trader.memory.append((state, action, reward, next_state, done))
    
    state = next_state
    
    if done:
      print("########################")
      print("TOTAL PROFIT: {}".format(total_profit))
      print("########################")
    
    if len(trader.memory) > batch_size:
      trader.batch_train(batch_size)
      
  if episode % 10 == 0:
    trader.model.save("ai_trader_{}.h5".format(episode))

  window_data = - starting_id * [data[0]] + list(data[0:timestep+1])


Episode: 1/1000


  0%|          | 0/10889 [00:00<?, ?it/s]

AI Trader bought:  $ 0.121652
AI Trader bought:  $ 0.112723
AI Trader bought:  $ 0.115513
AI Trader bought:  $ 0.126116
AI Trader sold:  $ 0.132254  Profit: $ 0.010602
AI Trader sold:  $ 0.137835  Profit: $ 0.025112
AI Trader sold:  $ 0.158482  Profit: $ 0.042969
AI Trader sold:  $ 0.160714  Profit: $ 0.034598
AI Trader bought:  $ 0.152344
AI Trader bought:  $ 0.154018
AI Trader sold:  $ 0.143973  Profit: - $ 0.008371
AI Trader bought:  $ 0.137835
AI Trader bought:  $ 0.135045
AI Trader bought:  $ 0.142299
AI Trader sold:  $ 0.136719  Profit: - $ 0.017299
AI Trader bought:  $ 0.139509
AI Trader bought:  $ 0.146763
AI Trader bought:  $ 0.142299
AI Trader bought:  $ 0.145089
AI Trader sold:  $ 0.146763  Profit: $ 0.008928
AI Trader bought:  $ 0.142857
AI Trader bought:  $ 0.138393


  trader.inventory.append(data[t])
  print("AI Trader bought: ", stocks_price_format(data[t]))
  reward = max(data[t] - buy_price, 0)
  total_profit += data[t] - buy_price
  print("AI Trader sold: ", stocks_price_format(data[t]), " Profit: " + stocks_price_format(data[t] - buy_price) )
  state.append(sigmoid(window_data[i+1] - window_data[i]))
  0%|          | 33/10889 [00:02<12:16, 14.74it/s]

AI Trader sold:  $ 0.118862  Profit: - $ 0.016183


  0%|          | 35/10889 [00:06<40:50,  4.43it/s]

AI Trader bought:  $ 0.123326


  0%|          | 36/10889 [00:08<58:05,  3.11it/s]

AI Trader sold:  $ 0.127790  Profit: - $ 0.014509


  0%|          | 37/10889 [00:10<1:19:22,  2.28it/s]

AI Trader sold:  $ 0.127790  Profit: - $ 0.011719


  0%|          | 38/10889 [00:12<1:45:00,  1.72it/s]

AI Trader sold:  $ 0.128348  Profit: - $ 0.018415


  0%|          | 39/10889 [00:14<2:14:10,  1.35it/s]

AI Trader sold:  $ 0.121652  Profit: - $ 0.020647


  0%|          | 41/10889 [00:18<3:19:25,  1.10s/it]

AI Trader bought:  $ 0.117746


  0%|          | 43/10889 [00:22<4:16:39,  1.42s/it]

AI Trader bought:  $ 0.113839


  0%|          | 44/10889 [00:24<4:40:29,  1.55s/it]

AI Trader bought:  $ 0.116629


  0%|          | 45/10889 [00:26<4:58:54,  1.65s/it]

AI Trader bought:  $ 0.121652


  0%|          | 46/10889 [00:28<5:16:36,  1.75s/it]

AI Trader sold:  $ 0.114397  Profit: - $ 0.030692


  0%|          | 47/10889 [00:30<5:28:46,  1.82s/it]

AI Trader bought:  $ 0.108259


  0%|          | 48/10889 [00:32<5:39:29,  1.88s/it]

AI Trader bought:  $ 0.109933


  0%|          | 51/10889 [00:38<5:58:03,  1.98s/it]

AI Trader sold:  $ 0.114397  Profit: - $ 0.028460


  0%|          | 52/10889 [00:40<5:58:54,  1.99s/it]

AI Trader bought:  $ 0.118304


  0%|          | 53/10889 [00:42<6:01:41,  2.00s/it]

AI Trader bought:  $ 0.118862


  0%|          | 54/10889 [00:44<6:02:45,  2.01s/it]

AI Trader bought:  $ 0.117188


  1%|          | 55/10889 [00:46<6:05:58,  2.03s/it]

AI Trader sold:  $ 0.116071  Profit: - $ 0.022322


  1%|          | 56/10889 [00:48<6:10:49,  2.05s/it]

AI Trader sold:  $ 0.115513  Profit: - $ 0.007813


  1%|          | 57/10889 [00:50<6:11:28,  2.06s/it]

AI Trader bought:  $ 0.114397


  1%|          | 58/10889 [00:52<6:12:05,  2.06s/it]

AI Trader bought:  $ 0.105469


  1%|          | 61/10889 [00:58<6:10:59,  2.06s/it]

AI Trader sold:  $ 0.100446  Profit: - $ 0.017300


  1%|          | 62/10889 [01:00<6:11:28,  2.06s/it]

AI Trader bought:  $ 0.099330


  1%|          | 63/10889 [01:02<6:14:41,  2.08s/it]

AI Trader bought:  $ 0.103237


  1%|          | 64/10889 [01:04<6:07:38,  2.04s/it]

AI Trader bought:  $ 0.108259


  1%|          | 65/10889 [01:06<6:06:21,  2.03s/it]

AI Trader sold:  $ 0.114955  Profit: $ 0.001116


  1%|          | 67/10889 [01:10<6:03:52,  2.02s/it]

AI Trader bought:  $ 0.114955


  1%|          | 68/10889 [01:13<6:07:22,  2.04s/it]

AI Trader sold:  $ 0.119420  Profit: $ 0.002791


  1%|          | 69/10889 [01:15<6:09:48,  2.05s/it]

AI Trader sold:  $ 0.118862  Profit: - $ 0.002790


  1%|          | 71/10889 [01:19<6:12:34,  2.07s/it]

AI Trader sold:  $ 0.114397  Profit: $ 0.006138


  1%|          | 72/10889 [01:21<6:13:24,  2.07s/it]

AI Trader sold:  $ 0.110491  Profit: $ 0.000558


  1%|          | 74/10889 [01:25<6:15:14,  2.08s/it]

AI Trader bought:  $ 0.109375


  1%|          | 75/10889 [01:27<6:14:03,  2.08s/it]

AI Trader sold:  $ 0.108259  Profit: - $ 0.010045


  1%|          | 78/10889 [01:33<6:16:09,  2.09s/it]

AI Trader sold:  $ 0.116071  Profit: - $ 0.002791


  1%|          | 79/10889 [01:36<6:19:40,  2.11s/it]

AI Trader bought:  $ 0.114955


  1%|          | 81/10889 [01:40<6:14:54,  2.08s/it]

AI Trader bought:  $ 0.122768


  1%|          | 84/10889 [01:46<6:13:28,  2.07s/it]

AI Trader bought:  $ 0.124442


  1%|          | 86/10889 [01:50<6:02:59,  2.02s/it]

AI Trader bought:  $ 0.111607


  1%|          | 87/10889 [01:52<6:01:04,  2.01s/it]

AI Trader sold:  $ 0.114955  Profit: - $ 0.002233


  1%|          | 89/10889 [01:56<5:57:44,  1.99s/it]

AI Trader bought:  $ 0.127232


  1%|          | 90/10889 [01:58<5:59:42,  2.00s/it]

AI Trader sold:  $ 0.130580  Profit: $ 0.016183


  1%|          | 91/10889 [02:00<6:03:32,  2.02s/it]

AI Trader sold:  $ 0.129464  Profit: $ 0.023995


  1%|          | 92/10889 [02:02<6:05:51,  2.03s/it]