# Investment Strategies with Deep Q-Network (DQN)

The aim of this Jupyter Notebook is to apply the Deep Q-Network (DQN) algorithm in the world of investment.\
In order to do so, the main essential thing is to create an environment in which to operate. The environment must have all the methods and characteristics that all Gymnasium environments have.\
In the specific, given a market that we want to analyze as input, it must returns all the relevant features of a typical environment, such as the actions that we can take, the observation space, terminated or truncated, and more importantly the **REWARD** signal that must indicate to the agent what we want to achieve.\
Once we have that, we connect it to the DQN algorithm, after some training we will evaluate the reward that it achieves in order to understand the goodness of our model.

It is important to point out that [Gymnasium](https://gymnasium.farama.org/) do not have any predefined trading environment, all we have are some projects carried out 'unofficially', thta often lack of corrcetness and rigorous methoods or documentation.

* [gym-anytrading](https://github.com/AminHP/gym-anytrading), repository GitHub for a possible trading environment.
* [tensortrade](https://github.com/tensortrade-org/tensortrade), Python library for reinforcement learning applied in trading.
* [q-trader](https://github.com/edwardhdlu/q-trader?tab=readme-ov-file), application of reinforcement learning in the stock market.

These three, seem to be the most relevant sources we can find on the web, but as already noted, they lack of proper documentation, and rigorous method.

In any case, the application of RL techniques in investments is already quite widespread as mentioned by this article [link](https://medium.com/ibm-data-ai/reinforcement-learning-the-business-use-case-part-2-c175740999), where prestigious firms suchs IBM or J.P. Morgan, already apply some of these techniques.

The first thing to do, is the design of the environment, and I have designed my own, as shown below, it is contained in a script called `tradingenv`, that contains our environment class, `TradingEnv`.\
It has all the methods that make it compliant to the Gym environments. But the most relevant thing to understand is how the **reward** signal is defined, as shown here:

$
     reward = (cash_{t+1} + shares_{t+1} \cdot p_{t+1}) - (cash_{t} + shares_{t} \cdot p_{t}) 
$

where:
* $reward$, indicates the reward obtained after taking the action at time-step $t$
* $cash_{t}$, indicates the amount of cash held at time step $t$
* $share_{t}$, indicates the amount of share held at time step $t$
* $p_{t}$, indicates the price of a single share at time step $t$

## Implementation with the Notebook

import of all the necessary libraries

In [None]:
#import of our defined environment
#YOU MUST ALSO HAVE THE SCRIPT "tradingenv.py"
from tradingenv import TradingEnv

import torch #for Neural Networks
import numpy as np #for numerical operations
import random #for random number generation
import joblib #for saving and loading the replay memory
import matplotlib.pyplot as plt #for plotting the results

example of how to use the library

In [None]:
""" example of how to use the described environment
env = TradingEnv("AAPL", "1d", sliding_window=10, start_date="2020-01-01", end_date="2021-01-01", initial_capital=10000)
obs, info = env.reset()
action = env.action_space.sample()  # es. 0=buy,1=hold,2=sell
obs, reward, done, truncated, info = env.step(action)
"""

Instantiaion of the trading environment with the selected stock ticker of interest.\
Note, you can select the granularity as the interval parameter of the `yfinance` library, representing the time scale over which you want to operate.
supported time scale: 

#### Short-time trading (intraday)

* `1m` – 1 minute
* `2m` – 2 minutes
* `5m` – 5 minutes
* `15m` – 15 minutes
* `30m` – 30 minutes
* `60m` – 60 minutes
* `90m` – 90 minutes
* `1h` – 1 hour

#### Long-time trading (daily and others)
* `1d` – 1 day
* `5d` – 5 days
* `1wk` – 1 week
* `1mo` – 1 month
* `3mo` – 3 months

In [None]:
#instatiate the environmentS
env = TradingEnv("AAPL", "1d", sliding_window=10, start_date="2020-01-01", end_date="2020-05-06", initial_capital=30000)

state_size = env.observation_space.shape
action_size = env.action_space.n

print('State Space: ', state_size)
print('Action Space: ', action_size)

NameError: name 'TradingEnv' is not defined

Show what we are analyzing, the time series of interest, note that we use as prices the closing prices

In [None]:
env.show_data()

We fix the seed of the random number generator in order to reproduce the results

In [None]:
# set the seeds for reproducibility of results
seed = 34

torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)

env.reset(seed=seed)
env.action_space.seed(seed)
env.observation_space.seed(seed)

Baseline random policy, to be used as reference, in order to understan whether we are actually learning smoething

In [None]:
def random_pi(state):
    # selects an action uniformly at random
    # from the environment's action space.
    return env.action_space.sample()

if available on the current calculator, we take advantage of the **GPU**

In [None]:
# set the backend device to MPS, if available
if torch.backends.mps.is_available():
    device = torch.device("cuda") #mps
else:
    device = torch.device("cpu")
# print the used device
print(f"Using device: {device}")

Experience Replay Buffer, for the creation of semy-independent experiences

In [None]:
# define the structured dtype for an experience tuple
experience_type = np.dtype([
    ('state',      np.float32, state_size),   # current state
    ('action',     np.int8),                  # action taken
    ('reward',     np.float32),               # reward received
    ('next_state', np.float32, state_size),   # next state
    ('failure',    np.int8)                   # terminal flag (1 if done)
])

In [None]:
# Set the replay memory size hyperparameter
memory_size = 100000

# Create the replay memory
replay_memory = {
    'size': memory_size,
    'buffer': np.empty(shape=(memory_size,), dtype=experience_type),
    'index': 0,
    'entries': 0
}

Function to store a new experience in the replay buffer

In [None]:
def store_experience(experience):
    # store the experience in the buffer
    replay_memory['buffer'][replay_memory['index']] = experience

    # update the number of experiences in the buffer
    replay_memory['entries'] = min(replay_memory['entries'] + 1, replay_memory['size'])

    # update index, if the memory is full, start from the begging
    replay_memory['index'] += 1
    replay_memory['index'] = replay_memory['index'] % replay_memory['size']

Function to sample a mini-batch of experiences from the replay buffer defined above

In [None]:
# Set the batch size for sampling experiences
batch_size = 32

# function to sample a batch of experiences from the replay memory
def sample_experiences():

    # select uniformly at random a batch of experiences from the memory
    idxs = np.random.choice(range(replay_memory['entries']), batch_size, replace=False)

    # return the batch of experiences
    experiences = replay_memory['buffer'][idxs]

    return experiences

### Neural Network architecture, it outputs the $Q(s,a)$ for all possible states

In [None]:
#hyperparameters for the neural network architecture to be tweaked
first_hidden_layer= 512
second_hidden_layer= 128

def create_network():

      # Define a deep neural network using Sequential:
      # Each layer feeds directly into the next one.
      dnn = torch.nn.Sequential( 
            # First fully connected layer maps state inputs to 512 hidden units
            torch.nn.Linear(state_size[0], first_hidden_layer),
            
            # ReLU activation introduces nonlinearity
            torch.nn.ReLU(),
            
            # Second fully connected layer (hidden layer with 128 units)
            torch.nn.Linear(first_hidden_layer, second_hidden_layer),
            
            # Another ReLU activation
            torch.nn.ReLU(),
            
            # Output layer: one unit per possible action
            # Produces Q-values
            torch.nn.Linear(second_hidden_layer, action_size)
      )
    
      # Return the constructed model
      return dnn

Creation of two separated networks (in terms of weights), in order to adress the isse of **identically distributed** targets.

$\displaystyle L(\theta) = E_{(s,a) \sim U(D)} \left[ ( r + \gamma \underset{a}{\text{ max }} Q(s',a;\theta^{-}) - Q(s,a;\theta))^2 \right]$

In [None]:
online_q = create_network() # online q-network for the prediction
target_q = create_network() # target network for the Ground-Truth Q-value estimation

Optimizer only for the online network, since is the onnliy one to be trained

In [None]:
learning_rate = 0.007
optimizer = torch.optim.RMSprop(online_q.parameters(), lr=learning_rate)

The target network is not trained, but sometimes we need to **update its weights to match the online network weights**

In [None]:
def update_target():
    # copy the parameters from the online model to the target model
    for target, online in zip(target_q.parameters(), online_q.parameters()):
        target.data.copy_(online.data)

At the end of training the online network will be used to implement the policy

In [None]:
def dqn_pi(state):
    # convert the state into a tensor
    state = torch.as_tensor(state, dtype=torch.float32)

    # compute Q-values from the network
    q_values = online_q(state).detach().numpy().squeeze()

    # select greedy action
    action = int(np.argmax(q_values))

    # return the action
    return action

Function to evaluate a policy, it return the average reward obtained over a number of episodes

In [None]:
def evaluate(pi, episodes=1):

     # collect total rewards per episode
    rewards = []

    # loop over episodes
    for episode in range(episodes):

        # reset the environment
        state, _ = env.reset()
        done = False
        total_reward = 0.0

        # run an episode
        while not done:
            action = pi(state)
            state, reward, terminal, truncated, _ = env.step(action)
            total_reward += reward
            done = terminal or truncated

        # store the total reward    
        rewards.append(total_reward)
            
    # return the average reward over the episodes        
    return np.mean(rewards)

Optimizer, for computing the loss and the backward propagation

In [None]:
# define the discount factor
gamma = 0.99

def optimize():

    # sample a batch of experiences
    batch = sample_experiences()
    
    # prepare the experience as tensors
    states      = torch.from_numpy(batch['state'].copy()).float()    
    actions     = torch.from_numpy(batch['action'].copy()).long()   
    rewards     = torch.from_numpy(batch['reward'].copy()).float()    
    next_states = torch.from_numpy(batch['next_state'].copy()).float() 
    failures    = torch.from_numpy(batch['failure'].copy()).float()

    # get the values of the Q-function at next state from the "target" network 
    # remember to detach, we need to treat these values as constants 
    q_target_next = target_q(next_states).detach()
    
    # get the max value 
    max_q_target_next = q_target_next.max(1)[0]

    # one important step, often overlooked, is to ensure 
    # that failure states are grounded to zero
    max_q_target_next *= (1 - failures.float())

    # calculate the target 
    target = rewards + gamma * max_q_target_next

    # finally, we get the current estimate of Q(s,a)
    # here we query the current "online" network
    q_online_current = torch.gather(online_q(states), 1, actions.unsqueeze(1)).squeeze(1)

    # create the errors
    td_error = target - q_online_current

    # calculate the loss
    loss = td_error.pow(2).mean()

    # backward pass: compute the gradients
    optimizer.zero_grad()
    loss.backward()

    # update model parameters
    optimizer.step()

Hyperparameters to be set for the exploration straetgies

In [None]:
# define decay parameters (max, min, steps)
epsilon_max = 1.0
epsilon_min = 0.01
epsilon_decay_steps = 10000

# generate epsilons
epsilons = np.logspace(start=0, stop=-2, num=epsilon_decay_steps, base=10)
   
# normalize epsilons 
epsilons = (epsilons - epsilon_min) / (epsilon_max - epsilon_min)
    
# scale  epsilons to the desired range
epsilons = (epsilon_max - epsilon_min) * epsilons + epsilon_min

Chosen exploration strategy.\
Based on the picked $\epsilon$, we alternate a random action selection and the greedy selection (greedy based on the value of the $Q(s,a,\theta)$ returned by our neural network)

In [None]:
import random

def epsilon_greedy(state, step):
    # get the epsilon value    
    epsilon = epsilons[step] if step < epsilon_decay_steps else epsilon_min

    # Exploration
    if random.random() < epsilon:
        action = random_pi(state)

    # Exploitation
    else:
        action = dqn_pi(state)
    
    return action

#### Deep Q-Network

In [None]:
def dqn(memory_start_size, target_update_steps, max_episodes):
    
    # create a score tracker for statistic purposes
    scores = []
    
    # counter for the number of steps 
    step = 0

    # update the target model with the online one
    update_target()
                   
    # train until the maximum number of episodes
    for episode in range(max_episodes):
        
        # reset the environment before starting the episode
        state, _ = env.reset()
        done = False

        # interact with the environment until the episode is done
        while not done:
                    
            # select the action using the exploration policy
            action = epsilon_greedy(state, step)

            # perform the selected action
            next_state, reward, terminal, truncated, _ = env.step(action)
            done = terminal or truncated
            failure = terminal and not truncated

            # store the experience into the replay buffer
            experience = (state, action, reward, next_state, failure)
            store_experience(experience)
    
            # optimize the online model after the replay buffer is large enough
            if replay_memory['entries'] > memory_start_size:
                optimize()
                 
                # sometimes, synchronize the target model with the online model
                if step % target_update_steps == 0:
                    update_target()
                
            # update current state to next state
            state = next_state

            # update the step counter
            step += 1

        # After each episode, evaluate the policy
        score = evaluate(dqn_pi, episodes=10)

        # store the score in the tracker
        scores.append(score)

        # print some informative logging 
        message = 'Episode {:03}, score {:05.1f}'
        message = message.format(episode+1, score)
        print(message, end='\r', flush=True)
        
    return scores

Apply the DQN to out `TradingEnv`

In [None]:
# set the hyperparameters
memory_start_size = 1000
max_episodes = 5 #TODO! 100
target_update_steps = 10

# run the DQN algorithm
dqn(memory_start_size, target_update_steps, max_episodes)

Definition of the experiment

In [None]:
def experiment(max_episodes):

    global online_q, target_q, optimizer, replay_memory, epsilons

    # List of random seeds to test algorithm stability
    seeds = (12, 34, 56, 78, 90)

    # Container to collect all experiment results
    results = []

    # Run an independent training experiment per seed
    for seed in seeds:

        print("Experiment seed: ", seed)

         # Set all relevant random seeds for reproducibility
        torch.manual_seed(seed)
        np.random.seed(seed)
        random.seed(seed)

        # reset the environment
        env.reset(seed=seed)
        env.action_space.seed(seed)
        env.observation_space.seed(seed)

        # create online and target models
        online_q = create_network()
        target_q = create_network()
        optimizer = torch.optim.RMSprop(online_q.parameters(), lr=learning_rate)

        # create the replay memory
        replay_memory = {
            'size': memory_size,
            'buffer': np.empty(shape=(memory_size,), dtype=experience_type),
            'index': 0,
            'entries': 0
        }

        # create the epsilon values
        epsilons = np.logspace(start=0, stop=-2, num=epsilon_decay_steps, base=10)
        epsilons = (epsilons - epsilon_min) / (epsilon_max - epsilon_min)
        epsilons = (epsilon_max - epsilon_min) * epsilons + epsilon_min

        # train the network    
        scores = dqn(memory_start_size, target_update_steps, max_episodes)
        
        # smooth the result using a sliding window
        sliding_windows = 25
        scores = np.convolve(scores, np.ones(sliding_windows)/sliding_windows, mode='valid')
                
        # collect the results
        results.append(scores)

        print("")

    # calculate max, min and average scores among experiments
    max_score = np.max(results, axis=0).T
    min_score = np.min(results, axis=0).T
    mean_score = np.mean(results, axis=0).T

    # prepare the results
    experiment_results = {
        'max_score': max_score,
        'min_score': min_score,
        'mean_score': mean_score
    }

    # save permanently
    joblib.dump(experiment_results, '../dqn_results.joblib');
    
    return experiment_results

Run of the experiment on our environemnt

In [None]:
# Define the experiment setup and hyperparameters
gamma = 0.99;               # discount factor
learning_rate = 0.001;      # step size for the optimizer
batch_size = 512;          # number of experiences per batch
epochs = 8;                # optimization steps per batch
epsilon = 0.5               # esploration vs exploitation parameter
first_hidden_layer = 256;   # size of the first hidden layer
second_hidden_layer = 128;  # size of the second hidden layer

# Run the experiment
dqn_results = experiment(max_episodes=1) #TODO! 1500

In [None]:
def experiment_random(max_episodes):

    # List of random seeds to test algorithm stability
    seeds = (12, 34, 56, 78, 90)

    # Container to collect all experiment results
    results = []

    # Run an independent training experiment per seed
    for seed in seeds:

        print("Experiment seed: ", seed)

         # Set all relevant random seeds for reproducibility
        np.random.seed(seed)
        random.seed(seed)

        # reset the environment
        env.reset(seed=seed)
        env.action_space.seed(seed)
        env.observation_space.seed(seed)

        # train the network    
        scores = evaluate(random_pi, episodes=max_episodes)
        
        # smooth the result using a sliding window
        sliding_windows = 25
        scores = np.convolve(scores, np.ones(sliding_windows)/sliding_windows, mode='valid')
                
        # collect the results
        results.append(scores)

        print("")

    # calculate max, min and average scores among experiments
    max_score = np.max(results, axis=0).T
    min_score = np.min(results, axis=0).T
    mean_score = np.mean(results, axis=0).T

    # prepare the results
    experiment_results = {
        'max_score': max_score,
        'min_score': min_score,
        'mean_score': mean_score
    }

    
    return experiment_results

In [None]:
random_pi_results = experiment_random(max_episodes=1) #TODO! 1500

In [None]:
len(random_pi_results[0])

In [None]:
len(random_pi_results['mean_score'])

In [None]:
len(dqn_results['mean_score'])

TODO! PLOT THE COMPARISON BETWEEN THE RANDOM POLCY AND THE DQN

Plotting of the result

In [None]:
plt.figure(figsize=(12,6))
plt.title('DQN vs Random Policy')
plt.ylabel('Average reward per episode [$]')
plt.xlabel('Episodes')

dqn_episodes = range(len(dqn_results['max_score']))
random_pi_episodes = range(len(random_pi_results['max_score']))

plt.plot(random_pi_results['max_score'], 'y', linewidth=1, label="random_pi")
plt.plot(random_pi_results['min_score'], 'y', linewidth=1)
plt.plot(random_pi_results['mean_score'], 'y', linewidth=2)
plt.fill_between(random_pi_episodes, random_pi_results['min_score'], random_pi_results['max_score'], facecolor='y', alpha=0.3)

plt.plot(dqn_results['max_score'], 'b', linewidth=1, label="DQN")
plt.plot(dqn_results['min_score'], 'b', linewidth=1)
plt.plot(dqn_results['mean_score'], 'b', linewidth=2)
plt.fill_between(dqn_episodes, dqn_results['min_score'], dqn_results['max_score'], facecolor='b', alpha=0.3)

plt.legend()
plt.show()

## Conclusions, limitations and further improvements

one single share to be sold or bought per actoin,\
markovian market\
see my paper\
see my fft for finance github, percentage variation of prices\



# important things

1. put the % variation of prices
2. keep the environment class as simple as possible, nothing superfluo
7. implement a simple implementation of what to buy and sell, such as just ONE stock, a portion of it?
8. evebn the action space must be super simple.
9. fai una comparazione importantissima ta quello implmementato online dal tizio, le notes di Berta e quello che voglio fare io!!!