# RL Project

## Key Changes

#### Hedging Environment
see *hedge_env.py*
- Changed from "cash flow PnL" to **"Accounting PnL"** (see Cao 2019 for more description)
    - general idea is now our daily reward now includes daily mark-to-market prices for options
    - For *stationary vol*: mark-to-market is black scholes
    - For *non-stationary vol*: mark-to-market is Hagan SABR implied vol (see Hagan 2002 paper for more)
- Adjusted stationary vol stock dynamics to match analytical Black-Scholes analytical Euler discretization
    - Stationary Vol:
        - Before: $S_{t+1} = S_t (1 + \sigma \cdot \sqrt{dt} \cdot Z1)$
        - New: Euler discretization of GBM analytical solution
    - Stochastic Vol:
        - Changed to SABR model with log-normal returns (e.g. beta = 1)
- Changed the 4th state variable from "time fraction" to tau
- Change it from hedging 1 option to hedging 100 options --> the bigger reward and action magnitude helps the model learn better
$

#### Actor and Critic Models
see *models.py*
- I only have DDPG models. In my mind neither PPO nor GRPO make sense in this context given we have to re paramterize to policy to output some probability distribution over our action space. Given the high amounts of precision required to properly hedge the option, this isn't stable enough to converge (from what I've experienced)
- Added batch normalization per layer on all networks
- Added sigmoid activation to action output of actor
- I use 2 seperate Q-functions (the "cost critic" and "risk critic"), each modelling $E[ \ C_t]$ and $E[C_t^2]$ respectively. This is from Cao 2019. Can read for more details.

#### Hot Start Actor
see *hot_start_actor.py*
- Supervised learning to teach actor either BS hedge (stationary vol) or Bartlett delta hedge (stochastic vol)
- The trained model works well but it's hard to train a critic function on it and then proceed with DDPG

#### Hot Start Critic
see *hot_start_critic.py*
- Takes a pre-trained actor and uses similar DDPG method (with cost and risk critic functions) to train the 2 critic functions. Seems to have trouble learning long range dependencies. Can try mess around with it if you want.

#### DDPG Trainer
see *DDPG_train*
- I rehaul the code to match Cao 2019 approach. Very gradual updates.

#### Policy Evaluation Function
see *evaluate.py*
- Simulates paths and gets average total objective function
- Can do simulation with BS policy, Bartlett policy, Do Nothing policy, and inputted policy
- Will use this to compare actors

## Demo
Will train 3 different actors and compare performances with Black-Scholes and Do Nothing Benchmarks: Note the costs are more than 100 times larger than our previous environment (i.e. a reward of -600 can be thought of as akin to a reward of -6 in our previous environments)
1. Analytical Models
2. Cold-start model
3. Hot-started model

In [1]:
from hedge_env import HedgingEnv
from hot_start_actor import hot_start_gen_actor_samples, hot_start_actor
from hot_start_critic import hot_start_critic_q_func, hot_start_critic_value_func
from DDPG_train import DDPG_train
from evaluate import evaluation
from models import Actor, DDPG_Cost_Critic, DDPG_Risk_Critic

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

#### Analytical Models:
Our analytical models are already coded into the evaluate function. Simply just need to call them.

In [2]:
# initialize environment
env = HedgingEnv(T = 0.25, kappa = 0.01, risk_aversion = 1.5, stochastic_vol=False, n_steps = 30)

In [3]:
# Test performance of policy that does nothing every step
print("'Do Nothing' policy:")
_,nothing_history = evaluation().eval_policy(env, "Nothing", verbose = True)

# Test analytical BS hedge
print()
print("BS policy:")
_,bs_history = evaluation().eval_policy(env, "BS", verbose = True)

# Doesnt rly make sense here (as non-stochastic vol) but also demonstrate Bartlett hedge
print()
print("Bartlett policy:")
_,bs_history = evaluation().eval_policy(env, "Bartlett", verbose = True)

'Do Nothing' policy:
testing...
Episode: 100 | Mean Objective: -864.490 | PnL: 18.010, Risk: 882.499
Episode: 200 | Mean Objective: -818.869 | PnL: 21.839, Risk: 840.708
Episode: 300 | Mean Objective: -931.933 | PnL: -7.004, Risk: 924.930
Episode: 400 | Mean Objective: -908.907 | PnL: 9.934, Risk: 918.841
Episode: 500 | Mean Objective: -920.351 | PnL: -1.969, Risk: 918.382
Episode: 600 | Mean Objective: -908.163 | PnL: 6.843, Risk: 915.006
Episode: 700 | Mean Objective: -922.916 | PnL: -2.266, Risk: 920.650
Episode: 800 | Mean Objective: -918.379 | PnL: 4.640, Risk: 923.019
Episode: 900 | Mean Objective: -978.994 | PnL: -17.242, Risk: 961.752
Episode: 1000 | Mean Objective: -959.104 | PnL: -12.419, Risk: 946.685

BS policy:
testing...
Episode: 100 | Mean Objective: -452.701 | PnL: -245.073, Risk: 207.628
Episode: 200 | Mean Objective: -462.747 | PnL: -243.250, Risk: 219.496
Episode: 300 | Mean Objective: -451.425 | PnL: -229.559, Risk: 221.866
Episode: 400 | Mean Objective: -453.933 | 

#### Cold Start Model:
1. Initialize actor and critic functions
2. Call DDPG train

In [None]:
# initialize
cold_actor = Actor(4,1)
cold_cost_critic = DDPG_Cost_Critic(4,1)
cold_risk_critic = DDPG_Risk_Critic(4,1)

# train
cold_obj_hist, _ = DDPG_train(cold_actor,
                            cold_cost_critic,
                            cold_risk_critic,
                            env,
                            episodes= 50000,
                            batch_size = 128,
                            lr = 0.0001,
                            tau=0.00001,
                            epsilon = 1,
                            epsilon_decay = 0.9998,
                            discount = 1,
                            eval_freq = 100,
                            min_epsilon = 0.0,
                            noisy_exploration = False
                            )

Training Critic (learning q function)...
Episode: 100 | Mean Objective: -1711.108 | PnL: -1048.616, Risk: 662.492 | e: 0.980
Episode: 200 | Mean Objective: -1696.257 | PnL: -994.694, Risk: 701.563 | e: 0.961
Episode: 300 | Mean Objective: -1674.942 | PnL: -982.505, Risk: 692.438 | e: 0.942
Episode: 400 | Mean Objective: -1696.979 | PnL: -1047.649, Risk: 649.330 | e: 0.923
Episode: 500 | Mean Objective: -1743.487 | PnL: -972.133, Risk: 771.354 | e: 0.905
Episode: 600 | Mean Objective: -1705.085 | PnL: -997.389, Risk: 707.695 | e: 0.887
Episode: 700 | Mean Objective: -1483.226 | PnL: -915.080, Risk: 568.146 | e: 0.869
Episode: 800 | Mean Objective: -1670.579 | PnL: -950.000, Risk: 720.579 | e: 0.852
Episode: 900 | Mean Objective: -1465.568 | PnL: -853.213, Risk: 612.354 | e: 0.835
Episode: 1000 | Mean Objective: -1560.740 | PnL: -872.185, Risk: 688.556 | e: 0.819
Episode: 1100 | Mean Objective: -1607.894 | PnL: -968.512, Risk: 639.382 | e: 0.803
Episode: 1200 | Mean Objective: -1495.384 

#### Hot-Start Model:
1. Generate Training Data
2. Hot start actor
3. Hot start critic (still not very good)
4. Do DDPG on pre-trained model

In [3]:
# initialize
hot_actor = Actor(4,1)
hot_cost_critic = DDPG_Cost_Critic(4,1)
hot_risk_critic = DDPG_Risk_Critic(4,1)

# Generate data
X, y = hot_start_gen_actor_samples(env,n_paths = 100)

# BC Train actor
hot_actor, loss_hist = hot_start_actor(hot_actor, X, y, lr=0.01, batch_size=32, epochs = 200)

Generating Samples for hot start...
Training Actor...
Epoch: 1, Batch: 156/156 | Loss: 193.156
Epoch: 11, Batch: 156/156 | Loss: 52.9809
Epoch: 21, Batch: 156/156 | Loss: 84.7890
Epoch: 31, Batch: 156/156 | Loss: 43.7909
Epoch: 41, Batch: 156/156 | Loss: 11.8410
Epoch: 51, Batch: 156/156 | Loss: 14.1049
Epoch: 61, Batch: 156/156 | Loss: 19.2050
Epoch: 71, Batch: 156/156 | Loss: 6.7309
Epoch: 81, Batch: 156/156 | Loss: 14.4409
Epoch: 91, Batch: 156/156 | Loss: 1.52295
Epoch: 101, Batch: 156/156 | Loss: 33.447
Epoch: 111, Batch: 156/156 | Loss: 6.67307
Epoch: 121, Batch: 156/156 | Loss: 25.1629
Epoch: 131, Batch: 156/156 | Loss: 0.84589
Epoch: 141, Batch: 156/156 | Loss: 7.78742
Epoch: 151, Batch: 156/156 | Loss: 1.2996
Epoch: 161, Batch: 156/156 | Loss: 3.93620
Epoch: 171, Batch: 156/156 | Loss: 1.65184
Epoch: 181, Batch: 156/156 | Loss: 2.3470
Epoch: 191, Batch: 156/156 | Loss: 16.0974
Epoch: 200, Batch: 156/156 | Loss: 5.6181

In [4]:
# Hot start critic
hot_obj, hot_q_guess = hot_start_critic_q_func(hot_actor,
                                            hot_cost_critic,
                                            hot_risk_critic,
                                            env,
                                            episodes= 10000,
                                            batch_size = 128,
                                            lr = 0.01,
                                            tau=0.000025,
                                            epsilon = 1.0,
                                            epsilon_decay = 0.9995,
                                            discount = 1.0,
                                            eval_freq = 100,
                                            min_epsilon = 0.01,
                                              )

Training Critic (learning q function)...
Episode: 100 | Mean Objective: -1695.656 | Guess: -129.341, Diff: 1566.315 | e: 0.905
Episode: 200 | Mean Objective: -1314.876 | Guess: -198.921, Diff: 1115.955 | e: 0.819
Episode: 300 | Mean Objective: -1326.176 | Guess: -208.903, Diff: 1117.272 | e: 0.741
Episode: 400 | Mean Objective: -1046.115 | Guess: -202.690, Diff: 843.425 | e: 0.6700
Episode: 500 | Mean Objective: -1388.094 | Guess: -215.204, Diff: 1172.890 | e: 0.606
Episode: 600 | Mean Objective: -1090.885 | Guess: -226.710, Diff: 864.174 | e: 0.5491
Episode: 700 | Mean Objective: -839.421 | Guess: -236.638, Diff: 602.783 | e: 0.4969
Episode: 800 | Mean Objective: -972.599 | Guess: -247.633, Diff: 724.966 | e: 0.4491
Episode: 900 | Mean Objective: -876.007 | Guess: -259.975, Diff: 616.032 | e: 0.406
Episode: 1000 | Mean Objective: -800.088 | Guess: -265.295, Diff: 534.793 | e: 0.368
Episode: 1100 | Mean Objective: -1028.005 | Guess: -266.853, Diff: 761.153 | e: 0.333
Episode: 1200 | Me

KeyboardInterrupt: 

In [5]:
# Continue training
hot_obj_train_hist, _ = DDPG_train(hot_actor,
                                    hot_cost_critic,
                                    hot_risk_critic,
                                    env,
                                    episodes= 20000,
                                    batch_size = 128,
                                    lr = 0.0001,
                                    tau=0.00001,
                                    epsilon = 5.00,
                                    epsilon_decay = 0.9999,
                                    discount = 1,
                                    eval_freq = 100,
                                    min_epsilon = 0.5,
                                    noisy_exploration = True
                                    )

Training Critic (learning q function)...
Episode: 100 | Mean Objective: -688.824 | PnL: -103.745, Risk: 585.079 | e: 0.000
Episode: 200 | Mean Objective: -714.381 | PnL: 38.457, Risk: 752.838 | e: 0.000
Episode: 300 | Mean Objective: -828.202 | PnL: -16.817, Risk: 811.385 | e: 0.000
Episode: 400 | Mean Objective: -736.541 | PnL: -80.881, Risk: 655.659 | e: 0.000
Episode: 500 | Mean Objective: -881.368 | PnL: -173.640, Risk: 707.728 | e: 0.000
Episode: 600 | Mean Objective: -665.988 | PnL: 6.854, Risk: 672.842 | e: 0.000
Episode: 700 | Mean Objective: -724.194 | PnL: 31.031, Risk: 755.225 | e: 0.000
Episode: 749 | Mean Objective: -825.548 | e: 0.000

KeyboardInterrupt: 