# RL Project

## Key Changes

#### Hedging Environment
see *hedge_env.py*
- Changed from "cash flow PnL" to **"Accounting PnL"** (see Cao 2019 for more description)
    - general idea is now our daily reward now includes daily mark-to-market prices for options
    - For *stationary vol*: mark-to-market is black scholes
    - For *non-stationary vol*: mark-to-market is Hagan SABR implied vol (see Hagan 2002 paper for more)
- Adjusted stationary vol stock dynamics to match analytical Black-Scholes analytical Euler discretization
    - Stationary Vol:
        - Before: $S_{t+1} = S_t (1 + \sigma \cdot \sqrt{dt} \cdot Z1)$
        - New: Euler discretization of GBM analytical solution
    - Stochastic Vol:
        - Changed to SABR model with log-normal returns (e.g. beta = 1)
- Changed the 4th state variable from "time fraction" to tau
- Change it from hedging 1 option to hedging 100 options --> the bigger reward and action magnitude helps the model learn better
$

#### Actor and Critic Models
see *models.py*
- I only have DDPG models. In my mind neither PPO nor GRPO make sense in this context given we have to re paramterize to policy to output some probability distribution over our action space. Given the high amounts of precision required to properly hedge the option, this isn't stable enough to converge (from what I've experienced)
- Added batch normalization per layer on all networks
- Added sigmoid activation to action output of actor
- I use 2 seperate Q-functions (the "cost critic" and "risk critic"), each modelling $E[ \ C_t]$ and $E[C_t^2]$ respectively. This is from Cao 2019. Can read for more details.

#### Hot Start Actor
see *hot_start_actor.py*
- Supervised learning to teach actor either BS hedge (stationary vol) or Bartlett delta hedge (stochastic vol)
- The trained model works well but it's hard to train a critic function on it and then proceed with DDPG

#### Hot Start Critic
see *hot_start_critic.py*
- Takes a pre-trained actor and uses similar DDPG method (with cost and risk critic functions) to train the 2 critic functions. Seems to have trouble learning long range dependencies. Can try mess around with it if you want.

#### DDPG Trainer
see *DDPG_train*
- I rehaul the code to match Cao 2019 approach. Very gradual updates.

#### Policy Evaluation Function
see *evaluate.py*
- Simulates paths and gets average total objective function
- Can do simulation with BS policy, Bartlett policy, Do Nothing policy, and inputted policy
- Will use this to compare actors

## Demo
Will train 3 different actors and compare performances with Black-Scholes and Do Nothing Benchmarks: Note the costs are more than 100 times larger than our previous environment (i.e. a reward of -600 can be thought of as akin to a reward of -6 in our previous environments)
1. Analytical Models
2. Cold-start model
3. Hot-started model

In [1]:
from hedge_env import HedgingEnv
from hot_start_actor import hot_start_gen_actor_samples, hot_start_actor
from hot_start_critic import hot_start_critic_q_func, hot_start_critic_value_func
from PPO_train import PPO_train
from DDPG_train import DDPG_train
from evaluate import evaluation
from models import Actor, DDPG_Cost_Critic, DDPG_Risk_Critic

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

#### Analytical Models:
Our analytical models are already coded into the evaluate function. Simply just need to call them.

In [2]:
# initialize environment
env = HedgingEnv(T = 1, kappa = 0.01, risk_aversion = 1.5, stochastic_vol=False, n_steps = 50)

# Test performance of policy that does nothing every step
print("'Do Nothing' policy:")
_,nothing_history = evaluation().eval_policy(env, "Nothing", verbose = True)

# Test analytical BS hedge
print()
print("BS policy:")
_,bs_history = evaluation().eval_policy(env, "BS", verbose = True)

# Doesnt rly make sense here (as non-stochastic vol) but also demonstrate Bartlett hedge
print()
print("Bartlett policy:")
_,bs_history = evaluation().eval_policy(env, "Bartlett", verbose = True)

'Do Nothing' policy:
testing...
Episode: 100 | Mean Objective: -1812.241 | PnL: 80.839, Risk: 1893.080
Episode: 200 | Mean Objective: -1896.349 | PnL: 38.335, Risk: 1934.684
Episode: 300 | Mean Objective: -1646.584 | PnL: 91.122, Risk: 1737.706
Episode: 400 | Mean Objective: -1664.372 | PnL: 94.468, Risk: 1758.839
Episode: 500 | Mean Objective: -1721.654 | PnL: 101.258, Risk: 1822.911
Episode: 600 | Mean Objective: -1769.551 | PnL: 84.089, Risk: 1853.640
Episode: 700 | Mean Objective: -1859.789 | PnL: 56.966, Risk: 1916.755
Episode: 800 | Mean Objective: -1896.552 | PnL: 39.904, Risk: 1936.456
Episode: 900 | Mean Objective: -1904.885 | PnL: 31.479, Risk: 1936.364
Episode: 1000 | Mean Objective: -1912.011 | PnL: 24.921, Risk: 1936.932

BS policy:
testing...
Episode: 100 | Mean Objective: -580.826 | PnL: -280.202, Risk: 300.624
Episode: 200 | Mean Objective: -618.071 | PnL: -287.185, Risk: 330.886
Episode: 300 | Mean Objective: -633.413 | PnL: -285.534, Risk: 347.879
Episode: 400 | Mean 

#### Cold Start Model:
1. Initialize actor and critic functions
2. Call DDPG train

In [None]:
# initialize
cold_actor = Actor(4,1)
cold_cost_critic = DDPG_Cost_Critic(4,1)
cold_risk_critic = DDPG_Risk_Critic(4,1)

# train
cold_obj_hist, _ = DDPG_train(cold_actor,
                            cold_cost_critic,
                            cold_risk_critic,
                            env,
                            episodes= 50000,
                            batch_size = 128,
                            lr = 0.0001,
                            tau=0.00001,
                            epsilon = 1,
                            epsilon_decay = 0.998,
                            discount = 1,
                            eval_freq = 100,
                            min_epsilon = 0.05,
                            )

Training Critic (learning q function)...
Episode: 100 | Mean Objective: -2828.787 | PnL: -854.519, Risk: 1974.268
Episode: 110 | Mean Objective: -2578.046

#### Hot-Start Model:
1. Generate Training Data
2. Hot start actor
3. Hot start critic (still not very good)
4. Do DDPG on pre-trained model

In [None]:
# initialize
hot_actor = Actor(4,1)
hot_cost_critic = DDPG_Cost_Critic(4,1)
hot_risk_critic = DDPG_Risk_Critic(4,1)

# Generate data
X, y = hot_start_gen_actor_samples(env,n_paths = 100)

# BC Train actor
hot_actor, loss_hist = hot_start_actor(actor, X, y, lr=0.01, batch_size=32, epochs = 200)

In [None]:
# Hot start critic
hot_obj, hot_q_guess = hot_start_critic_q_func(actor,
                                        cost_critic,
                                        risk_critic,
                                        env,
                                        episodes= 10000,
                                        batch_size = 128,
                                        lr = 0.01,
                                        tau=0.000025,
                                        epsilon = 0.8,
                                        epsilon_decay = 0.995,
                                        discount = 1,
                                        eval_freq = 100)

In [None]:
# Continue training
hot_obj_train_hist, _ = DDPG_train(hot_actor,
                                    hot_cost_critic,
                                    hot_risk_critic,
                                    env,
                                    episodes= 20000,
                                    batch_size = 128,
                                    lr = 0.0001,
                                    tau=0.00001,
                                    epsilon = 0.1,
                                    epsilon_decay = 0.9995,
                                    discount = 1,
                                    eval_freq = 100,
                                    min_epsilon = 0.05,
                                    )