# RL Project

## Key Changes

#### Hedging Environment
see *hedge_env.py*
- Changed from "cash flow PnL" to **"Accounting PnL"** (see Cao 2019 for more description)
    - general idea is now our daily reward now includes daily mark-to-market prices for options
    - For *stationary vol*: mark-to-market is black scholes
    - For *non-stationary vol*: mark-to-market is Hagan SABR implied vol (see Hagan 2002 paper for more)
- Adjusted stationary vol stock dynamics to match analytical Black-Scholes analytical Euler discretization
    - Stationary Vol:
        - Before: $S_{t+1} = S_t (1 + \sigma \cdot \sqrt{dt} \cdot Z1)$
        - New: Euler discretization of GBM analytical solution
    - Stochastic Vol:
        - Changed to SABR model with log-normal returns (e.g. beta = 1)
- Changed the 4th state variable from "time fraction" to tau
- Change it from hedging 1 option to hedging 100 options --> the bigger reward and action magnitude helps the model learn better
$

#### Actor and Critic Models
see *models.py*
- I only have DDPG models. In my mind neither PPO nor GRPO make sense in this context given we have to re paramterize to policy to output some probability distribution over our action space. Given the high amounts of precision required to properly hedge the option, this isn't stable enough to converge (from what I've experienced)
- Added batch normalization per layer on all networks
- Added sigmoid activation to action output of actor
- I use 2 seperate Q-functions (the "cost critic" and "risk critic"), each modelling $E[ \ C_t]$ and $E[C_t^2]$ respectively. This is from Cao 2019. Can read for more details.

#### Hot Start Actor
see *hot_start_actor.py*
- Supervised learning to teach actor either BS hedge (stationary vol) or Bartlett delta hedge (stochastic vol)
- The trained model works well but it's hard to train a critic function on it and then proceed with DDPG

#### Hot Start Critic
see *hot_start_critic.py*
- Takes a pre-trained actor and uses similar DDPG method (with cost and risk critic functions) to train the 2 critic functions. Seems to have trouble learning long range dependencies. Can try mess around with it if you want.

#### DDPG Trainer
see *DDPG_train*
- I rehaul the code to match Cao 2019 approach. Very gradual updates.

#### Policy Evaluation Function
see *evaluate.py*
- Simulates paths and gets average total objective function
- Can do simulation with BS policy, Bartlett policy, Do Nothing policy, and inputted policy
- Will use this to compare actors

## Demo
Will train 3 different actors and compare performances with Black-Scholes and Do Nothing Benchmarks: Note the costs are more than 100 times larger than our previous environment (i.e. a reward of -600 can be thought of as akin to a reward of -6 in our previous environments)
1. Analytical Models
2. Cold-start model
3. Hot-started model

In [1]:
from hedge_env import HedgingEnv
from hot_start_actor import hot_start_gen_actor_samples, hot_start_actor
from hot_start_critic import hot_start_critic_q_func, hot_start_critic_value_func
from DDPG_train import DDPG_train
from evaluate import evaluation
from models import Actor, DDPG_Cost_Critic, DDPG_Risk_Critic

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### Analytical Models:
Our analytical models are already coded into the evaluate function. Simply just need to call them.

In [37]:
# initialize environment
env = HedgingEnv(T = 1/12, sigma0 = 0.2, kappa = 0.01, risk_aversion = 1.5, stochastic_vol=True, n_steps = 21)

#### Evals

In [46]:
# Test performance of policy that does nothing every step
n = 10000

print("'Do Nothing' policy:")
_,nothing_history = evaluation().eval_policy(env, "Nothing", episodes = n, verbose = True)

# Test analytical BS hedge
print()
print("BS policy:")
_,bs_history = evaluation().eval_policy(env, "BS", episodes = n, verbose = True)

print()
print("Cold Start Actor:")
_,bs_history = evaluation().eval_policy(env, torch.load(f'cold_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth'), episodes = n, verbose = True)

print()
print("Hot Start Actor:")
_,bs_history = evaluation().eval_policy(env, torch.load(f'hot_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth'), episodes = n, verbose = True)

'Do Nothing' policy:
testing...
Episode: 100 | Mean Objective: -584.040 | PnL: -42.001, Risk: 542.040
Episode: 200 | Mean Objective: -520.432 | PnL: -16.731, Risk: 503.701
Episode: 300 | Mean Objective: -527.977 | PnL: -15.074, Risk: 512.902
Episode: 400 | Mean Objective: -507.628 | PnL: 3.390, Risk: 511.018
Episode: 500 | Mean Objective: -499.918 | PnL: 6.926, Risk: 506.843
Episode: 600 | Mean Objective: -483.776 | PnL: 12.396, Risk: 496.172
Episode: 700 | Mean Objective: -494.854 | PnL: 10.304, Risk: 505.158
Episode: 800 | Mean Objective: -498.495 | PnL: 7.507, Risk: 506.002
Episode: 900 | Mean Objective: -505.522 | PnL: 5.785, Risk: 511.307
Episode: 1000 | Mean Objective: -507.858 | PnL: 4.158, Risk: 512.016
Episode: 1100 | Mean Objective: -518.831 | PnL: -1.476, Risk: 517.355
Episode: 1200 | Mean Objective: -516.209 | PnL: -1.167, Risk: 515.043
Episode: 1300 | Mean Objective: -512.268 | PnL: -0.697, Risk: 511.570
Episode: 1400 | Mean Objective: -517.110 | PnL: -1.506, Risk: 515.604

  _,bs_history = evaluation().eval_policy(env, torch.load(f'cold_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth'), episodes = n, verbose = True)


Episode: 100 | Mean Objective: -332.862 | PnL: -66.761, Risk: 266.102
Episode: 200 | Mean Objective: -350.220 | PnL: -72.445, Risk: 277.774
Episode: 300 | Mean Objective: -334.398 | PnL: -65.345, Risk: 269.052
Episode: 400 | Mean Objective: -333.536 | PnL: -62.035, Risk: 271.501
Episode: 500 | Mean Objective: -330.682 | PnL: -58.613, Risk: 272.069
Episode: 600 | Mean Objective: -326.437 | PnL: -58.143, Risk: 268.294
Episode: 700 | Mean Objective: -332.027 | PnL: -58.196, Risk: 273.832
Episode: 800 | Mean Objective: -336.711 | PnL: -59.399, Risk: 277.312
Episode: 900 | Mean Objective: -338.009 | PnL: -57.576, Risk: 280.433
Episode: 1000 | Mean Objective: -335.155 | PnL: -56.710, Risk: 278.445
Episode: 1100 | Mean Objective: -331.297 | PnL: -54.247, Risk: 277.050
Episode: 1200 | Mean Objective: -334.088 | PnL: -57.158, Risk: 276.930
Episode: 1300 | Mean Objective: -338.002 | PnL: -58.623, Risk: 279.380
Episode: 1400 | Mean Objective: -339.929 | PnL: -59.462, Risk: 280.467
Episode: 1500 |

  _,bs_history = evaluation().eval_policy(env, torch.load(f'hot_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth'), episodes = n, verbose = True)


Episode: 100 | Mean Objective: -357.046 | PnL: -75.348, Risk: 281.698
Episode: 200 | Mean Objective: -359.273 | PnL: -81.387, Risk: 277.886
Episode: 300 | Mean Objective: -387.206 | PnL: -82.838, Risk: 304.369
Episode: 400 | Mean Objective: -380.385 | PnL: -82.358, Risk: 298.028
Episode: 500 | Mean Objective: -373.584 | PnL: -79.411, Risk: 294.172
Episode: 600 | Mean Objective: -383.572 | PnL: -83.420, Risk: 300.152
Episode: 700 | Mean Objective: -371.411 | PnL: -75.958, Risk: 295.453
Episode: 800 | Mean Objective: -370.710 | PnL: -76.995, Risk: 293.715
Episode: 900 | Mean Objective: -372.090 | PnL: -77.351, Risk: 294.739
Episode: 1000 | Mean Objective: -365.562 | PnL: -73.456, Risk: 292.106
Episode: 1100 | Mean Objective: -366.908 | PnL: -74.425, Risk: 292.483
Episode: 1200 | Mean Objective: -364.893 | PnL: -73.840, Risk: 291.053
Episode: 1300 | Mean Objective: -365.173 | PnL: -73.844, Risk: 291.329
Episode: 1400 | Mean Objective: -364.837 | PnL: -73.883, Risk: 290.954
Episode: 1500 |

#### Cold Start Model:
1. Initialize actor and critic functions
2. Call DDPG train

In [44]:
# initialize
cold_actor = Actor(4,1)
cold_cost_critic = DDPG_Cost_Critic(4,1)
cold_risk_critic = DDPG_Risk_Critic(4,1)

# train
cold_obj_hist, _ = DDPG_train(cold_actor,
                            cold_cost_critic,
                            cold_risk_critic,
                            env,
                            episodes= 50000,
                            batch_size = 128,
                            lr = 0.0001,
                            tau=0.00005,
                            epsilon = 1,
                            epsilon_decay = 0.9999,
                            discount = 1,
                            eval_freq = 100,
                            min_epsilon = 0.0,
                            noisy_exploration = False,
                            inital_buffer = 1000
                            )

Training Critic (learning q function)...
Episode: 100 | Mean Objective: -1182.663 | PnL: -719.668, Risk: 462.995 | e: 0.990
Episode: 200 | Mean Objective: -1107.030 | PnL: -669.455, Risk: 437.574 | e: 0.980
Episode: 300 | Mean Objective: -1106.163 | PnL: -671.607, Risk: 434.556 | e: 0.970
Episode: 400 | Mean Objective: -1171.221 | PnL: -758.832, Risk: 412.389 | e: 0.961
Episode: 500 | Mean Objective: -1121.366 | PnL: -694.669, Risk: 426.697 | e: 0.951
Episode: 600 | Mean Objective: -1158.925 | PnL: -726.193, Risk: 432.733 | e: 0.942
Episode: 700 | Mean Objective: -1071.113 | PnL: -678.975, Risk: 392.138 | e: 0.932
Episode: 800 | Mean Objective: -1145.819 | PnL: -736.247, Risk: 409.572 | e: 0.923
Episode: 900 | Mean Objective: -1071.420 | PnL: -672.657, Risk: 398.763 | e: 0.914
Episode: 1000 | Mean Objective: -1070.463 | PnL: -682.054, Risk: 388.409 | e: 0.905
Episode: 1100 | Mean Objective: -1127.588 | PnL: -710.365, Risk: 417.223 | e: 0.896
Episode: 1200 | Mean Objective: -1237.975 | 

In [45]:
torch.save(cold_actor, f'cold_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(cold_cost_critic, f'cold_cost_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(cold_risk_critic, f'cold_risk_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')

np.save(f'cold_training_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.npy', cold_obj_hist)

#### Hot-Start Model:
1. Generate Training Data
2. Hot start actor
3. Hot start critic (still not very good)
4. Do DDPG on pre-trained model

In [40]:
# initialize
hot_actor = Actor(4,1)
hot_cost_critic = DDPG_Cost_Critic(4,1)
hot_risk_critic = DDPG_Risk_Critic(4,1)

# Generate data
X, y = hot_start_gen_actor_samples(env,n_paths = 1000)

# BC Train actor
hot_actor, loss_hist = hot_start_actor(hot_actor, X, y, lr=0.001, batch_size=128, epochs = 300)

Generating Samples for hot start...
Training Actor...
Epoch: 1, Batch: 164/164 | Loss: 585.1235
Epoch: 11, Batch: 164/164 | Loss: 168.440
Epoch: 21, Batch: 164/164 | Loss: 166.432
Epoch: 31, Batch: 164/164 | Loss: 144.622
Epoch: 41, Batch: 164/164 | Loss: 85.6487
Epoch: 51, Batch: 164/164 | Loss: 80.9184
Epoch: 61, Batch: 164/164 | Loss: 83.5688
Epoch: 71, Batch: 164/164 | Loss: 85.5526
Epoch: 81, Batch: 164/164 | Loss: 96.5416
Epoch: 91, Batch: 164/164 | Loss: 102.780
Epoch: 101, Batch: 164/164 | Loss: 67.0517
Epoch: 111, Batch: 164/164 | Loss: 37.3928
Epoch: 121, Batch: 164/164 | Loss: 29.3563
Epoch: 131, Batch: 164/164 | Loss: 29.2694
Epoch: 141, Batch: 164/164 | Loss: 29.0160
Epoch: 151, Batch: 164/164 | Loss: 23.2487
Epoch: 161, Batch: 164/164 | Loss: 23.9586
Epoch: 171, Batch: 164/164 | Loss: 29.6952
Epoch: 181, Batch: 164/164 | Loss: 32.010
Epoch: 191, Batch: 164/164 | Loss: 17.5691
Epoch: 201, Batch: 164/164 | Loss: 17.338
Epoch: 211, Batch: 164/164 | Loss: 26.388
Epoch: 221, B

In [41]:
# Hot start critic
hot_obj, hot_q_guess = hot_start_critic_q_func(hot_actor,
                                            hot_cost_critic,
                                            hot_risk_critic,
                                            env,
                                            episodes= 5000,
                                            batch_size = 128,
                                            lr = 0.0005,
                                            tau=0.001,
                                            epsilon = 60.0,
                                            epsilon_decay = 0.9993,
                                            discount = 1.0,
                                            eval_freq = 100,
                                            min_epsilon = 10.0,
                                            noisy_exploration = True
                                              )

Training Critic (learning q function)...
Episode: 100 | Mean Objective: -2052.429 | Guess: -177.497, Diff: 1874.933 | e: 55.942
Episode: 200 | Mean Objective: -2053.041 | Guess: -246.102, Diff: 1806.940 | e: 52.159
Episode: 300 | Mean Objective: -1890.845 | Guess: -301.025, Diff: 1589.820 | e: 48.631
Episode: 400 | Mean Objective: -1821.772 | Guess: -351.681, Diff: 1470.091 | e: 45.343
Episode: 500 | Mean Objective: -1773.828 | Guess: -409.134, Diff: 1364.694 | e: 42.276
Episode: 600 | Mean Objective: -1671.061 | Guess: -460.460, Diff: 1210.601 | e: 39.417
Episode: 700 | Mean Objective: -1550.176 | Guess: -504.756, Diff: 1045.420 | e: 36.751
Episode: 800 | Mean Objective: -1455.615 | Guess: -545.348, Diff: 910.267 | e: 34.2667
Episode: 900 | Mean Objective: -1335.399 | Guess: -585.998, Diff: 749.401 | e: 31.948
Episode: 1000 | Mean Objective: -1243.743 | Guess: -627.285, Diff: 616.458 | e: 29.788
Episode: 1100 | Mean Objective: -1098.540 | Guess: -662.202, Diff: 436.338 | e: 27.773
Epi

In [42]:
# Continue training
hot_obj_train_hist, _ = DDPG_train(hot_actor,
                                    hot_cost_critic,
                                    hot_risk_critic,
                                    env,
                                    episodes= 20000,
                                    batch_size = 128,
                                    lr = 0.0001,
                                    tau=0.00005,
                                    epsilon = 10.00,
                                    epsilon_decay = 0.9998,
                                    discount = 1,
                                    eval_freq = 100,
                                    min_epsilon = 0.0,
                                    noisy_exploration = True,
                                    inital_buffer = 1000
                                    )

Training Critic (learning q function)...
Episode: 100 | Mean Objective: -562.257 | PnL: -312.328, Risk: 249.928 | e: 9.802
Episode: 200 | Mean Objective: -584.218 | PnL: -299.826, Risk: 284.392 | e: 9.608
Episode: 300 | Mean Objective: -641.892 | PnL: -335.275, Risk: 306.617 | e: 9.418
Episode: 400 | Mean Objective: -607.761 | PnL: -322.130, Risk: 285.631 | e: 9.231
Episode: 500 | Mean Objective: -574.870 | PnL: -325.943, Risk: 248.926 | e: 9.048
Episode: 600 | Mean Objective: -595.143 | PnL: -288.715, Risk: 306.429 | e: 8.869
Episode: 700 | Mean Objective: -514.892 | PnL: -295.794, Risk: 219.098 | e: 8.693
Episode: 800 | Mean Objective: -570.746 | PnL: -282.732, Risk: 288.015 | e: 8.521
Episode: 900 | Mean Objective: -643.795 | PnL: -326.596, Risk: 317.199 | e: 8.353
Episode: 1000 | Mean Objective: -612.937 | PnL: -322.420, Risk: 290.517 | e: 8.187
Episode: 1100 | Mean Objective: -713.236 | PnL: -320.611, Risk: 392.625 | e: 8.025
Episode: 1200 | Mean Objective: -771.052 | PnL: -315.87

In [43]:
torch.save(hot_actor, f'hot_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(hot_cost_critic, f'hot_cost_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(hot_risk_critic, f'hot_risk_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')

np.save(f'hot_training_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.npy', hot_obj+hot_obj_train_hist)

#### 3M env

In [83]:
# initialize environment
env = HedgingEnv(T = 1/4, sigma0 = 0.2, kappa = 0.01, risk_aversion = 1.5, stochastic_vol=False, n_steps = 65)

In [None]:
# initialize
cold_actor = Actor(4,1)
cold_cost_critic = DDPG_Cost_Critic(4,1)
cold_risk_critic = DDPG_Risk_Critic(4,1)

# train
cold_obj_hist, _ = DDPG_train(cold_actor,
                            cold_cost_critic,
                            cold_risk_critic,
                            env,
                            episodes= 50000,
                            batch_size = 128,
                            lr = 0.0001,
                            tau=0.00005,
                            epsilon = 1,
                            epsilon_decay = 0.9999,
                            discount = 1,
                            eval_freq = 100,
                            min_epsilon = 0.0,
                            noisy_exploration = False,
                            inital_buffer = 1000
                            )

torch.save(cold_actor, f'cold_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(cold_cost_critic, f'cold_cost_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(cold_risk_critic, f'cold_risk_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')

np.save(f'cold_training_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.npy', cold_obj_hist)

Training Critic (learning q function)...
Episode: 100 | Mean Objective: -2905.213 | PnL: -2150.071, Risk: 755.142 | e: 0.990
Episode: 200 | Mean Objective: -3143.255 | PnL: -2304.955, Risk: 838.300 | e: 0.980
Episode: 300 | Mean Objective: -2866.280 | PnL: -2168.423, Risk: 697.857 | e: 0.970
Episode: 400 | Mean Objective: -2935.323 | PnL: -2174.903, Risk: 760.420 | e: 0.961
Episode: 500 | Mean Objective: -3011.270 | PnL: -2209.826, Risk: 801.444 | e: 0.951
Episode: 600 | Mean Objective: -2925.807 | PnL: -2189.229, Risk: 736.579 | e: 0.942
Episode: 700 | Mean Objective: -3103.830 | PnL: -2248.733, Risk: 855.097 | e: 0.932
Episode: 800 | Mean Objective: -2957.910 | PnL: -2262.814, Risk: 695.096 | e: 0.923
Episode: 900 | Mean Objective: -3079.852 | PnL: -2278.127, Risk: 801.725 | e: 0.914
Episode: 1000 | Mean Objective: -2826.445 | PnL: -2162.091, Risk: 664.354 | e: 0.905
Episode: 1100 | Mean Objective: -3250.146 | PnL: -2383.462, Risk: 866.684 | e: 0.896
Episode: 1200 | Mean Objective: -

In [73]:
# initialize
hot_actor = Actor(4,1)
hot_cost_critic = DDPG_Cost_Critic(4,1)
hot_risk_critic = DDPG_Risk_Critic(4,1)

# Generate data
X, y = hot_start_gen_actor_samples(env,n_paths = 1000)

Generating Samples for hot start...
Progress: 100.00%

In [74]:
# BC Train actor
hot_actor, loss_hist = hot_start_actor(hot_actor, X, y, lr=0.001, batch_size=128, epochs = 150)

Training Actor...
Epoch: 1, Batch: 508/508 | Loss: 48.1297
Epoch: 11, Batch: 508/508 | Loss: 67.2236
Epoch: 21, Batch: 508/508 | Loss: 45.4249
Epoch: 31, Batch: 508/508 | Loss: 39.1863
Epoch: 41, Batch: 508/508 | Loss: 12.688
Epoch: 51, Batch: 508/508 | Loss: 11.357
Epoch: 61, Batch: 508/508 | Loss: 17.420
Epoch: 71, Batch: 508/508 | Loss: 12.213
Epoch: 81, Batch: 508/508 | Loss: 8.5863
Epoch: 91, Batch: 508/508 | Loss: 5.7333
Epoch: 101, Batch: 508/508 | Loss: 20.114
Epoch: 111, Batch: 508/508 | Loss: 2.3906
Epoch: 121, Batch: 508/508 | Loss: 4.3301
Epoch: 131, Batch: 508/508 | Loss: 4.6699
Epoch: 141, Batch: 508/508 | Loss: 3.1783
Epoch: 150, Batch: 508/508 | Loss: 3.7646

In [None]:
# Hot start critic
hot_obj, hot_q_guess = hot_start_critic_q_func(hot_actor,
                                            hot_cost_critic,
                                            hot_risk_critic,
                                            env,
                                            episodes= 5000,
                                            batch_size = 128,
                                            lr = 0.0005,
                                            tau=0.001,
                                            epsilon = 60.0,
                                            epsilon_decay = 0.9993,
                                            discount = 1.0,
                                            eval_freq = 100,
                                            min_epsilon = 10.0,
                                            noisy_exploration = True
                                              )

# Continue training
hot_obj_train_hist, _ = DDPG_train(hot_actor,
                                    hot_cost_critic,
                                    hot_risk_critic,
                                    env,
                                    episodes= 20000,
                                    batch_size = 128,
                                    lr = 0.0001,
                                    tau=0.00005,
                                    epsilon = 10.00,
                                    epsilon_decay = 0.9998,
                                    discount = 1,
                                    eval_freq = 100,
                                    min_epsilon = 0.0,
                                    noisy_exploration = True,
                                    inital_buffer = 1000
                                    )

torch.save(hot_actor, f'hot_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(hot_cost_critic, f'hot_cost_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(hot_risk_critic, f'hot_risk_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')

np.save(f'hot_training_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.npy', hot_obj+hot_obj_train_hist)

In [None]:
# Test performance of policy that does nothing every step
n = 10000

print("'Do Nothing' policy:")
_,nothing_history = evaluation().eval_policy(env, "Nothing", episodes = n, verbose = True)

# Test analytical BS hedge
print()
print("BS policy:")
_,bs_history = evaluation().eval_policy(env, "BS", episodes = n, verbose = True)

print()
print("Cold Start Actor:")
_,bs_history = evaluation().eval_policy(env, torch.load(f'cold_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth'), episodes = n, verbose = True)

print()
print("Hot Start Actor:")
_,bs_history = evaluation().eval_policy(env, torch.load(f'hot_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth'), episodes = n, verbose = True)

#### 3M Stoch Env

In [None]:
# initialize environment
env = HedgingEnv(T = 1/4, sigma0 = 0.2, kappa = 0.01, risk_aversion = 1.5, stochastic_vol=True, n_steps = 65)

In [None]:
# initialize
cold_actor = Actor(4,1)
cold_cost_critic = DDPG_Cost_Critic(4,1)
cold_risk_critic = DDPG_Risk_Critic(4,1)

# train
cold_obj_hist, _ = DDPG_train(cold_actor,
                            cold_cost_critic,
                            cold_risk_critic,
                            env,
                            episodes= 50000,
                            batch_size = 128,
                            lr = 0.0001,
                            tau=0.00005,
                            epsilon = 1,
                            epsilon_decay = 0.9999,
                            discount = 1,
                            eval_freq = 100,
                            min_epsilon = 0.0,
                            noisy_exploration = False,
                            inital_buffer = 1000
                            )

torch.save(cold_actor, f'cold_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(cold_cost_critic, f'cold_cost_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(cold_risk_critic, f'cold_risk_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')

np.save(f'cold_training_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.npy', cold_obj_hist)

In [81]:
# initialize
hot_actor_stoch = Actor(4,1)
hot_cost_critic = DDPG_Cost_Critic(4,1)
hot_risk_critic = DDPG_Risk_Critic(4,1)

# Generate data
X, y = hot_start_gen_actor_samples(env,n_paths = 1000)

Generating Samples for hot start...
Progress: 100.00%

In [82]:
# BC Train actor
hot_actor_stoch, loss_hist = hot_start_actor(hot_actor_stoch, X, y, lr=0.001, batch_size=128, epochs = 150)

Training Actor...
Epoch: 1, Batch: 508/508 | Loss: 78.7154
Epoch: 11, Batch: 508/508 | Loss: 49.1207
Epoch: 21, Batch: 508/508 | Loss: 31.1011
Epoch: 31, Batch: 508/508 | Loss: 46.1536
Epoch: 41, Batch: 508/508 | Loss: 26.4436
Epoch: 51, Batch: 508/508 | Loss: 10.092
Epoch: 61, Batch: 508/508 | Loss: 13.302
Epoch: 71, Batch: 508/508 | Loss: 25.181
Epoch: 81, Batch: 508/508 | Loss: 8.8741
Epoch: 91, Batch: 508/508 | Loss: 4.3059
Epoch: 101, Batch: 508/508 | Loss: 10.478
Epoch: 111, Batch: 508/508 | Loss: 5.3453
Epoch: 121, Batch: 508/508 | Loss: 11.563
Epoch: 131, Batch: 508/508 | Loss: 5.7084
Epoch: 141, Batch: 508/508 | Loss: 6.8103
Epoch: 150, Batch: 508/508 | Loss: 4.1548

In [None]:
# Hot start critic
hot_obj, hot_q_guess = hot_start_critic_q_func(hot_actor_stoch,
                                            hot_cost_critic,
                                            hot_risk_critic,
                                            env,
                                            episodes= 5000,
                                            batch_size = 128,
                                            lr = 0.0005,
                                            tau=0.001,
                                            epsilon = 60.0,
                                            epsilon_decay = 0.9993,
                                            discount = 1.0,
                                            eval_freq = 100,
                                            min_epsilon = 10.0,
                                            noisy_exploration = True
                                              )

# Continue training
hot_obj_train_hist, _ = DDPG_train(hot_actor_stoch,
                                    hot_cost_critic,
                                    hot_risk_critic,
                                    env,
                                    episodes= 20000,
                                    batch_size = 128,
                                    lr = 0.0001,
                                    tau=0.00005,
                                    epsilon = 10.00,
                                    epsilon_decay = 0.9998,
                                    discount = 1,
                                    eval_freq = 100,
                                    min_epsilon = 0.0,
                                    noisy_exploration = True,
                                    inital_buffer = 1000
                                    )

torch.save(hot_actor_stoch, f'hot_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(hot_cost_critic, f'hot_cost_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')
torch.save(hot_risk_critic, f'hot_risk_critic_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth')

np.save(f'hot_training_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.npy', hot_obj+hot_obj_train_hist)

In [None]:
# Test performance of policy that does nothing every step
n = 10000

print("'Do Nothing' policy:")
_,nothing_history = evaluation().eval_policy(env, "Nothing", episodes = n, verbose = True)

# Test analytical BS hedge
print()
print("BS policy:")
_,bs_history = evaluation().eval_policy(env, "BS", episodes = n, verbose = True)

print()
print("Cold Start Actor:")
_,bs_history = evaluation().eval_policy(env, torch.load(f'cold_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth'), episodes = n, verbose = True)

print()
print("Hot Start Actor:")
_,bs_history = evaluation().eval_policy(env, torch.load(f'hot_actor_{env.T*12:.1f}m_{env.n_steps}_{env.stochastic_vol}_{env.sigma0}_{env.kappa}_{env.risk_aversion}.pth'), episodes = n, verbose = True)