## 3.1 From Delta Fragility to minimax RL

Classical delta-hedging collapses the continuous-time Black–Scholes ideal onto a single-regime volatility.  Section 2 showed empirical evidence — from the 2020-22 SPX vol-spikes — that mis-specifying σ inflates tail P&L.  Deep-hedging papers inject neural nets, yet still train on one fitted model.  Recent “robust RL” studies (e.g. Rajeswaran 2017; Merton-Heston PPO 2025) instead cast hedging as a two-player game: the hedger chooses positions, while Nature perturbs dynamics inside a statistically plausible set. We now formalise that game for latent regime-switch volatility and show that the resulting minimax problem admits a stationary optimal policy and an efficient PPO-style solver.

## 3.2 Market & Hedge Dynamics

We model an options trader who has sold an option and must dynamically hedge the position by trading the underlying asset. The goal is to minimise the final hedging shortfall when the option expires.

We first discretise the trading horizon $[0,T]$ into $N$ equal steps of length $\Delta t= \frac{T}{N}$ <br>
Let $S_t$ denote the underlying price and $C_t$ the option fair value at time $t$.

#### 3.2.1 Underlying dynamics
At each step

$$S_{t+\Delta t} = S_t \times \exp\big[(\mu - \frac{1}{2}\,\sigma_t^{2})\,\Delta t + \sigma_t\sqrt{\Delta t}\,Z_t\big] \tag{3.6}$$ 

where $Z_t \sim \mathcal N(0,1)$ and the latent volatility regime $\sigma_t$ takes the low or high value as defined in section 3.3. We note that the drift term contains $-\frac{1}{2}\sigma^2$, the Itô correction, because the exponential map is convex (second derivative always positive) and in stochastic calculus, the quadratic variation satisfies $dW_t^2 = dt$ rather than zero as in deterministic calculus. This term subtraction removes the "convexity lift" implied by Jensen's inequality. It guarantees that under $E[\cdot]$ the discretised process has mean growth $e^{\mu\Delta t}$, keeping the log-normal model martingale under the risk-neutral measure. <i> See Appendix A for full treatment of this. </i>

#### 3.2.2 From random walks to the Heston model
We now discuss how to simulate the price of an underlying asset and why such mathematical models are necessary. At its core, hedging depends on our ability to predict how derivative securities will respond to changes in underlying asset prices, market volatility, to name a few. Therefore, we require mathematical models that can capture the essential statistical properties of asset price movements with sufficient accuracy.
<br>
<br>

<b>The need for stochastic models. </b><br>
Real world observations in many cases, not just asset prices, exhibit randomness. If asset prices followed predictable patterns, arbitrage opportunities would be immediately exploited given sufficient capital. Consider an equity price chart, where every tiny tick is buffeted by a constant stream of influences that no deterministic model could ever capture. News arrives unpredictably, traders receive private information, and market microstructure effects create noise that persists across multiple time scales. We can use a mathematical framework to capture this essential randomness by looking at the percentage change as 

$$\frac{dS_t}{S_t} = (\text{predictable drift})dt + (\text{unpredictable volatility})dW_t$$

where $W_t$ represents standard Brownian motion; a continuous-time mathematical representation of pure randomness. There are several key properties that make Brownian motion the right tool for modelling financial uncertainty.

1. $W_o$ = 0 to provide a clean reference point for measuring cumulative random effects
2. In any tiny interval $\Delta t$, the random increment $dW_t$ $\sim\mathcal N(0,\Delta t)$ captures the idea that uncertainty accumulates gradually.
3. Increments over non-overlapping time periods are completely independent, reflecting the efficient market hypothesis that past price movements provide no information about future changes.

What makes the Brownian motion different is its quadratic variation: $dW_t^2 = dt$ rather than zero as in deterministic calculus. This means even infinitesimal random movements accumulate measurable effects over time, which is why stock price models require stochastic treatment. 
<br>
<br>
<b> The foundation: Geometric Brownian Motion </b><br>
We model the percentage change as 

$$dS_t = \underbrace{\mu S_tdt}_{\text{drift term}} + \underbrace{\sigma S_tdW_t}_{\text{diffusion term i.e. volatility or "shake"}}$$

We care for percentage changes because absolute movements have different impacts depending on the existing price. a $2 move on a $10 stock means more than a $2 move on a $600 stock. When solving, we obtain the result shown by 3.6.
<br>
<br>

<b> The volatility puzzle: when constants betray reality </b><br>
Geometric Brownian Motion has one *strong* assumption. It assumes volatility $\sigma$ remains constant through time. A follower of the financial markets will know this to be false. Market practitioners observe distinct behaviour regimes: extended periods of relative calm  perturbed by sudden volatility explosions seemingly without warning. Whilst models require assumptions to work, we are able to fine tune this idea to attain a model that exhibits a more accurate representation of the dynamics we seek to solve. 
<br>
<br>
<b> Stochastic volatility: modeling the volatility of volatility </b><br>
Given volatility exhibits random behaviour, it would be prudent to allow $\sigma$ itself to evolve stochastically rather than remaining frozen at some historical average. We use the Heston (1993) model by treating variance as its own mean-reverting process.

$$dS_t = \mu S_t dt + \sqrt{v_t} S_t dW_t^S$$
$$dv_t = \kappa(\theta - v_t)dt + \xi\sqrt{v_t}dW_t^v$$

There are 5 parameters:
1. $v_0$ : initial variance
2. $\theta$ : the average variance of price over some period. $\mathbb{E}[v_t]$ will tend to $\theta$ as $t$ tends to $\infty$
3. $p$ : the correlation of the two stochastic processes provided
4. $\kappa$ : a mean reversion term i.e. the rate at which $v_t$ reverts to $\theta$
5. $\xi$ : the volatility of the volatility which determines variance of $v_t$

If the following condition is met $2\kappa \theta \gt \xi^2$ then the process $v_t$ is strictly positive. 

<b> Connecting to regime control </b><br>
In section 3.3 we discuss behavioural shifts between well-defined regimes. Whilst Heston models volatility as a continuous diffusion process, we capture the real world economic insight that markets will alternative between volatility states with distinct discrete parameters. Market participants naturally think in terms of market regimes rather than continuous stochastic processes. In practice we will keep the Heston form for both regimes but assign low versus high parameter sets, letting the Markov chain decide when parameters switch.  We will see that this discrete space enables our minimax optimisation framework while maintaining computational tractability. 

#### 3.2.2 Hedging control
The agent (trader or other acting entity) chooses a hedge position $h_t \in \mathbb R$ (number of underlying shares at price $S_t$)
We observe the self-financing portfolio evolving as: <br><br>
$\Pi_{t+\Delta t} = e^{r\Delta t}\big(\Pi_t - h_t S_t\big) + h_t S_{t+\Delta t} - C_{t+\Delta t}$ <br> <br>
with $r$ the risk-free rate and $C_t$ the option’s fair value.


#### 3.2.3 Self-financing portfolio intuition
Consider an options trader who has just sold a call option to a client for price $C_0$. The trader now has a liability - depending on market movements, the trader could owe the client a payoff at expiration. To manage this risk, the trader can create a dynamic hedging portfolio with the aim to offset the option liability. This portfolio consists of two components: some number of shares of the underlying ($h_t$ shares) and some cash invested at the risk-free rate.<br><br>

The following constraint must be satisfied: this must be a self-financing portfolio. The implications are as follows: the trader cannot inject or withdraw money post initial setup; any changes to the position must be funded by selling other assets within the portfolio.


<b>Mathematical Framework:</b> <br> 
We denote portfolio value at time t as $\Pi_t$ consisting of $h_t$ shares worth $h_t \times S_t$ plus some cash position. <br>
At any moment total portfolio value is  $\Pi_t$ = (value of stock holdings) + (value of cash holdings) <br> <br>
The self-financing constraint then appears as follows: <br>
Suppose the trader decides to change their hedge ratio from $h_t$ to $h_{t+\Delta t}$. If they increase stock holdings, the purchase must be funded by reducing cash holdings by exactly the same amount. Similarly, reduction of stock holdings, results in proceeds added to cash account.
<br> <br>
<b>Rebalancing mechanics:</b> <br> 
At time t, suppose the trader holds $h_t$ shares and some cash. Just prior to time $t+\Delta t$, they observe the new stock price $S_{t+\Delta t}$ and decide to adjust hedge to $h_{t+\Delta t}$ shares. The change in stock position, multiplied by new price, determines the corresponding adjustment to the cash holdings. <br><br>
However, cash holdings have grown at the risk-free rate $r$ during this period. Therefore assuming cash amount $X$ at time $t$, it becomes $Xe^{r\Delta t}$ at time $t + \Delta t$ <br> <br>
Cash at $t+\Delta t$ = (Cash at t)$ \times e^{r\Delta t}$ - (Change in stock position)$\times S_{t+\Delta t}$
<br> <br>
<b>Connecting to Portfolio Value:</b> <br>  <br>
Portfolio value at time t can be written as: <br> <br>
$\Pi_t = h_tS_t$ + (Cash holdings at time $t$)<br> <br>
Rearranging: <br> <br>
(Cash holdings at time $t$) = $\Pi_t - h_tS_t$ <br> <br>
After rebalancing at time $t+\Delta t$, the new portfolio value becomes: <br><br>
$\Pi_{t+\Delta t} = h_{t+\Delta t}S_{t+\Delta t}$ + (Cash holdings at time $t+\Delta t$)<br> <br>
Substituting the expression for cash holdings: <br><br>
$\Pi_{t+\Delta t} = h_{t+\Delta t}S_{t+\Delta t}$ + $(\Pi_t - h_tS_t)e^{r\Delta t} - (h_{t+\Delta t} - h_t)S_{t+\Delta t}$<br> <br>
Simplifying the stock terms: <br><br>
$\Pi_{t+\Delta t} = (\Pi_t - h_tS_t)e^{r\Delta t} + h_tS_{t+\Delta t}$<br> <br>
<b>Subtracting the call option liability : </b> <br><br>
$\Pi_{t+\Delta t} = (\Pi_t - h_tS_t)e^{r\Delta t} + h_tS_{t+\Delta t} - C_{t+\Delta t}$<br> <br>
<b>Economic Implications: </b> <br><br>
The self-financing constraint dictates profits or losses within the hedging strategy come purely from the difference between the hedge portfolio performance versus the option liability. Assuming efficient markets and continuous hedging (cornerstones of Black-Scholes assumptions), a trader could hedge continuously with zero transaction costs. This would lead to a perfect replication of the option payoff, making the hedging shortfall zero. Reality however, dictates rebalancing only occurs discretely, markets have transaction costs, and the underlying model is not perfect. These imperfections create the hedging shortfall that the trader seeks to minimise.


#### 3.2.4 Terminal loss (hedging shortfall)
$L(\pi,P) = -\big(\Pi_T - C_0\big);$ <br><br>
Here $\pi$ denotes the trader's hedging policy mapping observed states to hedge size and $P$ is the transition matrix selected by the adversary $\Omega$. We expand on these parameters further in 3.3. <br>
Negative values represent a loss relative to the initial option price $C_0$ <br>
We insert the minus sign because the optimisation problem is to minimise the hedging shortfall; a negative $L$ therefore corresponds to a loss relative to the initial option price $C_0$

## 3.2 Market & Hedge Dynamics

We model an options trader who has sold an option and must dynamically hedge the position by trading the underlying asset. The goal is to minimise the final hedging shortfall when the option expires.

We first discretise the trading horizon $[0,T]$ into $N$ equal steps of length $\Delta t= \frac{T}{N}$ <br>
Let $S_t$ denote the underlying price and $C_t$ the option fair value at time $t$.

#### 3.2.1 Underlying dynamics
At each step

$$S_{t+\Delta t} = S_t \times \exp\big[(\mu - \frac{1}{2}\,\sigma_t^{2})\,\Delta t + \sigma_t\sqrt{\Delta t}\,Z_t\big] \tag{3.6}$$ 

where $Z_t \sim \mathcal N(0,1)$ and the latent volatility regime $\sigma_t$ takes the low or high value as defined in section 3.3. We note that the drift term contains $-\frac{1}{2}\sigma^2$, the Itô correction, because the exponential map is convex (second derivative always positive) and in stochastic calculus, the quadratic variation satisfies $dW_t^2 = dt$ rather than zero as in deterministic calculus. This term subtraction removes the "convexity lift" implied by Jensen's inequality. It guarantees that under $E[\cdot]$ the discretised process has mean growth $e^{\mu\Delta t}$, keeping the log-normal model martingale under the risk-neutral measure. <i> See Appendix A for full treatment of this. </i>

#### 3.2.2 From random walks to the Heston model
We now discuss how to simulate the price of an underlying asset and why such mathematical models are necessary. At its core, hedging depends on our ability to predict how derivative securities will respond to changes in underlying asset prices, market volatility, to name a few. Therefore, we require mathematical models that can capture the essential statistical properties of asset price movements with sufficient accuracy.
<br>
<br>

<b>The need for stochastic models. </b><br>
Real world observations in many cases, not just asset prices, exhibit randomness. If asset prices followed predictable patterns, arbitrage opportunities would be immediately exploited given sufficient capital. Consider an equity price chart, where every tiny tick is buffeted by a constant stream of influences that no deterministic model could ever capture. News arrives unpredictably, traders receive private information, and market microstructure effects create noise that persists across multiple time scales. We can use a mathematical framework to capture this essential randomness by looking at the percentage change as 

$$\frac{dS_t}{S_t} = (\text{predictable drift})dt + (\text{unpredictable volatility})dW_t$$

where $W_t$ represents standard Brownian motion; a continuous-time mathematical representation of pure randomness. There are several key properties that make Brownian motion the right tool for modelling financial uncertainty.

1. $W_o$ = 0 to provide a clean reference point for measuring cumulative random effects
2. In any tiny interval $\Delta t$, the random increment $dW_t$ $\sim\mathcal N(0,\Delta t)$ captures the idea that uncertainty accumulates gradually.
3. Increments over non-overlapping time periods are completely independent, reflecting the efficient market hypothesis that past price movements provide no information about future changes.

What makes the Brownian motion different is its quadratic variation: $dW_t^2 = dt$ rather than zero as in deterministic calculus. This means even infinitesimal random movements accumulate measurable effects over time, which is why stock price models require stochastic treatment. 
<br>
<br>
<b> The foundation: Geometric Brownian Motion </b><br>
We model the percentage change as 

$$dS_t = \underbrace{\mu S_tdt}_{\text{drift term}} + \underbrace{\sigma S_tdW_t}_{\text{diffusion term i.e. volatility or "shake"}}$$

We care for percentage changes because absolute movements have different impacts depending on the existing price. a $2 move on a $10 stock means more than a $2 move on a $600 stock. When solving, we obtain the result shown by 3.6.
<br>
<br>

<b> The volatility puzzle: when constants betray reality </b><br>
Geometric Brownian Motion has one *strong* assumption. It assumes volatility $\sigma$ remains constant through time. A follower of the financial markets will know this to be false. Market practitioners observe distinct behaviour regimes: extended periods of relative calm  perturbed by sudden volatility explosions seemingly without warning. Whilst models require assumptions to work, we are able to fine tune this idea to attain a model that exhibits a more accurate representation of the dynamics we seek to solve. 
<br>
<br>
<b> Stochastic volatility: modeling the volatility of volatility </b><br>
Given volatility exhibits random behaviour, it would be prudent to allow $\sigma$ itself to evolve stochastically rather than remaining frozen at some historical average. We use the Heston (1993) model by treating variance as its own mean-reverting process.

$$dS_t = \mu S_t dt + \sqrt{v_t} S_t dW_t^S$$
$$dv_t = \kappa(\theta - v_t)dt + \xi\sqrt{v_t}dW_t^v$$

There are 5 parameters:
1. $v_0$ : initial variance
2. $\theta$ : the average variance of price over some period. $\mathbb{E}[v_t]$ will tend to $\theta$ as $t$ tends to $\infty$
3. $p$ : the correlation of the two stochastic processes provided
4. $\kappa$ : a mean reversion term i.e. the rate at which $v_t$ reverts to $\theta$
5. $\xi$ : the volatility of the volatility which determines variance of $v_t$

If the following condition is met $2\kappa \theta \gt \xi^2$ then the process $v_t$ is strictly positive. 

<b> Connecting to regime control </b><br>
In section 3.3 we discuss behavioural shifts between well-defined regimes. Whilst Heston models volatility as a continuous diffusion process, we capture the real world economic insight that markets will alternative between volatility states with distinct discrete parameters. Market participants naturally think in terms of market regimes rather than continuous stochastic processes. In practice we will keep the Heston form for both regimes but assign low versus high parameter sets, letting the Markov chain decide when parameters switch.  We will see that this discrete space enables our minimax optimisation framework while maintaining computational tractability. 

#### 3.2.2 Hedging control
The agent (trader or other acting entity) chooses a hedge position $h_t \in \mathbb R$ (number of underlying shares at price $S_t$)
We observe the self-financing portfolio evolving as: <br><br>
$\Pi_{t+\Delta t} = e^{r\Delta t}\big(\Pi_t - h_t S_t\big) + h_t S_{t+\Delta t} - C_{t+\Delta t}$ <br> <br>
with $r$ the risk-free rate and $C_t$ the option’s fair value.


#### 3.2.3 Self-financing portfolio intuition
Consider an options trader who has just sold a call option to a client for price $C_0$. The trader now has a liability - depending on market movements, the trader could owe the client a payoff at expiration. To manage this risk, the trader can create a dynamic hedging portfolio with the aim to offset the option liability. This portfolio consists of two components: some number of shares of the underlying ($h_t$ shares) and some cash invested at the risk-free rate.<br><br>

The following constraint must be satisfied: this must be a self-financing portfolio. The implications are as follows: the trader cannot inject or withdraw money post initial setup; any changes to the position must be funded by selling other assets within the portfolio.


<b>Mathematical Framework:</b> <br> 
We denote portfolio value at time t as $\Pi_t$ consisting of $h_t$ shares worth $h_t \times S_t$ plus some cash position. <br>
At any moment total portfolio value is  $\Pi_t$ = (value of stock holdings) + (value of cash holdings) <br> <br>
The self-financing constraint then appears as follows: <br>
Suppose the trader decides to change their hedge ratio from $h_t$ to $h_{t+\Delta t}$. If they increase stock holdings, the purchase must be funded by reducing cash holdings by exactly the same amount. Similarly, reduction of stock holdings, results in proceeds added to cash account.
<br> <br>
<b>Rebalancing mechanics:</b> <br> 
At time t, suppose the trader holds $h_t$ shares and some cash. Just prior to time $t+\Delta t$, they observe the new stock price $S_{t+\Delta t}$ and decide to adjust hedge to $h_{t+\Delta t}$ shares. The change in stock position, multiplied by new price, determines the corresponding adjustment to the cash holdings. <br><br>
However, cash holdings have grown at the risk-free rate $r$ during this period. Therefore assuming cash amount $X$ at time $t$, it becomes $Xe^{r\Delta t}$ at time $t + \Delta t$ <br> <br>
Cash at $t+\Delta t$ = (Cash at t)$ \times e^{r\Delta t}$ - (Change in stock position)$\times S_{t+\Delta t}$
<br> <br>
<b>Connecting to Portfolio Value:</b> <br>  <br>
Portfolio value at time t can be written as: <br> <br>
$\Pi_t = h_tS_t$ + (Cash holdings at time $t$)<br> <br>
Rearranging: <br> <br>
(Cash holdings at time $t$) = $\Pi_t - h_tS_t$ <br> <br>
After rebalancing at time $t+\Delta t$, the new portfolio value becomes: <br><br>
$\Pi_{t+\Delta t} = h_{t+\Delta t}S_{t+\Delta t}$ + (Cash holdings at time $t+\Delta t$)<br> <br>
Substituting the expression for cash holdings: <br><br>
$\Pi_{t+\Delta t} = h_{t+\Delta t}S_{t+\Delta t}$ + $(\Pi_t - h_tS_t)e^{r\Delta t} - (h_{t+\Delta t} - h_t)S_{t+\Delta t}$<br> <br>
Simplifying the stock terms: <br><br>
$\Pi_{t+\Delta t} = (\Pi_t - h_tS_t)e^{r\Delta t} + h_tS_{t+\Delta t}$<br> <br>
<b>Subtracting the call option liability : </b> <br><br>
$\Pi_{t+\Delta t} = (\Pi_t - h_tS_t)e^{r\Delta t} + h_tS_{t+\Delta t} - C_{t+\Delta t}$<br> <br>
<b>Economic Implications: </b> <br><br>
The self-financing constraint dictates profits or losses within the hedging strategy come purely from the difference between the hedge portfolio performance versus the option liability. Assuming efficient markets and continuous hedging (cornerstones of Black-Scholes assumptions), a trader could hedge continuously with zero transaction costs. This would lead to a perfect replication of the option payoff, making the hedging shortfall zero. Reality however, dictates rebalancing only occurs discretely, markets have transaction costs, and the underlying model is not perfect. These imperfections create the hedging shortfall that the trader seeks to minimise.


#### 3.2.4 Terminal loss (hedging shortfall)
$L(\pi,P) = -\big(\Pi_T - C_0\big);$ <br><br>
Here $\pi$ denotes the trader's hedging policy mapping observed states to hedge size and $P$ is the transition matrix selected by the adversary $\Omega$. We expand on these parameters further in 3.3. <br>
Negative values represent a loss relative to the initial option price $C_0$ <br>
We insert the minus sign because the optimisation problem is to minimise the hedging shortfall; a negative $L$ therefore corresponds to a loss relative to the initial option price $C_0$

## 3.3 Market regime and uncertainty sets

#### Markets have moods
Most days an equity index drifts with some noise; occasionally it erupts into violent swings. Volatility is therefore stateful; the market switches between calm and storm moods. We capture these moods with an unobservable two-state Markov chain. As a result, the agent must hedge without seeing the regime directly. Before we continue let us briefly discuss key terms: <br>
<b>Markov Chain </b>: a stochastic process whose next state depend sonly on the current state. <br>
<b>Hidden Regime note </b>: the latent volatility state $\sigma_t$; the agent works instead with a filtered belief $q_t \;=\;\Pr(\sigma_t=\sigma_H \mid \mathcal F_t)$ 

#### Markov chain matrix

A finite-state Markov chain moves between states according to some transition matrix whose rows are probability vectors (must sum to 1). We denote the latent volatility state at trading day $t$ by:
$$\sigma_t\in\{\sigma_L,\;\sigma_H\}$$
where
- $\sigma_L$ = [~8%] represents the "calm" annualised volatility level
- $\sigma_H$ = [~21 %] represents the "storm annualised volatility level (typically ~3x $\sigma L$) <br><br>

We encode the one step dynamics by:
$$P=
\begin{pmatrix}
p_{LL} & 1-p_{LL}\\
p_{HL} & 1-p_{HL}
\end{pmatrix}
\tag{3.7}$$
<br>
<br>

State L indicates low volatility day, thus state H indicates high. The four entries of matrix $P$ tell us how sticky each regime or mood is. Note that given rows of $P$ sum up to 1, $p_{LL}$ is the probability that a low volatility regime remains so and $1-p_{LL}$ is a switch to a high volatility regime; the second row is interpreted analogously <br> 
<i> Please see Appendix B for a refresher on finite-state Markov chains, spell lengths, and the Chapman-Kolmogorov equation.</i><br> <br>

#### Calibrating a reference matrix $\bar P$

Classifying daily SPX returns from <b> 2010 to 2024</b> into calm or storm yields the empirical estimate, rounded to two decimal figures
$$
\bar P=
\begin{pmatrix}
\text{0.98} & \text{0.02}\\
\text{0.04} & \text{0.96}
\end{pmatrix}.
\tag{3.8}
$$
numbers that line up with earlier regime-switch studies (Hamilton & Susmel 1994; Guo 2013). Whilst these numbers are slightly different, they are within an acceptable ballpark. <br><br>

#### Addressing model risk through uncertainty sets 
Historical calibration provides our best estimate of regime transition behaviour, however, it remains imperfect; a new crisis could shorten spells or lengthen storms. Market conditions change over time, sample periods may not capture all possible regime dynamics, and estimation error creates additional uncertainty around the true transition probabilities. To guard against such model risk we give Nature (adversary $\Omega$) the right to nudge any entry of the matrix by as most $\epsilon$ whilst preserving reach row sum. This uncertainty set captures our acknowledgement that the true regime dynamics may differ systematically from our calibrated model. 

$$
\boxed{\;
\mathcal P
=\Bigl\{P:\,\|P-\bar P\|_\infty\le\varepsilon,\;P\mathbf1=\mathbf1\Bigr\}},
\qquad
\varepsilon=\mathbf{[0.02]}.
\tag{3.9}
$$

<b> Explanation of epsilon and selection </b> <br><br>
The parameter epsilon represents our level of confidence in the historical calibration. Larger epsilon values create more conservative hedging policies that aim to perform well across a broader range of possible regime dynamics, whilst smaller values produce policies more closely tailored to the historical estimate. Smaller epsilon values result in hedging policies that perform well when historical calibration proves accurate but potentially perform poorly when market conditions differ significantly from the training period. Larger epsilon values create hedging policies that maintain reasonable performance across a much wider range of market conditions, however, optimality is sacrificed when historical estimates prove accurate. This connects directly to the classic bias versus variance trade-off.<br><br>
Thus, epsilon selection requires careful though; the most natural approach builds on the statistical uncertainty inherent in the parameter estimation process. With all historical data and associated statistical estimates, they come with confidence intervals that reflect the precision of our estimation procedure. Let us break down intuitively, how to go about defining such a confidence interval.<br><br>
<b> Wilson Confidence Interval</b><br><br>
We base epsilon on the Wilson 95% confidence interval for each entry of $\bar P$. Wilson's rule is a statistically safer margin of error for proportions than the textbook method particularly when we deal with estimates close to 0% or 100%. Taking the largest half-width across all four entries gives a data-driven bound $\epsilon_{stat}=0.015$; we round up to 0.02 for a modest safety margin. <i> See Appendix C for an derivation of Wilson's interval </i>

<b> Infinity Norm </b><br><br>
The infinity-norm acts entry-by-entry:
$$\|P-\bar P\|\infty=\max{i,j}|p_{ij}-\bar p_{ij}|$$
Saying this quantity is $\le \epsilon$ is equivalent to putting the same absolute tolerance $\pm \epsilon$ for each transition probability. Graphically, the admissable points form an axis-aligned rectangle around $\bar P$ in the ($P_{LL}, P_{HL}$) plane (see figure 3.1). <br> This provides several benefits for practitioners. 
1. The desk can think about "How wrong might the calm-to-storm probability be" without simultaneously worrying about the storm-to-calm leg; each has its own $\le \epsilon$ band.
2. Can easily move transitions to their lower and upper bands, if we chose to shorten calm spells and lengthen storm spells. Both stresses provide simple explanations to non-quant stakeholders, because only one intuitive metric is changing at a time.
3. A rectangle is convex, so the robust-MDP machinery guarantees the minimax value is reached by a deterministic stationary policy and admits efficient dynamic-programming evaluation. This provides tractability.<br><br> 

<b>The significance of convexity </b><br><br>
$\mathcal P$ is an axis-aligned rectangle, hence convex. The convexity property proves essential because it triggers the existence guarantees established by Iyengar (2005) for robust Markov Decision Processes. When uncertainty sets are convex and satisfy additional regularity conditions, deterministic stationary minimax policies exist and can be computed using standard dynamic programming techniques as noted above.<br><br>
This arises when we try to solve minimax optimisation problems. In our hedging application, we want to find policies that minimise the worst-case performance over all possible transition matrices in the uncertainty set. Suppose we lack mathematical guarantees, such problems may yield no solutions, multiple solutions, or solutions that are too computationally expensive to obtain.

<b>[TODO: state and prove the convexity lemma formally]</b><br>
<b>[TODO: cite Iyengar 2005 and show proof why convexity enables existence of optimal policies]</b><br>

#### Game structure and information flow
$$\boxed{%
\Omega: P\in\mathcal P \;\Longrightarrow\;
\sigma_t \xrightarrow{\text{Heston}} S_t \xrightarrow{\pi} h_t }$$

<b> Episode start </b>: Nature $\Omega$ secretly draws a matrix $P$ inside the $\epsilon$-rectangle<br>
<b> Daily loop </b>: The hidden chain generates $\sigma_t$ which drives price dynamics; the agent observes prices, updates belief $q_t$ and sets hedge $h_t = \pi(X_t)$ <br>
The information structure reflects the asymmetry a trader would face in practice. Nature's regime choice remains hidden throughout the episode, forcing a trader to make hedging decisions based on probabilistic beliefs about the current volatility state rather than perfect regime knowledge.

## 3.4 Conditional Value at Risk (CVaR)

### 3.4.1 Why tail risk matters
A delta-hedged option book is usually sleepy: 95% of the time P&L wiggles within a narrow daily band. Most trading days follow predictable patterns, most market movements fall within expected ranges, and most hedging strategies work reasonably well under normal conditions. Blow-ups happen in the last 5% of days including but not limited to earning calls, flash crashes, pandemics. If risk metrics ignore the magnitude of these bad days, significant losses can occur. <br><br>
This asymmetry between frequency and impact becomes particularly acute in regime-switching environments. During calm volatility periods, hedging errors remaining manageable and traditional risk measures provide reasonable guidance. However, when markets transition into high volatility regimes, the distribution of potential losses develops much fatter tails, resulting in greater probability mass in extreme loss regions. This leaves us with strategies generating errors precisely when they are needed most. Hence, the focus of modern trading desks has shifted to tail risk, hence our robust hedging framework should optimise a statistic that cares about the size of the tail, not just its probability. 

### 3.4.2 Definition of Conditional Value at Risk
For a loss random variable $L$ and confidence level $\alpha$ (assume 95%), the Value at Risk $$\operatorname{VaR}_\alpha(L) = \text{inf}\{l :Pr[L\le l]\ge \alpha\}$$
tells us the smallest loss threshold to cover 95% of all possible days (assuming $\alpha$ is 95%).

Whilst this is useful, it fails to tell us by how much we breach the candidate breach value. Conditional Value at Risk accounts for this by examining what occurred in $1 - \alpha$ of scenarios. $$\boxed{\;
\mathrm{CVaR}{\alpha}(L)=
\frac{1}{1-\alpha}\,\mathbb E\bigl[L\;\big|\;L>\mathrm{VaR}{\alpha}(L)\bigr]
\;}
\tag{3.10}$$
To put simply, it calculates the average loss in the worst 5% of scenarios. Consider a scenario where 95% VaR is $1m but 95% CVaR is $4m. The tail is fat: most of the bad 5% days are far worse than $1m.

### 3.4.3 Why CVaR dominates VaR for hedging
<b> Captures severity not just cut off </b><br>
Assume two trading books have identical VaR but wildly different CVaR. Only the latter will flag which book will explode. <br><br>
<b> Coherent and convex (Artzner et al., 1999)</b><br>
CVaR satisfies sub-additivity, meaning that diversifying portfolios never increases CVaR property that aligns with economic intuition about risk reduction through diversification. VaR can actually increase after diversification due to mathematical pathologies that make it unsuitable for portfolio optimization. The coherence properties ensure that CVaR provides economically sensible guidance for risk management decision <br><br>
<b> Optimisation-friendly </b><br>
Rockafellar & Uryasev (2000) showed CVaR can be written as the minimum of a convex objective, so gradient based algorithms and DP recursion apply without hacks.

### 3.4.4 CVaR under model-uncertainty
Our hedge problem is a two-player game. The agent has a policy $\Pi$ which determines how many shares to hold each day. It seeks to minimise the CVaR. Nature, our adversary, chooses a transition matrix $P\in\mathcal P(\varepsilon)$. It seeks to maximise this same CVaR. Given Nature moves after seeing our policy, we must prepare for the worst-case CVaR. The reader may question why do we play a game whereby Nature can observe our strategy and then create its own. This information asymmetry can seem almost 'unfair' and not reflect of reality. However, the objective is to create a policy whereby our agent can weather through extreme events. It is akin to crafting points in a debate that your opponent is privy to. We must ensure that no matter how strong our opponent's move is, our policy is robust. <br><br>
<b> Necessity of convexity </b><br>
First, consider the possibility that our objective function is not convex function and exhibits multiple local minima scattered throughout the parameter space. Our optimisation algorithm may converge to one of these local minima and report the optimal solution has been obtained, when in reality, a better solution exists elsewhere. <br><br>
Consider another circumstance whereby our objective function exhibits discontinuities or undefined regions. Such instances can cause our algorithm to crash, produce inconsistent results, or converge to points that represent mathematical artifacts rather than economically meaningful solutions. Such cases can waste significant computational time whilst simultaneously yielding unreliable results. 
<br><br>
Suppose we have two policies. 
- Policy A involves aggressive position adjustments that produce a CVaR of 2 million under certain market conditions. 
- Policy B involves conservative position management that produces CVaR of 4 million under the same conditions.

Assume we now flip a coin between these two strategies. Convexity property guarantees the mixed strategy will have a CVaR between  2 million and 4 million. Even if we opt for an inferior strategy it cannot be worse than 4 million. If 50/50, we observe a CVaR of 3 million. Whilst this seems trivial and almost obvious, not all mathematical functions operate this way. Thus mathematically for any $\lambda\in[0,1]$ 
$$\mathrm{CVaR}{\alpha}\bigl(\lambda L_A+(1-\lambda)L_B\bigr)
\;\le\;
\lambda\,\mathrm{CVaR}{\alpha}(L_A)
+(1-\lambda)\,\mathrm{CVaR}_{\alpha}(L_B).$$
<i> Please see Appendix D for sketch and proof of Rockafellar and Uryasev (2000).</i><br><br>
<b> Result of convexity </b><br>
Returning to the uncertainty set in section 3.3, we note that each point in the uncertainty set represents a different set of regime dynamics that Nature may choose. Intuitively, Nature's choice could be anywhere. However, given CVaR is convex, any interior point, which itself is a weighted average of the extreme corner points, will not exceed the corner points. Therefore corner points will be at least as large as any interior point, ergo Nature, which always seeks to maximise this objective, will select corner points. This means that Nature needs to only enumerate through four specific transitions matrices changing this from a continuous optimsation problem to discrete, significantly reducing computation load.

### 3.4.5 Economic Interpretation
Each corner of the uncertainty set corresponds to some stress scenario rather than some abstract mathematical construction. We are simply reducing matrix values by $\epsilon$.
- Bottom left corner reflects calm periods are less likely to persist and storm days are more likely to persist.
- Top right corner reflects the opposite, where calm periods increase and storm periods are more likely to change to calm.
- The other corners give mixed cases capturing realistic possibilities where different market conditions are impacted differently.

## 3.5 The MinMax Objective

We can now write the complete mathematical formulation of our robust hedging game. Our agent, as discussed above, seeks a policy to minimise the worst-case CVaR across all possible regime dynamics that Nature might choose. We denote the minimax objective in the Rockafellar-Uryasev (hinge) form:
$$
\boxed{\;
\min_{\pi\in\mathcal{\Pi}}
\;\max_{P\in\mathcal P(\varepsilon)}
\;\min_{\zeta\in\mathbb R}
\Bigl\{
\zeta
+\frac{1}{1-\alpha}\;
\mathbb{E}{\tau\sim(\pi,P)}
\!\bigl[(L(\tau)-\zeta)_+\bigr]
\Bigr\}
\;}
\tag{3.11}
$$

Agent (policy $\pi$) chooses a hedge strategy <br>
Nature $\Omega$ picks the worst regime within the uncertainty set $\mathcal P(\varepsilon)$ <br>
Buffer $\zeta$ self-adjust to the VaR level, converting the tail-average CVaR into a convex objective. <i> Please see Appendix E for detailed breakdown on self-adjustment to VaR </i> <br>

### 3.5.1 Helping optimisation through $\zeta$
We note the inclusion of a new buffer variable $\zeta$ which transforms eq 3.10 into eq 3.11. <br><br>
<b> Drawback of calculating CVaR alone </b><br>
Typically, we collect all possible loss outcomes, sort them, find the percentile which matches our $\alpha$, and average everything above that threshold. <br>
However, sorting and finding percentiles brings its own set of problems such as being inherently discontinuous. Consider computing gradients when changes in the policy might reorder the loss distribution. <i> Please see Appendix F for a detailed breakdown of the issues that arise on this matter </i> <br> <br>

### 3.5.2 Swapping Min and Max
From eq 3.11, we note the computationally challenging nested structure because for every $\pi$ we consider, we must solve an inner optimisation problem to find the worse-case $P$ and optimal $\zeta$. If we could swap the $P$ and $\zeta$ around, we could solve for the policy and threshold jointly first, then find Nature's response. This is much more tractable algorithmically, as we would not need tos solve Nature's problem as a subroutine within each every iteration of our policy search. Fortunately we can do this because for any fixed policy $\pi$ and threshold $\zeta$, the objective function $\zeta + \frac{1}{1-\alpha}\mathbb{E}[(L^\pi(P) - \zeta)_+]$ is linear in the probability measure $P$. <br><br>
<b>Linear in P explained</b><br>
Suppose we have our transition matrix and we wished to calculate the expected loss over a two-day period starting from a low volatility state. Observe the four outcomes
$$
\begin{aligned}
&\text{Low}\rightarrow\text{Low}\rightarrow\text{Low}: && P_{LL}\,P_{LL}\\
&\text{Low}\rightarrow\text{Low}\rightarrow\text{High}: && P_{LL}\,P_{LH}\\
&\text{Low}\rightarrow\text{High}\rightarrow\text{Low}: && P_{LH}\,P_{HL}\\
&\text{Low}\rightarrow\text{High}\rightarrow\text{High}: && P_{LH}\,P_{HH}
\end{aligned}
$$
Suppose now, we have found our hedging policy $\pi$ and threshold $\zeta$ for this example. Each of the four trajectories has a determined loss. $P$ only changes how likely a path is, not the dollar attached to it. <br>
Let us now say, the losses are 1.2, 2.8, 3.5 and 4.9 million respectively. Observe the objective function 

$$\mathbb{E}[(L^\pi(P) - \zeta)] = \underbrace{1.2}_{\text{fixed}}\times\underbrace{(P_{LL} \times P_{LL})}_\text{can change} + \underbrace{2.8}_\text{fixed} \times \underbrace{(P_{LL} \times P_{LH})}_\text{can change} + \underbrace{3.5}_\text{fixed} \times \underbrace{(P_{LH} \times P_{HL})}_\text{can change} + \underbrace{4.9}_\text{fixed} \times \underbrace{(P_{LH} \times P_{HH})}_\text{can change}$$

Since weighted sums are linear in their weights, this expectation is linear in P, which is precisely the property required by Sion's minimax theorem to justify swapping the optimisation order. This linearity holds irrespective of number of market states or trading horizon. <br><br>
With this in mind, we now obtain a more tractable form where we solve for the policy and threshold jointly, then find Nature's worst case response.
$$
\boxed{\;
\min_{\pi\in\mathcal{\Pi}}
\;\min_{\zeta\in\mathbb R}
\;\max_{P\in\mathcal P(\varepsilon)}
\Bigl\{
\zeta
+\frac{1}{1-\alpha}\;
\mathbb{E}{\tau\sim(\pi,P)}
\!\bigl[(L(\tau)-\zeta)_+\bigr]
\Bigr\}
\;}
\tag{3.11}
$$

### 3.5.3 Nature's simplified problem: The Extreme Point Theorem
For a fixed ($\pi, \zeta$), the maximiser $$P^*=\arg\max_{P\in\mathcal P(\varepsilon)}
\mathbb E[(L^{\pi}(P)-\zeta)_+]$$
is a vertex of the $\varepsilon$-rectangle $\mathcal P(\varepsilon)$. Note from section 3.4, Nature need only evaluate the objective function at the four specific corner matrices and select the transition matrix which maximises CVaR. For any candidate ($\pi, \zeta$) we can therefore enumerate these four matrices, compute CVaR exactly, and pick the worst case scenario without numerical tolerance issues. This discrete step is what makes the robust PPO algorithm in Chapter 4 both fast and exact.

### 3.5.4 Computational structure and implementation
Each training epoch repeats three steps:
1. Nature enumerates the four rectangle corners and picks $P^*$
2. Buffer update: $\displaystyle \zeta\leftarrow\zeta-\eta_\zeta\Bigl[1-\frac{1}{1-\alpha}\,\widehat{\mathbb P}(L>\zeta)\Bigr]$ whose gradient vanishes exactly at $\operatorname{VaR}_{\alpha}$
3. Policy update: apply PPO to minimise the tail loss $(L^{\pi}-\zeta)_+$



[TODO]: Look at buffer line and Understand the convex concave bit slightly more <br>

## 3.6 Existence and stationarity

### 3.6.1 What must be proved
Before training an agent, two guarantees are required.
1. Existence: The minimax CVaR value must be finite and attainable.
2. Stationarity: An optimal policy can be kept time-homogenous i.e. one decision rule $\pi^*:S\!\to\!A$ suffices for all dates

It is important to note that time to expiry is included in the state space and as such we maintain a finite state set and do not violate Iyengar's theorem. Our neural network, architecture discussed in Chapter 4, will receive "time till expiry" as an input feature. 

### 3.6.2 The three pillars of Existence
Because the state and action spaces are finite, the uncertainty set $\mathcal P(\varepsilon)$ is a compact rectangle, and the objective is convex-concave, the robust MDP results of Iyengar (2005) and Nilim & El Ghaoui (2005) apply. They guarantee that an optimal stationary policy $\pi^\star$ and a saddle point $(\pi^{\star},\zeta^{\star},P^{\star})$, and that policy-gradient methods converge to this minimax value.


<b> Finite State and Action Spaces </b><br>
Consider a function $f(x) = x$ on the real line. Assume we sought the maximum value. There is not one since we can always choose a larger $x$ giving us an unbounded function. The finite number of volatility regimes prevent this unbounded issue. In finite spaces, continuous function will always attain their maximum and minimum values, therefore we will not have situations where we escape to infinity
<br>
<br>
<b> Compact Uncertainty Set </b><br>
Given $\mathcal P(\varepsilon)$ is a compact rectangle it holds two properties

1. Closed: contains all the boundary points. This means we do not asymptotically approach the boundary points but give us the Extreme Point Theorem where maximum and minimum values are attained.
2. Bounded: Does not extend to infinity in any direction.


We know our uncertainty set is defined by $\|P-\bar P\|_\infty\le\varepsilon$ which creates a rectangular region around the calibrated transition matrix $\bar P$. It is bounded as entries cannot deviate more than $\varepsilon$ and closed as it includes the boundary where deviations equal $\varepsilon$. Since continuous functions can achieve these maximum and minimum deviations, it guarantees that Nature can actually find the worst-case transition matrix $P^*$
<br>
<br>
<b> Convex-Concave Structure </b><br>
Our objective function has a special mathematical structure
- Convex in ($\pi, \zeta$) : Agent's variables
- Concave in $P$ : Nature's variable

This gives us Sion's theorem that given these properties  

$$\min_{x \in X}\,\max_{y \in Y} f(x,y) = \max_{y \in Y}\,\min_{x \in X} f(x,y)$$

which simply tells us, the worst-case scenario that Nature can create against our best strategy is the same as the best strategy we can deploy against Nature's worst case. Which is why it fundamentally does not make a difference that Nature goes second. 

### 3.6.3 Why Stationarity emerges
In our setup we posit the following:
1. Markov property: future regime transitions depend only on the current regime, not the entire history. This may seem counterintuitive given the assumption that calm days are thought to naturally follow calm days, however it simply means tomorrow depends only on today and the day after will depend only on tomorrow. If we go from calm to storm, then today's calm spell will not be a factor.
2. finite horizon: the episode ends at option expiration
3. Stationary environment: the transition probabilities do not change over time.

These conditions mean that the optimal decision at any state does not depend on calendar time, only the current market state, despite of course time to expiry being a feature. This is precisely why we are able to utilise a stationary policy $\pi^*$ rather than a time-dependent policy $\pi^*_t(s_t)$

### 3.6.4 Practical implications 
These theoretical guarantees have direct computational benefits:
1. Our RL algorithm is guaranteed to converge to the global optimum
2. We benefit from simpler stationary polices rather than more complicated time-dependent ones, therefore we do not require time-indexed policy weights resulting in fewer parameters
3. The minimax value provides a meaningful worst-case performance bound ensuring robustness.



### 3.7 Algorithm Choice
Having established our minimax CVaR objective and proven the existence of optimal stationary polices, we now face the practical question: which learning engine can shoulder all four structural challenges?
- continuous hedge ratios
- hidden (unobservalbe) volatility regime states
- tail risk (CVaR) focus
- adversarial dynamics (corner-matrix Nature)

As such, we require a practical RL engine that can accomplish the following
- Discover that policy in a continuous hedge space
- Cope with hidden-regime uncertainty
- Align neatly with the CVaR-buffer objective whilst remaining stable under adversarial stress

#### 3.7.1 Why Policy Gradient Methods

The nature of our hedging problem rules out classical value-based reinforcement learning approaches such as Q-learning or DQN (insert reference). At each trading period, our agent must choose a hedge ratio $h_t \in \mathbb R$, representing the number of shares to hold. This continuous action space creates an infinite dimensional optimisation problem that traditional tabular methods cannot address.
<br>
<br>
One may argue to discretise the action space into bins (e.g. hedge ratios from -2 to + 2 in increments of 0.1), however, it introduces constraints that may be costly. We may see a situation where a hedge ratio of 0.45 may be optimal, but forcing the agent to choose between 0.4 and 0.5 could lead to significant errors over time. Additionally, to even achieve acceptable performance, we would have to implement such a large discrete action space which would defeat the purpose of discretising in the first place.
<br>
<br>
Policy gradient methods elegantly sidestep this issue by directly parametrising a stochastic stochastic policy $\pi_\theta(h|s)$ that outputs a probability distribution over continuous hedge ratios. During training, the agent samples from this distribution, naturally exploring the continuous space while gradually concentrating probability mass around optimal actions as learning progress. This in turn, also eliminates the need for tuning exploration decay as seen in classical algorithms.

#### 3.7.2 Which Policy Gradient Method?
We will briefly introduce several algorithms that could be selected, and why we settle on our choice. <br><br>
<b>Proximal Policy Optimisation  </b><br>
Schulman et al. (2017) introduced Proximal Policy Optimisation (PPO) which takes a conservative approach to policy updates. By constraining each update to remain within a "trust region" of the previous policy, resulting in a clipped objective. Within our use case, assume Nature discovers a particularly devastating transition matrix that causes large losses. A naive policy gradient might overreact, drastically changing the hedging strategy based on what could have been a rare event. PPO's clipping mechanism acts as a natural damper, ensuring that even under adversarial pressure, policy updates remain measured and stable.

Before continuing, let us visit other possible choices and why they have not been selected. Trust Region Policy Optimisation (Shulman, et al. 2015) works similar to PPO utilising Kullback Leibler (KL) Divergence, however it utilises second-order derivatives which adds significant complexity, despite techniques such as calculating the conjugate gradient are used. Empirically PPO has attained the same KL control as TRPO.

Soft-Actor-Critic (SAC) (Haarnoja et al.2019) is another compelling algorithm, encouraging diverse, exploratory behaviour by maximising entropy; rewarding randomness in actions. In many domains this exploration bonus accelerates learning. However, when attempting to minimise worst-case risk, you often want to approach deterministic, precise hedging strategies. SAC's drive for randomness directly opposes our goal of tight risk control.

In addition, we are able to dispense with the need to create an adversarial agent that learns. Since our Nature is solved by corner enumeration as proved throughout this chapter, we need not require an adversary network to learn a mapping that is already known. This adds additional complexity and introduces stochasticity without any upside.

Despite the widespread adoption and viability of PPO, we must adjust it our purpose. The standard implementation optimises for expected returns, however, our focus on CVaR requires careful modifications to the advantage estimation and policy updates. We must teach the algorithm to care disproportionately about tail events, whilst maintaining the stability benefits of PPO's trust region approach. Essentially, we require special emphasis on worst case scenario, given the risk-based nature of our problem. For this we introduce "RobustCVaR-PPO". 

#### 3.7.3 Robust CvaR-PPO
Let us first visit the concept of Actor Critic Reinforcement Learning methods. Actor Critic algorithms combine both policy based and value based methods. They have an Actor, which as the name suggests, selects actions. A Critic then evaluates them. It is akin to having an external entity providing feedback on your decisions.
The formulation is as follows:
1. Actor : follows a Gaussian policy $\pi_\theta(h|s) = \mathcal N(\mu_\theta(s), \sigma^2(s))$
2. Critic : tail value function $V_\phi(s)=\mathbb E\!\bigl[(L-\zeta)_+ \mid s\bigr]$


where 
- $\mu$ is our best guess delta
- $\sigma$ is how nervous we are about that guess
- $V_\phi$ as shown earlier, is the expected loss beyond the current VaR buffer

As mentioned before, we remove the expensive constraint imposed by TRPO and second derivatives with a more compute friendly, empirically sound and stable clip of
$$r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$$
where if $r_t$ drifts outside [1-$\varepsilon$, $1+\varepsilon$] we stop the gradient.
<br>
<br>
We implement the loss function as shown by Schulman et al. (2017) with a slight engineering tweak
$$\boxed{\;
\mathcal{L}_{\text{PPO}}(\theta)\;=\;
\mathbb E_t\!\big[
\min\bigl(
r_t(\theta)\,\hat A^{\text{tail}}_t,\;
\operatorname{clip}(r_t,1\!\pm\!\varepsilon)\,\hat A^{\text{tail}}_t
\bigr)
\bigr]\;}
\tag{3.27}$$
We note the change of $\hat A$ to $\hat A^{\text{tail}}$, which reflects the change in vanilla advantage $$\displaystyle \hat A_t = Q_\theta(s_t,a_t) - V_\phi(s_t)$$ to $$\displaystyle \hat A^{\text{tail}}t = (L_t - \zeta)_+ - V_\phi(s_t)$$
This however, does not impact our clipping objective and as such we can make use of the compute friendly clipping of PPO ensuring the trust region safety still holds.

#### 3.7.4
Revisitng our four constraints, we note how this implementation suits our practical needs<br><br>
<b> Continuous $h_t$ </b> : Gaussian actor outputs $\mu$ and $\sigma$ directly ($\mu$ is a real nujber, so gradient descent can push it to any point on $\mathbb R$, therefore search space is continuous<br>
<b> Hidden Regimes </b> : Filtered belief state<br>
<b> Tail risk </b> : Tail advantage implementation through PPO with $\zeta$ tracker<br>
<b> Adversary </b> : Cheap four corner enumeration and PPO clipping keeps updates stable even in worst case<br>

With theoretical foundations now in place, we turn to Chapter 4 to the practical methodology and implementation details

In [2]:
import numpy as np
(12.3-17)/(np.sqrt(4.06))

-2.3325708348316856

In [3]:
(18-9.5)/(np.sqrt(16.42))

2.097646729713957

In [6]:
(12.04-13.3)/-0.9

1.4000000000000017

In [9]:
35 - 1.2 * 8.75

24.5

In [11]:
-.2/.35

-0.5714285714285715

In [12]:
5 * np.log(3)/np.log(7)

2.8228751702678987

In [16]:
np.log(5) #+ np.log(5**2 + 1)
np.log(26)

3.258096538021482

In [20]:
np.log10(5) + np.log10(26)

2.113943352306837

In [24]:
import math

(5 * math.pi)/17 * (180/math.pi)


52.94117647058824

In [25]:
3.15 * (180/np.pi)

180.48170546620932

In [28]:
234.7 * np.pi/180

4.096287754430691

In [34]:
math.factorial(6)


720

In [36]:
np.log10(41)/np.log10(18)
np.log10(5)/np.log10(12)

0.647685462377997

In [46]:
# v = pi * r^2 * height
print(f"{math.factorial(11)/math.factorial(8):.10f}")

990.0000000000


In [55]:
math.factorial(10)/(math.factorial(7) * math.factorial(3))

120.0

In [58]:
np.log10(33)/np.log10(4)

2.5221970596792267

In [61]:
print(2)

2


In [66]:
np.log10(100)


2.0

In [68]:
12 * (1/2**7)

0.09375

In [84]:
np.log10(50)/np.log10(7)

2.0103821378018543

In [101]:
math.factorial(20)/(math.factorial(17)* math.factorial(3))

1140.0

In [97]:
math.factorial(8)/math.factorial(3) 

6720.0

In [107]:
(np.log10(3**0.5)/np.log10(5))/2 + 3/2

1.6706515486214963

In [131]:
(np.log10(9)/np.log10(5) - 1)/2

0.18260619448598525

In [133]:
np.log10(3)/np.log10(2)

1.584962500721156

In [134]:
1/36 * (-6**3)

-6.0

In [137]:
np.log(-np.inf)

  np.log(-np.inf)


nan