# Choice of Model
## choosing what to model
we are going to model UK term structure of interest rates and the raw dataset will be the time-series of spot yields available from the BoE website

decide whether:
- model the whole interest rate curve
  or
- changes in interest rate curve from one period to the next
we go with _"changes in the interest rate curve"_ since this is more appropriate for the use case of calculating an SCR.

- ?? why do we do y_t+1 / y_t and not change / y_t  ??
- do we base analysis on covariance matrix or correlation matrix
  - and how does this decision affect backing out to arrive at stress

### which measure of change in interest rates
There are several options:
- absolute differences  $ \Delta y_t = y_t - y_{t-1} $
- log differences $ \Delta y_t = \log(y_t) - \log(y_{t-1}) $
- percentage change $ \Delta y_t = \frac{y_t - y_{t-1}}{y_{t-1}} $

abosolute differences don't work well when rates are low. percentage changes can be extremely high when rates are close to zero.  although it is less intuitive log differences ensures percentage changes are comparable across different rate environments.

## transforming dataset via principle component analysis 
To create a parsimonious model for demonstrating the Bayesian technique, we will perform PCA on the raw dataset. This helps reduce noise and extract the most significant patterns of variation. Additionally, it requires fewer parameters than more complex affine term structure models.

## Covariance Vs Correlation

|Approach   |when to use   |scale |Impact of Variances|
|---|---|---|---|
|Covariance   | yields at each maturity similar units and variances  |retains original scale of data   | ??magnitude at different maturities has impact <br>|
|Correlation   |variances differ by maturity   |standardises data| different volatilities at diffnt maturities doesn't dominate results|

?? what are the advantages of using covariances then  ??

!!we should probably make a choice with a scatterplot of adjusted dataset!!

read here that the two would yield the same results:  https://medium.com/towards-data-science/applying-pca-to-the-yield-curve-4d2023e555b3

## Modeling Steps
- demeaning the dataset (PCA expects centering around zero)
- generate ??covariance or correlation?? matrix
- ??eigenvectors and eigenvalues??

# Steps
- calculate log of change (don't use df.pcnt_  whatever it is)
- de mean the dataset
- calculate covariance matrix, eigenvalues and eigenvectors
- derive a calibration dataset
- attempt to fit normal or student-t distribution  (student t better for heavier tails)
  - Q Q plot or historgrams  
  - q q plot great for seeing if normally distributed
- ?? BAYESIAN INFERENCE PARTS OF THE PROCESS ??
  - any hyperparameters
- ?? DERIVING STRESSES, COMPARING CLASSICAL VS BAYESIAN APPROACH ??


## Deriving Stresses
$Y_t = \log {\frac{X_t}{X_{t-1}}}$ <br><br>
simulate $Y_t$<br><br>
$ \exp{Y_t} = \frac{X_t}{X_{t-1}} $<br><br>
$X_{t-1}e^{Y_t}  = X_t $<br><br>
$X_t-1$ is current value of the curve and $X_t$ value one year from now

### the steps
- draw realisation of PC from probablistic model
- (if using correlations) rescale using s.d. for each yield maturity
- add back the mean

# Getting Raw Data into Dataframes


The bank of england provides two spreadsheets with historic spot yields at https://www.bankofengland.co.uk/statistics/yield-curves/
we import each of these into a dataframes (df1 and df2) and join to make a single dataframe (df)

In [79]:
import pandas as pd

#load in first spreadsheet to df1
df1 = pd.read_excel("GLC Nominal month end data_1970 to 2015.xlsx",sheet_name="4. spot curve",engine="openpyxl",skiprows=5,header=None)
#create an appropriate set of headers
col_names=pd.read_excel("GLC Nominal month end data_1970 to 2015.xlsx",sheet_name="4. spot curve",engine="openpyxl",skiprows=3,nrows=1,header=None)
df1.columns = col_names.iloc[0] 
col_names[0]="Date"
#load in second spreadsheet to df2
df2 = pd.read_excel("GLC Nominal month end data_2016 to present.xlsx",sheet_name="4. spot curve",engine="openpyxl",skiprows=5,header=None)

In [81]:
#join the two dataframes to create df
df = pd.concat([df1, df2], ignore_index=True)

#producing some sense checks
print("the first dates is "+ str(df.iloc[0,0].strftime('%Y-%m-%d'))+" and the last is " +str(df.iloc[551,0].strftime('%Y-%m-%d') ))
print("one would therefore expect 12 x 46yrs = 552 entries")
print("and indeed we see the number of rows in df is "+str(len(df1)))

the first dates is 1970-01-31 and the last is 2015-12-31
one would therefore expect 12 x 46yrs = 552 entries
and indeed we see the number of rows in df is 552


# Incorporating Bayesian Framework

PCA reveals latent factors (

Having decided to model interest principle components, which economic outlooks correspond to these components:

|Principle Component   |Relevant Insights   |
|---|---|
|PC1|level of interest rates  -  expected prolonged rates or gradual hiking against prolonged inflation |
|PC2|slope  -  short term vs long term expectations   |
|PC3|curvature  -   short term vs long term expectations   |



coud the bayesian rules enforce arbitrage freeness ??

- Economic theory imposes contraints of the first moments (see https://www.nber.org/system/files/working_papers/w24618/w24618.pdf)

# Some links
https://www.thegoldensource.com/pca-and-the-term-structure/#:~:text=The%20purpose%20of%20PCA%20is,14%20orthogonal%20lines%20using%20eigenvectors.