# Expected losses and VaR

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gaarutyunov/credit-risk/blob/master/notebooks/colab_el_var.ipynb)

## Environment settings

For better performance change Colab runtime type to GPU

In [1]:
import numpy as np
import scipy.stats
!git clone https://github.com/gaarutyunov/credit-risk.git

Cloning into 'credit-risk'...
remote: Enumerating objects: 431, done.[K
remote: Counting objects: 100% (97/97), done.[K
remote: Compressing objects: 100% (67/67), done.[K
remote: Total 431 (delta 50), reused 74 (delta 30), pack-reused 334[K
Receiving objects: 100% (431/431), 10.94 MiB | 15.34 MiB/s, done.
Resolving deltas: 100% (254/254), done.


In [2]:
%cd credit-risk

/content/credit-risk


In [3]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wing
  Cloning https://github.com/sberbank-ai/wing.git (to revision master) to /tmp/pip-install-4dhu6sqg/wing_83adb08c7aed4405b07c4f7de6ed9eed
  Running command git clone -q https://github.com/sberbank-ai/wing.git /tmp/pip-install-4dhu6sqg/wing_83adb08c7aed4405b07c4f7de6ed9eed
Collecting hydra-core
  Downloading hydra_core-1.2.0-py3-none-any.whl (151 kB)
[K     |████████████████████████████████| 151 kB 5.1 MB/s 
[?25hCollecting omegaconf
  Downloading omegaconf-2.2.2-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 7.0 MB/s 
Collecting catboost
  Downloading catboost-1.0.6-cp37-none-manylinux1_x86_64.whl (76.6 MB)
[K     |████████████████████████████████| 76.6 MB 102 kB/s 
Collecting antlr4-python3-runtime==4.9.*
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
[K     |████████████████████████████████| 117 kB 54.1 MB/s 
Collecting PyYAML>=

To get username and key follow instructions in [readme](https://github.com/Kaggle/kaggle-api)

In [None]:
%env KAGGLE_USERNAME=<username>
%env KAGGLE_KEY=<key>

In [5]:
!kaggle datasets download wordsforthewise/lending-club

Downloading lending-club.zip to /content/credit-risk
 99% 1.25G/1.26G [00:08<00:00, 114MB/s]
100% 1.26G/1.26G [00:08<00:00, 164MB/s]


In [6]:
!unzip lending-club.zip

Archive:  lending-club.zip
  inflating: accepted_2007_to_2018Q4.csv.gz  
  inflating: accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv  
  inflating: rejected_2007_to_2018Q4.csv.gz  
  inflating: rejected_2007_to_2018q4.csv/rejected_2007_to_2018Q4.csv  


In [7]:
!mkdir data

In [8]:
!mv accepted_2007_to_2018q4.csv/accepted_2007_to_2018Q4.csv data/accepted_2007_to_2018Q4.csv
!mv rejected_2007_to_2018q4.csv/rejected_2007_to_2018Q4.csv data/rejected_2007_to_2018Q4.csv

## Preprocessing

In [10]:
from pipeline import get_pipeline

preprocessing = get_pipeline(
    name="cat_boost",
    group="preprocessing",
    overrides=[
        "preprocessing_pipeline=raw_data"
    ],
    debug=True,
)

_target_: pipeline.ReaderPipeline
memory: ./cache/preprocessing/raw
steps:
- - CSVReader
  - _target_: pipeline.CSVReader
    _convert_: all
    file: data/accepted_2007_to_2018Q4.csv
    columns:
    - loan_amnt
    - term
    - emp_title
    - emp_length
    - home_ownership
    - verification_status
    - purpose
    - zip_code
    - addr_state
    - earliest_cr_line
    - fico_range_low
    - fico_range_high
    - revol_bal
    - application_type
    - verification_status_joint
    - sec_app_earliest_cr_line
    - loan_status
    - issue_d
    - funded_amnt
    - disbursement_method
- - EmpTitle
  - _target_: pipeline.JobTransformer
    _convert_: all
    max_jobs: 20
- - ImputeNumerical
  - _target_: pipeline.ApplyToColumns
    _convert_: all
    inner:
      _target_: sklearn.impute.SimpleImputer
      strategy: mean
    columns:
    - loan_amnt
    - fico_range_low
    - fico_range_high
    - revol_bal
- - ImputeCategorical
  - _target_: pipeline.ApplyToColumns
    _convert_: al

In [11]:
X = preprocessing.fit_transform([], y=[])

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  **fit_params_steps[name],
If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  **fit_params_steps[name],


In [12]:
import pandas as pd

X['issue_d'] = pd.to_datetime(X['issue_d'])

In [13]:
X = X[X['issue_d'] >= '01.01.2017']

In [14]:
X['issue_d'] = X['issue_d'].dt.strftime('%b-%Y')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
import numpy as np
import scipy.stats


def vasicek(PD, rho, alpha):
    return ( scipy.stats.norm.ppf(PD) + np.sqrt(rho) * scipy.stats.norm.ppf(alpha) ) / np.sqrt(1 - rho)


def mean_confidence_interval(data, confidence=0.999):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return h

## Cat Boost

In [None]:
from pipeline import get_pipeline

catboost = get_pipeline(
    name="cat_boost",
    group="prediction",
    debug=True,
)

In [None]:
X["PD_CB"] = catboost.predict_proba(X.drop(columns=["funded_amnt", "issue_d", "loan_status"]))[:, 1]

In [None]:
LGD = 1.0

In [None]:
X["EL_CB"] = LGD * X["PD_CB"] * X["funded_amnt"]

print(f"Expected losses: {X['EL_CB'].sum():.2f}")

In [None]:
term_36_mask = X.term.str.strip().str.startswith('36')

In [None]:
VaR_1 = vasicek(X.loc[term_36_mask, "PD_CB"], 0, .999) * X.loc[term_36_mask, "funded_amnt"]
VaR_6_1 = vasicek(X.loc[term_36_mask, "PD_CB"], .06, .999) * X.loc[term_36_mask, "funded_amnt"]

In [None]:
VaR = vasicek(X["PD_CB"], 0, .999) * X["funded_amnt"]
-VaR.sum()

In [None]:
mean_confidence_interval(VaR)

In [None]:
X["EL_CB"].sum()

In [None]:
C = -VaR.sum() - X["EL_CB"].sum()
C

In [None]:
VaR_6 = vasicek(X["PD_CB"], 0.06, .999) * X["funded_amnt"]
-VaR_6.sum()

In [None]:
mean_confidence_interval(VaR_6)

In [None]:
C = -VaR_6.sum() - X["EL_CB"].sum()
C

In [68]:
mean_confidence_interval(VaR_6)

24.04852008398609

In [69]:
C = -VaR_6.sum() - X["EL_CB"].sum()
C

-2045003135.3392162