# Assignment 2
## Advanced Financial Economics M-335

#### An Exercise of Financial Crises Prediction Using Machine Learning Techniques.

This assignment is inspired by the research article *'Credit growth, the yield curve and financial
crisis prediction: evidence from a machine
learning approach'* (2019), by Bluwstein et al. The idea is to reproduce the financial crises prediction exercise in that paper on a smaller scale.

Our prediction exercise takes the form of a binary classification problem, where each datapoint characterized by a vector of predictors $\textbf{x}_t=(x^1_t, x^2_t,\dots,x^N_t)$ realized at time $t$ must be classified into one of two categories: 1) there **will** be a financial crises **at time *t+1 or t+2***. 2) there **will not** be a financial crises **at time *t+1 or t+2***. Time *t* is measured in years. Your task is to implement this prediction problem using five machine learning algorithms:
 1. logistic regression
 2. logistic regression with LASSO regularization
 3. random trees
 4. random forest
 5. neural networks. 
   
Your task is to assess the accuracy of each and make all the performance comparisons you deem appropriate from what we have learned in class.

I will walk you through the beginning of this exercise, basically importing and cleaning the data. Then I'll leave you to the rest...

To get started, open the Jupyter Notebook where you will execute your assignment.


##### 1. Import the Python modules you need
First import the modules you need (you will certainly need to import more later on, `matplotlib` or `seaborn` for plotting, `sklearn` indeed. By the way, remember to add `%matplotlib inline` to show plots)

In [1]:
import numpy as np
import pandas as pd

##### 2. Load the dataset as a Pandas Data Frame
Import the dataset from the Excel file (located in the same directory as notebook) as a Pandas DataFrame

In [2]:
df=pd.read_excel("JSTdatasetR4.xlsx",sheet_name="Data")

##### 3. Creating the desired variables in a new Data Frame
Now we need to do a few manipulations on the database. Basically we need to create a smaller dataframe comprising only the variables we need, which are those in the *Baseline Experiment* of Bluwstein et al (2019). The code below (with comments) reports how I would do it, but feel free to follow your own method based on your reading of the paper:

In [3]:
df.country.unique()

array(['Australia', 'Belgium', 'Canada', 'Denmark', 'Finland', 'France',
       'Germany', 'Italy', 'Japan', 'Netherlands', 'Norway', 'Portugal',
       'Spain', 'Sweden', 'Switzerland', 'UK', 'USA'], dtype=object)

In [4]:
#let's make a copy, in order to preserve original dataset
df_copy=df.copy()
#let's create new (temporary) columns with the transormed variables we need:
#-slope of the yield curve
df_copy["slope_yield_curve"]=df_copy["ltrate"]/100-df_copy["stir"]/100
# credit: loans to the privete sector / gdp
df_copy["credit"]=df_copy["tloans"]/df_copy["gdp"]
# debt service ratio: credit * long term interest rate
df_copy["debt_serv_ratio"]=(df_copy["tloans"]/df_copy["gdp"])*df_copy["ltrate"]/100
# broad money over gdp
df_copy["bmoney_gdp"]=df_copy["money"]/df_copy["gdp"]
# current account over gdp
df_copy["curr_acc_gdp"]=df_copy["ca"]/df_copy["gdp"]
# Now we need to compute 1-year absolute variations and percentage variations for a few variables
# Obviosly this must be done country-wise, so we cannot act on the dataframe as it is.
# a Convenient way of doing this is the Pandas method 'groupby()'
df_copy_group=df_copy.groupby("iso") # 'iso' is the country code
# create 1 year-variation of credit from grouped dataframe and add back to initial dataframe
df_copy["delta_credit"]=df_copy_group["credit"].diff(periods=1)
# create 1 year-variation of debt ser ratio from grouped dataframe and add back to initial dataframe
df_copy["delta_debt_serv_ratio"]=df_copy_group["debt_serv_ratio"].diff(periods=1)
# create 1 year-variation of investment/gdp from grouped dataframe and add back to initial dataframe
df_copy["delta_investm_ratio"]=df_copy_group["iy"].diff(periods=1)
# create 1 year-variation of public debt/gdp from grouped dataframe and add back to initial dataframe
df_copy["delta_pdebt_ratio"]=df_copy_group["debtgdp"].diff(periods=1)
# create 1 year-variation of broad money / gdp from grouped dataframe and add back to initial dataframe
df_copy["delta_bmoney_gdp"]=df_copy_group["bmoney_gdp"].diff(periods=1)
# create 1 year-variation of current / gdp from grouped dataframe and add back to initial dataframe
df_copy["delta_curr_acc_gdp"]=df_copy_group["curr_acc_gdp"].diff(periods=1)
# now we need to create new variables which are 1-year growth rates of existing ones

# we will need this function to apply to the columns of the dataframe

def lag_pct_change(x):
    """ Computes percentage changes """
    lag = np.array(pd.Series(x).shift(1))
    return (x - lag) / lag

# create 1 year growth rate of CPI from grouped dataframe and add back to initial dataframe
df_copy["growth_cpi"]=df_copy_group["cpi"].apply(lag_pct_change)
# create 1 year growth rate of consumption per capita from grouped dataframe and add back to initial dataframe
df_copy["growth_cons"]=df_copy_group["rconpc"].apply(lag_pct_change)

# low let's create the crises early warning label: a dummy variable which takes value one if in the next 
# or two there will be a crises

# temporary array of zeros, dimension number of rows in database
temp_array=np.zeros(len(df_copy))
# loop to create dummy
for i in np.arange(0,len(df_copy)-2):
    temp_array[i]= 1 if ( (df_copy.loc[i+1,'crisisJST']== 1) or (df_copy.loc[i+2,'crisisJST']== 1)  ) else 0

#put the dummy in the dataframe

df_copy["crisis_warning"]=temp_array.astype("int64")

# create a smaller dataframe including only the variables we are interested in: the first ten are predictors (X) and the last one is the output, or label (y)
variables=["slope_yield_curve","delta_credit","delta_debt_serv_ratio","delta_investm_ratio","delta_pdebt_ratio","delta_bmoney_gdp","delta_curr_acc_gdp","growth_cpi","growth_cons","eq_tr","crisis_warning"]
df_final=df_copy[variables].dropna()

# let's also create a version of our dataframe which includes the year
df_final_withyear=df_copy[["year"]+variables].dropna()


Notice that Bluwstein et al (2019) drop more observations than what I did in order to obtain more robust results. Fill free to follow their procedure more closely. Otherwise, I am fine with `df_final`.

##### 4. Now perform your analysis
Remember that the feature that you need to predict (the outcome $y$) is the variable `df_final["crisis_warning"]`, while all the other columns in the data frame are the features $x$ that you use to predict it.

Inspired by what we have learned from the notebooks `regression.ipynb` and `classification.ipynb`, and possibly your reading of the additional material in the repository, perform your data analysis:

1. Randomly split the data into a training and a test sample.
2. Fit the following models on the training sample:
    * logistic regression
    * logistic regression with LASSO regularization. Here, select the regularization parameter using a 5-fold cross validation
    *  random trees. Experiment with different tree depths, not necessarily with a cross validation
    *  random forest 
    *  neural networks. Experiment with different numbers of hidden layers, and neurons for each layers, not necessarily using a cross-validation
3. Plot the ROC curves for the best versions of your models and compute the AUROC. According to this criterion, which model performs best ?
4. Compare the confusion matrices generated by the models.
5. Which variables do 'survive' in the logistic regression with LASSO ? 
6. OPTIONAL: Is there a way in the logistic regression to conclude which variables are more important for the prediction performance ?
7. OPTIONAL: Now let's see if in a real-time experiment any of our models would have predicted the financial crises of 2007-08. Put all the observations before (and including) 2005 in the training sample, and the rest in the test sample. You can use the data frame `df_final_withyear` for this purpose. Fit your preferred model for each of the 5 categories on the training sample. Would have they warned us in 2006 and 2007 of the imminent financial crises?  (for the logistic regressions, in order to draw this conclusion use the probability thresholds which, on  the ROC curves, obtain an 80\% rate of true positives).
8. OPTIONAL: compute any indicator you like......  


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# 1. Train test split

In [8]:
X = df_final.drop('crisis_warning', axis = 1)
y = df_final.crisis_warning

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

# 2. Fitting

## Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(penalty = 'none', max_iter = 1000).fit(X_train, y_train)
roc_auc_score(y_test, log_reg.predict(X_test))

0.5185185185185185

## Logistic Regression with Lasso

In [9]:
from sklearn.model_selection import GridSearchCV

In [13]:
log_reg_lasso = LogisticRegression(penalty = 'l1', solver = 'liblinear', max_iter = 1000)
parameters = {'C':[1, 10, 100]}
grid_search_results = GridSearchCV(log_reg_lasso, parameters).fit(X_train, y_train)

In [18]:
log_reg.coef_

array([[-25.82032589,   6.54519306,  64.88130481,  15.381356  ,
         -3.9211358 ,   0.87129992,  -5.05051519,  -5.98374812,
         -6.2618073 ,   0.23731307]])

In [14]:
pd.DataFrame(grid_search_results.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.007026,0.001722,0.002756,0.00099,1,{'C': 1},0.933086,0.933086,0.929368,0.929368,0.932836,0.931549,0.001783,1
1,0.004653,0.002219,0.00177,0.000415,10,{'C': 10},0.933086,0.933086,0.929368,0.929368,0.932836,0.931549,0.001783,1
2,0.003517,0.000396,0.001769,0.000601,100,{'C': 100},0.929368,0.929368,0.929368,0.929368,0.929104,0.929315,0.000105,3


# Random Trees
## Mathias
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

# Random Forest
## Mathias

# Neural Networks
## Andrey

# 3. AUC ROC
## Andrey

# 4. Confusion matricies
## Mathias

# 5. Which factors survived?

In [20]:
best_log_reg_lasso = LogisticRegression(penalty = 'l1', C = 1, solver = 'liblinear', max_iter = 1000)
best_log_reg_lasso.fit(X_train, y_train)

LogisticRegression(C=1, max_iter=1000, penalty='l1', solver='liblinear')

In [21]:
best_log_reg_lasso.coef_

array([[ 0.        ,  4.66433991,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        , -0.45938952,  0.        , -0.0838737 ]])

# 6. To write smth

# 7. Prediction
## Andrey