# Objetive

- Create a model to predict delinquency probability and measure its performance.
- Submit a document that explains your methodology - preferably ipython notebooks

# Datasets
You will find 3 csv files. The files were created from our database as of 2020 Feb 05.

#####      *Brazil_DS_loans_2019-11-10_2019-12-05*

It has the loans made for a period of 25 days with following important fields

- Loan_id - unique identifier for a loan
- Uuid - user identifier
- Created_at - time when loan was created
- Paid_at - time when it was paid. If it is missing then loan was not paid as of file creation date
- Amount - amount of loan

A
 loan is considered repaid if it's paid within 60 days.The objective is 
to create a predictive model for loan repayment based on the labels in 
this file.

#####       *Brazil_DS_prev_loans*

This has the previous loans taken for users in the above file and should have the same schema.

#####       *Brazil_DS_recharges_2019-08-10_2019-12-05*

A user pays for loans by making recharges after taking a loan. This file contains recharges for all users for about 4 months.

#### Read

Datasets Brazil_DS_loans_2019-11-10_2019-12-05 and Brazil_DS_prev_loans were concatenated

In [13]:
from functions.read import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from functions.feature_default import ForecastDefault
import logging

logger = logging.getLogger()
logger.setLevel(logging.ERROR)
loans, recharges = load_dfs()

#### Fit model

In [14]:
forecast = ForecastDefault(
    loans_hist=loans, 
    recharges_hist=recharges, estimators_list=[
                                                LogisticRegression,
                                                XGBClassifier,
                                                RandomForestClassifier,
                                                ],
    eval_metric='roc_auc',
    inicial_date='2019-01-01',days_to_default=60,limit_date='2019-12-31')

# Report

#### Exploratory data analysis

O intervalo de analise é entre *2019-01-16* - *2019-12-04*. A linha do tempo para a criação do dataset de treinamento do modelo foi possui a data limite até *2019-10-05* que corresponde a 60 dias antes da última data de creação dos emprestimos. A variável target, clientes que não pagaram seus imprestimos até 60 dias, possui a prevalencia de apenas 11%.

| Default | Relative % |
|---------|------------|
| Not     | 88.4%      |
| Yes     | 11.5%      |


In [15]:
forecast.train_df.groupby('target').count()['uuid'].to_frame() / len(forecast.train_df)

Unnamed: 0_level_0,uuid
target,Unnamed: 1_level_1
0,0.884802
1,0.115198


In [16]:
forecast.train_df.describe()

Unnamed: 0,target,sum_amount,freq_recharges_weekly,recharges_weekly,delta_after_recharges
count,1033.0,1033.0,1033.0,1033.0,1033.0
mean,0.115198,12.662149,0.956559,0.170196,0.541723
std,0.319416,14.9216,0.824489,0.087977,0.96441
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,5.0,0.375,0.125,0.075
50%,0.0,10.0,0.75,0.125,0.285
75%,0.0,15.0,1.25,0.25,0.67
max,1.0,150.0,5.5,0.8125,20.05


In [17]:
forecast.train_df.median()

  forecast.train_df.median()


target                    0.000
sum_amount               10.000
freq_recharges_weekly     0.750
recharges_weekly          0.125
delta_after_recharges     0.285
dtype: float64

##### Created variables

Os dados foram agrupados para cada *UUID*

```python
already_default =   0 if not defaulted, 1 if defaulted
sum_amoun =  "sum amount of all previous loans paid"
count_loans =  "number of previous loans paid"
freq_recharges_weekly =  "mean frequency recharges per week"
recharges_weekly =  "median frequency recharges per week"
delta_after_recharge =  "difference between balance after recharge and recharge value"
```



The variable *count_loans* was removed because have multicolinearity with the variables *sum_amount* over 95%.

In [18]:
forecast.corr_df.corr()

Unnamed: 0,target,sum_amount,count_loans,freq_recharges_weekly,recharges_weekly,delta_after_recharges
target,1.0,-0.062007,-0.061411,-0.237355,-0.085627,-0.054256
sum_amount,-0.062007,1.0,0.974835,0.624845,0.477428,0.074858
count_loans,-0.061411,0.974835,1.0,0.609952,0.449434,0.071755
freq_recharges_weekly,-0.237355,0.624845,0.609952,1.0,0.6974,0.065405
recharges_weekly,-0.085627,0.477428,0.449434,0.6974,1.0,0.097317
delta_after_recharges,-0.054256,0.074858,0.071755,0.065405,0.097317,1.0


The training dataset was balanced by undersampling the majority class resulting in the shape

In [19]:
forecast.model.train_X.shape

(238, 4)

## Fit model and hyphotesis cases

The model was trained splitting in folds to select the best model to the evaluation metric. The models picked were [*XGboost*, *Random Forest*, *Logistic Regression*]. Differents hyphotesis were created.

#### 1 - If the model's target is to primarily detect the delinquencies. The evaluation metric should be *recall*

The *XGBClassifier* should be the choose with the mean performance metric = 0.64

In [20]:
logger.setLevel(logging.INFO)
forecast = ForecastDefault(
    loans_hist=loans, 
    recharges_hist=recharges, estimators_list=[
                                                LogisticRegression,
                                                XGBClassifier,
                                                RandomForestClassifier,
                                                ],
    eval_metric='recall',
    inicial_date='2019-01-01',days_to_default=60,limit_date='2019-12-31')

INFO:root:<class 'sklearn.linear_model._logistic.LogisticRegression'> : Training sample metrics | [0.913, 0.6, 0.957, 0.87, 0.64] | mean: 0.796
INFO:root:<class 'xgboost.sklearn.XGBClassifier'> : Training sample metrics | [0.826, 0.72, 0.87, 0.957, 0.84] | mean: 0.843
INFO:root:<class 'sklearn.ensemble._forest.RandomForestClassifier'> : Training sample metrics | [0.826, 0.6, 0.87, 0.913, 0.76] | mean: 0.794
INFO:root:Best model :XGBClassifier
INFO:root:Test sampling recall : 0.647
INFO:root:Classification report 
               precision    recall  f1-score   support

           0       0.85      0.54      0.66       187
           1       0.28      0.65      0.39        51

    accuracy                           0.56       238
   macro avg       0.56      0.59      0.52       238
weighted avg       0.73      0.56      0.60       238

INFO:root:Confusion matrix 
 [[101  86]
 [ 18  33]]


#### 2 - If the model's target is to primarily not miss at all. The evaluation metric should be *Accuracy*

The *Random Forest* should be the choose with the mean performance metric = 0.55

In [21]:
logger.setLevel(logging.INFO)
forecast = ForecastDefault(
    loans_hist=loans, 
    recharges_hist=recharges, estimators_list=[
                                                LogisticRegression,
                                                XGBClassifier,
                                                RandomForestClassifier,
                                                ],
    eval_metric='acc',
    inicial_date='2019-01-01',days_to_default=60,limit_date='2019-12-31')

INFO:root:<class 'sklearn.linear_model._logistic.LogisticRegression'> : Training sample metrics | [0.708, 0.688, 0.833, 0.766, 0.745] | mean: 0.748
INFO:root:<class 'xgboost.sklearn.XGBClassifier'> : Training sample metrics | [0.812, 0.812, 0.854, 0.872, 0.851] | mean: 0.84
INFO:root:<class 'sklearn.ensemble._forest.RandomForestClassifier'> : Training sample metrics | [0.833, 0.812, 0.896, 0.872, 0.83] | mean: 0.849
INFO:root:Best model :RandomForestClassifier
INFO:root:Test sampling acc : 0.563
INFO:root:Classification report 
               precision    recall  f1-score   support

           0       0.85      0.54      0.66       187
           1       0.28      0.65      0.39        51

    accuracy                           0.56       238
   macro avg       0.56      0.59      0.52       238
weighted avg       0.73      0.56      0.60       238

INFO:root:Confusion matrix 
 [[101  86]
 [ 18  33]]


#### 3 - If the model's target is to primarily balance between precision and detect correctly delinquencies. The evaluation metric should be *F-1*

The *Random Forest* should be the choose with the mean performance metric = 0.38

In [22]:
logger.setLevel(logging.INFO)
forecast = ForecastDefault(
    loans_hist=loans, 
    recharges_hist=recharges, estimators_list=[
                                                LogisticRegression,
                                                XGBClassifier,
                                                RandomForestClassifier,
                                                ],
    eval_metric='f1',
    inicial_date='2019-01-01',days_to_default=60,limit_date='2019-12-31')

INFO:root:<class 'sklearn.linear_model._logistic.LogisticRegression'> : Training sample metrics | [0.75, 0.667, 0.846, 0.784, 0.727] | mean: 0.755
INFO:root:<class 'xgboost.sklearn.XGBClassifier'> : Training sample metrics | [0.809, 0.8, 0.851, 0.88, 0.857] | mean: 0.839
INFO:root:<class 'sklearn.ensemble._forest.RandomForestClassifier'> : Training sample metrics | [0.844, 0.75, 0.87, 0.875, 0.851] | mean: 0.838
INFO:root:Best model :XGBClassifier
INFO:root:Test sampling f1 : 0.388
INFO:root:Classification report 
               precision    recall  f1-score   support

           0       0.85      0.54      0.66       187
           1       0.28      0.65      0.39        51

    accuracy                           0.56       238
   macro avg       0.56      0.59      0.52       238
weighted avg       0.73      0.56      0.60       238

INFO:root:Confusion matrix 
 [[101  86]
 [ 18  33]]


#### 4 - If the purpose of the model is primarily to distinguish the classes. The evaluation metric should be *AUC*

The *Random Forest* should be the choose with the mean performance metric = 0.58

In [23]:
forecast = ForecastDefault(
    loans_hist=loans, 
    recharges_hist=recharges, estimators_list=[
                                                LogisticRegression,
                                                XGBClassifier,
                                                RandomForestClassifier,
                                                ],
    eval_metric='roc_auc',
    inicial_date='2019-01-01',days_to_default=60,limit_date='2019-12-31')

INFO:root:<class 'sklearn.linear_model._logistic.LogisticRegression'> : Training sample metrics | [0.717, 0.691, 0.838, 0.768, 0.752] | mean: 0.753
INFO:root:<class 'xgboost.sklearn.XGBClassifier'> : Training sample metrics | [0.813, 0.817, 0.855, 0.874, 0.852] | mean: 0.842
INFO:root:<class 'sklearn.ensemble._forest.RandomForestClassifier'> : Training sample metrics | [0.853, 0.82, 0.875, 0.873, 0.835] | mean: 0.851
INFO:root:Best model :RandomForestClassifier
INFO:root:Test sampling roc_auc : 0.586
INFO:root:Classification report 
               precision    recall  f1-score   support

           0       0.84      0.55      0.66       187
           1       0.27      0.63      0.38        51

    accuracy                           0.56       238
   macro avg       0.56      0.59      0.52       238
weighted avg       0.72      0.56      0.60       238

INFO:root:Confusion matrix 
 [[102  85]
 [ 19  32]]


## Future improvements

Creating allowed loans limits to each costumer combining regression models with business stategy to simulate different scenarios and with evaluate possible Return On Investiment for each strategy. Once chosen the strategy we could test it using A/B tests