# DATA CHALLENGE: Non Performing Loans

Welcome to this data challenge! 

## Framework

A non-performing loan (**NPL**) is the sum of borrowed money upon which the debtor has not made his scheduled payments for at least 90 days. Once a loan is non-performing, the odds that it will be fully repaid are considered to be substantially lower. High levels of NPLs inhibit the capacity of banks to lend to the economy and take up valuable bank management time. As of the third quarter of 2016, NPLs of significant institutions in the Euro area amounted to €921bln (average NPE of 7%), therefore the ECB asks banks to devise a strategy to manage and reduce the volume of impaired loans. 


## Challenge Objective

The general goal of this data challenge is to help NPL portfolios managers in their daily tasks by re-ranking their customers according to the likelihood to repay and/or by suggesting new strategies to **maximize the recovery rate**. In particular, you are asked to perform analysis on the given data, train predictive models and present your findings in a clear and data-driven way. 


## Data

You should have reiceived different CSVs along with this README Jupyter Notebook. These files contain synthetic but very realistic data on NPL counterperties. In particular, you should have: 


| FILE NAME  | BRIEF DESCRIPTION |
| ------------- | ------------- |
| ANAGRAFICA_CLIENTI.csv  | customers' registry data |
| CC.csv | bank accounts data |
| CENTRALE_RISCHI.csv | central risk data |
| GARANZIE.csv | guarantees data |
| MUTUI.csv | mortgage data |
| PERIMETRO_INIZIALE.csv | customers in scope |
| TRANSCODIFICA_GARANZIE.csv | decoding table for guarantees |
| DICTIONARY.xlsx | detailed explaination of columns |

The last file contains a detailed description for every column of each CSV listed above. 


## Target variable 

No target variable is given because in a real world setting it should be created according to business requirements and project objectives. Therefore, you will be asked to create the target variable by motivating your choices in a coherent way. 


## Useful Advices

Please, keep in mind the advices below during your analysis. 

* Manage carefully your time to deliver a high quality analysis that answers most of the required tasks. 
* Make questions and give reasonable answers in a data-driven way. 
* Pay attention to details without losing focus on the general objective of this challenge. 
* Keep track of your assumptions and motivate them during the discussion. 
* Model performance is important but not the most important.
* Knowing how to run model.fit() doesn't make you automatically a data scientist. :) 


#### Enjoy the challenge!

______________________________________________________

Perform all your analysis in __Python__ or __R__. Present your findings in a clean and coherent way, for example, by using **Jupyter Notebook** or **R Markdown**. The code that you deliver along with the presentation should be well-organized and appropriately commented so that it is ready to be re-run to reproduce any of your results, also on a new dataset. 

## Required tasks

This list contains the minimum required tasks you should complete (not necessarily in this order). However, we strongly encourage any your additional observation useful to achieve the challenge objective. 

### 1. Data Exploration
Perform some data explorations on the given CSVs, point out your interesting findings and answer the questions below. 
 * How are they connected? Which are the columns you should use to join them? 
 * Focus on a single daset: CC.csv
     * How many months of data do you have?  
     * Are there seasonalities over the time? If yes, in which (aggregated) variables? Make some plots. 
     * Are there correlations between columns? 
 
### 2. Data Cleaning 
Select at least two datasets
 * Detect weird values, outliers and missing values. 
     * Which techniques or algorithms do you use to detect them?
     * How do you treat them? Is it ok to remove them from the dataset? 
     * Is there any evident inconsistency in the data? If yes, how would you clean the data to overcome it? 
 * Why did you select these datasets?
 
### 3. Target Variable
As in the real world, you are asked to create the target variable for this challenge by keeping in mind the challenge objective stated above. 
 * Create a data matrix to feed models (you can decide to use only a subset of the given datasets).
 * Which granularity did you select and why?
 * Which columns identify each occurence?
 * Propose a target variable construction, write it with a mathematics formula and implement it. 
 * Explain and validate every choice/assumption you made in the previous point, also with business intuitions. 
 * If you decide to not have a target variable to predict, explain carefully your decision and how this could influence next steps in this analysis.  

### 4. Features Engineering
Given or not the target variable defined above, you can choose the features to use in a predictive model. 
 * Create new features (we suggest ~3) from existing columns. Explain the intuition behind them.
 * Select a subset of features to use to train a predictive model. The selection process should be rigorous and reproducible. 

### 5. Modeling and Performance
This is the fancy part, right? 
 * How would you split the dataset to have train, test and validation set? 
 * Which models have you tried to fit? Explain them.
 * Which metrics have you used to measure the performance? Why?
 * What have you done to improve models' performance? Are they useful? 

### 6. Presentation of the results
Along with the Jupyter Notebook or the R Markdown, prepare a concise Power Point presentation where the steps 1-5 are explained to a non-technical audience and a proposal of future development is made. 



In [1]:
import datetime
print("Last update: " + str(datetime.datetime.now()))

Last update: 2019-05-20 00:54:13.015023
