![Image](../OUTPUT/compimage.png )


<h1 style="text-align: center;" markdown="1">Data Science Presentation: Give Me Some Credit</h1>

## ***The Brief:***

### Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years

* ** Prize pool was `$5,000` (`$3,000` for first, `$1,500` for second and `$500` for third) **
* **Winner of the competition was *Perfect Storm* **

The scoring criteria which was to be used is the **Area under the Curve**. The AUC is a commonly used evaluation metric for binary classification problems such as the one we have here. The interpretation is that given a random positive observation and negative observation, the AUC gives the proportion of the time you guess which is correct. It is less affected by sample balance than accuracy. This is one reason we may use this ver something like a confusion matrix. A perfect model will score an AUC of 1, while random guessing will score an AUC of around 0.5, a meager 50% chance on each other. 

In [2]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib as mpl

#import the initial training data
training = pd.read_csv('../DATA/cs-training.csv',index_col= 'idx' )

training.head()

Unnamed: 0_level_0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30to59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60to89DaysPastDueNotWorse,NumberOfDependents
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,1,0.766127,45,2,0.802982,9120,13,0,6,0,2
2,0,0.957151,40,0,0.121876,2600,4,0,0,0,1
3,0,0.65818,38,1,0.085113,3042,2,1,0,0,0
4,0,0.23381,30,0,0.03605,3300,5,0,0,0,0
5,0,0.907239,49,1,0.024926,63588,7,0,1,0,0


The target was the binary variable of "Serious Deliquency in 2 years", which is 1 if the person experienced 90 days past due delinquency within that 2 year period, and 0 otherwise.

## Features
The most interesting to consider are often key splitting criteria for classification trees. If we run a simple regression tree we fine the 3 variables considered as splitting criteria are:
- The number of times borrower has been 90 days or more past due in the last 2 years.
- Number of times borrower has been 60-89 days past due but no worse in the last 2 years.
- Number of times borrower has been 30-59 days past due but no worse in the last 2 years.

![Image](../OUTPUT/tree3.png )

One other point which is not obvious from here is the effect of income.. Surely this plays a role. And surely the debt ratio is other key factor. As we will see later with the variable importance plots from random forests and gradient boosting, these provide a very intuitive way of understanding what variable splayed a role in the prediction.


## My Approach
* Data Cleaning
* Focus was on the model build
* Cross Validation - Training & Test splits, Out-of-Bag and Grid Searching
* Models attempted
    * Linear Discriminant Analysis - A supervised dimensionality reduction approach
    * Logistic Regression
    * Classification Trees
    * Random Forests
    * Gradient Boosting
    * Support Vector Machines
    
## Data Cleaning 
* Missing data issues for monthly income and the number of dependants
    * Some basic methods such as median for income and mode for the number of dependants were employed. Ultimately more sophisticated approaches could have been investigated. 
    * Late in the model development process, it was discovered that the bedt ratio was not a ratio when the original data was missing. So a simple apporach was to use the recalculate the debt ratio. This did help to improve the overall prediction.
    
# Model Build
This was made difficult due to this being the first time I've used Python. A key goal for me coming out of this course was to be able to develop a broad range of models to add to my arsenel.


## Difficulties faced


## Outcomes of the modelling process

In order of worst to best:


| Model   |    Modelling outcome | AUC | 
|----------|:-------------|--------:| 
| Linear Discriminant Analysis |  Training score score was terrible I didnt even bother to submit to Kaggle| **N/A**  |
| Logistic Regression    |  Poor prediction - Non lineariities in data causing issues                |  ** 0.697912 **|
| Classification Trees  |  Large improvement in classification                                      | **0.848429**    |
| Random Forests  |  Improvement on CART. Due to low number of features, minimal improvement as a result | **  0.852720**|
| Gradient Boosting | Improvement again on Random Forests. The best stand alone prediction result achieved | **0.866799**| 
| Support Vector Machines |  Very hard to fit and cross validate, thus ininal CV conducted       |   ** X **   |



## Variable Importance

## For improvement 