# Data Understanding

In [None]:
%matplotlib inline
#Initialise Libraries
import pandas as pd
import sklearn
import numpy as np
import matplotlib as mpl


In [None]:
#import the initial training data
training = pd.read_csv('cs-training.csv',index_col= 'id', )

In [None]:
training.head(n=5)

### Feature Discussion

#### The training dataset has 150,000 observations, with the target being an indicator (i.e Y/N) to the individual experiencing 90 days past due delinquency or worse. 

** The features we will seek to make use of are:** 
> **RevolvingUtilizationOfUnsecuredLines** = Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits. (percentage)

> **age** = The age of borrower in years. (integer)

> **NumberOfTime30-59DaysPastDueNotWorse** = Number of times borrower has been 30-59 days past due but no worse in the last 2 years. (integer)

> **DebtRatio** = Monthly debt payments, alimony,living costs divided by monthy gross income. (percentage)

> **MonthlyIncome** = Monthly income (real)

> **NumberOfOpenCreditLinesAndLoans** = Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards). (integer)

> **NumberOfTimes90DaysLate** = Number of times borrower has been 90 days or more past due. (integer)

> **NumberRealEstateLoansOrLines** = Number of mortgage and real estate loans including home equity lines of credit. (integer)

> **NumberOfTime60-89DaysPastDueNotWorse** = Number of times borrower has been 60-89 days past due but no worse in the last 2 years. (integer) 

> **NumberOfDependents** = Number of dependents in family excluding themselves (spouse, children etc.). (integer)




In [None]:
#Lets get a feel for all the columns in the dataframe
training.describe(include='all')

In [None]:
# Count the number and percentage of Serious Deliquencies in the training set
print "The number of Serious Deliquencies in the training set is" , training.SeriousDlqin2yrs.sum()
print "The percentage of deliquencies in the training set is ", \
float(sum(training.SeriousDlqin2yrs == 1) * 100) / float(training.SeriousDlqin2yrs.count()) , "%" 

#### Observations:
> So the training data is quite unbalanced. This is to a large extent expected given the nature of the problem. This will cause issues even with some very good learning algorithms unless we carefully control for this imbalance. If we dont, the learner will often consider this as noise . We may still end up with a high accuracy, but low sensitivity. Ultimately we want to be able to predict the number the individuals who will be under financial distress, as these are the applications which we are more likely to refuse as a bank, as this threatens the Banks' liquidity position.

In [None]:
#Check for missing values also. This will present an issue later when we attempt to applying modelling frameworks. 
#So we better check now.  
training.isnull().sum()

#### Observations:
> The training data has no missing values for almost all variables except for the monthly income and also the number of dependants. The monthly income variable will need to be treated with caution. Here we need to ask ourselves, what is the data generating process? The data would most likely be gathered from loan applications. This would presumably be a mandatory field. So there is data loss which we will need to account for. As one would expect monthly income to have a strong relationship with the rate of delinquency, we will need to infer the monthly income somehow. If we consider the variable 'Debt Ratio', this is "Monthly debt payments, alimony, living costs divided by monthy gross income". The data here is full, so we may be able to use this variable to impute values for monthly income, based on this ratio the like ratios of individuals and their reported incomes. The number of dependents also presents an issue for a learning algorithm. Using some common sense, we could argue that the field may not have been filled out because the individual may not have dependents. So it could be argued that ever missing value is "0". Before such a crude rule is applied, we may decide to cluster individuals to see if we can find commonalities. 

*** (nb). Some leaners can handle missing data. For instance, classification and regression trees can deal with missing data by using surrogate splits. However the scikit learn implementation does not support this so we will be explicity controlling for missing data in all models. ***

### Feature Analysis 

In [None]:
# Given all variables are numeric, we will look at the correlations in the data and histograms.
#training.corr(method='kendall')  #non-parametric method .. (nb) Takes longer to run
training.corr(method='pearson') #simple linear correlation

### Visualisation

One of the key challenges of data analysis is to be able to extract meaningful visualisations out of the data. Considering that we are dealing with a classification problem and there are 10 features, we will consider how the data appears in a lower dimensional space. We'll perform singular value decomposition to analyse the directions which we should perform othogonal projections.  