In [5]:
import pipeline


In [6]:
df = pipeline.read_load('/home/erhla/Downloads/credit-data.csv')
pipeline.explore(df, ['PersonID','zipcode']) 

FileNotFoundError: [Errno 2] File b'/home/erhla/Downloads/credit-data.csv' does not exist: b'/home/erhla/Downloads/credit-data.csv'

## Load and Explore data

The above cell loads (using pd.read_csv) and explore the data set. Exploration is facilitating by temporarily dropping descriptive columns before printing pd.describe() for each column. From this output we can see about 16.1% of entries in the database have the outcome SeriouesDlqin2yrs.  Additionally, we can see that from variables such as revolving utilization, number of times past due, debt ratio, and monthly income that  most of the variables have some entries with severe financial distress (represented by the difference between the 75% quantile and the maximal values).

Additionally, some of these variables are correlated. Number of times past due 30-59 days, 60-89 days, and 90+ days are highly correlated. A more serious analysis might consider dropping some of these variables. Number of Real Estate Loans is also correlated with number of open credit lines, but at a less significant factor.

Finally, by looking at values greater than five standard deviations from variable means, some entries are also identified as possible outliers. Considering the nature of this data set (one capturing financial distress), these outliers were not removed as they are indiciative of the target outcome variable. Number of real estate loans had the most values identified at 154.

Below, you can see a simple percent comparision of attribute means between those with SeriousDlqin2yrs and those not.

In [None]:
no_df = df[df['SeriousDlqin2yrs'] == 0]
yes_df = df[df['SeriousDlqin2yrs'] == 1]

print('percent change in variable name comparing outcome = yes to outcome = no')
(yes_df.agg('mean') - no_df.agg('mean')) / yes_df.agg('mean')

In [None]:
df = pipeline.preprocess(df, ['PersonID','zipcode']) 

In [None]:
df = pipeline.generate_features(df, 'SeriousDlqin2yrs', 'dummy')
df = pipeline.generate_features(df, 'MonthlyIncome', 'discretized', 10)
df['zipcode'] = df['zipcode'].astype('category')

## Preprocess and Generate Features

Preprocessing at this stage is simple; merely filling na values with variable medians. The counts of nas can be seen above.

Generating features is more complex. generate_features replaces a column with a discretized or dummy version. Discretizing works by splitting a variable up into quantile bins (so MonthlyIncome has 10 possible values/bins for each decile). Setting a variable as dummy simply converts a column into a categorical variable and updates the category name. To create dummy variables for columns with more than two values is partially implemented and generates multiple new columns.

In [None]:

model, x_test, y_test = pipeline.build_classifier(df, 'SeriousDlqin2yrs', 0.1, 4, 20)
pipeline.evaluate_classifier(model, x_test, y_test)

## Build Classifier and Evaluate

These stages of the pipeline build the model using sklearn.tree.DecisionTreeClassifier. Withholding ten percent of the data as testing data, this block builds a decision tree with max_depth=4 and min_split_size=20.

This model evaluated on accuracy against the testing data has an accuracy of about 95%. Below is an image of the decision tree produced. An approximate table of contents for the features used to split the tree is provided below the model.

In [None]:
#see https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()

export_graphviz(model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

In [None]:
for i, col in enumerate(x_test.columns):
    print('X_',i+1, col)