# Assignment \#2: Simple Machine Learning Pipeline

Ben Fogarty  
University of Chicago, Harris School of Public Policy  
CAPP 30254: Machine Learning for Public Policy  
Spring 2019

## Project overview & requirements

This project's folder contains the following files:

- writeup.ipynb: the assignment write up
- pipeline_library.py: general functions for a machine learning pipeline (reading data, preprocessing data, generating features, building models, etc.)
- predict_financial.py: specific functions for applying the functions in pipeline_library to predicting who will experience financial distress within the next two years
- tree.pdf: a visualization of the decision tree generated in this model
- credit-data.csv: the dataset used for training and testing the tree predicting who will experience financial distress within the next two years
- data-dictionary.csv: dictionary describing the dataset in credit-data.csv
- hw2.pdf: the assignment statement

The project was developed using Python 3.7.3 on MacOS Mojave 10.14.4. It requires the following libraries:

| Package        | Version     |
| :------------: | :---------: |
| graphviz       | 2.40.1      |
| pandas         | 0.24.2      |
| matplotlib     | 3.0.3       |
| numpy          | 1.16.2      |
| seaborn        | 0.9.0       |
| scikit-learn   | 0.20.3      |

Helpful documentation and references are cited throughout the docstrings of the code.

## Building a simple machine learning pipeline

All code for this portion of the project is located in the pipeline_library module. Excerpts from this module are included throughout.

### Read data

The pipeline_library module provides a function, read_csv, which imports CSV files into pandas dataframes, optionally allowing for the user to specify which columns to import from the csv, what the type of the columns should be in the resulting dataframe, and an index column. This function simply wraps the read_csv function provided by the pandas library.

```
def read_csv(filepath, cols=None, col_types=None, index_col=None):
    '''
    ...
    '''
    return pd.read_csv(filepath, usecols=cols, dtype=col_types, index_col=index_col)
```

### Explore data

The pipeline_library module also provides a suite for functions for exploratory data analysis. The first, show_distribution, returns a histogram and box plot for pandas series with a numeric type, and a bar plot for variables with a non-numeric type.

```
def show_distribution(series):
    '''
    ...
    '''
    sns.set()
    if pd.api.types.is_numeric_dtype(series):
        f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
        sns.distplot(series, kde=False, ax=ax1)
        sns.boxplot(x=series, ax=ax2, orient='h')
        ax1.set_title('Histogram')
        ax1.set_ylabel('Count')
        ax1.set_xlabel('')
        ax2.set_title('Box plot')
    else:
        f, ax = plt.subplots(1, 1)
        val_counts = series.value_counts()
        sns.barplot(x=val_counts.index, y=val_counts.values, ax=ax)
        ax.set_ylabel('Count')

    f.suptitle('Distribution of {}'.format(series.name))
    f.subplots_adjust(hspace=.5, wspace=.5)

    return f
```

The next, pw_correlate, calculates a table of pairwise correlations between numeric type variables. The user can optionally specify which variables to include pairwise correlations for, and enable visualization. If visualization is enabled, the function also generates a heat map to help the user identify strong correlations.

```
def pw_correlate(df, variables=None, visualize=False):
    '''
    ...
    '''
    if not variables:
        variables = [col for col in df.columns
                         if pd.api.types.is_numeric_dtype(df[col])]

    corr_table = np.corrcoef(df[variables].dropna(), rowvar=False)
    corr_table = pd.DataFrame(corr_table, index=variables, columns=variables)

    if visualize:
        sns.set()
        f, ax = plt.subplots(figsize=(8, 6))
        sns.heatmap(corr_table, annot=True, annot_kws={"size": 'small'}, 
                    fmt='.2f', linewidths=0.5, vmin=-1, vmax=1, square=True,
                    cmap='coolwarm', ax=ax)

        labels = ['-\n'.join(wrap(l.get_text(), 16)) for l in ax.get_yticklabels()]
        ax.set_yticklabels(labels)
        labels = ['-\n'.join(wrap(l.get_text(), 16)) for l in ax.get_xticklabels()]
        ax.set_xticklabels(labels)
        ax.tick_params(axis='both', rotation=0, labelsize='small')
        ax.tick_params(axis='x', rotation=90, labelsize='small')

        ax.set_title('Correlation Table')
        f.tight_layout()
        f.show()

    return corr_table
```

The function summarize_data provides summary statistics over numeric data columns. By default, the function summarizes over all numeric columns, however, the user can restrict the summary statistics to certain numeric columns using the agg_cols positional keyword. The user can also chose to group observations based on one or more categorical variables and then compute summaries over each group. The aggregating functions are count, mean, standard deviation, min, 25th percentile, median, 50th percentile, and max.

```
def summarize_data(df, grouping_vars=None, agg_cols=None):
    '''
    ...
    '''
    if agg_cols:
        df = df[agg_cols]
    
    if grouping_vars:
        summary = df.groupby(grouping_vars)\
                    .describe()
    else:
        summary = df.describe()

    return summary.transpose()
```

The final function for exploratory data analysis, find_outliers, relies on a helper function find_outliers_univariate. The find_outliers function identifies the outliers in each numeric column of a dataframe, then records the number and percent of evaluated columns for which each observation is an outlier. 

The return values is a dataframe where each row contains booleans describing whether the associated row in the passed in dataframe is considered an outlier for each numeric column and the numer and percent of evaulated columns for which the associated row is considered an outlier. 

For the purposes of this analysis, an outlier is defined as an observation falling more than 1.5x the interquartile range below the 25th percentile value of a variable or more than 1.5x the interquartile range above the 75th percentile value of a variable. 

Optionally, the user can exclude certain columns from this procedure with the keyword argument excluded, which leads the function to ignore certain columns when looking for outliers.

```
def find_ouliers_univariate(series):
    '''
    ...
    '''
    quartiles = np.quantile(series.dropna(), [0.25, 0.75])
    iqr = quartiles[1] - quartiles[0]
    lower_bound = quartiles[0] - 1.5 * iqr
    upper_bound = quartiles[1] + 1.5 * iqr

    return (lower_bound > series) | (upper_bound < series)

def find_outliers(df, excluded=None):
    '''
    ...
    '''
    if not excluded:
        excluded = []

    numeric_cols = list(df.select_dtypes(include=[np.number]).columns)

    outliers = df[numeric_cols]\
                 .drop(excluded, axis=1, errors='ignore')\
                 .apply(find_ouliers_univariate, axis=0)
    outliers['Count Outlier'] = outliers.sum(axis=1, numeric_only=True)
    outliers['% Outlier'] = (outliers['Count Outlier'] /
                             (len(outliers.columns) - 1) * 100)

    return outliers
```

### Preprocess Data

The preprocess_data function in the pipeline_library module also relies on a helper function, replace_missing. At this time, the only preprocessing step is to replace any missing values in the dataframe. The replace_missing function take one column of a dataframe in the form of a series as its input. It then determines whether the series contains numeric type data, and if so, replaces the missing values in the series with the median value in the series. If the series does not contain numeric type data, the data is assumed to be unordered categorical data, and the function replaces missing values with the modal value in the series since a median cannot be calculated for unordered categorical data. The preprocess_data functions applies this algorithm for replacing missing data to all the columns of a given dataframe.

```
def replace_missing(series):
    '''
    ...
    '''
    if pd.api.types.is_numeric_dtype(series):
        median = np.median(series.dropna())
        return series.fillna(median)
    else:
        mode = series.mode().iloc[0]
        return series.fillna(mode)

def preprocess_data(df):
    '''
    ...
    '''
    return df.apply(replace_missing, axis=0)
```

### Generate features/predictors

To discretize a continuous variable, the pipeline_library module provies the cut_variable function. This functions takes in a single columns of a dataframe (in the form of a pandas series) and returns that column discretized into bins. The user can either specify a list of "edges" for the bins (for example \[0, 0.5, 1.0\] would create the bins \[0, 0.5) and \[0.5, 1) ) or a number of bins, which creates approximately n bins with approximately the same number of observations (their may be slighly fewer bins or some bins with significantly more observations depending on the exact distribution of the data). The user can also specify labels for the bins.

```
def cut_variable(series, bins, labels=None):
    '''
    ...
    '''
    if type(bins) is int:
        return pd.qcut(series, bins, labels=labels, duplicates='drop')\
                 .astype('category')

    return pd.cut(series, bins, labels=labels, include_lowest=True)\
             .astype('category')
```

To convert a categorical varaible into a set of dummy variables, the pipeline_library module provides another custome function, create_dummies. This function takes in a dataframe and the name of the column to create dummies from, and returns a new dataframe with the categorical column removed and the new dummy columns appended to the end of the dataframe.

The pandas library also provides a function to convert categorical variables to dummies, pd.get_dummies. I chose to write a custom function, however, because I was dissatisfied with how the pandas library encodes missing data (it makes all dummies false where the categorical column is NA, whereas the function in pipeline_library makes dummies with missing values where the categorical column value is missing). I dislike the pandas approach because this may falsely convey that we have some knowledge that is not actually reflected by the data.

```
def create_dummies(df, column):
    '''
    ...
    '''
    col = df[column]
    values = list(col.value_counts().index)
    output = df.drop(column, axis=1)
    for value in values:
        dummy_name = '{}_{}'.format(column, value)
        output[dummy_name] = (col == value)
        output.loc[col.isnull(), dummy_name] = float('nan')

    return output
```

### Build Classifier

The pipeline designed for this project can be used to generate a Decision Tree Classifier using the generate_decision_tree function. The function takes in training data in the form of a pandas dataframe of features, and a pandas series of labels for the same observations, and optionally an instance of sklearn.tree.DecisionTreeClassifier. 

By default, the decision tree generated by this function uses all the default values specified in the sklearn.tree.DecisionTreeClassifier object ([see the documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)). The optional decision tree parameter allows for the user to customize the properties of the DecisionTreeClassifier by instantiating a sklearn.tree.DecisionTreeClassifier and passing it to the function.

```
def generate_decision_tree(features, target, dt=None):
    '''
    ...
    '''
    if not dt:
        dt = tree.DecisionTreeClassifier()

    return dt.fit(features, target)
```

### Evaluating Classifier

A simple function, score_decision_tree, returns the mean accuracy of a decision tree when used to predict the target attribute for a set of observations for which the real value of the target attribute is known. The function takes in a decision tree, a set of testing data in the form of a pandas dataframe of features (in the same order as the data on which the tree was trained), and a pandas series of the target attribute values for the same observations.

```
def score_decision_tree(dt, test_features, test_target):
    '''
    ...
    '''
    return dt.score(test_features, test_target)
```

### Visualizing Classifier

Lastly, the function visualize_decision_tree saves and opens a PDF containing a visual representation of a DecisionTreeClassifer. For this function, the user must provide a decision tree, a list of feature names in the same order as the data on which the tree was trained, and a list of class names for the target attribute that the tree predicts. Optionally, the user may also specify an output path for the generated PDF.

```
def visualize_decision_tree(dt, feature_names, class_names, filepath='tree'):
    '''
    ...
    '''
    dot_data = tree.export_graphviz(dt, None, feature_names=feature_names, 
                                  class_names=class_names, filled=True)
    graph = graphviz.Source(dot_data)
    output_path = graph.render(filename=filepath, view=True)
```

## Applying the simple machine learning pipeline

All the code for this portion of the project is in the predict_financial module, and can be run from the command line with the following command:

```
python3 predict_financial.py credit-data.csv
```
For the purposes of this write-up, however, I will step through the code in that module with added explanations at some points.

Before we being, we load in the pipeline_library module with the alias pl and set-up matplotlib to output to the notebook properly.

In [1]:
%matplotlib notebook
import pipeline_library as pl

### Reading in the data

The data required for this project is located in a file in the root of the project directory titled credit-data.csv. We load it in using the function from the pipeline created in the first section.

Based on the data dictionary, we specify that all columns except the zipcode column be loaded in as floats since they either represent numeric data or represent booleans that may contain missing data. The zipcode column is read in as a string because it contains categorical data. PersonID is specified as the index column.

In [2]:
col_types = {'SeriousDlqin2yrs': float,
             'RevolvingUtilizationOfUnsecuredLines': float,
             'age': float,
             'NumberOfTime30-59DaysPastDueNotWorse': float,
             'zipcode': str,
             'DebtRatio': float,
             'MonthlyIncome': float,
             'NumberOfOpenCreditLinesAndLoans': float,
             'NumberOfTimes90DaysLate': float,
             'NumberRealEstateLoansOrLines': float,
             'NumberOfTime60-89DaysPastDueNotWorse': float,
             'NumberOfDependents': float}
df = pl.read_csv('credit-data.csv', col_types=col_types, index_col='PersonID')

Looking at the first five rows and one individual row from the dataframe we just loaded in:

In [3]:
df.head(5)

Unnamed: 0_level_0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,zipcode,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
PersonID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
98976,0.0,1.0,55.0,60601,0.0,505.0,0.0,2.0,0.0,0.0,0.0,0.0
98991,0.0,0.547745,71.0,60601,0.0,0.459565,15666.0,7.0,0.0,2.0,0.0,0.0
99012,0.0,0.04428,51.0,60601,0.0,0.01452,4200.0,5.0,0.0,0.0,0.0,0.0
99023,0.0,0.914249,55.0,60601,4.0,0.794875,9052.0,12.0,0.0,3.0,0.0,0.0
99027,0.0,0.026599,45.0,60601,0.0,0.049966,10406.0,4.0,0.0,0.0,0.0,2.0


In [4]:
df.iloc[0]

SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        1
age                                        55
zipcode                                 60601
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                 505
MonthlyIncome                               0
NumberOfOpenCreditLinesAndLoans             2
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDependents                          0
Name: 98976, dtype: object

Each row has its person ID as the index/name, and each row contains 12 columns/variables. Since our goal is to predict financial delinquency within the next two years, we anticipate that our target attribute will be SeriousDlqin2yrs.

### Explore Data

Before building our model, we first explore our data, looking for the distribution of each variable, any correlations, notable outliers, and other summaries.

To being, we use use the summarize_data function from our pipeline to generate various aggregations from each numeric type variable in our dataset, ignoring missing values.

In [5]:
summary = pl.summarize_data(df)

In [6]:
summary

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SeriousDlqin2yrs,41016.0,0.1614,0.367904,0.0,0.0,0.0,0.0,1.0
RevolvingUtilizationOfUnsecuredLines,41016.0,6.37587,221.61895,0.0,0.03431,0.18973,0.66716,22000.0
age,41016.0,51.683489,14.74688,21.0,41.0,51.0,62.0,109.0
NumberOfTime30-59DaysPastDueNotWorse,41016.0,0.589233,5.205628,0.0,0.0,0.0,0.0,98.0
DebtRatio,41016.0,331.458137,1296.109695,0.0,0.176375,0.369736,0.866471,106885.0
MonthlyIncome,33042.0,6578.995733,13446.82593,0.0,3333.0,5250.0,8055.75,1794060.0
NumberOfOpenCreditLinesAndLoans,41016.0,8.403477,5.207324,0.0,5.0,8.0,11.0,56.0
NumberOfTimes90DaysLate,41016.0,0.419592,5.190382,0.0,0.0,0.0,0.0,98.0
NumberRealEstateLoansOrLines,41016.0,1.008801,1.153826,0.0,0.0,1.0,2.0,32.0
NumberOfTime60-89DaysPastDueNotWorse,41016.0,0.371587,5.169641,0.0,0.0,0.0,0.0,98.0


Looking at these summaries, the first thing of note is the relative infrequency of serious financial distress within the previous 2 years (as it is defined by our target variable) among our observations. Only 16% of the individuals in the dataset experienced 90 days past due delinency or worse within the past two years.

Next, I notice that the maximum of the revolving utilization of unsecured lines variable is 220,000, despite the variable being described as a percentage in the data dictionary. This value is surprising becuase this varaible is the proportion of an individual's total balance on certain personal lines of credit to the sum of the credit limits on these lines, so it seems that this value should be bounded closer to 1. It's possible that this is a miscoding, however, it may also be that an individual had numerous credit lines revoked, causing their revolving utilization to become very high. It'll be important to assess whether there are other similarly high values when we show the distribtuion of that data.

Other noteworth takeaways from this data summary include that the monthly income variable is highly spreadout, with a standard deviation of 13,447 (more than twice its mean). Alternatively, many of the other varaibles describing individual's credit history or current credit usage are tightly distributed (having minimum values and 75th percentiles that are very close to one another) but one or more substantial outliers (visible through extremely high maximum values. Examples of variables demonstrating this include NumberOfTimes30-59DaysPastDueNotWorse, NumberOfOpenCredinLinesAndLoans, NumberOfTimes90DaysLate, NumberRealEstateLoansOrLInes, and NumberOfTimes60-89DaysPastDueNotWorse.

Next, we use the pw_correlate function in our pipeline with the visualize option enabled to see a correlation matrix between all the numeric varialbles in our data.

In [7]:
correlate = pl.pw_correlate(df, visualize=True)

<IPython.core.display.Javascript object>

In [8]:
correlate

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
SeriousDlqin2yrs,1.0,-0.006503,-0.152462,0.144919,-0.004334,-0.03281,-0.034495,0.129647,-0.004545,0.108975,0.065389
RevolvingUtilizationOfUnsecuredLines,-0.006503,1.0,-0.008925,-0.00163,-0.001146,0.005832,-0.014338,-0.001594,0.002381,-0.0012,0.010279
age,-0.152462,-0.008925,1.0,-0.054464,0.003723,0.048138,0.196067,-0.055686,0.078648,-0.049221,-0.204317
NumberOfTime30-59DaysPastDueNotWorse,0.144919,-0.00163,-0.054464,1.0,-0.001386,-0.015224,-0.054963,0.97548,-0.028853,0.981599,0.002203
DebtRatio,-0.004334,-0.001146,0.003723,-0.001386,1.0,-0.022988,0.01479,-0.002675,0.015829,-0.001696,-0.001443
MonthlyIncome,-0.03281,0.005832,0.048138,-0.015224,-0.022988,1.0,0.1071,-0.017954,0.127313,-0.015336,0.060528
NumberOfOpenCreditLinesAndLoans,-0.034495,-0.014338,0.196067,-0.054963,0.01479,0.1071,1.0,-0.087372,0.437245,-0.07376,0.03079
NumberOfTimes90DaysLate,0.129647,-0.001594,-0.055686,0.97548,-0.002675,-0.017954,-0.087372,1.0,-0.048272,0.987877,-0.005589
NumberRealEstateLoansOrLines,-0.004545,0.002381,0.078648,-0.028853,0.015829,0.127313,0.437245,-0.048272,1.0,-0.040207,0.106386
NumberOfTime60-89DaysPastDueNotWorse,0.108975,-0.0012,-0.049221,0.981599,-0.001696,-0.015336,-0.07376,0.987877,-0.040207,1.0,-0.007141


Looking at this correlation table and the associated visualization, the most noteworthy results are the strong positive correlations between the different variables denoting the number of times a borrower has been past due, with the pairwise correlations between NumberOfTimes30-59DaysPastDueNotWorse, NumberOfTime60-89DaysPastDueNotWorse, and number of NumberOfTimes90DaysLate all at or above 0.98. Relative to the rest of the dataset, there also appears to be a strong positive correlation between the Numer of Real Estate Loans/Credit Lines and the Number of Open Credit and Loan Lines, with a correlation of coeffiecient of 0.44.

Looking at our target attribute, SeriousDlqin2yrs, there are no other variable with a particularly strong correlation with it.

Since zipcode is a categorical variable, it was not include in the pairwise correlations table. Instead, we consider it separately by grouping the observations by zipcode and displaying various summary statistics. For simplicity, we limit the output to the means of each variable.

In [9]:
by_zip = pl.summarize_data(df, grouping_vars='zipcode')

In [10]:
by_zip.xs('mean', axis=0, level=1)

zipcode,60601,60618,60625,60629,60637,60644
SeriousDlqin2yrs,0.169753,0.176406,0.16956,0.166744,0.180516,0.0
RevolvingUtilizationOfUnsecuredLines,5.426326,3.533973,5.848579,6.410459,10.909085,7.113789
age,51.650848,51.623026,51.720707,51.498529,51.426503,52.840945
NumberOfTime30-59DaysPastDueNotWorse,0.528551,0.665824,0.62531,0.554265,0.699559,0.181496
DebtRatio,326.090201,351.840384,318.77168,319.513465,355.413892,329.127883
MonthlyIncome,6939.742488,6414.652413,6506.113519,6437.078613,6389.541109,7274.661621
NumberOfOpenCreditLinesAndLoans,8.569784,8.338598,8.398171,8.329463,8.330973,8.540945
NumberOfTimes90DaysLate,0.344951,0.499368,0.464585,0.380709,0.510072,0.053543
NumberRealEstateLoansOrLines,1.002023,1.005843,1.017746,0.988233,1.017312,1.018898
NumberOfTime60-89DaysPastDueNotWorse,0.306208,0.440145,0.408478,0.336585,0.465376,0.033071


Based on the group means, individuals in the dataset living in the 60644 zipcode appear to have a much more secure financial status than those living in the other zipcodes in the dataset. The percentage of individuals living in the 60644 area code in our data set who experienced 90 days past due delinquency within the past 2 years was 0%, and there were substantially lower percent of individuals experiencing all levels of being past due among these observations than those from other zipcodes.

Between the other zipcodes, there is some variation in these measures of financial health, however, none of these differences are as substantial as those between the 60644 and each of the other zipcodes.

Next, we loop over all the variable in the dataset, and generate figures showing their distribution with the pipeline's show_distribution function. For numeric type variables, we generate two charts, a histogram and a box plot. For all other variables, we generate a simple bar chart.

In [11]:
for var in df.columns:
    pl.show_distribution(df[var].dropna()).show()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Looking through these plots, many variables, particularly those representing credit history or current credit characteristics, have extremely long right tail. To avoid overfitting, these it may be a good idea to discretize these variables before training the model, perhaps by percentiles, or in a binary fashion denoting those who have or have not experienced a certain past due status within the previous two years. For some variable such as monthly income, which actually display quite a bit of variance that is obscured by the long right tails in these graphs, it may also be appropriate to create bins by percentiles.

Debt ratio, which we previously noted the presence of a possibly miscoded value in, actually has numerous observations with different extremely high values, likely indicating that these values are not miscodings. This variable with its long right tail may also be a good candidate for discretizing.

Finally, we generate a table summarizing how many numeric columns each observation is an outlier in and how many outliers there are for each numeric variable.

In [12]:
outliers = pl.find_outliers(df)

In [13]:
outliers['Count Outlier'].value_counts().sort_index()

0    18915
1    13640
2     4719
3     2287
4     1143
5      288
6       23
7        1
Name: Count Outlier, dtype: int64

In [14]:
outliers.drop(['Count Outlier', '% Outlier'], axis=1).sum().sort_values()

age                                       45
RevolvingUtilizationOfUnsecuredLines     207
NumberRealEstateLoansOrLines             255
NumberOfOpenCreditLinesAndLoans         1086
MonthlyIncome                           1392
NumberOfTime60-89DaysPastDueNotWorse    3028
NumberOfTimes90DaysLate                 3430
NumberOfDependents                      3726
SeriousDlqin2yrs                        6620
NumberOfTime30-59DaysPastDueNotWorse    7934
DebtRatio                               8373
dtype: int64

Overall, a majority of our datapoints are considered outliers in one or more of the numeric variables in our dataset. The variable with the least number of outliers is age, followied by revolving utilization of unsecured lines, and then number of real estate loans/lines. The variable with the greatest number of outleirs is Debt Ratio, follwed by the Number of Times 30-59 days past due, and wheter a person had been 90 day past due deliquent or worse within the past two years. 

These finding are not entirely surprising given the distribution graph, which showed most measures of credit history had extrememly long right tails, and the correlation table showing strong correlations between variables describing credit history/status.

### Preprocess Data

To replace the missing data with the median value for numeric variables and the modal value for non-numeric variables, we simply use the preprocess_data function from our pipeline.

In [15]:
df = pl.preprocess_data(df)

In [16]:
df.isna().sum()

SeriousDlqin2yrs                        0
RevolvingUtilizationOfUnsecuredLines    0
age                                     0
zipcode                                 0
NumberOfTime30-59DaysPastDueNotWorse    0
DebtRatio                               0
MonthlyIncome                           0
NumberOfOpenCreditLinesAndLoans         0
NumberOfTimes90DaysLate                 0
NumberRealEstateLoansOrLines            0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfDependents                      0
dtype: int64

As we can now see above, out data set contains no missing values.

### Generate Features

To generate features, we will use two methods from the pipeline, cut_variables and create_dummies. The first will be used to bin/discretize continuous variables, and the second will be used to convert categorical variables into numerical type dummy variables so that they are compatible with scikit learn.

The first feature to generate is converting the categorical variable representing zipcodes into dummy variables.

In [17]:
df = pl.create_dummies(df, 'zipcode')

In [18]:
list(df.columns)

['SeriousDlqin2yrs',
 'RevolvingUtilizationOfUnsecuredLines',
 'age',
 'NumberOfTime30-59DaysPastDueNotWorse',
 'DebtRatio',
 'MonthlyIncome',
 'NumberOfOpenCreditLinesAndLoans',
 'NumberOfTimes90DaysLate',
 'NumberRealEstateLoansOrLines',
 'NumberOfTime60-89DaysPastDueNotWorse',
 'NumberOfDependents',
 'zipcode_60625',
 'zipcode_60629',
 'zipcode_60601',
 'zipcode_60637',
 'zipcode_60618',
 'zipcode_60644']

We can now see that the zipcodes category has been dropped and replaced with dummies for each zipcode in the set.

Next, I'd like to discretize the monthly income variable into approximately 20 groups each containing approximately 5% of the observations) and then turn that categorical variable into a set of dummies.

In [19]:
df.MonthlyIncome = pl.cut_variable(df.MonthlyIncome, 20)
df = pl.create_dummies(df, 'MonthlyIncome')

Next, I hope to cut some of the other continuously-valued variables describing a person's credit history and current credit characterisitcs. In particular, I intend to cut revolving utilization of unsecured lines by deciles, and number of times 30-59/60-89/90 days past due not worse into whether or not an individual has or has not experienced these forms of financial distress within the past two years. 

These cuts are intended to help reduce overfitting within the model.

In [20]:
#RevolvingUtilizationOfUnsecuredLines
df.RevolvingUtilizationOfUnsecuredLines = pl.cut_variable(df.RevolvingUtilizationOfUnsecuredLines, 10)
df = pl.create_dummies(df, 'RevolvingUtilizationOfUnsecuredLines')

#NumberOfTime30-59DaysPastDueNotWorse
df['NumberOfTime30-59DaysPastDueNotWorse'] = pl.cut_variable(df['NumberOfTime30-59DaysPastDueNotWorse'], [0, 1, float('inf')], labels=['Zero', 'One or more'])
df = pl.create_dummies(df, 'NumberOfTime30-59DaysPastDueNotWorse')

#NumberOfTime60-89DaysPastDueNotWorse
df['NumberOfTime60-89DaysPastDueNotWorse'] = pl.cut_variable(df['NumberOfTime60-89DaysPastDueNotWorse'], [0, 1, float('inf')], labels=['Zero', 'One or more'])
df = pl.create_dummies(df, 'NumberOfTime60-89DaysPastDueNotWorse')

#NumberOfTimes90DaysLate
df['NumberOfTimes90DaysLate'] = pl.cut_variable(df['NumberOfTimes90DaysLate'], [0, 1, float('inf')], labels=['Zero', 'One or more'])
df = pl.create_dummies(df, 'NumberOfTimes90DaysLate')

### Creating the decision tree

Having finished exploring and preprocessing our data and generating our features, we are now ready to create our decision tree using the generate_decision_tree function from the pipeline. As an additional measure to prevent overfitting, I plan to limit the maximum depth of the generated decision tree. In total, we have 11 distinct variables to predict with. I will set a somewhat arbitrary limit of 15 as the maximum depth of the tree since a numer of our variables were categorial/discretized into categories with more than two values. (One possible future improvement would be writing a function to loop over different arbitrary maximum depths). 

As a result, we will first need to create a new sklearn.tree.DecisionTreeClassifier object since we want to customize the properties of the tree. In addition to setting the max depth, we will also set the splitting criterion to information gain, as seen in class..

In [21]:
from sklearn import tree
dt = tree.DecisionTreeClassifier(criterion='entropy', max_depth=15)

First, we break the larger dataframe into a dataframe containing all our feature columns and a series containing the target attribute.

In [22]:
features = df.drop('SeriousDlqin2yrs', axis=1)
target = df.SeriousDlqin2yrs

We are now ready to actually generate our decision tree.

In [23]:
dt = pl.generate_decision_tree(features, target, dt=dt)

We can visualize this tree using the visualize_decision_tree function from the pipeline. We pass in the column names of our training dataframe and the name of our target series as the feature/target names.

In [24]:
pl.visualize_decision_tree(dt, features.columns, ['No Financial Distress', 'Financial Distress'])

This visualization is available in the tree.pdf file in the root of the project directory.

### Evaluating Our Classifier

Finally, we want to evaluate the accuracy of our classifier. Though it is poor practice, this assignment specifically allows for us to use the same data to test and train our decision tree. We thus apply the score_decision_tree function from our pipeline to our previous features and target objects. The output of this function will be the portion of the observations for which the target attribute was accurately predicted.

In [25]:
pl.score_decision_tree(dt, features, target)

0.8960405695338405

Given that this decision is scored using its training data, it is unsurprising that the accuracy is so high.