# Assignment \#2: Simple Machine Learning Pipeline

Ben Fogarty  
University of Chicago, Harris School of Public Policy  
CAPP 30254: Machine Learning for Public Policy  
Spring 2019

## Project overview & requirements

This project's folder contains the following files:

- writeup.ipynb: the assignment write up
- pipeline_library.py: general functions for a machine learning pipeline (reading data, preprocessing data, generating features, building models, etc.)
- predict_financial.py: specific functions for applying the functions in pipeline_library to predicting who will experience financial distress within the next two years
- tree.pdf: a visualization of the decision tree generated in this model
- credit-data.csv: the dataset used for training and testing the tree predicting who will experience financial distress within the next two years
- data-dictionary.csv: dictionary describing the dataset in credit-data.csv
- hw2.pdf: the assignment statement

The project was developed using Python 3.7.3 on MacOS Mojave 10.14.4. It requires the following libraries:

| Package        | Version     |
| :------------: | :---------: |
| graphviz       | 2.40.1      |
| pandas         | 0.24.2      |
| matplotlib     | 3.0.3       |
| numpy          | 1.16.2      |
| seaborn        | 0.9.0       |
| scikit-learn   | 0.20.3      |

Helpful documentation and references are cited throughout the docstrings of the code.

## Building a simple machine learning pipeline

All code for this portion of the project is located in the pipeline_library module. Excerpts from this module are included throughout.

### Read data

The pipeline_library module provides a function, read_csv, which imports CSV files into pandas dataframes, optionally allowing for the user to specify which columns to import from the csv and what the type of the columns should be in the result dataframe. This function simply wraps the read_csv function provided by the pandas library.

### Explore data

The pipeline_library module also provides a suite for functions for exploratory data analysis. The first, show_distribution, returns a histogram and box plot for variable with a numeric type, and a bar plot for variables with a non-numeric type.

The next, pw_correlate, calculates a table of pairwise correlations between numeric type variables. The user can optionally specify which variables to include pairwise correlations for, and enable visualization. If visualization is enabled, the function also generates a heat map to help the user identify strong correlations.

The function summarize_data provides summary statistics over numeric data columns. By default, the function summarizes over all numeric columns, however, the user can restrict the summary statistics to certain numeric columns using the agg_cols positional keyword. Additonally, the user can also change the aggregating functions; the default are mean, variance, and quartiles. Lastly, the user can also chose to group observations based on one or more categorical variables and then compute summaries over each group. This functionality can be helpful for seeing the relationship between categorical variables and other numeric type variables.

The final function for exploratory data analysis, find_outliers, relies on a helper function find_outlier_univariate. The find_outliers function identifies the outliers in each numeric column of a dataframe, then records the number and percent of evaluated columns for which an observation is an outlier. The return values is a dataframe that links each row contains booleans describing whether the associated row in the passed in dataframe is considered an outlier for each numeric column and the numer and percent of evaulated columns for which the associated row is considered an outlier. For the pruposes of this analysis, an outlier is falling more than 1.5x the interquartile range below the 25th percentile value or more than 1.5x the interquartile range above the 75th percentile value. Optionally, the user can exclude certain columns from this procedure with the keyword argument excluded.

### Preprocess Data

The preprocess_data function in the pipeline_library module also relies on a helper function, replace_missing. At this time, the only preprocessing step is to replace missing vaules for any variables that have missing values. The replace_missing function take one column of a dataframe in the form of a series as its input. It then determines wheter the series contains numeric type data, and if so, replaces the missing values with the median value in the series. If the series does not contain numeric type data, the data is assumed to be unordered categorical data, and the functions replaces missing values with the modal value in the series, since a median cannot be calculated for unordered categorical data. The preprocess_data functions applies this algorithm for replacing missing data to all the columns of a given dataframe.

### Generate features/predictors

To discretize a continuous variable, the pipeline_library module provies the cut_variable function. This functions takes in a single columns of a dataframe (in the form of a pandas series) and returns that column discretized into bins. The user can either specify a list of "edges" for the bins (for example \[0, 0.5, 1.0\] would create the bins \[0, 0.5) and \[0.5, 1) ) or a number of bins, which creates n approximately equipercentile bins. The user can also specify labels for the bins.

Though the pandas library contains a function, pd.cut, which can be used to discretize continuous data, I decided to write a custom function for greater control over choosing bin size and greater certainty about how continuous variables are being discretized.

To convert a categorical varaible into a set of dummy variables, the pipeline_library module also provides another custome function, create_dummies. This function takes in a dataframe and the name of the column to create dummies from, and returns a new dataframe with the categorical column remove and the new dummy columns appended to the end of the dataframe. 

The pandas library also provides a function to convert categorical varaibles to dummies, pd.get_dummies. I chose to write a custom function, however, because I was dissatisfied with how the pandas library encodes missing data (it makes all dummies false where the categorical column is NA, whereas the function in pipeline_library makes dummies with missing values where the categorical column is NA) and to provide the opportunity for additional customization in the future.

### Build Classifier

The pipeline designed for this project can be used to generate a Decision Tree Classifier using the generate_decision_tree function. The function takes in training data in the form of a pandas dataframe of features, and pandas series of labels for the same observations, and optionally an instance of sklearn.tree.DecisionTreeClassifier. By default, the decision tree generated by this function uses all the default values specified in the sklearn.tree.DecisionTreeClassifier object ([see the documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)), except it uses information gain instead of Gini impurity as its criterion for splitting. The optional decision tree parameter allows for the user to customize the properties of the DecisionTreeClassifier by instantiating a sklearn.tree.DecisionTreeClassifier and passing it to the function. If the user passes a DecisionTreeClassifier object, that object is used instead of the default decision tree generate by the function. 

### Evaluating Classifier

A simple function, score_decision_tree, returns the mean accuracy of a decision tree when used to predict the target attribute for a set of observations where the value of the target attribute is known. The function takes in a decision tree, and a set of testing data in the form of a pandas dataframe of features (in the same order as the data on which the tree was trained) and a pandas series of classes for the same observations.

### Visualizing Classifier

Lastly, the function visualize_decision_tree saves and opens a PDF containing a visual representation of a DecisionTreeClassifer. For this function, the user must provide a decision tree, a list of feature names in the same order as the data on which the tree was trained, and the list of class names for the target attribute that the tree predicts. Optionally, the user may also specify an output path for the generate PDF.

## Applying the simple machine learning pipeline