# Feature Engineering

The second step of this project is feature engineering. In this notebook, I've done several things:

* Generated 10 new features, including:
    * Bank ID
    * Expense Type
    * Rent-to-Income Ratio
    * Measures of timing and seasonality
* Recoded 4 old features, including:
    * Duration of email, residence, and bank account
    * Payment Frequency
* Imputed missing data using sklearn imputer modules
* Standardized contiuous variables to mean = 0, stdev = 1
* One-hot encoded categorical variables into 42 new binary variables
* Dynamically expanded data dictionary to 89 entries

In [1]:
import pandas as pd
import numpy as np
import warnings

import scipy.stats
import sklearn
import xgboost
import statsmodels.api as sm

from sklearn.impute import SimpleImputer, KNNImputer

from utils import *

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 50)

In [2]:
data = pd.read_pickle('output_data/00_data.pkl')
data_dict = pd.read_pickle('output_data/00_data_dict.pkl')

# Building New Features

## Deriving Bank Information from Routing Number

Bank information can be derived from the provided bank routing number. [Plaid](https://plaid.com/resources/banking/account-numbers-explained/) provides the following explanation: 

>* The first four digits of a routing number are called the Federal Reserve Processing Symbol.
>* The first two digits are typically a number between 01-12. These numbers refer to the head branch of the Federal Reserve District Office under which the bank falls. The Districts start with 01 in Boston and end with 12 in San Francisco and often encompass surrounding states. A range of 61-72 is assigned to non-bank payment processors and the number 80 to travelers’ checks.
>* The third digit identifies the Regional Federal Reserve Processing Center assigned to the bank within its district. For example, “1” would indicate bank check processing center 1.
>* The fourth digit indicates the bank’s location within a Federal Reserve District. A bank that is located in a Federal Reserve District is designated with a 0. Numbers 1-9 are used according to the state that the Federal Reserve District is in.
>* Digits **five through eight comprise a unique identity code** assigned to the bank by the American Bankers Association. 
>* The final digit of the routing number is called a “check digit.” Whereas the other digits refer to identifying information about the specific financial institution, the check digit is calculated using the other eight digits as a way to ensure authenticity and prevent fraud. (Thus “check” is used in the sense of “verify” rather than “paper check.”) 

Because the Federal Reserve Districts and Processing Centers are primarily geographic, and geography is suspect as a credit underwriting criteria under ECOA because of potential for disparate impact, I have not used the first four digits of the routing number as a feature of the model. 

Instead I generated a new variable called 'bank_id' consisting of digits five through eight in the routing number and identifying an individual bank. This allows us to monitor which banks see high default rates, perhaps indicating a low underlying quality of the credit pool and leading to more defaults in the future. 

In [3]:
# Generate bank_id
data['bank_id'] = data['bank_routing_number'].astype(str).str.slice(4,8)

In [4]:
# Assess the available sample size of bank_ids
bank_id_value_counts = pd.DataFrame(data['bank_id'].value_counts()).reset_index()
bank_id_value_counts.rename({'bank_id': 'count',
                             'index': 'bank_id'}, 
                            axis = 1, inplace=True)
bank_id_n30 = bank_id_value_counts.loc[bank_id_value_counts['count'] > 30,
                                      'bank_id'].values
bank_id_n30 = list(bank_id_n30)

print(f"{len(bank_id_n30)} banks have a sample of 30 or more customers.")
print(f"They are: {bank_id_n30}")

6 banks have a sample of 30 or more customers.
They are: ['0297', '0154', '7751', '0005', '0215', '7955']


In [5]:
# Suppress individual banks with sample sizes of <30 customers
data.loc[~data['bank_id'].isin(bank_id_n30), 'bank_id'] = '9999'

In [6]:
# Add bank_id to the data dictionary
# Function from utils.py
data_dict = add_entry_to_dictionary(data, data_dict, 'bank_id', 
                                    eda_category = 'personal_finance',
                                    categorical = 1)

In [7]:
# Show values
bad_rate_by_category(data, 'bank_id')

Unnamed: 0_level_0,bad,bad,bad_rate
Unnamed: 0_level_1,sum,count,Unnamed: 3_level_1
bank_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
215,14,33,0.424242
7751,24,53,0.45283
154,33,72,0.458333
9999,194,345,0.562319
7955,18,32,0.5625
5,23,40,0.575
297,47,75,0.626667


## Timestamp, Seasonality, and Secular Trends

In [8]:
# Generating variables based on time of application
data['application_year'] = data['application_when'].dt.year
data['application_month'] = data['application_when'].dt.month
data['application_day_of_year'] = (data['application_month'] * 30 + \
                                   data['application_when'].dt.day)
data['application_month_of_cycle'] = (data['application_month'] - 10) % 12
data['application_hour_of_day'] = data['application_when'].dt.hour
data['application_day_of_week'] = data['application_when'].dt.weekday

After some data exploration, the above variables do not look great.
Clearly the default rate falls between the first application in 
October 2010 and the last one in April 2011. This is a trend that 
should be corrected for, not depended upon, by our default model. 

Later in the model process, I will implement a train-test split that 
depends on the month of application. This will solve the look-ahead
problem and make our accuracy estimates more reliable. 

In [9]:
bad_rate_by_category(data, 'application_month').sort_index()

Unnamed: 0_level_0,bad,bad,bad_rate
Unnamed: 0_level_1,sum,count,Unnamed: 3_level_1
application_month,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,67,110,0.609091
2,74,131,0.564885
3,70,149,0.469799
4,47,95,0.494737
10,11,18,0.611111
11,37,58,0.637931
12,47,89,0.52809


One potentially useful variable looks to be time of day. The middle
of the day seems to have the lowest default rates, while later and 
earlier have higher defaults. Above we generate a new variable for this,
and below we check the default rate when categorized as such. 

In [10]:
data['application_third_of_day'] = np.floor(data['application_hour_of_day'] / 8)

In [11]:
bad_rate_by_category(data, 'application_third_of_day')

Unnamed: 0_level_0,bad,bad,bad_rate
Unnamed: 0_level_1,sum,count,Unnamed: 3_level_1
application_third_of_day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1.0,192,379,0.506596
2.0,115,195,0.589744
0.0,46,76,0.605263


Finally, let's input these new variables into the data dictionary. I've defined a function in utils.py for this purpose. 

In [12]:
# Data Dictionary entries
# List of tuples consisting of (variable_name, categorical_bool)
timestamp_vars = [('application_year', 1),
                  ('application_month', 1),
                  ('application_day_of_year', 0),
                  ('application_month_of_cycle', 0),
                  ('application_hour_of_day', 0),
                  ('application_day_of_week', 1),
                  ('application_third_of_day', 1)]

# Dynamically add entries to data dictionary
for var_name, cat_bool in timestamp_vars:
    # Function from utils.py
    data_dict = add_entry_to_dictionary(data, data_dict, var_name, 
                                eda_category = 'other_info',
                                categorical = cat_bool)

## Rent to Income Ratio

Some of the most commonly used credit underwriting criteria measure the difference between cashflows and costs. Debt to income ratio, payment to income ratio, measures of disposable income, and so forth. While we do not have all the information necessary to compute all of these statistics, we can use a simple measure of rent-to-income. 

In [13]:
data['rent_to_income_ratio'] = data['monthly_rent_amount'] / data['monthly_income_amount']

In [14]:
# Add to data dictionary
data_dict = add_entry_to_dictionary(data, data_dict, 
                                    'rent_to_income_ratio', 
                                    eda_category = 'personal_finance',
                                    categorical = 0)

## Expense Type

In [15]:
# Combine categories. Still might not be significant. 
recode_purpose = {
    'Bills': 'general',
    'Bills (Auto)': 'transportation',
    'Bills (General)': 'general',
    'Bills (Home / Utilities)': 'housing',
    'Bills (Medical)': 'medical',
    'Car': 'transportation',
    'Gifts / Leisure': 'general',
    'Medical': 'medical',
    'None': 'general',
    'Other': 'general',
    'Pay off loans / overdrawn acct': 'credit',
    'Rent': 'housing',
    'Rent / Mortgage': 'housing',
    'School': 'education',
    "Won't say": 'general'
}

data['expense_type'] = data['how_use_money'].apply(lambda x: recode_purpose.get(x))

In [16]:
# Add to data dictionary
data_dict = add_entry_to_dictionary(data, data_dict, 'expense_type', 
                                    eda_category = 'personal_finance',
                                    categorical = 1)

In [17]:
bad_rate_by_category(data, 'expense_type')

Unnamed: 0_level_0,bad,bad,bad_rate
Unnamed: 0_level_1,sum,count,Unnamed: 3_level_1
expense_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
credit,2.0,11,0.181818
education,2.0,5,0.4
housing,12.0,25,0.48
medical,21.0,41,0.512195
general,271.0,490,0.553061
transportation,45.0,78,0.576923


## Duration Categories to Continuous

#### <span style='color:red'>Representing Ordinal Variables with Continuous Quantities</span>

Ordinal variables are variables whose values can be ranked and ordered, but cannot be represented by a continuous quantity. For example, runners in a race can be ranked by order of first finish without providing the exact time of finish. This would tell you the relative ranking of runners, but nothing about the size of the differences between individual runners. 

Standard classification models do not often support ordinal categories. Instead of recognizing the ranked nature of the values, they assume each value is a nominal category that is distinct from and incomparable to other categories. This eliminates valuable information that can be used in the prediction process. 

To more accurately represent our data, I have generated continuous variables to represent ordinal categories. The continuous value assigned to each ordinal category is equal to the midpoint of that category. The new variables are then used during the modeling process as continuous variables, such that the distance between values is now meaningful.

In [18]:
# Construct numerical alternative to categorical durations
duration_vars = ['email_duration', 
                 'residence_duration', 
                 'bank_account_duration']

duration_decoder = {
    '3 months or less': 1.5,
    '6 months or less': 3,
    '4-12 months': 8,
    '7-12 months': 9.5,
    '1-2 years': 18,
    '1 year or more': 36,
    '3+ years': 72
}

for var in duration_vars:
    new_varname = var + '_months'
    data[new_varname] = data[var].apply(lambda x: duration_decoder.get(x))

In [19]:
# Dynamically add entries to data dictionary
new_duration_vars = ['email_duration_months',
                     'residence_duration_months',
                     'bank_account_duration_months']

for var_name in new_duration_vars:
    # Function from utils.py
    data_dict = add_entry_to_dictionary(data, data_dict, var_name, 
                                eda_category = 'other_info',
                                categorical = 0)

In [20]:
bad_rate_by_category(data, 'bank_account_duration_months')

Unnamed: 0_level_0,bad,bad,bad_rate
Unnamed: 0_level_1,sum,count,Unnamed: 3_level_1
bank_account_duration_months,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
9.5,28,57,0.491228
72.0,162,328,0.493902
18.0,77,132,0.583333
3.0,85,132,0.643939


## Payment Frequency Recoding

#### <span style='color:red'>Representing Ordinal Variables with Continuous Quantities</span>

Question: Does a semi-monthly payment frequency mean once every two months?

In [21]:
payment_frequency_decoder = {
    'Weekly': 4,
    'Bi-weekly': 2, 
    'Monthly': 1, 
    'Semi-monthly': 2,
}

data['payment_frequency_per_month'] = \
    data['payment_frequency'].apply(lambda x: payment_frequency_decoder.get(x))

In [22]:
# Dynamically add entries to data dictionary
data_dict = add_entry_to_dictionary(data, data_dict, 
                                    'payment_frequency_per_month', 
                                    eda_category = 'personal_finance',
                                    categorical = 0)

In [23]:
bad_rate_by_category(data, 'payment_frequency_per_month')

Unnamed: 0_level_0,bad,bad,bad_rate
Unnamed: 0_level_1,sum,count,Unnamed: 3_level_1
payment_frequency_per_month,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,76,176,0.431818
2,255,446,0.571749
4,22,28,0.785714


## Missing Data Imputation

In [24]:
# Data Validation: Am I missing the missing values?
# It looks like there's missing data that .count() is not catching. 

# This returns no results! Doesn't make sense. 
for col in data.columns:
    if None in list(data[col]):
        print(col)

# Got it! Very tricky, nicely done
# 'None' is present as a string
for col in data.columns:
    if 'None' in list(data[col]):
        print(f"Strings of 'None' found in {col}")
        
# Replacing 'None' strings with np.NaN
data = data.replace('None', np.NaN)

# Verifying deletion of 'None' strings
successfully_replaced = True
for col in data.columns:
    if 'None' in list(data[col]):
        print("Failed to replace 'None' strings!")
        successfully_replaced = False
    
if successfully_replaced == True:
    print("Successfully replaced 'None' strings with np.NaN")

Strings of 'None' found in bank_account_duration
Strings of 'None' found in other_phone_type
Strings of 'None' found in how_use_money
Successfully replaced 'None' strings with np.NaN


In [25]:
# Which variables have missing values?
count_df = pd.DataFrame(data.count().reset_index())
count_df.rename({'index': 'variable',
                  0: 'count'}, axis=1, inplace=True)

count_df[count_df['count'] < data.shape[0]]

Unnamed: 0,variable,count
14,payment_amount_approved,628
20,bank_account_duration,649
23,other_phone_type,359
24,how_use_money,648
45,bank_account_duration_months,649


Some of these variables are designed to have missing values:
* other_phone_type is designed to be missing for people without another phone. 
* bank_id_supp_n30 is designed to be missing for low sample size banks. 


Others are lender decisions, not consumer characteristics (think outputs, not inputs):
* payment_amount_approved is a term issued by the bank, not a predictive feature of the model. 
* how_use_money is about the category of loan, which does not necessarily reflect creditworthiness. I will impute the missing data here because we might use it in the model, though expense_type is probably more useful and already groups missing values into the 'general' category. 


But we do have two variables with a single clearly missing data point:
* bank_account_duration has one missing data point that should be imputed. 
* By extension, bank_account_duration_months has a missing value as well. 

In [26]:
# Run SimpleImputer to fill bank_account_duration
features = [
    'raw_FICO_telecom',
    'raw_FICO_retail',
    'raw_FICO_bank_card',
    'raw_FICO_money',
    'raw_l2c_score',
    'monthly_income_amount',
    'monthly_rent_amount',
    'bank_account_direct_deposit',
    'bank_account_duration', 
    'email_duration',
    'residence_duration',
    'residence_rent_or_own',
    'home_phone_type',
    'other_phone_type',
    'how_use_money'
]

simple_imputer = SimpleImputer(strategy='most_frequent')
data_array_1 = np.array(data['bank_account_duration'])
data_array_1 = data_array_1.reshape(-1, 1)
data['bank_account_duration'] = simple_imputer.fit_transform(data_array_1)

In [27]:
# Run KNN Imputer to fill bank_account_duration_months
knn_imputer = KNNImputer(n_neighbors = 10, weights = 'uniform')
data_array_2 = np.array(data['bank_account_duration_months'])
data_array_2 = data_array_2.reshape(-1, 1)
data['bank_account_duration_months'] = knn_imputer.fit_transform(data_array_2)

In [28]:
# Show counts of newly imputed columns
data[['bank_account_duration', 'bank_account_duration_months']].count()

# TODO: Update coverage column in data dictionary

bank_account_duration           650
bank_account_duration_months    650
dtype: int64

## Standardization

Standardizing all continuous variables to a mean of 0 and standard deviation of 1. 

In [29]:
# Create a copy of the data
data = data.copy()

# Designate variables to be standardized
standardize_categories = ['personal_finance', 
                          'credit_score', 
                          'other_info']

# TODO: Address the FutureWarning
standardize_cols = list(data_dict.loc[(data_dict['eda_category'].isin(standardize_categories)) & 
                                      (data_dict['categorical'] == 0), 'variable'].values)

# Add or remove individual columns
standardize_cols.remove('application_month_of_cycle')

# Standardize them
data[standardize_cols] = (data[standardize_cols] - data[standardize_cols].mean()) \
                                / data[standardize_cols].std()

In [30]:
# Show mean and standard deviation of standardized columns
# Round off floating-point decimal errors

data[standardize_cols].describe().round(2).head(3)

Unnamed: 0,monthly_rent_amount,monthly_income_amount,raw_l2c_score,raw_FICO_telecom,raw_FICO_retail,raw_FICO_bank_card,raw_FICO_money,application_day_of_year,application_hour_of_day,rent_to_income_ratio,email_duration_months,residence_duration_months,bank_account_duration_months,payment_frequency_per_month
count,650.0,650.0,650.0,650.0,650.0,650.0,650.0,650.0,650.0,650.0,650.0,650.0,650.0,650.0
mean,0.0,-0.0,-0.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## One Hot Encoding

In [31]:
# Designate variables for one-hot
one_hot_categories = ['personal_finance', 'other_info']
one_hot_cols = data_dict.loc[(data_dict['eda_category'].isin(one_hot_categories)) & 
                             (data_dict['categorical'] == 1), 'variable'].str.strip().values

# Encode using pd.get_dummies()
df_dummies = pd.get_dummies(data[one_hot_cols])
new_columns = list(set(df_dummies.columns) - set(one_hot_cols))
df_dummies = df_dummies[new_columns]

# Eliminate categoricals and concatenate dummies
data = data.drop(one_hot_cols, axis=1)
data = pd.concat([data, df_dummies], axis=1)

In [32]:
print(f"Number of features after one-hot encoding: {len(data.columns)}")

Number of features after one-hot encoding: 79


In [33]:
# Add dummies to data_dictionary
def get_eda_category(data_dict, dummy_name):
    """
    For a binary variable generated with pd.get_dummies(), finds
    the eda_category of the underlying categorical variable. 
    
    Returns: eda_category
    """
    var_name = dummy_name[:-(len(dummy_name.split('_')[-1]) + 1)]
    eda_category = data_dict.loc[data_dict['variable']==var_name, 'eda_category'].values
    
    return eda_category

In [34]:
# Generate data dictionary entry for every new column
for dummy_name in new_columns:
    # Dynamically determine eda_category through data dictionary
    eda_category = get_eda_category(data_dict, dummy_name)[0]
    
    # Clean dummy names of stray characters and whitespace
    clean_dummy_name = clean_name(dummy_name)
    data = data.rename({dummy_name: clean_dummy_name}, axis=1)
    
    # Add newly cleaned entry to the data dictionary
    data_dict = add_entry_to_dictionary(data, data_dict, 
                                        clean_dummy_name, 
                                        eda_category = eda_category,
                                        categorical = -1)

In [35]:
# Eliminate data dictionary entries for dropped categoricals
data_dict = data_dict[~data_dict['variable'].isin(one_hot_cols)]
data_dict = data_dict.reset_index(drop=True)

In [36]:
print(f"Data dictionary has {data_dict.shape[0]} variables.")

Data dictionary has 79 variables.


## Export Data

In [37]:
data.to_pickle('output_data/01_data.pkl')
data_dict.to_pickle('output_data/01_data_dict.pkl')