# Introduction


Evaluating Correlations:
* For all categorical variables, show bar chart of default rates. 
* For all continuous variables, show point biserial correlation or ANOVA. 
* * ANOVA gives p-value of equal means. Point biserial is tougher to compare. 
* Futher dimenstionality reduction: Run an LR, show coefficient, p value, and regularizer results. 
* Chi-squared for categorical variables?

How do you evaluate the correlation between two categorical variables: ANOVA or KW?
* ANOVA gives the probability that different samples are drawn from the same population. It assumes a normally-distributed dependent variable, which binary data is not, but [academics have argued](https://journals.sagepub.com/doi/abs/10.3102/00346543042003237?journalCode=rera) that it can be accurately used for binary data anyways. 
* Kruskal Wallis is a non-parametric alternative to ANOVA that similarly tests whether series of observations come from the same distribution. [Other academics have argued](https://journals.sagepub.com/doi/abs/10.3102/00346543051004499?journalCode=rera) that it more accurately assesses correlations between categorical values. But I've used this before and gotten unclear results. I'd like to discuss with the team before choosing a method here. 
* Decision: Chi Squared Test of Association. Using scipy.stats.chi2_contingency, as documented [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) and implemented [here](https://stackoverflow.com/questions/25139326/chi-squared-test-in-python). 


In [1]:
import numpy as np
import pandas as pd
import scipy.stats

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 50)

## Data

In [2]:
# Import data and data dictionary
data = pd.read_pickle('output_data/01_data.pkl')

data_dict = pd.read_pickle('output_data/01_data_dict.pkl')

## Potential Features

In [3]:
# Designate eda categories as containing features
prelim_feature_dict = ['credit_score', 'personal_finance', 'other_info']
data_dict['potential_feature'] = data_dict.apply(lambda x: x.eda_category in prelim_feature_dict, axis=1)

## Pearson Correlations

In [4]:
# Function to evaluate pearson correlation
def correlate_pearson(var1, var2, data):
    # Drop rows without either input or output variable
    data_notna = data.dropna(axis=0, subset=[var1, var2])
    
    # Run Pearson correlation
    pearson, pearson_p = scipy.stats.pearsonr(data_notna[var1], data_notna[var2])
    
    return pearson, pearson_p

In [5]:
# Designate list of potential model features
potential_features = data_dict.loc[data_dict['potential_feature']==True, 'variable'].values

# For each potential feature, calculate Pearson's correlation 
# and its p-value and store them in the data dictionary
for var in potential_features:
    pearson, pearson_p = correlate_pearson(var, 'bad', data)
    data_dict.loc[data_dict['variable']==var,'pearson'] = pearson
    data_dict.loc[data_dict['variable']==var,'pearson_p'] = pearson_p

In [11]:
data_dict.sort_values('eda_category').reset_index(drop=True)

Unnamed: 0,variable,var_dtype,eda_category,categorical,coverage,potential_feature,pearson,pearson_p
0,application_when,datetime64[ns],application,1,1.0,False,,
1,loan_duration,int64,application,0,1.0,False,,
2,raw_FICO_money,int64,credit_score,0,1.0,True,-0.152437,9.6e-05
3,raw_FICO_bank_card,int64,credit_score,0,1.0,True,-0.152722,9.3e-05
4,raw_FICO_retail,int64,credit_score,0,1.0,True,-0.180558,4e-06
5,raw_FICO_telecom,int64,credit_score,0,1.0,True,-0.156673,6e-05
6,raw_l2c_score,int64,credit_score,0,1.0,True,0.017142,0.662671
7,other_phone_type_work,uint8,other_info,-1,1.0,True,0.054079,0.168479
8,application_day_of_year,int64,other_info,0,1.0,True,0.013557,0.730101
9,email_duration_months,float64,other_info,0,1.0,True,0.017248,0.660712


## Sets of Features

# Confounders

Characteristics which affect the performance of the loan in this dataset, but which cannot be used in future underwriting. Hypothetically you'd want to 'control for' these characteristics by adjusting performance by the underlying likelihood of default based on those confounding characteristics which cannot be used by the model. But this is not a run-of-the-mill exercise, and for this analysis I'll simply call out the risks presented. 

In [None]:
# Loan Duration
display(bad_rate_by_category('duration_approved', data))
print('My hypothesis that longer loan durations allow more time for default is incorrect.')
print('Bad rates remain consistent across different loan durations.')

## Export Data

In [12]:
data.to_pickle('output_data/02_data.pkl')
data_dict.to_pickle('output_data/02_data_dict.pkl')