# Robustness Study
This notebook contains the work evaluating robustness of the final model. Result and finding are presented in [capstone%20report.ipynb](capstone%20report.ipynb) file.

## Data Preprocessing
Read in the data file and call preprocess to encode and remove outliers. Display top 10 rows as sample.

In [37]:
import numpy as np
import pandas as pd
import itertools
import matplotlib.pyplot as plt
from pandas.plotting import table
import seaborn as sns
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
import random
%matplotlib inline


from preprocess_visuals import *

pd.options.display.max_rows = 160
pd.options.display.max_columns = 200

In [5]:
df = pd.read_csv('data/LoanStats_securev1_2017Q1.csv.zip', skiprows=1, skipfooter=2,
                 engine='python', usecols=get_usecols(), converters=get_conv())

# dummy encode categorical variables and impute missing values
df = preprocess(df)
df = remove_outliers(df)

# Random Seed
Run the final model with various random states in the train-test split and analyze the mean / variance of the results

In [91]:
precisions = []
recalls = []
accuracies = []
f2s = []
s = 30
    
for s in np.arange(s):
    random.seed(s*3)
    ri1 = random.randint(27, 99999)
    # Calculate the training and testing scores of best classifier
    clf = RandomForestClassifier(criterion='entropy', class_weight='balanced', random_state=ri1, 
                                 max_depth=1, max_features=50, n_estimators=10)

    y = df.loan_status
    X = df.drop(columns='loan_status')
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=ri1)
    clf.fit(X_train, y_train)
    y_predicted = clf.predict(X_test)
    a = round(accuracy_score(y_test, y_predicted), 4)
    p, r, f2, s = precision_recall_fscore_support(y_test, y_predicted, beta=2, average='binary')
    
    precisions.append(p)
    recalls.append(r)
    accuracies.append(a)
    f2s.append(f2)

print("Number of tries:", s)
print("Precision mean: %.3f, std: %.3f " % (np.mean(precisions), np.std(precisions)))
print("Recall mean: %.3f, std: %.3f " % (np.mean(recalls), np.std(recalls)))
print("Accuracy mean: %.3f, std: %.3f " % (np.mean(accuracies), np.std(accuracies)))
print("F2 score mean: %.3f, std: %.3f " % (np.mean(f2s), np.std(f2s)))

print("\nrecalls:", recalls)

    



Number of tries: None
Precision mean: 0.072, std: 0.010 
Recall mean: 0.716, std: 0.117 
Accuracy mean: 0.570, std: 0.109 
F2 score mean: 0.251, std: 0.011 

recalls: [0.4618086040386304, 0.8361233480176211, 0.7918436703483432, 0.7409326424870466, 0.8248730964467005, 0.8307291666666666, 0.7636518771331058, 0.4804421768707483, 0.7602389078498294, 0.8409286328460877, 0.7879558948261238, 0.7523809523809524, 0.8344709897610921, 0.8406445837063563, 0.8063139931740614, 0.7648514851485149, 0.6466552315608919, 0.5849870578084556, 0.49828178694158076, 0.719626168224299, 0.7790697674418605, 0.620353982300885, 0.5924686192468619, 0.517566409597258, 0.6418642681929682, 0.7878535773710482, 0.7662447257383966, 0.8464912280701754, 0.6257408975444538, 0.726649528706084]


In [92]:
from scipy import stats
print('t-statistic = %6.3f pvalue = %6.4e' %  stats.ttest_1samp(recalls, 0.5))

# varify that the samples come from a t-distribution using KS-test:
print('KS-statistic D = %6.3f pvalue = %6.4f' % stats.kstest(recalls, 't', (30,)))

t-statistic =  9.962 pvalue = 7.1916e-11
KS-statistic D =  0.676 pvalue = 0.0000


## Results from T-statistic

The p-value indicates the probablity that the classifier recall is the same as the naive classifier's recall.
Since the sample mean is 7.1916e-11, and it is less than alpha=0.05, we can reject the null hypothesis, and accept the alternative that classifier recall mean is higher.

The Kolmogovov-Smirnoff test pvalue < 0.05 indicates that the recall means form a t-distribution with 30 degrees of freedom.
