# Test v1 Models on New Data
## Use Model #1 (Classification - Fully Paid vs. Default) to Filter Rows for Model #2 ( Regression - Annualized Return %)
- I will use my predicted classification from Model #1 on train set to feed rows to Model #2, rather than filtering using pre-labelled classes
- This is what I would need to do with new, real-world data


In [28]:
import pandas as pd
import numpy as np
from functions_and_objects import (impute_means_zeros_maxs_X,parse_percentage,
                                   clean_LC_data_classification_eval,preprocessing_future_test, 
                                   columns_list, nan_max_cols, nan_mean_cols, nan_zero_cols, dtype,
                                   one_hot_encode_current, concat_X_and_6ohe_dfs,
                                   prep_all_df_for_classification)
import joblib
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, roc_curve,precision_recall_fscore_support)

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Import New Data from 2018-2019

In [25]:
raw_q12019 = pd.read_csv('~/peervest/lc_data/LoanStats_securev1_2019Q1.csv',header=1)
raw_q42018 = pd.read_csv('~/peervest/lc_data/LoanStats_securev1_2018Q4.csv',header=1)
raw_q32018 = pd.read_csv('~/peervest/lc_data/LoanStats_securev1_2018Q3.csv',header=1)
raw_q22018 = pd.read_csv('~/peervest/lc_data/LoanStats_securev1_2018Q2.csv',header=1)
raw_q12018 = pd.read_csv('~/peervest/lc_data/LoanStats_securev1_2018Q1.csv',header=1)
dfs_list = [raw_q42018,
            raw_q32018,
            raw_q22018,
            raw_q12018]

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


### Clean Data

In [45]:
#use eval instead of current because 
#we're making predictions on already completed loans & comparing
#rather than predicting blindly (which the app will do)

clean_lc_df_future = clean_LC_data_classification_eval(dfs_list) #takes in dfs_list, outputs clean_lc_df

  raw_lc_df['earliest_cr_line'] = pd.to_timedelta(pd.to_datetime(raw_lc_df['earliest_cr_line'])).dt.days


### Preprocess, no train-test-split: entire dataset is our test data

In [46]:
X_future, y_future = preprocessing_future_test(clean_lc_df_future) # takes in clean_lc_df_current, outputs X_current, y_current

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #pd.DataFrame(clean_lc_df_future['loan_status'], index=clean_lc_df_future.index, columns=['class_pred'])


### One Hot Encode

In [47]:
(ohe_home_ownership, ohe_purpose, ohe_zip_code, 
 ohe_application_type, ohe_sub_grade, ohe_emp_title_2) = one_hot_encode_current(X_future)

In [48]:
X_future = concat_X_and_6ohe_dfs(X_future, ohe_home_ownership, ohe_purpose, ohe_zip_code, 
                                 ohe_application_type, ohe_sub_grade, ohe_emp_title_2)

#### Prep for Classification

- reset index

In [49]:
X_future.set_index('index',inplace=True)

- drop non-numeric & OHE source columns

In [50]:
prep_all_df_for_classification(X_future)

In [51]:
X_future.shape

(67586, 1124)

### X_future is ready to input into Classification Model

###### Load Model JobLib

In [52]:
loaded_log_reg_v1 = joblib.load('log_reg_v1.joblib')

In [53]:
future_preds = loaded_log_reg_v1.predict(X_future)
future_preds_proba = loaded_log_reg_v1.predict_proba(X_future)

In [54]:
print ("Precision Fully Paid: {}".format(precision_score(y_future,future_preds)))
print ("Recall Fully Paid: {}".format(recall_score(y_future,future_preds)))
print ("F-1 Score Fully Paid: {}".format(f1_score(y_future,future_preds)))
print ("ROC-AUC Score: {}".format(roc_auc_score(y_future,future_preds)))

Precision Fully Paid: 0.9999470815473356
Recall Fully Paid: 0.9998412614424044
F-1 Score Fully Paid: 0.9998941686951001
ROC-AUC Score: 0.9997828770248113


In [56]:
#precision, recall, f1-score
print ("Precision, Recall, F, & Support By Class [0,1] aka [Default,Fully Paid]: {}".format(
    precision_recall_fscore_support(y_future,future_preds)))

print ("Precision, Recall, F, & Support Weighted Average by Support: {}".format(
    precision_recall_fscore_support(y_future,future_preds,average='weighted')))

Precision, Recall, F, & Support By Class [0,1] aka [Default,Fully Paid]: (array([0.99917393, 0.99994708]), array([0.99972449, 0.99984126]), array([0.99944914, 0.99989417]), array([10889, 56697]))
Precision, Recall, F, & Support Weighted Average by Support: (0.9998225170729338, 0.9998224484360666, 0.9998224681900859, None)
