<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objetivos" data-toc-modified-id="Objetivos-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objetivos</a></span></li><li><span><a href="#KPI" data-toc-modified-id="KPI-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>KPI</a></span></li><li><span><a href="#Datos" data-toc-modified-id="Datos-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Datos</a></span></li><li><span><a href="#Análisis-Exploratorio-(EDA)" data-toc-modified-id="Análisis-Exploratorio-(EDA)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Análisis Exploratorio (EDA)</a></span><ul class="toc-item"><li><span><a href="#Target-Variable" data-toc-modified-id="Target-Variable-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Target Variable</a></span></li><li><span><a href="#Variables-Independientes" data-toc-modified-id="Variables-Independientes-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Variables Independientes</a></span></li></ul></li><li><span><a href="#Selección-de-Variables" data-toc-modified-id="Selección-de-Variables-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Selección de Variables</a></span></li></ul></div>

# Práctica Final
* Curso: Machine Learning
* Integrantes: 
    * Marina Ortín
    * Mayra Goicochea


In [1]:
import pandas as pd
import numpy as np

## Objetivos

El objetivo de este ejercicio es el desarrollo de un modelo que permita la identificación de prestamistas que puedan realizar un impago en el crédito concedido. En este sentido, probaremos varios modelos, desde regresión logística a XGBoost, con el objetivo de encontrar el modelo óptimo. En este sentido, nos remitimos al apartado "KPI", en el que definimos los indicadores de desempeño que guiarán nuestra decisión.

Para ello, es necesario analizar qué variables son determinantes para la concesión del crédito. Este estudio se puede encontrar en el epígrafe "Análisis exploratorio (EDA)".


## KPI

En este apartado describiremos brevemente los indicadores de *performance* que utilizaremos para elegir el mejor modelo.

Aunque nos basaremos en la medida de *accuracy*, tomaremos otras medidas adicionales, como la matriz de confusión, área bajo la curva (AUC) y F1 Score.

Definimos **accuracy** como el ratio entre el número de predicciones correctas realizadas por el modelo y el número total de observaciones *input*.
Se trata de la medida básica que utilizamos para la selección de modelos; sin embargo, realizaremos cálculos adicionales como:

* **Matriz de confusión**: se trata de una matriz que describe, de forma completa, el desempeño del modelo. De esta forma, se puede observar en la diagonal principal el número de aciertos del modelo, así como los falsos negativos y falsos positivos que ha devuelto.

* **Área bajo la curva (AUC), curva ROC**: esta medida es ampliamente utilizada en problemas de clasificación binaria, como este. Se trata de una representación gráfica que, dado un umbral de clasificación, representa sensibilidad y especificidad del modelo.

* **F1 Score**: se trata de una medida de la accuracy. Matemáticamente, se calcula como la media armónica entre precisión y recall. Cabe recordar, en este sentido, que la precisión se define como el número de positivos acertados sobre el número de aciertos total.

$Precision = \frac{TruePositives}{TruePositives + FalsePositives}$

Mientras que el recall es el número de verdaderos positivos sobre el total de muestras que deberían haberse identificado como positivas.

$Recall = \frac{TruePositives}{TruePositives + FalseNegatives}$

## Datos

Mediante revisión previa en el archivo del dataset, se observa que tiene 2.260.668 registros almacenados con 145 columnas. 
La columna id, member_id y url no tienen propiedad discriminante para el análisis de concesión de créditos es así que no serán consideradas en el dataset final.
También se muestra que se tiene una columna loan_status de la cual se considerará sólo las categorías "Default","Charged Off" y "Fully Paid" que se trata de préstamos históricos que tomaremos de referencia para detectar las dos clases "Default" (préstamos impagos) y "No Default" (préstamos pagados).
Dada éstas condiciones se procede a la carga del pandas dataframe loan con estas restricciones:

In [2]:
data_path_train = '../data/training_set.csv'
data_path_test = '../data/test_set.csv'

In [3]:
loan_train = pd.read_csv(data_path_train)
loan_train.drop('Unnamed: 0', axis=1,inplace=True)

In [4]:
loan_test = pd.read_csv(data_path_test)
loan_test.drop('Unnamed: 0', axis=1,inplace=True)

## Análisis Exploratorio (EDA)

### Target Variable

In [5]:
loan_train['credit_risk'] = loan_train['target']
loan_train.drop('target', axis=1, inplace=True)

In [6]:
loan_test['credit_risk'] = loan_test['target']
loan_test.drop('target', axis=1, inplace=True)

In [7]:
loan_train = loan_train[loan_train.application_type == 'Individual']
loan_test = loan_test[loan_test.application_type == 'Individual']

### Variables Independientes

In [8]:
# Función para visualizar las características con valores faltantes, porcentaje % de valores totales, y tipo de dato.
def missing_values_table(df):
     # valores faltantes
    mis_val = df.isnull().sum()
    # Porcentaje de valores faltantes
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_type = df.dtypes
    # Se crea la tabla con los resultados
    mis_val_table = pd.concat([mis_val, mis_val_percent, mis_val_type], axis=1)
     # Renombrar las columnas
    mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values', 2: 'type'})
    # Ordenamos la tabla de forma descendente
    mis_val_table_ren_columns = mis_val_table_ren_columns[ mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)
    # Se imprime el resumen
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n" "There are " + str(mis_val_table_ren_columns.shape[0]) + " columns that have missing values.")
    # Retorna la información ausente
    return mis_val_table_ren_columns

In [9]:
loan_train['loan_amnt'].describe() #0-40k
buck = [0, 5000, 10000, 15000, 20000, 25000 , 40000]
lab = ['0-5000', '5000-10000', '10000-15000', '15000-20000', '20000-25000','25000+']
loan_train['loan_amnt_range'] = pd.cut(loan_train['loan_amnt'], buck, labels=lab)

In [10]:
loan_test['loan_amnt'].describe() #0-40k
buck = [0, 5000, 10000, 15000, 20000, 25000 , 40000]
lab = ['0-5000', '5000-10000', '10000-15000', '15000-20000', '20000-25000','25000+']
loan_test['loan_amnt_range'] = pd.cut(loan_test['loan_amnt'], buck, labels=lab)

In [11]:
loan_train['term'] = loan_train['term'].apply(lambda s: np.int8(s.split()[0]))

In [12]:
loan_test['term'] = loan_test['term'].apply(lambda s: np.int8(s.split()[0]))

In [13]:
from datetime import datetime
loan_train['issue_year'] = pd.to_datetime(loan_train.issue_d).dt.year

In [14]:
loan_test['issue_year'] = pd.to_datetime(loan_test.issue_d).dt.year

In [15]:
loan_train.earliest_cr_line = pd.to_datetime(loan_train.earliest_cr_line)
dttoday = pd.Timestamp(datetime.now().strftime('%Y-%m-%d'))
loan_train['earliest_cr_line'].fillna(dttoday, inplace = True)

In [16]:
loan_test.earliest_cr_line = pd.to_datetime(loan_test.earliest_cr_line)
dttoday = pd.Timestamp(datetime.now().strftime('%Y-%m-%d'))
loan_test['earliest_cr_line'].fillna(dttoday, inplace = True)

In [17]:
loan_train['Age_Borrower'] = loan_train.earliest_cr_line.apply(lambda x: (0 if (x==dttoday) else (np.timedelta64((x - dttoday),'D')).astype(int))/-365)

In [18]:
loan_test['Age_Borrower'] = loan_test.earliest_cr_line.apply(lambda x: (0 if (x==dttoday) else (np.timedelta64((x - dttoday),'D')).astype(int))/-365)

In [19]:
loan_train.loc[(loan_train['purpose'] == 'debt_consolidation')|(loan_train['purpose'] =="credit_card"), 'purpose_g'] = 'debt' 
loan_train.loc[(loan_train['purpose'] == 'home_improvement')|(loan_train['purpose'] =="major_purchase")|
                 (loan_train['purpose'] == 'car')|(loan_train['purpose'] =="house")|
                 (loan_train['purpose'] == 'vacation')|(loan_train['purpose'] =="renewable_energy"),
                 'purpose_g'] = 'major_purchase' 
loan_train.loc[(loan_train['purpose'] == 'small_business')|(loan_train['purpose'] =="medical")|
                 (loan_train['purpose'] == 'moving')|(loan_train['purpose'] =="wedding")|
                 (loan_train['purpose'] == 'educational'),
                 'purpose_g'] = 'life_event'
loan_train.loc[(loan_train['purpose'] == 'other'), 'purpose_g'] = 'other'

In [20]:
loan_test.loc[(loan_test['purpose'] == 'debt_consolidation')|(loan_test['purpose'] =="credit_card"), 'purpose_g'] = 'debt' 
loan_test.loc[(loan_test['purpose'] == 'home_improvement')|(loan_test['purpose'] =="major_purchase")|
                 (loan_test['purpose'] == 'car')|(loan_test['purpose'] =="house")|
                 (loan_test['purpose'] == 'vacation')|(loan_test['purpose'] =="renewable_energy"),
                 'purpose_g'] = 'major_purchase' 
loan_test.loc[(loan_test['purpose'] == 'small_business')|(loan_test['purpose'] =="medical")|
                 (loan_test['purpose'] == 'moving')|(loan_test['purpose'] =="wedding")|
                 (loan_test['purpose'] == 'educational'),
                 'purpose_g'] = 'life_event'
loan_test.loc[(loan_test['purpose'] == 'other'), 'purpose_g'] = 'other'

In [21]:
loan_train['emp_length'].fillna(0, inplace=True)
loan_train['emp_length'].replace('10+ years', '10 years', inplace=True)
loan_train['emp_length'].replace('< 1 year', '0 years', inplace=True)
loan_train.emp_length.map( lambda x: str(x).split()[0]).value_counts(dropna=True).sort_index()
loan_train['emp_length'] = loan_train.emp_length.map( lambda x: float(str(x).split()[0]))

In [22]:
loan_test['emp_length'].fillna(0, inplace=True)
loan_test['emp_length'].replace('10+ years', '10 years', inplace=True)
loan_test['emp_length'].replace('< 1 year', '0 years', inplace=True)
loan_test.emp_length.map( lambda x: str(x).split()[0]).value_counts(dropna=True).sort_index()
loan_test['emp_length'] = loan_test.emp_length.map( lambda x: float(str(x).split()[0]))

In [23]:
loan_train['home_ownership'].replace(['NONE','ANY'],'OTHER', inplace=True)

In [24]:
loan_test['home_ownership'].replace(['NONE','ANY'],'OTHER', inplace=True)

In [25]:
loan_train['verification_status'].replace(['Source Verified','Verified'],'Verified', inplace=True)
di = {"Verified":1, "No Verified":0}   #converting target variable to boolean
loan_train = loan_train.replace({"verification_status": di})

In [26]:
loan_test['verification_status'].replace(['Source Verified','Verified'],'Verified', inplace=True)
di = {"Verified":1, "No Verified":0}   #converting target variable to boolean
loan_test = loan_test.replace({"verification_status": di})

In [27]:
# create region of residence based on state
west = ['CA', 'OR', 'UT','WA', 'CO', 'NV', 'AK', 'MT', 'HI', 'WY', 'ID']
south_west = ['AZ', 'TX', 'NM', 'OK']
south_east = ['GA', 'NC', 'VA', 'FL', 'KY', 'SC', 'LA', 'AL', 'WV', 'DC', 'AR', 'DE', 'MS', 'TN' ]
mid_west = ['IL', 'MO', 'MN', 'OH', 'WI', 'KS', 'MI', 'SD', 'IA', 'NE', 'IN', 'ND']
north_east = ['CT', 'NY', 'PA', 'NJ', 'RI','MA', 'MD', 'VT', 'NH', 'ME']

def finding_regions(state):
    if state in west:
        return 'West'
    elif state in south_west:
        return 'SouthWest'
    elif state in south_east:
        return 'SouthEast'
    elif state in mid_west:
        return 'MidWest'
    elif state in north_east:
        return 'NorthEast'

loan_train['region'] = loan_train['addr_state'].apply(finding_regions)
loan_test['region'] = loan_test['addr_state'].apply(finding_regions)

In [28]:
grade_mapping={'A':7,'B':6,'C':5,'D':4,'E':3,'F':2,'G':1}
loan_train["grade"] = loan_train["grade"].replace(grade_mapping)
loan_test["grade"] = loan_test["grade"].replace(grade_mapping)

In [29]:
loan_train["delinq_2yrs_cat"] = 0
loan_train.loc[loan_train["delinq_2yrs"] > 0, "delinq_2yrs_cat"] = 1

In [30]:
loan_test["delinq_2yrs_cat"] = 0
loan_test.loc[loan_test["delinq_2yrs"] > 0, "delinq_2yrs_cat"] = 1

In [31]:
loan_train["pub_rec_cat"] = 0
loan_train.loc[loan_train["pub_rec"]>0,"pub_rec_cat"] = 1

In [32]:
loan_test["pub_rec_cat"] = 0
loan_test.loc[loan_test["pub_rec"]>0,"pub_rec_cat"] = 1

In [33]:
loan_train['total_acc'].fillna(0, inplace = True) 
loan_train['open_acc'].fillna(0, inplace = True) 
loan_train['acc_ratio'] = loan_train.open_acc / loan_train.total_acc
loan_train['acc_ratio'].fillna(0, inplace = True) 

In [34]:
loan_test['total_acc'].fillna(0, inplace = True) 
loan_test['open_acc'].fillna(0, inplace = True) 
loan_test['acc_ratio'] = loan_test.open_acc / loan_test.total_acc
loan_test['acc_ratio'].fillna(0, inplace = True) 

In [35]:
loan_train["revol_bal_cat"] = 0
loan_train.loc[loan_train["revol_bal"]>0,"revol_bal_cat"] = 1

In [36]:
loan_test["revol_bal_cat"] = 0
loan_test.loc[loan_test["revol_bal"]>0,"revol_bal_cat"] = 1

In [37]:
loan_train["hardship_type"].fillna("NO HARDSHIP PLAN", inplace = True) 

In [38]:
loan_test["hardship_type"].fillna("NO HARDSHIP PLAN", inplace = True) 

In [39]:
cols_drop = ['id','url','member_id','application_type','loan_amnt','issue_d','earliest_cr_line','pymnt_plan','emp_title',
             'desc','title','purpose','policy_code','addr_state','zip_code','sub_grade','num_actv_rev_tl',
             'num_actv_bc_tl','num_bc_sats','num_bc_tl','num_op_rev_tl','num_rev_accts','num_rev_tl_bal_gt_0',
             'delinq_2yrs','inq_last_6mths','pub_rec','num_il_tl','num_tl_120dpd_2m','num_tl_30dpd',
             'num_tl_op_past_12m','pct_tl_nvr_dlq','pub_rec_bankruptcies','open_acc_6m','open_act_il',
             'open_il_12m','open_il_24m','mths_since_rcnt_il','open_rv_12m','open_rv_24m','inq_fi',
             'inq_last_12m','mo_sin_rcnt_rev_tl_op','revol_bal','total_rev_hi_lim','total_acc','total_bal_il',
             'max_bal_bc','all_util', 'total_cu_tl','total_bal_ex_mort','tot_hi_cred_lim','total_bc_limit',
             'out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp','total_rec_int',
             'total_rec_late_fee','recoveries','collection_recovery_fee','last_pymnt_d','last_pymnt_amnt',
             'next_pymnt_d','last_credit_pull_d','mths_since_recent_bc','mths_since_recent_bc_dlq',
             'mths_since_recent_inq','mths_since_recent_revol_delinq','hardship_flag','hardship_reason', 
             'hardship_status','deferral_term', 'hardship_amount', 'hardship_start_date','hardship_end_date',
             'payment_plan_start_date', 'hardship_length','hardship_dpd', 'hardship_loan_status',
             'hardship_last_payment_amount','orig_projected_additional_accrued_interest',
             'hardship_payoff_balance_amount', 'debt_settlement_flag', 'debt_settlement_flag_date',
             'settlement_status', 'settlement_date', 'settlement_amount','settlement_percentage', 'settlement_term',
            'annual_inc_joint','dti_joint','verification_status_joint', 'revol_bal_joint','sec_app_earliest_cr_line',
           'sec_app_inq_last_6mths','sec_app_mort_acc', 'sec_app_open_acc','sec_app_revol_util',
           'sec_app_open_act_il','sec_app_num_rev_accts',
           'sec_app_chargeoff_within_12_mths',
           'sec_app_collections_12_mths_ex_med',
           'sec_app_mths_since_last_major_derog']
loan_train.drop(cols_drop, axis=1,inplace=True)
loan_test.drop(cols_drop, axis=1,inplace=True)

In [40]:
di = {np.NaN:0}  
loan_train.replace({'tax_liens':di,'annual_inc':di,'delinq_2yrs':di,'pub_rec':di,
                    'mths_since_last_delinq':di,'mths_since_last_record':di,
                    'mths_since_last_major_derog':di,'collections_12_mths_ex_med':di,
                    'tot_coll_amt':di,'acc_open_past_24mths':di,'chargeoff_within_12_mths':di,
                    'mo_sin_old_il_acct':di,'mo_sin_old_rev_tl_op':di,'mo_sin_rcnt_tl':di,
                    'num_accts_ever_120_pd':di,'num_tl_90g_dpd_24m':di,'percent_bc_gt_75':di,
                    'revol_util':di,'total_acc':di,'num_sats':di,'acc_now_delinq':di,
                    'tot_cur_bal':di,'total_bal_il':di,'il_util':di,'max_bal_bc':di,'all_util':di,
                    'total_cu_tl':di,'avg_cur_bal':di,'bc_open_to_buy':di,'bc_util':di,
                    'delinq_amnt':di,'mort_acc':di,'total_bc_limit':di,'total_il_high_credit_limit':di},inplace=True)

In [41]:
loan_test.replace({'tax_liens':di,'annual_inc':di,'delinq_2yrs':di,'pub_rec':di,
                    'mths_since_last_delinq':di,'mths_since_last_record':di,
                    'mths_since_last_major_derog':di,'collections_12_mths_ex_med':di,
                    'tot_coll_amt':di,'acc_open_past_24mths':di,'chargeoff_within_12_mths':di,
                    'mo_sin_old_il_acct':di,'mo_sin_old_rev_tl_op':di,'mo_sin_rcnt_tl':di,
                    'num_accts_ever_120_pd':di,'num_tl_90g_dpd_24m':di,'percent_bc_gt_75':di,
                    'revol_util':di,'total_acc':di,'num_sats':di,'acc_now_delinq':di,
                    'tot_cur_bal':di,'total_bal_il':di,'il_util':di,'max_bal_bc':di,'all_util':di,
                    'total_cu_tl':di,'avg_cur_bal':di,'bc_open_to_buy':di,'bc_util':di,
                    'delinq_amnt':di,'mort_acc':di,'total_bc_limit':di,'total_il_high_credit_limit':di},inplace=True)

In [42]:
missing_values_table(loan_train)

Your selected dataframe has 50 columns.
There are 0 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values,type


In [43]:
missing_values_table(loan_test)

Your selected dataframe has 50 columns.
There are 0 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values,type


In [44]:
for col in loan_train.select_dtypes(include=['object']).columns:
    loan_train[col] = loan_train[col].astype('category')

In [45]:
for col in loan_test.select_dtypes(include=['object']).columns:
    loan_test[col] = loan_test[col].astype('category')

In [46]:
loan_train = pd.get_dummies(loan_train, columns=loan_train.select_dtypes(include=['category']).columns)

In [47]:
loan_test = pd.get_dummies(loan_test, columns=loan_test.select_dtypes(include=['category']).columns)

## Selección de Variables

In [48]:
# para obtener los valores de cada columna.
var_n = pd.DataFrame(loan_train.count()).reset_index()
var_n.columns=['index', 'n']

# correlación de cada variable con respecto a la variable objetivo (credit risk)
var_cor = pd.DataFrame((loan_train.corr()['credit_risk'])).reset_index()
var_cor.columns=['index', 'correlation']

# Se obtiene el tipo de dato de cada variable
var_type = pd.DataFrame(loan_train.dtypes).reset_index()
var_type.columns=['index', 'type']
var_type

Unnamed: 0,index,type
0,funded_amnt,int64
1,funded_amnt_inv,float64
2,term,int64
3,int_rate,float64
4,installment,float64
5,grade,int64
6,emp_length,float64
7,annual_inc,float64
8,dti,float64
9,mths_since_last_delinq,float64


In [54]:
from scipy import stats, optimize, interpolate
import scipy
import scipy.integrate as integrate

In [50]:
def do_test(x):
    # Se ejecuta la prueba t student en con cada variable numérica.
    if (x.dtypes == 'float64') | (x.dtypes == 'int64'):
        group1, group2 = [g[1] for g in x.groupby(loan_train['credit_risk'])]
        t, p = scipy.stats.ttest_ind(group1, group2,nan_policy='omit')
        if np.isnan(np.ma.getdata(t)):
            p=1
        return p
    
    # Se realiza la prueba Chi cuadrado con cada variable categorica.
    if x.dtypes == 'object':
        observed = pd.crosstab(x, loan_train['credit_risk'])
        chi, p = scipy.stats.chi2_contingency(observed, correction=False)[0:2]
        return p

In [55]:
# se calcula los p-valor de cada variable.
pval = pd.DataFrame(loan_train.apply(do_test)).reset_index()
pval.columns=['index', 'p']

In [56]:
var_n = pd.DataFrame(loan_train.count()).reset_index()
var_n.columns=['index', 'n']

# correlación de cada variable con respecto a la variable objetivo (credit risk)
var_cor = pd.DataFrame((loan_train.corr()['credit_risk'])).reset_index()
var_cor.columns=['index', 'correlation']

# Se obtiene el tipo de dato de cada variable
var_type = pd.DataFrame(loan_train.dtypes).reset_index()
var_type.columns=['index', 'type']

In [57]:
var_n.shape

(69, 2)

In [58]:
# Se une el p-valor con la informacion de cada variable.
var_info = pd.merge(var_n, pval, how='left', on=['index'])
var_info = pd.merge(var_info, var_type, how='left', on=['index'])
var_select = var_info.loc[((var_info['n']/len(loan_train)>.95)) & (var_info['p'] < 0.05/len(var_info))]
var_select.to_csv('selected_vars.csv')

print('Total number of variables selected =',len(var_select))
print('Type of variables:')
print(var_select['type'].value_counts())

import matplotlib.pyplot as plt
plt.plot(var_select['p'],'ro')
var_select.sort_values(by=['index'])

Total number of variables selected = 62
Type of variables:
float64    31
uint8      24
int64       7
Name: type, dtype: int64


Unnamed: 0,index,n,p,type
37,Age_Borrower,980547,0.000000e+00,float64
15,acc_now_delinq,980547,7.594770e-05,float64
19,acc_open_past_24mths,980547,0.000000e+00,float64
40,acc_ratio,980547,0.000000e+00,float64
7,annual_inc,980547,3.039275e-277,float64
20,avg_cur_bal,980547,0.000000e+00,float64
21,bc_open_to_buy,980547,0.000000e+00,float64
22,bc_util,980547,0.000000e+00,float64
13,collections_12_mths_ex_med,980547,3.472952e-70,float64
35,credit_risk,980547,0.000000e+00,int64


In [59]:
loan_test.shape

(325634, 69)

Según el T-student y el Chi Cuadrado se seleccionarian 61 variables.

In [60]:
#Se imprime las variables seleccionadas.
drop_list = [col for col in loan_train.columns if col not in var_select['index'].tolist()]
drop_list

['tot_coll_amt',
 'chargeoff_within_12_mths',
 'delinq_amnt',
 'revol_bal_cat',
 'home_ownership_OTHER',
 'purpose_g_debt',
 'region_SouthWest']

In [61]:
loan_select =loan_train.drop(labels=drop_list, axis=1)
loan_test_s =loan_test.drop(labels=drop_list, axis=1)

In [62]:
loan_select.to_csv('../data/2_TrainSet.gz', compression='gzip', index=False)

In [66]:
loan_select.shape

(980547, 62)

In [63]:
loan_test_s.to_csv('../data/2_TestSet.gz', compression='gzip', index=False)