# LASSO

Lasso es un método de regularización que favorece que algunos de los coeficientes terminen valiendo 0. Esto quiere decir que el modelo va a ignorar algunas de las características predictivas, lo que puede ser considerado un tipo de selección automática de características. El incluir menos características supone un modelo más sencillo de interpretar que puede poner de manifiesto las características más importantes del conjunto de datos. En el caso de que exista cierta correlación entre las características predictivas, Lasso tenderá a escoger una de ellas al azar

__Importamos todas las librerías que se van a utilizar en este notebook:__

In [1]:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt # plots
import seaborn as sns # plots
from sklearn.ensemble import IsolationForest
from sklearn import preprocessing 
from sklearn.linear_model import Lasso, LassoCV
import pickle

Importamos los datos resultantes del notebook anterior, considerados como data_cleaned, ya que hemos realizado las labores de data engeneering necesarias.

In [2]:
path = ('../data/02_intermediate/data_cleaned.csv') # importamos los datos de la carpteta
data_cleaned = pd.read_csv(path)

In [3]:
data_cleaned.head() # visulalizamos la carga de los datos

Unnamed: 0.1,Unnamed: 0,id,loan_amnt,funded_amnt,funded_amnt_inv,installment,annual_inc,dti,delinq_2yrs,fico_range_low,...,verification_status_Source Verified,verification_status_Verified,purpose_credit_card,purpose_debt_consolidation,purpose_housing,purpose_leisure,purpose_medical,purpose_other,application_type_Individual,application_type_Joint App
0,8,112038251,11575,11575,11575.0,359.26,153000.0,16.99,0,720,...,0,0,1,0,0,0,0,0,1,0
1,10,112149045,7200,7200,7200.0,285.7,50000.0,6.07,0,685,...,1,0,0,1,0,0,0,0,1,0
2,24,112052261,7500,7500,7500.0,232.79,110000.0,13.12,0,710,...,0,0,0,1,0,0,0,0,1,0
3,42,111999259,10000,10000,10000.0,243.29,51979.0,10.11,0,690,...,1,0,0,1,0,0,0,0,1,0
4,91,111808508,14000,14000,14000.0,492.34,75000.0,10.86,1,685,...,0,1,0,1,0,0,0,0,1,0


In [4]:
# en la visualización vemos que se ha añadido una columna así que la eliminamos:
del data_cleaned['Unnamed: 0']

Nombramos y seleccionamos las variables predictoras y la variable target:

In [5]:

predictiveVariables = data_cleaned.loc[:, data_cleaned.columns != 'target'] #serán todas menos la varoables target

target = data_cleaned['target'] # es la variable 'target' de la bbdd importada

## Definimos el modelo LASSO:
Como este modelo va a ser utilizado para seleccionar aquellas variables más relevantes definimos que realice 500 iteraciones y cross validation, además lo realizará con toda la muestra ya que en este caso no será utilizado para futurass predicciones o clasificaciones

In [6]:
%%time
lassocv = LassoCV(alphas = None, # no predefinimos los alphas, queremos que los seleccione el propio modelo
                  cv = 10, # indicamos que el cross validation realice 10K
                  max_iter = 500, # número máximo de iteraciones = 500
                  normalize = True) # normalizamos las variables

# entrenamos el modelo de cross validation con toda la muestra:
lassocv.fit(predictiveVariables, target)

# valores de alpha seleccionados por el modelo:
lassocv.alpha_


# definimos: lasso function, en el que el alpha será el calculado anteriormente
model_lasso = Lasso(alpha=lassocv.alpha_) 
# entrenamos el modelo lasso con toda la muestra
model_lasso.fit(predictiveVariables, target)

Wall time: 2min 38s


  model = cd_fast.enet_coordinate_descent(


Lasso(alpha=4.453967939231418e-07)

Como se ha comentado anteriormente este modelo lo realizamos para conocer las variables más significativas del modelo. Por lo tanto, el objetivo es conocer los coeficientes resulates del modelo Lasso:

In [7]:
# coefficientes lasso de las variables predictoras:
lasso_coefficients = pd.DataFrame(model_lasso.coef_, set(predictiveVariables), columns = ['Coefficients'])

# término independiente del modelo Lasso
lasso_coefficients.loc['Intercept'] = model_lasso.intercept_
lasso_coefficients

# data frame de los coeficientes Lasso
df_lasso = pd.DataFrame(lasso_coefficients)

# ordenamos las variables en función de sus coeficientes
df_lasso_ordered = df_lasso.sort_values(by = "Coefficients")

In [8]:
pd.set_option("display.max_rows", None, "display.max_columns", None) # para que muestre todos los valores del df

print(df_lasso_ordered)

                                     Coefficients
num_bc_sats                         -1.194939e-01
mort_acc                            -1.127738e-01
id                                  -9.558519e-02
home_ownership_ANY                  -1.738155e-02
purpose_credit_card                 -1.451818e-02
num_rev_tl_bal_gt_0                 -1.352649e-02
loan_amnt                           -1.150573e-02
last_fico_range_high                -8.425800e-03
pct_tl_nvr_dlq                      -6.799665e-03
tot_coll_amt                        -5.603795e-03
mths_since_recent_inq               -5.126447e-03
tax_liens                           -4.575474e-03
num_sats                            -3.980897e-03
fico_range_low                      -3.462910e-03
delinq_amnt                         -3.419417e-03
purpose_housing                     -3.237984e-03
mths_since_last_major_derog         -2.501222e-03
avg_cur_bal                         -2.316511e-03
mo_sin_rcnt_rev_tl_op               -2.055137e-03


Considerando el valor absoluto de los coeficientes, hemos determinado que seleccionaremos aquellas variables que su coeficientes sea > |0.001|.

In [9]:
# seleccionamos aquellas variables con coef >  |0.001|
important_variables = lasso_coefficients[(lasso_coefficients['Coefficients'] > 0.001) |
                                         (lasso_coefficients['Coefficients'] < -0.001)]

# visualizamos las variables  y los coeficientes que cumplen este requisito
lasso_coefficients_001 = important_variables.sort_values(['Coefficients'])
lasso_coefficients_001

Unnamed: 0,Coefficients
num_bc_sats,-0.119494
mort_acc,-0.112774
id,-0.095585
home_ownership_ANY,-0.017382
purpose_credit_card,-0.014518
num_rev_tl_bal_gt_0,-0.013526
loan_amnt,-0.011506
last_fico_range_high,-0.008426
pct_tl_nvr_dlq,-0.0068
tot_coll_amt,-0.005604


In [10]:
lasso_coefficients_001.shape

(45, 1)

Vemos que aplicanto este criterio son 45 variables las resultantes, éstas serán las que utilicemos en los próximos modelos:

In [11]:
final_variables = list(lasso_coefficients_001.index) # seleccionamos las variables que cumple el criterio coef >|0.001|
final_variables

['num_bc_sats',
 'mort_acc',
 'id',
 'home_ownership_ANY',
 'purpose_credit_card',
 'num_rev_tl_bal_gt_0',
 'loan_amnt',
 'last_fico_range_high',
 'pct_tl_nvr_dlq',
 'tot_coll_amt',
 'mths_since_recent_inq',
 'tax_liens',
 'num_sats',
 'fico_range_low',
 'delinq_amnt',
 'purpose_housing',
 'mths_since_last_major_derog',
 'avg_cur_bal',
 'mo_sin_rcnt_rev_tl_op',
 'num_tl_op_past_12m',
 'open_acc',
 'emp_length_> 10 years',
 'mths_since_last_delinq',
 'num_accts_ever_120_pd',
 'all_util',
 'open_il_24m',
 'open_act_il',
 'num_tl_30dpd',
 'num_actv_bc_tl',
 'installment',
 'home_ownership_MORTGAGE',
 'emp_length_5-10 years',
 'max_bal_bc',
 'funded_amnt',
 'num_tl_90g_dpd_24m',
 'tot_cur_bal',
 'acc_now_delinq',
 'open_il_12m',
 'verification_status_Not Verified',
 'num_rev_accts',
 'purpose_leisure',
 'mths_since_rcnt_il',
 'total_rev_hi_lim',
 'home_ownership_OWN',
 'Intercept']

Guardamos en un data frame los datos de las variables seleccionadas:

In [14]:
final_df = data_cleaned.loc[:, ['num_tl_30dpd',
                                 'tot_hi_cred_lim',
                                 'revol_bal',
                                 'purpose_debt_consolidation',
                                 'home_ownership_ANY',
                                 'max_bal_bc',
                                 'total_cu_tl',
                                 'open_rv_24m',
                                 'num_tl_120dpd_2m',
                                 'open_il_12m',
                                 'mo_sin_rcnt_rev_tl_op',
                                 'num_il_tl',
                                 'id',
                                 'dti',
                                 'num_tl_90g_dpd_24m',
                                 'funded_amnt',
                                 'purpose_credit_card',
                                 'open_acc_6m',
                                 'inq_last_6mths',
                                 'application_type_Joint App',
                                 'mort_acc',
                                 'policy_code',
                                 'percent_bc_gt_75',
                                 'mths_since_recent_bc_dlq',
                                 'term_ 60 months',
                                 'mths_since_recent_bc',
                                 'num_accts_ever_120_pd',
                                 'last_fico_range_low',
                                 'purpose_medical',
                                 'installment',
                                 'collections_12_mths_ex_med',
                                 'open_il_24m',
                                 'fico_range_high',
                                 'verification_status_Verified',
                                 'purpose_housing',
                                 'mths_since_recent_revol_delinq',
                                 'total_bal_ex_mort',
                                 'num_sats',
                                 'home_ownership_MORTGAGE',
                                 'mths_since_last_record',
                                 'pub_rec',
                                 'delinq_amnt',
                                 'num_bc_tl',
                                 'mths_since_rcnt_il']]
final_df.head()

Unnamed: 0,num_tl_30dpd,tot_hi_cred_lim,revol_bal,purpose_debt_consolidation,home_ownership_ANY,max_bal_bc,total_cu_tl,open_rv_24m,num_tl_120dpd_2m,open_il_12m,mo_sin_rcnt_rev_tl_op,num_il_tl,id,dti,num_tl_90g_dpd_24m,funded_amnt,purpose_credit_card,open_acc_6m,inq_last_6mths,application_type_Joint App,mort_acc,policy_code,percent_bc_gt_75,mths_since_recent_bc_dlq,term_ 60 months,mths_since_recent_bc,num_accts_ever_120_pd,last_fico_range_low,purpose_medical,installment,collections_12_mths_ex_med,open_il_24m,fico_range_high,verification_status_Verified,purpose_housing,mths_since_recent_revol_delinq,total_bal_ex_mort,num_sats,home_ownership_MORTGAGE,mths_since_last_record,pub_rec,delinq_amnt,num_bc_tl,mths_since_rcnt_il
0,0,528172,8550,0,0,1581.0,6.0,8.0,0.0,0.0,3,12,112038251,16.99,0,11575,1,1.0,0,0,2,1,11.1,176.0,0,3.0,0,720,0,359.26,0,0.0,724,0,0,180.0,100865,20,0,84.0,1,0,16,27.0
1,0,7600,3560,1,0,2779.0,0.0,1.0,0.0,0.0,14,1,112149045,6.07,0,7200,0,0.0,0,0,0,1,100.0,176.0,0,14.0,1,665,0,285.7,0,1.0,689,0,0,180.0,5588,4,0,121.0,0,0,3,21.0
2,0,350617,23348,1,0,5965.0,8.0,6.0,0.0,1.0,3,8,112052261,13.12,0,7500,0,1.0,2,0,4,1,8.3,176.0,0,3.0,0,715,0,232.79,0,5.0,714,0,0,180.0,45955,19,1,121.0,0,0,13,7.0
3,0,34200,5733,1,0,3898.0,0.0,5.0,0.0,1.0,6,4,111999259,10.11,0,10000,0,1.0,0,0,0,1,0.0,176.0,1,6.0,0,655,0,243.29,0,2.0,694,0,0,180.0,10956,15,0,55.0,2,0,8,9.0
4,0,170591,2700,1,0,2700.0,0.0,3.0,0.0,1.0,5,3,111808508,10.86,0,14000,0,1.0,0,0,1,1,100.0,176.0,0,8.0,0,680,0,492.34,0,2.0,689,1,0,180.0,27684,4,1,121.0,0,0,5,7.0


Añadimos la variable 'target' al df final que usaremos en nuestros modelos

In [20]:
final_df['target'] = data_cleaned['target'] # renombramos el df ya que va ser el final para nuestros modelos
final_df.head()

Unnamed: 0,num_tl_30dpd,tot_hi_cred_lim,revol_bal,purpose_debt_consolidation,home_ownership_ANY,max_bal_bc,total_cu_tl,open_rv_24m,num_tl_120dpd_2m,open_il_12m,mo_sin_rcnt_rev_tl_op,num_il_tl,id,dti,num_tl_90g_dpd_24m,funded_amnt,purpose_credit_card,open_acc_6m,inq_last_6mths,application_type_Joint App,mort_acc,policy_code,percent_bc_gt_75,mths_since_recent_bc_dlq,term_ 60 months,mths_since_recent_bc,num_accts_ever_120_pd,last_fico_range_low,purpose_medical,installment,collections_12_mths_ex_med,open_il_24m,fico_range_high,verification_status_Verified,purpose_housing,mths_since_recent_revol_delinq,total_bal_ex_mort,num_sats,home_ownership_MORTGAGE,mths_since_last_record,pub_rec,delinq_amnt,num_bc_tl,mths_since_rcnt_il,target
0,0,528172,8550,0,0,1581.0,6.0,8.0,0.0,0.0,3,12,112038251,16.99,0,11575,1,1.0,0,0,2,1,11.1,176.0,0,3.0,0,720,0,359.26,0,0.0,724,0,0,180.0,100865,20,0,84.0,1,0,16,27.0,0
1,0,7600,3560,1,0,2779.0,0.0,1.0,0.0,0.0,14,1,112149045,6.07,0,7200,0,0.0,0,0,0,1,100.0,176.0,0,14.0,1,665,0,285.7,0,1.0,689,0,0,180.0,5588,4,0,121.0,0,0,3,21.0,0
2,0,350617,23348,1,0,5965.0,8.0,6.0,0.0,1.0,3,8,112052261,13.12,0,7500,0,1.0,2,0,4,1,8.3,176.0,0,3.0,0,715,0,232.79,0,5.0,714,0,0,180.0,45955,19,1,121.0,0,0,13,7.0,0
3,0,34200,5733,1,0,3898.0,0.0,5.0,0.0,1.0,6,4,111999259,10.11,0,10000,0,1.0,0,0,0,1,0.0,176.0,1,6.0,0,655,0,243.29,0,2.0,694,0,0,180.0,10956,15,0,55.0,2,0,8,9.0,0
4,0,170591,2700,1,0,2700.0,0.0,3.0,0.0,1.0,5,3,111808508,10.86,0,14000,0,1.0,0,0,1,1,100.0,176.0,0,8.0,0,680,0,492.34,0,2.0,689,1,0,180.0,27684,4,1,121.0,0,0,5,7.0,0


#### Guardamos el modelo y los datos seleccionados:
 - Guardamos el modelo

In [21]:
def save_models(filename, model): # definimos una funcion para guardar los modelos 
    with open(filename, 'wb') as file:
        pickle.dump(model, file) # IMPORTANTE ARCHIVO PICKLE

In [22]:
save_models('data/03_processed/model_lasso.pkl', model_lasso) 

 - Guardamos los datos seleccionados en un CSV para poder utilizarlo en los siguientes modelos

In [23]:
ruta = '../data/03_processed/final_df.csv' # ruta para guardar el csv

In [24]:
final_df.to_csv(ruta)

Por último hemos creado un reporte a modo resumen EDA de los datos que serán utilizados en los modelos

In [26]:
# importing sweetviz
import sweetviz as sv
#analyzing the dataset
final_report = sv.analyze(final_df)
#display the report
final_report.show_html('../data/reporting_06/final_report.html')

HBox(children=(HTML(value=''), FloatProgress(value=0.0, layout=Layout(flex='2'), max=46.0), HTML(value='')), l…


Report final_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
