![](https://www.dii.uchile.cl/wp-content/uploads/2021/06/Magi%CC%81ster-en-Ciencia-de-Datos.png)


# Proyecto: Riesgo en el Banco Giturra

**MDS7202: Laboratorio de Programación Científica para Ciencia de Datos**

### Cuerpo Docente:

- Profesor: Pablo Badilla, Ignacio Meza De La Jara
- Auxiliar: Sebastián Tinoco
- Ayudante: Diego Cortez M., Felipe Arias T.

_Por favor, lean detalladamente las instrucciones de la tarea antes de empezar a escribir._

---

## Reglas

- Fecha de entrega: 01/06/2021
- **Grupos de 2 personas.**
- Cualquier duda fuera del horario de clases al foro. Mensajes al equipo docente serán respondidos por este medio.
- Estrictamente prohibida la copia.
- Pueden usar cualquier material del curso que estimen conveniente.


---


# Presentación del Problema


![](https://www.diarioeldia.cl/u/fotografias/fotosnoticias/2019/11/8/67218.jpg)


**Giturra**, un banquero astuto y ambicioso, estableció su propio banco con el objetivo de obtener enormes ganancias. Sin embargo, su reputación se vio empañada debido a las tasas de interés usureras que imponía a sus clientes. A medida que su banco crecía, Giturra enfrentaba una creciente cantidad de préstamos impagados, lo que amenazaba su negocio y su prestigio.

Para abordar este desafío, Giturra reconoció la necesidad de reducir los riesgos de préstamo y mejorar la calidad de los préstamos otorgados. Decidió aprovechar la ciencia de datos y el análisis de riesgo crediticio. Contrató a un equipo de expertos para desarrollar un modelo predictivo de riesgo crediticio.

Cabe señalar que lo modelos solicitados por el banquero deben ser interpretables. Ya que estos le permitira al equipo comprender y explicar cómo se toman las decisiones crediticias. Utilizando visualizaciones claras y explicaciones detalladas, pudieron identificar las características más relevantes, le permitirá analizar la distribución de la importancia de las variables y evaluar si los modelos son coherentes con el negocio.

Para esto Giturra les solicita crear un modelo de riesgo disponibilizandoles una amplia gama de variables de sus usuarios: como historiales de crédito, ingresos y otros factores financieros relevantes, para evaluar la probabilidad de incumplimiento de pago de los clientes. Con esta información, Giturra podra tomar decisiones más informadas en cuanto a los préstamos, ofreciendo condiciones más favorables a aquellos con menor riesgo de impago.


## Instalación de Librerías y Carga de Datos.


Para el desarrollo de su proyecto, utilice el conjunto de datos `dataset.pq` para entrenar un modelo de su elección. Además, se adjunta junto con los datos del proyecto un archivo llamado `requirements.txt` que contiene todas las bibliotecas y versiones necesarias para el desarrollo del proyecto. Se le recomienda levantar un ambiente de `conda` para instalar estas librerías y así evitar cualquier problema con las versiones.


In [3]:
!  pip install -r requirements.txt

Collecting joblib==1.3.1 (from -r requirements.txt (line 1))
  Using cached joblib-1.3.1-py3-none-any.whl (301 kB)
Collecting lightgbm==4.0.0 (from -r requirements.txt (line 2))
  Using cached lightgbm-4.0.0-py3-none-macosx_13_0_arm64.whl
Collecting numpy==1.25.1 (from -r requirements.txt (line 3))
  Using cached numpy-1.25.1-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
Collecting pandas==2.0.3 (from -r requirements.txt (line 4))
  Using cached pandas-2.0.3-cp310-cp310-macosx_11_0_arm64.whl (10.8 MB)
Collecting pytz==2023.3 (from -r requirements.txt (line 6))
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting scikit-learn==1.3.0 (from -r requirements.txt (line 7))
  Using cached scikit_learn-1.3.0-cp310-cp310-macosx_12_0_arm64.whl (9.5 MB)
Collecting scipy==1.11.1 (from -r requirements.txt (line 8))
  Using cached scipy-1.11.1-cp310-cp310-macosx_12_0_arm64.whl (29.6 MB)
Collecting threadpoolctl==3.2.0 (from -r requirements.txt (line 10))
  Using cached threadpoolctl-3.2

In [5]:
# Data Handle
import pandas as pd
import numpy as np
import random

# Viz
#import matplotlib.pyplot as plt
#import seaborn as sns
#import plotly.express as px

# Seed
seed = 42

In [8]:
! pip install pyarrow
! pip install fastparquet

Collecting pyarrow
  Using cached pyarrow-12.0.1-cp310-cp310-macosx_11_0_arm64.whl (22.6 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-12.0.1
Collecting fastparquet
  Downloading fastparquet-2023.7.0-cp310-cp310-macosx_11_0_arm64.whl (583 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m583.9/583.9 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
Collecting cramjam>=2.3 (from fastparquet)
  Downloading cramjam-2.6.2-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (3.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0mm
[?25hCollecting fsspec (from fastparquet)
  Using cached fsspec-2023.6.0-py3-none-any.whl (163 kB)
Installing collected packages: fsspec, cramjam, fastparquet
Successfully installed cramjam-2.6.2 fastparquet-2023.7.0 fsspec-2023.6.0


## 1 Introducción

## 2 Carga de Datos y EDA

In [9]:
df = pd.read_parquet('/Users/cristobalperez/Desktop/Proyecto_2/dataset.pq')
df

Unnamed: 0,customer_id,age,occupation,annual_income,monthly_inhand_salary,num_bank_accounts,num_credit_card,interest_rate,num_of_loan,delay_from_due_date,...,num_credit_inquiries,outstanding_debt,credit_utilization_ratio,credit_history_age,payment_of_min_amount,total_emi_per_month,amount_invested_monthly,payment_behaviour,monthly_balance,credit_score
0,CUS_0xd40,23.0,Scientist,19114.12,1824.843333,3,4,3,4.0,3,...,4.0,809.98,23.933795,,No,49.574949,24.785217,High_spent_Medium_value_payments,358.124168,0
1,CUS_0x21b1,28.0,Teacher,34847.84,3037.986667,2,4,6,1.0,3,...,2.0,605.03,32.933856,27.0,No,18.816215,218.904344,Low_spent_Small_value_payments,356.078109,0
2,CUS_0x2dbc,34.0,Engineer,143162.64,12187.220000,1,5,8,3.0,8,...,3.0,1303.01,38.374753,18.0,No,246.992319,10000.000000,High_spent_Small_value_payments,895.494583,0
3,CUS_0xb891,55.0,Entrepreneur,30689.89,2612.490833,2,5,4,-100.0,4,...,4.0,632.46,27.332515,17.0,No,16.415452,125.617251,High_spent_Small_value_payments,379.216381,0
4,CUS_0x1cdb,21.0,Developer,35547.71,2853.309167,7,5,5,-100.0,1,...,4.0,943.86,25.862922,31.0,Yes,0.000000,181.330901,High_spent_Small_value_payments,364.000016,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12495,CUS_0x372c,19.0,Lawyer,42903.79,3468.315833,0,4,6,1.0,9,...,1.0,1079.48,35.716618,28.0,No,34.975457,115.184984,High_spent_Medium_value_payments,,0
12496,CUS_0xf16,45.0,Media_Manager,16680.35,,1,1,5,4.0,1,...,8.0,897.16,41.212367,,No,41.113561,70.805550,Low_spent_Large_value_payments,,0
12497,CUS_0xaf61,50.0,Writer,37188.10,3097.008333,1,4,5,3.0,7,...,3.0,620.64,39.300980,30.0,No,84.205949,42.935566,High_spent_Medium_value_payments,,0
12498,CUS_0x8600,29.0,Architect,20002.88,1929.906667,10,8,29,5.0,33,...,9.0,3571.70,37.140784,6.0,Yes,60.964772,34.662906,High_spent_Large_value_payments,,0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12500 entries, 0 to 12499
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               12500 non-null  object 
 1   age                       12500 non-null  float64
 2   occupation                12500 non-null  object 
 3   annual_income             12500 non-null  float64
 4   monthly_inhand_salary     10584 non-null  float64
 5   num_bank_accounts         12500 non-null  int64  
 6   num_credit_card           12500 non-null  int64  
 7   interest_rate             12500 non-null  int64  
 8   num_of_loan               12500 non-null  float64
 9   delay_from_due_date       12500 non-null  int64  
 10  num_of_delayed_payment    11660 non-null  float64
 11  changed_credit_limit      12246 non-null  float64
 12  num_credit_inquiries      12243 non-null  float64
 13  outstanding_debt          12500 non-null  float64
 14  credit

In [25]:
df.describe().T.round(1)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,12500.0,105.8,664.5,-500.0,25.0,33.0,42.0,8678.0
annual_income,12500.0,161620.6,1297842.0,7005.9,19453.3,37572.4,72690.2,23834698.0
monthly_inhand_salary,10584.0,4186.6,3173.7,303.6,1622.4,3087.6,5967.9,15204.6
num_bank_accounts,12500.0,16.9,114.4,-1.0,3.0,6.0,7.0,1756.0
num_credit_card,12500.0,23.2,132.0,0.0,4.0,5.0,7.0,1499.0
interest_rate,12500.0,73.2,468.7,1.0,8.0,14.0,20.0,5789.0
num_of_loan,12500.0,3.1,65.1,-100.0,1.0,3.0,5.0,1495.0
delay_from_due_date,12500.0,21.1,14.9,-5.0,10.0,18.0,28.0,67.0
num_of_delayed_payment,11660.0,32.9,237.4,-3.0,9.0,14.0,18.0,4293.0
changed_credit_limit,12246.0,10.4,6.8,-6.5,5.4,9.4,14.9,37.0


Outliers en:
- age: valores negativos y extremos
- annual_income: valores extremos
- num_bank_accounts: valores negativos y extremos
- num_of_loan: valores negativos y extremos
- delay_from_due_date: valores negativos y extremos
- num_of_delayed_payment: valores negativos y extremos
- changed_credit_limit: valores negativos
- num_credit_inquiries: valores extremos
- total_emi_per_month:valores extremos
- monthly_balance: valores negativos y extremos

In [33]:
mask = df.loc[:,'age'] <= 18
df.loc[mask,:]

Unnamed: 0,customer_id,age,occupation,annual_income,monthly_inhand_salary,num_bank_accounts,num_credit_card,interest_rate,num_of_loan,delay_from_due_date,...,num_credit_inquiries,outstanding_debt,credit_utilization_ratio,credit_history_age,payment_of_min_amount,total_emi_per_month,amount_invested_monthly,payment_behaviour,monthly_balance,credit_score
36,CUS_0x4080,16.0,Mechanic,29469.980,2227.831667,7,7,24,5.0,53,...,11.0,3421.66,32.962950,13.0,Yes,69.685459,24.066131,High_spent_Medium_value_payments,379.031577,1
45,CUS_0x8e9b,16.0,Entrepreneur,55829.790,,10,10,18,8.0,30,...,11.0,3422.49,37.865668,11.0,Yes,314.901785,255.729501,Low_spent_Small_value_payments,156.316963,0
51,CUS_0xb986,14.0,Developer,39887.220,3224.935000,9,9,16,-100.0,57,...,14.0,3119.60,34.970866,13.0,NM,133.470845,145.940549,High_spent_Small_value_payments,303.082106,0
58,CUS_0x47db,16.0,Musician,78988.480,6449.373333,9,5,29,7.0,27,...,9.0,1746.90,28.955438,10.0,Yes,291.782760,162.081601,!@9#%8,431.072972,1
65,CUS_0x5cdf,18.0,Musician,75273.240,,8,6,32,5.0,50,...,8.0,2497.34,39.124418,5.0,Yes,351.367045,369.491720,Low_spent_Medium_value_payments,343.166224,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12344,CUS_0xbf23,17.0,Accountant,8246.785,582.232083,10,9,22,4.0,39,...,10.0,1528.01,32.856037,9.0,Yes,24.475115,62.001973,!@9#%8,,0
12393,CUS_0x50d4,18.0,Engineer,41159.540,3638.961667,9,6,16,7.0,31,...,10.0,2370.64,22.415697,19.0,Yes,157.126939,321.852891,Low_spent_Small_value_payments,,1
12412,CUS_0x6c58,18.0,Accountant,9409.450,952.120833,5,6,13,7.0,7,...,7.0,209.02,40.709515,8.0,Yes,37.524075,33.744336,Low_spent_Large_value_payments,,0
12438,CUS_0x504e,17.0,_______,72572.460,6233.705000,4,4,7,4.0,29,...,8.0,1321.46,24.551789,20.0,NM,176.374432,414.373280,Low_spent_Large_value_payments,,0


In [34]:
mask = df.loc[:,'age'] >90
df.loc[mask,:]

Unnamed: 0,customer_id,age,occupation,annual_income,monthly_inhand_salary,num_bank_accounts,num_credit_card,interest_rate,num_of_loan,delay_from_due_date,...,num_credit_inquiries,outstanding_debt,credit_utilization_ratio,credit_history_age,payment_of_min_amount,total_emi_per_month,amount_invested_monthly,payment_behaviour,monthly_balance,credit_score
50,CUS_0xb14,3052.0,Manager,49967.01,4091.917500,6,5,31,6.0,27,...,12.0,2253.95,31.520782,17.0,Yes,156.003312,116.361294,High_spent_Medium_value_payments,386.827144,1
67,CUS_0xa156,4431.0,Entrepreneur,58674.66,,8,5,15,4.0,55,...,11.0,2425.38,33.808584,13.0,NM,116.103417,334.269165,Low_spent_Small_value_payments,322.082919,1
147,CUS_0x710f,3115.0,Scientist,33502.10,2890.841667,7,9,27,6.0,57,...,12.0,2362.68,22.183579,18.0,Yes,158.607927,184.548187,Low_spent_Large_value_payments,215.928052,1
343,CUS_0x8c03,4670.0,Scientist,54987.52,,4,5,10,0.0,18,...,8.0,1263.12,29.000913,16.0,Yes,0.000000,420.219964,Low_spent_Large_value_payments,326.009369,0
373,CUS_0xa560,395.0,_______,133410.66,11387.555000,1,3,4,0.0,5,...,3.0,933.84,27.005230,27.0,No,0.000000,518.235781,High_spent_Small_value_payments,880.519719,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12083,CUS_0x6ff5,7445.0,Entrepreneur,43856.38,3870.698333,2,5,6,0.0,4,...,1.0,975.82,29.259389,16.0,No,42213.000000,81.237487,High_spent_Small_value_payments,565.832346,0
12277,CUS_0x42c3,598.0,Entrepreneur,40056.10,,6,4,13,3.0,29,...,2.0,182.58,29.849686,,No,76.833249,82.693506,High_spent_Medium_value_payments,429.574078,0
12398,CUS_0x6334,1683.0,Mechanic,13944362.00,8216.153333,5,6,15,5.0,25,...,4.0,698.23,24.567582,14.0,Yes,302.916240,168.311204,High_spent_Large_value_payments,,0
12436,CUS_0x290c,1070.0,Engineer,28954.04,2703.836667,8,5,25,5.0,36,...,10.0,2077.68,37.483994,16.0,Yes,96.250067,,!@9#%8,,1


In [31]:
(df.isna().sum()/len(df)).round(3)*100

customer_id                  0.0
age                          0.0
occupation                   0.0
annual_income                0.0
monthly_inhand_salary       15.3
num_bank_accounts            0.0
num_credit_card              0.0
interest_rate                0.0
num_of_loan                  0.0
delay_from_due_date          0.0
num_of_delayed_payment       6.7
changed_credit_limit         2.0
num_credit_inquiries         2.1
outstanding_debt             0.0
credit_utilization_ratio     0.0
credit_history_age           9.0
payment_of_min_amount        0.0
total_emi_per_month          0.0
amount_invested_monthly      4.7
payment_behaviour            0.0
monthly_balance              2.8
credit_score                 0.0
dtype: float64

nulos en:
- monthly_inhand_salary
- num_of_delayed_payment
- changed_credit_limit
- num_credit_inquiries
- credit_history_age
- amount_invested_monthly
- monthly_balance

In [26]:
df.loc[:,'customer_id'].duplicated().sum()

0

In [21]:
df.loc[:,'occupation'].unique()

array(['Scientist', 'Teacher', 'Engineer', 'Entrepreneur', 'Developer',
       'Lawyer', 'Media_Manager', 'Doctor', 'Journalist', 'Manager',
       'Accountant', 'Musician', 'Mechanic', 'Writer', 'Architect',
       '_______'], dtype=object)

In [28]:
mask = df.loc[:,'occupation'] == '_______'
df.loc[mask,:]

Unnamed: 0,customer_id,age,occupation,annual_income,monthly_inhand_salary,num_bank_accounts,num_credit_card,interest_rate,num_of_loan,delay_from_due_date,...,num_credit_inquiries,outstanding_debt,credit_utilization_ratio,credit_history_age,payment_of_min_amount,total_emi_per_month,amount_invested_monthly,payment_behaviour,monthly_balance,credit_score
37,CUS_0x706a,20.0,_______,72559.360,6284.613333,4,5,17,4.0,15,...,5.0,1173.70,38.170982,28.0,No,215.839171,105.717873,High_spent_Medium_value_payments,556.904290,0
44,CUS_0xaedb,20.0,_______,85554.030,7185.502500,4,2,3,0.0,12,...,3.0,1095.73,41.513288,,No,81050.000000,78.798183,High_spent_Large_value_payments,879.752067,0
46,CUS_0x609d,27.0,_______,14165.230,1057.435833,7,10,33,9.0,61,...,11.0,2797.17,35.201445,14.0,NM,58.868441,133.882578,Low_spent_Small_value_payments,202.992565,0
60,CUS_0x7d0b,22.0,_______,19981.600,1459.133333,10,8,20,7.0,54,...,12.0,4834.59,30.813295,10.0,Yes,78.418272,171.639827,Low_spent_Small_value_payments,185.855234,1
64,CUS_0xab76,26.0,_______,60162.100,5197.508333,5,7,5,7.0,18,...,9.0,1037.45,32.078619,13.0,Yes,50812.000000,231.766467,Low_spent_Medium_value_payments,347.770639,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12441,CUS_0x85dd,45.0,_______,8974.555,783.879583,10,8,28,7.0,21,...,7.0,1660.14,33.883240,16.0,Yes,30.443262,57.441336,Low_spent_Small_value_payments,,0
12456,CUS_0x2e91,31.0,_______,70926.060,5741.505000,7,7,10,2.0,13,...,6.0,332.75,32.450983,29.0,No,67.185777,297.237364,Low_spent_Large_value_payments,,0
12469,CUS_0x4a8f,24.0,_______,34493.920,3043.493333,9,10,24,7.0,20,...,13.0,4138.67,30.807734,5.0,Yes,110.383098,348.266986,Low_spent_Small_value_payments,,1
12491,CUS_0xb11c,38.0,_______,15319.650,1460.637500,6,7,15,4.0,54,...,6.0,1453.61,34.557510,11.0,Yes,28.182033,191.877779,Low_spent_Small_value_payments,,1


Se puede asumir que occupation = '_______' representa que no se tiene esta información o no la entrega, o bien, no tiene trabajo estable o formal.

In [22]:
df.loc[:,'payment_of_min_amount'].unique()

array(['No', 'Yes', 'NM'], dtype=object)

In [23]:
df.loc[:,'payment_behaviour'].unique()

array(['High_spent_Medium_value_payments',
       'Low_spent_Small_value_payments',
       'High_spent_Small_value_payments', '!@9#%8',
       'Low_spent_Medium_value_payments',
       'High_spent_Large_value_payments',
       'Low_spent_Large_value_payments'], dtype=object)

In [29]:
mask = df.loc[:,'payment_behaviour'] == '!@9#%8'
df.loc[mask,:]

Unnamed: 0,customer_id,age,occupation,annual_income,monthly_inhand_salary,num_bank_accounts,num_credit_card,interest_rate,num_of_loan,delay_from_due_date,...,num_credit_inquiries,outstanding_debt,credit_utilization_ratio,credit_history_age,payment_of_min_amount,total_emi_per_month,amount_invested_monthly,payment_behaviour,monthly_balance,credit_score
5,CUS_0x95ee,31.0,Lawyer,73928.46,5988.705000,4,5,8,0.0,8,...,,548.20,31.580990,32.0,No,0.000000,42.635590,!@9#%8,796.234910,0
22,CUS_0xac86,20.0,Entrepreneur,106733.13,8873.427500,4,4,1,0.0,7,...,1.0,76.23,39.243640,33.0,No,0.000000,384.763725,!@9#%8,792.579025,0
27,CUS_0x3edc,32.0,Accountant,43070.24,,3,3,18,-100.0,11,...,8.0,1233.10,26.365700,19.0,Yes,30.576085,403.516809,!@9#%8,218.125773,0
41,CUS_0x6a1b,33.0,Accountant,30788.44,2623.703333,7,9,31,6.0,49,...,9.0,3470.08,38.254422,,Yes,114.533021,242.532349,!@9#%8,195.304963,0
56,CUS_0x3f5b,25.0,Doctor,80108.31,6866.692500,5,3,17,0.0,25,...,4.0,997.28,28.123278,18.0,Yes,0.000000,127.886289,!@9#%8,808.782961,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12436,CUS_0x290c,1070.0,Engineer,28954.04,2703.836667,8,5,25,5.0,36,...,10.0,2077.68,37.483994,16.0,Yes,96.250067,,!@9#%8,,1
12462,CUS_0x1ea8,30.0,Teacher,33896.53,,7,3,18,2.0,26,...,,807.65,26.800550,27.0,Yes,40.293529,88.098247,!@9#%8,,0
12473,CUS_0x62f5,54.0,Musician,99520.50,8479.375000,3,1,4396,3.0,8,...,5.0,547.21,38.798114,16.0,NM,196.528591,147.563516,!@9#%8,,0
12476,CUS_0x64f0,19.0,Architect,39977.21,,4,7,11,0.0,13,...,4.0,832.09,35.328081,30.0,No,0.000000,90.549071,!@9#%8,,0


Respecto a payment_behaviour = '!@9#%8' no se puede concluir mucho, puede ser que no se tenga información de ello.

## 3 Preparación de los Datos

### 3.1 Preprocesamiento con ColumnTransformer

Convertir columnas mal leídas

In [40]:
df.loc[:'num_of_delayed_payment'] 

Unnamed: 0,customer_id,age,occupation,annual_income,monthly_inhand_salary,num_bank_accounts,num_credit_card,interest_rate,num_of_loan,delay_from_due_date,...,num_credit_inquiries,outstanding_debt,credit_utilization_ratio,credit_history_age,payment_of_min_amount,total_emi_per_month,amount_invested_monthly,payment_behaviour,monthly_balance,credit_score
0,CUS_0xd40,23.0,Scientist,19114.12,1824.843333,3,4,3,4.0,3,...,4.0,809.98,23.933795,,No,49.574949,24.785217,High_spent_Medium_value_payments,358.124168,0
1,CUS_0x21b1,28.0,Teacher,34847.84,3037.986667,2,4,6,1.0,3,...,2.0,605.03,32.933856,27.0,No,18.816215,218.904344,Low_spent_Small_value_payments,356.078109,0
2,CUS_0x2dbc,34.0,Engineer,143162.64,12187.220000,1,5,8,3.0,8,...,3.0,1303.01,38.374753,18.0,No,246.992319,10000.000000,High_spent_Small_value_payments,895.494583,0
3,CUS_0xb891,55.0,Entrepreneur,30689.89,2612.490833,2,5,4,-100.0,4,...,4.0,632.46,27.332515,17.0,No,16.415452,125.617251,High_spent_Small_value_payments,379.216381,0
4,CUS_0x1cdb,21.0,Developer,35547.71,2853.309167,7,5,5,-100.0,1,...,4.0,943.86,25.862922,31.0,Yes,0.000000,181.330901,High_spent_Small_value_payments,364.000016,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12495,CUS_0x372c,19.0,Lawyer,42903.79,3468.315833,0,4,6,1.0,9,...,1.0,1079.48,35.716618,28.0,No,34.975457,115.184984,High_spent_Medium_value_payments,,0
12496,CUS_0xf16,45.0,Media_Manager,16680.35,,1,1,5,4.0,1,...,8.0,897.16,41.212367,,No,41.113561,70.805550,Low_spent_Large_value_payments,,0
12497,CUS_0xaf61,50.0,Writer,37188.10,3097.008333,1,4,5,3.0,7,...,3.0,620.64,39.300980,30.0,No,84.205949,42.935566,High_spent_Medium_value_payments,,0
12498,CUS_0x8600,29.0,Architect,20002.88,1929.906667,10,8,29,5.0,33,...,9.0,3571.70,37.140784,6.0,Yes,60.964772,34.662906,High_spent_Large_value_payments,,0


In [43]:
df.loc[:,'age'] = pd.to_numeric(df.loc[:,'age'], errors='coerce', downcast='integer')
df.loc[:,'num_of_loan'] = pd.to_numeric(df.loc[:,'num_of_loan'], errors='coerce', downcast='integer')
df.loc[:,'delay_from_due_date'] = pd.to_numeric(df.loc[:,'delay_from_due_date'], errors='coerce', downcast='integer')
df.loc[:,'num_of_delayed_payment'] = pd.to_numeric(df.loc[:,'num_of_delayed_payment'], errors='coerce', downcast='integer')
df.loc[:,'changed_credit_limit'] = pd.to_numeric(df.loc[:,'changed_credit_limit'], errors='coerce', downcast='integer')
df.loc[:,'num_credit_inquiries'] = pd.to_numeric(df.loc[:,'num_credit_inquiries'], errors='coerce', downcast='integer')
df.loc[:,'credit_history_age'] = pd.to_numeric(df.loc[:,'credit_history_age'], errors='coerce', downcast='integer')

Generar ColumnTransformer

In [35]:
# Machine Learning - Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Configurar transform_output para que el resultado sea un DataFrame
from sklearn import set_config
set_config(transform_output='pandas')

In [36]:
# Machine Learning - Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer

In [44]:
df.dtypes

customer_id                  object
age                         float64
occupation                   object
annual_income               float64
monthly_inhand_salary       float64
num_bank_accounts             int64
num_credit_card               int64
interest_rate                 int64
num_of_loan                 float64
delay_from_due_date           int64
num_of_delayed_payment      float64
changed_credit_limit        float64
num_credit_inquiries        float64
outstanding_debt            float64
credit_utilization_ratio    float64
credit_history_age          float64
payment_of_min_amount        object
total_emi_per_month         float64
amount_invested_monthly     float64
payment_behaviour            object
monthly_balance             float64
credit_score                  int64
dtype: object

In [53]:
log_numeric = [
    'age','annual_income','monthly_inhand_salary','delay_from_due_date','outstanding_debt','amount_invested_monthly'
    ]
numeric = [
    'num_of_delayed_payment','num_bank_accounts','num_of_loan','num_credit_inquiries','total_emi_per_month',
    'credit_utilization_ratio','credit_history_age','changed_credit_limit','monthly_balance','num_credit_card',
    'interest_rate',
    'age','annual_income','monthly_inhand_salary','delay_from_due_date','outstanding_debt','amount_invested_monthly'
    ]
categorical = [
    'occupation', 'payment_of_min_amount', 'payment_behaviour'
    ]
ordinal = [
    ]
not_used = [
    'customer_id','credit_score'
    ]

In [None]:
# Log Transformer
LogTransformer = FunctionTransformer(np.log1p)

In [None]:
ct = ColumnTransformer(
    [
        ('One Hot',OneHotEncoder(sparse=False, handle_unknown="ignore",drop='if_binary'),categorical),
        ('Ordinal',OrdinalEncoder(),ordinal),
        ('Log Scaler',LogTransformer,log_numeric),
        ('Scaler',RobustScaler(),numeric)
    ],
    #remainder='passthrough'
)

Probar ColumnTransformer

In [None]:
ct.fit_transform(df)

### 3.2 Holdout

In [None]:
# Train-Test Split
from sklearn.model_selection import train_test_split

In [None]:
# Primero se separan los datos en train y test
features = df.drop(columns=not_used)
y = df.loc[:, 'credit_score']

X_train, X_test, y_train, y_test = train_test_split(
    features, y, test_size=0.2, shuffle=True, stratify=y, random_state=seed
)

### 3.3 Datos Nulos

In [None]:
def Imputer_Function(X):
   
    return X

In [None]:
imputer = FunctionTransformer(Imputer_Function)

### 3.4 Feature Engineering

In [None]:
def Feature_Extractor(X):

    return X

In [None]:
feature_extractor = FunctionTransformer(Feature_Extractor)

Nuevo Column Transformer:
Se agregan las características que se extraen desde el Feature Extractor

In [None]:
log_numeric = [
    'age','annual_income','monthly_inhand_salary','delay_from_due_date','outstanding_debt','amount_invested_monthly'
    ]
numeric = [
    'num_of_delayed_payment','num_bank_accounts','num_of_loan','num_credit_inquiries','total_emi_per_month',
    'credit_utilization_ratio','credit_history_age','changed_credit_limit','monthly_balance','num_credit_card',
    'interest_rate',
    'age','annual_income','monthly_inhand_salary','delay_from_due_date','outstanding_debt','amount_invested_monthly'
    ]
categorical = [
    'occupation', 'payment_of_min_amount', 'payment_behaviour'
    ]
ordinal = [
    ]
not_used = [
    'customer_id','credit_score'
    ]

In [None]:
ct = ColumnTransformer(
    [
        ('One Hot',OneHotEncoder(sparse=False, handle_unknown="ignore",drop='if_binary'),categorical),
        ('Ordinal',OrdinalEncoder(),ordinal),
        ('Log Scaler',LogTransformer,log_numeric),
        ('Scaler',RobustScaler(),numeric)
    ],
    #remainder='passthrough'
)

## 4 Baseline

In [None]:
# Machine Learning - Models - Classification
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

### Pipelines

In [None]:
clf_dummy = Pipeline(steps=[
                    #('extractor', feature_extractor),
                    ("preprocessing", ct),
                    #('imputer', imputer),
                    ("model", DummyClassifier(strategy='stratified',random_state=seed) )
])

clf_logit = Pipeline(steps=[
                    #('extractor', feature_extractor),
                    ("preprocessing", ct),
                    #('imputer', imputer),
                    ("model", LogisticRegression(random_state=seed) )
])

In [None]:
clf_tree = Pipeline(steps=[
                    #('extractor', feature_extractor),
                    ("preprocessing", ct),
                    #('imputer', imputer),
                    ("model", DecisionTreeClassifier(random_state=seed) )
])

clf_knn = Pipeline(steps=[
                    #('extractor', feature_extractor),
                    ("preprocessing", ct),
                    #('imputer', imputer),
                    ("model", KNeighborsClassifier(random_state=seed) )
])

clf_svm = Pipeline(steps=[
                    #('extractor', feature_extractor),
                    ("preprocessing", ct),
                    #('imputer', imputer),
                    ("model", SVC(random_state=seed) )
])

clf_rf = Pipeline(steps=[
                    #('extractor', feature_extractor),
                    ("preprocessing", ct),
                    #('imputer', imputer),
                    ("model", RandomForestClassifier(random_state=seed) )
])

In [None]:
clf_xgb = Pipeline(steps=[
                    #('extractor', feature_extractor),
                    ("preprocessing", ct),
                    #('imputer', imputer),
                    ("model", XGBClassifier(random_state=seed) )
])

clf_lgbm = Pipeline(steps=[
                    #('extractor', feature_extractor),
                    ("preprocessing", ct),
                    #('imputer', imputer),
                    ("model", LGBMClassifier(random_state=seed) )
])

### Evaluación

In [None]:
# Machine Learning - Metrics
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, confusion_matrix, classification_report

In [None]:
metricas = pd.DataFrame(columns=['Modelo','Precision_Test'])

In [None]:
# Fit
clf_dummy.fit(X_train,y_train)

# Predict
y_pred = clf_dummy.predict(X_train)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_train))
print(classification_report(y_pred, y_train))
print("\nPrecision-Score", precision_score(y_pred, y_train).round(2))

In [None]:
# Predict
y_pred = clf_dummy.predict(X_test)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_test))
print(classification_report(y_pred, y_test))
print("\nPrecision-Score", precision_score(y_pred, y_test).round(2))

In [None]:
nueva_fila = {'Modelo': 'Dummy', 'Precision_Test': precision_score(y_pred, y_test)}

# Agregar la nueva fila al DataFrame
metricas = metricas.append(nueva_fila, ignore_index=True)

In [None]:
# Fit
clf_logit.fit(X_train,y_train)

# Predict
y_pred = clf_logit.predict(X_train)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_train))
print(classification_report(y_pred, y_train))
print("\nPrecision-Score", precision_score(y_pred, y_train).round(2))

In [None]:
# Predict
y_pred = clf_logit.predict(X_test)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_test))
print(classification_report(y_pred, y_test))
print("\nPrecision-Score", precision_score(y_pred, y_test).round(2))

In [None]:
nueva_fila = {'Modelo': 'Logit', 'Precision_Test': precision_score(y_pred, y_test)}

# Agregar la nueva fila al DataFrame
metricas = metricas.append(nueva_fila, ignore_index=True)

In [None]:
# Fit
clf_tree.fit(X_train,y_train)

# Predict
y_pred = clf_tree.predict(X_train)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_train))
print(classification_report(y_pred, y_train))
print("\nPrecision-Score", precision_score(y_pred, y_train).round(2))

In [None]:
# Predict
y_pred = clf_tree.predict(X_test)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_test))
print(classification_report(y_pred, y_test))
print("\nPrecision-Score", precision_score(y_pred, y_test).round(2))

In [None]:
nueva_fila = {'Modelo': 'DecisionTree', 'Precision_Test': precision_score(y_pred, y_test)}

# Agregar la nueva fila al DataFrame
metricas = metricas.append(nueva_fila, ignore_index=True)

In [None]:
# Fit
clf_knn.fit(X_train,y_train)

# Predict
y_pred = clf_knn.predict(X_train)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_train))
print(classification_report(y_pred, y_train))
print("\nPrecision-Score", precision_score(y_pred, y_train).round(2))

In [None]:
# Predict
y_pred = clf_knn.predict(X_test)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_test))
print(classification_report(y_pred, y_test))
print("\nPrecision-Score", precision_score(y_pred, y_test).round(2))

In [None]:
nueva_fila = {'Modelo': 'KNN', 'Precision_Test': precision_score(y_pred, y_test)}

# Agregar la nueva fila al DataFrame
metricas = metricas.append(nueva_fila, ignore_index=True)

In [None]:
# Fit
clf_svm.fit(X_train,y_train)

# Predict
y_pred = clf_svm.predict(X_train)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_train))
print(classification_report(y_pred, y_train))
print("\nPrecision-Score", precision_score(y_pred, y_train).round(2))

In [None]:
# Predict
y_pred = clf_svm.predict(X_test)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_test))
print(classification_report(y_pred, y_test))
print("\nPrecision-Score", precision_score(y_pred, y_test).round(2))

In [None]:
nueva_fila = {'Modelo': 'SVM', 'Precision_Test': precision_score(y_pred, y_test)}

# Agregar la nueva fila al DataFrame
metricas = metricas.append(nueva_fila, ignore_index=True)

In [None]:
# Fit
clf_rf.fit(X_train,y_train)

# Predict
y_pred = clf_rf.predict(X_train)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_train))
print(classification_report(y_pred, y_train))
print("\nPrecision-Score", precision_score(y_pred, y_train).round(2))

In [None]:
# Predict
y_pred = clf_rf.predict(X_test)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_test))
print(classification_report(y_pred, y_test))
print("\nPrecision-Score", precision_score(y_pred, y_test).round(2))

In [None]:
nueva_fila = {'Modelo': 'RandomForest', 'Precision_Test': precision_score(y_pred, y_test)}

# Agregar la nueva fila al DataFrame
metricas = metricas.append(nueva_fila, ignore_index=True)

In [None]:
# Fit
clf_xgb.fit(X_train,y_train)

# Predict
y_pred = clf_xgb.predict(X_train)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_train))
print(classification_report(y_pred, y_train))
print("\nPrecision-Score", precision_score(y_pred, y_train).round(2))

In [None]:
# Predict
y_pred = clf_xgb.predict(X_test)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_test))
print(classification_report(y_pred, y_test))
print("\nPrecision-Score", precision_score(y_pred, y_test).round(2))

In [None]:
nueva_fila = {'Modelo': 'XGBoost', 'Precision_Test': precision_score(y_pred, y_test)}

# Agregar la nueva fila al DataFrame
metricas = metricas.append(nueva_fila, ignore_index=True)

In [None]:
# Fit
clf_lgbm.fit(X_train,y_train)

# Predict
y_pred = clf_lgbm.predict(X_train)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_train))
print(classification_report(y_pred, y_train))
print("\nPrecision-Score", precision_score(y_pred, y_train).round(2))

In [None]:
# Predict
y_pred = clf_lgbm.predict(X_test)

# Resultados
print("Matriz de confusión\n", confusion_matrix(y_pred, y_test))
print(classification_report(y_pred, y_test))
print("\nPrecision-Score", precision_score(y_pred, y_test).round(2))

In [None]:
nueva_fila = {'Modelo': 'LGBM', 'Precision_Test': precision_score(y_pred, y_test)}

# Agregar la nueva fila al DataFrame
metricas = metricas.append(nueva_fila, ignore_index=True)

### Resumen de Modelos

In [None]:
metricas.sort_values(by='Precision_Test', ascending=False)

## 5 Optimización del Modelo

## 6 Interpretabilidad

## 7 Conclusiones