# Procesado dataset

Procesado del dataset en base al análisis previo. Este procesado sirve tanto para el conjunto de entrenamiento como para los conjuntos que se quieran clasisificar con los modelos. Haciendo este procesado común nos aseguramos que los datos que manejan los modelos tienen los mismos formatos y cualidades.

In [53]:
# external imports
import os

import pandas as pd

In [54]:
# constants
data_folder="../../data"
raw_data_folder = f"{data_folder}/raw"
processed_data_folder = f"{data_folder}/processed"

original_train_dataset_path = f"{raw_data_folder}/train.csv"

processed_train_dataset_path = f"{processed_data_folder}/train_processed.csv"

In [55]:
# upload the dataset
original_train_df = pd.read_csv(original_train_dataset_path, sep=",")
original_train_df.head()

Unnamed: 0,id,LoanNr_ChkDgt,Name,City,State,Bank,BankState,ApprovalDate,ApprovalFY,NoEmp,...,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,Accept
0,bd9d6267ec5,1523195006,"P-SCAPE LAND DESIGN, LLC",NORTHFIELD,OH,CITIZENS BANK NATL ASSOC,RI,1-Nov-05,2006,2,...,0,2,0,1,N,N,31-Dec-05,"$8,000.00",$0.00,1
1,9eebf6d8098,1326365010,The Fresh & Healthy Catering C,CANTON,OH,"FIRSTMERIT BANK, N.A.",OH,6-Jun-05,2005,2,...,1,2,1,1,N,N,31-Jul-05,"$166,000.00",$0.00,1
2,83806858500,6179584001,AARON MASON & HOWE LLC,SAWYERWOOD,OH,"PNC BANK, NATIONAL ASSOCIATION",OH,18-Mar-03,2003,2,...,4,2,1,2,Y,N,31-Mar-03,"$25,000.00",$0.00,1
3,a21ab9cb3af,8463493009,MID OHIO CAR WASH,COLUMBUS,OH,THE HUNTINGTON NATIONAL BANK,OH,28-Jun-95,1995,2,...,0,0,1,0,N,N,31-Jan-96,"$220,100.00",$0.00,1
4,883b5e5385e,3382225007,Bake N Brew LLC,Newark,OH,THE HUNTINGTON NATIONAL BANK,OH,16-Apr-09,2009,0,...,0,0,0,1,N,N,31-May-09,"$25,000.00",$0.00,0


## Procesado por columnas

En la siguiente sección aplicamos el procesamiemto de datos a cada columna adecuado. Un tratamiento común será rellenar todos los valores nulos de manera que los datasets sobre los que se realicen las predicciones no tengan nulos.

In [56]:
processed_train_df = original_train_df.copy()
processed_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22835 entries, 0 to 22834
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 22835 non-null  object 
 1   LoanNr_ChkDgt      22835 non-null  int64  
 2   Name               22834 non-null  object 
 3   City               22834 non-null  object 
 4   State              22835 non-null  object 
 5   Bank               22813 non-null  object 
 6   BankState          22813 non-null  object 
 7   ApprovalDate       22835 non-null  object 
 8   ApprovalFY         22835 non-null  int64  
 9   NoEmp              22835 non-null  int64  
 10  NewExist           22833 non-null  float64
 11  CreateJob          22835 non-null  int64  
 12  RetainedJob        22835 non-null  int64  
 13  FranchiseCode      22835 non-null  int64  
 14  UrbanRural         22835 non-null  int64  
 15  RevLineCr          22744 non-null  object 
 16  LowDoc             227

In [57]:
# Procesado de columna
column = "id"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)

In [58]:
# Procesado de columna
column = "LoanNr_ChkDgt"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)

In [59]:
# Procesado de columna
column = "Name"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)

In [60]:
# Procesado de columna
column = "City"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)


In [61]:
# Procesado de columna
column = "State"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)

In [62]:
# Procesado de columna
column = "Bank"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)

In [63]:
# Procesado de columna
column = "BankState"

# Cambio de valores a binario
processed_train_df[column].fillna("OH", inplace=True)
processed_train_df.loc[
    processed_train_df[column] == "OH",
    "BankStateInOhio"
] = int(1)

processed_train_df.loc[
    processed_train_df[column] != "OH",
    "BankStateInOhio"
] = int(0)

processed_train_df.drop(columns=column, inplace=True)

processed_train_df["BankStateInOhio"] = processed_train_df["BankStateInOhio"].astype("int64")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna("OH", inplace=True)


In [64]:
# Procesado de columna
column = "ApprovalDate"

# Al ser una fecha se dividirá en 2 columnas, una para el mes y otra para el año
processed_train_df[column] = pd.to_datetime(processed_train_df[column])

processed_train_df["ApprovalDateMonth"] = processed_train_df[column].dt.month
processed_train_df["ApprovalDateMonth"] = processed_train_df["ApprovalDateMonth"].astype("int64")

# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)


  processed_train_df[column] = pd.to_datetime(processed_train_df[column])


In [65]:
# Procesado de columna
column = "ApprovalFY"
column_grouped = f"{column}Grouped"

fiscal_year_mode = processed_train_df[column].mode()[0]

# Agrupación de años
def agrupar_años(year:int):
    if 1970 <= year < 1980:
        return 1975
    elif 1980 <= year < 1990:
        return 1989
    elif year > 2025:
        return fiscal_year_mode
    else:
        return year

processed_train_df[column_grouped] = processed_train_df[column].apply(agrupar_años)
# Camnbio a enteros
processed_train_df[column_grouped] = processed_train_df[column_grouped].astype("int64")
# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)


In [66]:
# Procesado de columna
column = "NoEmp"
column_grouped = f"{column}Grouped"

# Agrupación de años
def agrupar_numero_empleado(value:int):
    if value <= 10:
        return 0
    return 1
    
processed_train_df[column_grouped] = processed_train_df[column].apply(agrupar_numero_empleado)
# Camnbio a enteros
processed_train_df[column_grouped] = processed_train_df[column_grouped].astype("int64")
# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)


In [67]:
# Procesado de columna
column = "NewExist"

# Sustituyo nulos
processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)
# Camnbio a enteros
processed_train_df[column] = processed_train_df[column].astype("int64")

processed_train_df.loc[
    processed_train_df[column] != 2,
    column
] = int(1)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [68]:
# Procesado de columna
column = "CreateJob"
column_grouped = "CreateJobBinary"

# Sustituyo nulos y extraños por la moda
processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)
# Cambio a enteros
processed_train_df[column] = processed_train_df[column].astype("int64")

processed_train_df[column_grouped] = 0
processed_train_df.loc[
    processed_train_df[column] > 0,
    column_grouped
] = int(1)

# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [69]:
# Procesado de columna
column = "RetainedJob"
column_grouped = "RetainedJobBinary"

# Sustituyo nulos y extraños por la moda
processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)
# Cambio a enteros
processed_train_df[column] = processed_train_df[column].astype("int64")

processed_train_df[column_grouped] = 0
processed_train_df.loc[
    processed_train_df[column] > 0,
    column_grouped
] = int(1)

# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [70]:
# Procesado de columna
column = "FranchiseCode"
column_grouped = "IsFranchise"

# Sustituyo nulos por la moda
processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)
# Cambio a enteros
processed_train_df[column] = processed_train_df[column].astype("int64")

processed_train_df[column_grouped] = 0
processed_train_df.loc[
    (processed_train_df[column] != 0) &
    (processed_train_df[column] != 1),
    column_grouped
] = int(1)


# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [71]:
# Procesado de columna
column = "UrbanRural"

# Sustituyo nulos por la moda
processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)
# Camnbio a enteros
processed_train_df[column] = processed_train_df[column].astype("int64")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [72]:
# Procesado de columna
column = "RevLineCr"

# Sustituyo nulos por la moda
processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)

# Modifico valores extraños
processed_train_df.loc[
    processed_train_df[column] == "N",
    column
] = int(0)

processed_train_df.loc[
    (processed_train_df[column] == "Y") |
    (processed_train_df[column] == "T"),
    column
] = int(1)

processed_train_df.loc[
    (processed_train_df[column] != 1) &
    (processed_train_df[column] != 0),
    column
] = int(0)

# Cambio a enteros
processed_train_df[column] = processed_train_df[column].astype("int64")


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [73]:
# Procesado de columna
column = "LowDoc"

# Sustituyo nulos por la moda
processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)

# Modifico valores extraños
processed_train_df.loc[
    processed_train_df[column] == "Y",
    column
] = int(1)

processed_train_df.loc[
    (processed_train_df[column] != 1),
    column
] = int(0)

# Cambio a enteros
processed_train_df[column] = processed_train_df[column].astype("int64")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [74]:
# Procesado de columna
column = "DisbursementDate"

# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)

In [75]:
# Procesado de columna
column = "DisbursementGross"
column_grouped = "DisbursementGrossGrouped"


# Para convertir la moneda en un número
processed_train_df[column] = processed_train_df[column].replace('[\$,]', '', regex=True).astype(float)

# Agrupación de años
def agrupar_gross(value:int):
    if value <= 50000:
        return 0
    elif 50000 < value <= 250000:
        return 1
    elif 250000 < value <= 1000000:
        return 2
    else:
        return 3

processed_train_df[column_grouped] = processed_train_df[column].apply(agrupar_gross)

# Cambio a enteros
processed_train_df[column_grouped] = processed_train_df[column_grouped].astype("int64")

# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)


  processed_train_df[column] = processed_train_df[column].replace('[\$,]', '', regex=True).astype(float)


In [76]:
# Procesado de columna
column = "BalanceGross"

# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)

## Tratamiento general

Aplicamos un tratamiento posterior de eliminar duplicados al tener categorizados las columnas

In [77]:
processed_train_df.drop_duplicates(inplace=True)

## Guardamos el dataset procesado

In [78]:
processed_train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14434 entries, 0 to 22829
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   NewExist                  14434 non-null  int64
 1   UrbanRural                14434 non-null  int64
 2   RevLineCr                 14434 non-null  int64
 3   LowDoc                    14434 non-null  int64
 4   Accept                    14434 non-null  int64
 5   BankStateInOhio           14434 non-null  int64
 6   ApprovalDateMonth         14434 non-null  int64
 7   ApprovalFYGrouped         14434 non-null  int64
 8   NoEmpGrouped              14434 non-null  int64
 9   CreateJobBinary           14434 non-null  int64
 10  RetainedJobBinary         14434 non-null  int64
 11  IsFranchise               14434 non-null  int64
 12  DisbursementGrossGrouped  14434 non-null  int64
dtypes: int64(13)
memory usage: 1.5 MB


In [79]:
processed_train_df.to_csv(processed_train_dataset_path, sep=",", index=False)