# Procesado dataset

Procesado del dataset en base al análisis previo. Este procesado sirve tanto para el conjunto de entrenamiento como para los conjuntos que se quieran clasisificar con los modelos. Haciendo este procesado común nos aseguramos que los datos que manejan los modelos tienen los mismos formatos y cualidades.

In [1]:
# external imports
import os

import pandas as pd

In [2]:
# constants
data_folder="../../data"
raw_data_folder = f"{data_folder}/raw"
processed_data_folder = f"{data_folder}/processed"

original_train_dataset_path = f"{raw_data_folder}/train.csv"

processed_train_dataset_path = f"{raw_data_folder}/train_processed.csv"

In [3]:
# upload the dataset
original_train_df = pd.read_csv(original_train_dataset_path, sep=",")
original_train_df.head()

Unnamed: 0,id,LoanNr_ChkDgt,Name,City,State,Bank,BankState,ApprovalDate,ApprovalFY,NoEmp,...,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,Accept
0,bd9d6267ec5,1523195006,"P-SCAPE LAND DESIGN, LLC",NORTHFIELD,OH,CITIZENS BANK NATL ASSOC,RI,1-Nov-05,2006,2,...,0,2,0,1,N,N,31-Dec-05,"$8,000.00",$0.00,1
1,9eebf6d8098,1326365010,The Fresh & Healthy Catering C,CANTON,OH,"FIRSTMERIT BANK, N.A.",OH,6-Jun-05,2005,2,...,1,2,1,1,N,N,31-Jul-05,"$166,000.00",$0.00,1
2,83806858500,6179584001,AARON MASON & HOWE LLC,SAWYERWOOD,OH,"PNC BANK, NATIONAL ASSOCIATION",OH,18-Mar-03,2003,2,...,4,2,1,2,Y,N,31-Mar-03,"$25,000.00",$0.00,1
3,a21ab9cb3af,8463493009,MID OHIO CAR WASH,COLUMBUS,OH,THE HUNTINGTON NATIONAL BANK,OH,28-Jun-95,1995,2,...,0,0,1,0,N,N,31-Jan-96,"$220,100.00",$0.00,1
4,883b5e5385e,3382225007,Bake N Brew LLC,Newark,OH,THE HUNTINGTON NATIONAL BANK,OH,16-Apr-09,2009,0,...,0,0,0,1,N,N,31-May-09,"$25,000.00",$0.00,0


## Procesado por columnas

En la siguiente sección aplicamos el procesamiemto de datos a cada columna adecuado. Un tratamiento común será rellenar todos los valores nulos de manera que los datasets sobre los que se realicen las predicciones no tengan nulos.

In [4]:
processed_train_df = original_train_df.copy()
processed_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22835 entries, 0 to 22834
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 22835 non-null  object 
 1   LoanNr_ChkDgt      22835 non-null  int64  
 2   Name               22834 non-null  object 
 3   City               22834 non-null  object 
 4   State              22835 non-null  object 
 5   Bank               22813 non-null  object 
 6   BankState          22813 non-null  object 
 7   ApprovalDate       22835 non-null  object 
 8   ApprovalFY         22835 non-null  int64  
 9   NoEmp              22835 non-null  int64  
 10  NewExist           22833 non-null  float64
 11  CreateJob          22835 non-null  int64  
 12  RetainedJob        22835 non-null  int64  
 13  FranchiseCode      22835 non-null  int64  
 14  UrbanRural         22835 non-null  int64  
 15  RevLineCr          22744 non-null  object 
 16  LowDoc             227

In [5]:
# Procesado de columna
column = "id"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)

In [6]:
# Procesado de columna
column = "LoanNr_ChkDgt"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)

In [7]:
# Procesado de columna
column = "Name"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)

In [None]:
# Procesado de columna
column = "City"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)


City
COLUMBUS      2147
CINCINNATI    1617
CLEVELAND     1118
TOLEDO         761
DAYTON         729
              ... 
GATE MILLS       1
WAUSSEON         1
MIAMI            1
METAMORA         1
NEFFS            1
Name: count, Length: 1181, dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [9]:
# Procesado de columna
column = "State"

# No interesa para el modelo
processed_train_df.drop(columns=column, inplace=True)

In [10]:
# Procesado de columna
column = "Bank"

# Relleno los nulos con la moda
processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [11]:
# Procesado de columna
column = "BankState"

# Relleno los nulos con la moda
processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  processed_train_df[column].fillna(processed_train_df[column].mode()[0], inplace=True)


In [12]:
# Procesado de columna
column = "ApprovalDate"

# Al ser una fecha se dividirá en 2 columnas, una para el mes y otra para el año
processed_train_df[column] = pd.to_datetime(processed_train_df[column])

processed_train_df["ApprovalDateMonth"] = processed_train_df[column].dt.month
processed_train_df["ApprovalDateYear"] = processed_train_df[column].dt.year

# Elimino la columna que ya no sirve
processed_train_df.drop(columns=column, inplace=True)


  processed_train_df[column] = pd.to_datetime(processed_train_df[column])


In [None]:
# Procesado de columna
column = "ApprovalFY"
