scikit-learn==1.5.2

feature-engine==1.8.2

# Feature Engineering con Software Interno

En este Notebook, configuraremos todos los pasos de ingeniería de características dentro de un pipeline de Scikit-learn utilizando los transformadores de código abierto, además de aquellos que desarrollamos internamente.

# Paso 1: Reproducibility: Setting the seed


Con el objetivo de asegurar la reproducibilidad entre ejecuciones del mismo cuaderno, así como entre el entorno de investigación y producción, para cada paso que incluya algún elemento de aleatoriedad, es extremadamente importante que . **establezcamos la semilla**.

In [None]:
#!pip install feature-engine==1.8.2

In [None]:
#! pip install scikit-learn==1.5.2

In [217]:
# data manipulation and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# for saving the pipeline
import joblib

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, Binarizer , FunctionTransformer ,OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler

# from feature-engine
from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer,
)


from feature_engine.transformation import (
    LogTransformer
)

from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapper


import sys
import os
# Subir un nivel desde la carpeta notebooks para llegar al directorio raíz del proyecto
module_path = os.path.abspath(os.path.join(os.path.dirname('__file__'), '..'))
if module_path not in sys.path:
    sys.path.append(module_path)


# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [218]:
# load dataset
data = pd.read_parquet('../data/train.parquet')
# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(766608, 71)


Unnamed: 0,DNI,BASE,CELULAR1,CELULAR2,CELULAR3,12M_MONTO,12M_TASA,18M_MONTO,18M_TASA,24M_MONTO,24M_TASA,36M_MONTO,36M_TASA,EDAD,MARCA_LABORAL,DEPARTAMENTO,PROVINCIA,DISTRITO_INEI,LIMAS,PROPENSION,PLD_NACION,PLD_BCP,PLD_BBVA,PLD_SAGA,PLD_SCOTIA,PLD_C_HUANCAYO,PLD_CREDISCOTIA,PLD_INTERBANK,PLD_C_AREQUIPA,PLD_C_CUSCO,PLD_MIBANCO,PLD_RIPLEY,PLD_C_PIURA,PLD_EFECTIVA,PLD_PICHINCHA,PLD_CONFIANZA,TC_BCP,TC_SAGA,TC_INTERBANK,TC_BBVA,TC_OH,TC_RIPLEY,TC_SCOTIA,TC_CREDISCOTIA,TC_PICHINCHA,TC_CENCOSUD,MENSAJE_TASA,COMPETITIVIDAD,PRINCIPALIDAD_CONSUMO,MENSAJE_VARIACION,ULTIMA_AGRUPACION,ULTIMO_RESULTADO,ULTIMO_MOTIVO,RANGO_RCI,ESTADO_CIVIL,GENERO,veces_acepto_producto,tiempo_desde_ultima_conversion,tiempo_desde_ultima_negacion,intentos_totales,meses_gestionados,dias_ultima_gestion,ultima_gestion,veces_sin_respuesta,veces_solicitud_seguimiento,promedio_dias_entre_gestiones,max_intentos_en_un_mes,veces_respuesta_positiva,veces_respuesta_negativa,_merge_variables,target
0,90000,2024_07,965756532.0,,,6100.0,0.799,6900.0,0.799,6900.0,0.799,6900.0,0.799,69.0,4.INFORMAL,LIMA,LIMA,SAN MARTIN DE PORRES,LIMA NORTE,PROPENSION 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NO APLICA A EXCEPCION DE TASA,,SIN DEUDA CONSUMO,MISMA OFERTA,NO CONTACTO,NO CONTACTO MAQUINA,Abandono en sistema por timeout (CDN),"1. <0%,10%>",Otro,M,0.0,,,24.0,1.0,36.0,NO CONTACTO,24.0,0.0,0.0,24.0,0.0,0.0,1.0,0.0
1,90001,2024_07,995834373.0,,,6500.0,0.799,8600.0,0.799,9800.0,0.799,9800.0,0.799,36.0,3.INDEPENDIENTE,LIMA,LIMA,COMAS,LIMA NORTE,PROPENSION 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,931.0,0.0,0.0,0.0,0.0,0.0,NO APLICA A EXCEPCION DE TASA,OFERTA MAYOR A ALGUNA DEUDA,DEUDA EN BANCOS MEDIANOS,MISMA OFERTA,NO CONTACTO,VOLVER A INTENTAR,SE CORTA LLAMADA SIN MOTIVO,"2. [10%,20%>",Casado,F,,,,,,,NO GESTIONADO,,,,,,,0.0,0.0
2,90002,2024_07,949534932.0,,,5500.0,0.5,7600.0,0.5,9000.0,0.5,9000.0,0.5,36.0,4.INFORMAL,LIMA,LIMA,VILLA MARIA DEL TRIUNFO,LIMA SUR,PROPENSION 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,701.0,0.0,0.0,0.0,0.0,0.0,NO APLICA A EXCEPCION DE TASA,OFERTA MAYOR A ALGUNA DEUDA,DEUDA EN BANCOS MEDIANOS,MISMA OFERTA,NO CONTACTO,VOLVER A INTENTAR,OCUPADO,"1. <0%,10%>",Casado,M,,,,,,,NO GESTIONADO,,,,,,,0.0,0.0
3,90003,2024_07,952301341.0,,,4500.0,0.799,6100.0,0.799,6400.0,0.799,6400.0,0.799,70.0,4.INFORMAL,CALLAO,CALLAO,CALLAO,CALLAO,PROPENSION 2,3564.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NO APLICA A EXCEPCION DE TASA,OFERTA MAYOR A ALGUNA DEUDA,DEUDA EN BANCOS MEDIANOS,MISMA OFERTA,CONTACTO NO EFECTIVO,NEGATIVO,DE VIAJE,"2. [10%,20%>",Otro,F,,,,,,,NO GESTIONADO,,,,,,,0.0,0.0
4,90004,2024_07,,,,5800.0,0.799,6400.0,0.799,6400.0,0.799,6400.0,0.799,36.0,2.DEPEN+INDEPEN,LIMA,LIMA,CARABAYLLO,LIMA NORTE,PROPENSION 1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,NO APLICA A EXCEPCION DE TASA,,SIN DEUDA CONSUMO,MISMA OFERTA,NO CONTACTO,NO CONTACTO MAQUINA,Contestador/Fax,"1. <0%,10%>",Soltero,M,,,,,,,NO GESTIONADO,,,,,,,0.0,0.0


In [219]:
data = data[data.BASE.isin(sorted(data.BASE.unique())[-3:])]

# Paso 2: Separate dataset into train and test

It is important to separate our data intro training and testing set.

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

Our feature engineering techniques will learn:

- mean
- mode
- exponents for the yeo-johnson
- category frequency
- and category to number mappings

from the train set.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [220]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['DNI','BASE', 'target'], axis=1), # predictive variables
    data['target'], # target
    test_size=0.1, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

X_train.shape, X_test.shape

((190350, 68), (21150, 68))

In [223]:
valores_permitidos_cats = {}
for columna in limpiar_cats:
    print(f"Variable: {columna}")
    valores = [x for x in X_train[columna].unique() if str(x).strip() != '']
    print(f"Valores permitidos: {valores}")
    valores_permitidos_cats[columna] = valores
    
valores_permitidos_cats = {'ULTIMA_AGRUPACION': ['CONTACTO NO EFECTIVO',
  None,
  'NO CONTACTO',
  'NO CONTACTO MAQUINA',
  'CONTACTO EFECTIVO'],
 'GENERO': ['M', 'F', None],
 'COMPETITIVIDAD': ['OFERTA MAYOR A ALGUNA DEUDA', None],
 'PRINCIPALIDAD_CONSUMO': ['DEUDA EN BANCOS GRANDES',
  'SIN DEUDA CONSUMO',
  'DEUDA EN BANCOS MEDIANOS',
  'DEUDA EN CAJAS'],
 'MARCA_LABORAL': ['3.INDEPENDIENTE',
  '2.DEPEN+INDEPEN',
  '1.DEPENDIENTE',
  '4.INFORMAL'],
 'PROPENSION': ['PROPENSION 1', 'PROPENSION 2'],
 'LIMAS': ['LIMA CENTRO',
  'FUERA DE LIMA',
  'LIMA OESTE',
  'LIMA NORTE',
  'LIMA ESTE',
  'LIMA SUR',
  'CALLAO'],
 'MENSAJE_VARIACION': ['MISMA OFERTA Y MAS TASA',
  'CON MAS OFERTA Y MISMA TASA',
  'CON MAS OFERTA Y MAS TASA',
  'CON MAS OFERTA',
  'MISMA OFERTA',
  'MISMA OFERTA Y MENOS TASA',
  'MISMA OFERTA Y MISMA TASA',
  'NUEVA OFERTA',
  'MENOS OFERTA Y MAS TASA',
  'MENOS OFERTA Y MISMA TASA',
  'CON MAS OFERTA Y MENOS TASA',
  'MENOS OFERTA Y MENOS TASA'],
 'RANGO_RCI': ['3. [20%,30%>',
  '1. <0%,10%>',
  '4. [30%,60%]',
  '2. [10%,20%>',
  '0. SIN DEUDA'],
 'ESTADO_CIVIL': ['Soltero', 'Casado', 'Otro'],
 'ultima_gestion': ['NO GESTIONADO',
  'NO CONTACTO',
  'CONTACTO NO EFECTIVO',
  'CONTACTO EFECTIVO']}

Variable: ULTIMA_AGRUPACION
Valores permitidos: [None, 'NO CONTACTO', 'CONTACTO NO EFECTIVO', 'CONTACTO EFECTIVO']
Variable: GENERO
Valores permitidos: ['F', 'M', None]
Variable: COMPETITIVIDAD
Valores permitidos: ['OFERTA MAYOR A ALGUNA DEUDA', None]
Variable: PRINCIPALIDAD_CONSUMO
Valores permitidos: ['DEUDA EN BANCOS GRANDES', 'DEUDA EN BANCOS MEDIANOS', 'SIN DEUDA CONSUMO', 'DEUDA EN CAJAS']
Variable: MARCA_LABORAL
Valores permitidos: ['3.INDEPENDIENTE', '2.DEPEN+INDEPEN', '4.INFORMAL', '1.DEPENDIENTE']
Variable: PROPENSION
Valores permitidos: ['PROPENSION 1', 'PROPENSION 2']
Variable: LIMAS
Valores permitidos: ['LIMA NORTE', 'LIMA SUR', 'FUERA DE LIMA', 'LIMA CENTRO', 'LIMA ESTE', 'CALLAO', 'LIMA OESTE']
Variable: MENSAJE_VARIACION
Valores permitidos: ['NUEVA OFERTA', 'CON MAS OFERTA', 'MISMA OFERTA Y MENOS TASA', 'MISMA OFERTA Y MAS TASA', 'MISMA OFERTA Y MISMA TASA', 'CON MAS OFERTA Y MISMA TASA', 'CON MAS OFERTA Y MENOS TASA', 'MISMA OFERTA', 'CON MAS OFERTA Y MAS TASA', 'MENOS

# Target

We apply the logarithm

In [224]:
y_train.value_counts() , y_test.value_counts()

(target
 0.0    189850
 1.0       500
 Name: count, dtype: int64,
 target
 0.0    21090
 1.0       60
 Name: count, dtype: int64)

# Config

In [122]:
bancos_comerciales_pld = ["PLD_BCP", "PLD_BBVA", "PLD_SCOTIA", "PLD_INTERBANK", "PLD_PICHINCHA", "PLD_NACION"]
bancos_comerciales_tc = ["TC_BCP", "TC_BBVA", "TC_SCOTIA", "TC_INTERBANK", "TC_PICHINCHA"]

cajas_ahorro_pld = ["PLD_C_HUANCAYO", "PLD_C_AREQUIPA", "PLD_C_CUSCO", "PLD_C_PIURA", "PLD_EFECTIVA", "PLD_CONFIANZA"]
cajas_ahorro_tc = []  # No hay TC en cajas de ahorro

retail_financieras_pld = ["PLD_SAGA", "PLD_RIPLEY", "PLD_CREDISCOTIA", "PLD_MIBANCO"]
retail_financieras_tc = ["TC_SAGA", "TC_RIPLEY", "TC_CENCOSUD", "TC_OH", "TC_CREDISCOTIA"]

pld_columns = bancos_comerciales_pld + cajas_ahorro_pld + retail_financieras_pld
tc_columns = bancos_comerciales_tc + cajas_ahorro_tc + retail_financieras_tc
col_drop = tc_columns + pld_columns

FEATURES_TO_DROP = ['12M_MONTO', '12M_TASA', '18M_MONTO', '18M_TASA' , 
                    'DEPARTAMENTO', 'PROVINCIA', 'DISTRITO_INEI',
                    '36M_MONTO','36M_TASA','MENSAJE_TASA','MENSAJE_VARIACION',"ULTIMO_RESULTADO","ULTIMO_MOTIVO",'LIMAS']

limpiar_cats =['ULTIMA_AGRUPACION',
 'GENERO',
 'COMPETITIVIDAD',
 'PRINCIPALIDAD_CONSUMO',
 'MARCA_LABORAL',
 'PROPENSION',
 'LIMAS',
 'MENSAJE_VARIACION',
 'RANGO_RCI',
 'ESTADO_CIVIL',
 'ultima_gestion']
# categorical variables with NA in train set
CATEGORICAL_VARS_WITH_NA_FREQUENT = ['GENERO']


CATEGORICAL_VARS_WITH_NA_MISSING = ['ULTIMA_AGRUPACION','COMPETITIVIDAD']


# numerical variables with NA in train set
NUMERICAL_VARS_WITH_NA = ['veces_acepto_producto',
                         'tiempo_desde_ultima_conversion',
                         'tiempo_desde_ultima_negacion',
                         'intentos_totales',
                         'meses_gestionados',
                         'dias_ultima_gestion',
                         'veces_sin_respuesta',
                         'veces_solicitud_seguimiento',
                         'promedio_dias_entre_gestiones',
                         'max_intentos_en_un_mes',
                         'veces_respuesta_positiva',
                         'veces_respuesta_negativa']


# variables to log transform
NUMERICALS_LOG_VARS = ['24M_MONTO', '24M_TASA', 'EDAD']

BINARIZE_VARS = [
    'tiempo_desde_ultima_negacion',
     'dias_ultima_gestion',
     'veces_sin_respuesta',
     'promedio_dias_entre_gestiones',
     'veces_respuesta_negativa',
     'Bancos_PLD_Total',
     'Cajas_PLD_Total',
     'Retail_PLD_Total',
     'Bancos_TC_Total',
     'Retail_TC_Total'
]


QUAL_MAPPINGS = {
    "MARCA_LABORAL": {'2_DEPEN_INDEPEN': 1, '3_INDEPENDIENTE': 3, '1_DEPENDIENTE': 2, '4_INFORMAL': 4},
    "PRINCIPALIDAD_CONSUMO": {"SIN_DEUDA_CONSUMO": 0 , "DEUDA_EN_CAJAS":3 , 
                              "DEUDA_EN_BANCOS_MEDIANOS":2 , "DEUDA_EN_BANCOS_GRANDES":1},
    "ESTADO_CIVIL": {"Soltero":2,"Casado":1,"Missing":1 , "Otro":1},
    "ULTIMA_AGRUPACION": {"NO_GESTIONADO": 2 , "NO_CONTACTO": 0 , "CONTACTO_NO_EFECTIVO":1,
                          "CONTACTO_EFECTIVO":1 , "Missing":2 , "NO_CONTACTO_MAQUINA":0},
    "ultima_gestion": {"NO_GESTIONADO": 2, 'Missing':2, "NO_CONTACTO": 0 , "CONTACTO_NO_EFECTIVO":1,
                       "CONTACTO_EFECTIVO":1},
    "GENERO": {"F":0,"M":1,"Missing":0 , '':0 , None:0},
}

def transformar_mensaje_variacion(X):
    X = X.copy()
    X["ESTADO_TASA"] = X["MENSAJE_VARIACION"].apply(lambda x: 2 if "MENOS_TASA" in str(x) else 
                                                          (1 if "MISMA_TASA" in str(x) else 
                                                           (3 if "NUEVA" in str(x) else 0)))
    
    X["ESTADO_OFERTA"] = X["MENSAJE_VARIACION"].apply(lambda x: 0 if "MENOS_OFERTA" in str(x) else 
                                                             (1 if "MISMA_OFERTA" in str(x) else 
                                                              (3 if "NUEVA" in str(x) else 2)))
    
    X["NUEVA_OFERTA"] = X["MENSAJE_VARIACION"].apply(lambda x: 1 if "NUEVA" in str(x) else 0)
    
    return X

class DeudaTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self  # No se requiere ajuste
    
    def transform(self, X):
        X = X.copy()  # Evitar modificar los datos originales

        # Calcular totales de deuda
        X["Bancos_PLD_Total"] = X[bancos_comerciales_pld].sum(axis=1)
        X["Cajas_PLD_Total"] = X[cajas_ahorro_pld].sum(axis=1)
        X["Retail_PLD_Total"] = X[retail_financieras_pld].sum(axis=1)
        X["PLD_Total"] = X[pld_columns].sum(axis=1)

        X["Bancos_TC_Total"] = X[bancos_comerciales_tc].sum(axis=1)
        X["Retail_TC_Total"] = X[retail_financieras_tc].sum(axis=1)
        X["TC_Total"] = X[tc_columns].sum(axis=1)

        # Contar entidades con deuda distinta de 0
        X["Bancos_PLD_Entidades"] = (X[bancos_comerciales_pld] != 0).sum(axis=1)
        X["Cajas_PLD_Entidades"] = (X[cajas_ahorro_pld] != 0).sum(axis=1)
        X["Retail_PLD_Entidades"] = (X[retail_financieras_pld] != 0).sum(axis=1)
        X["PLD_Entidades"] = (X[pld_columns] != 0).sum(axis=1)

        X["Bancos_TC_Entidades"] = (X[bancos_comerciales_tc] != 0).sum(axis=1)
        X["Retail_TC_Entidades"] = (X[retail_financieras_tc] != 0).sum(axis=1)
        X["TC_Entidades"] = (X[tc_columns] != 0).sum(axis=1)

        # Crear variables binarias
        X["TC_Entidades_Mas3"] = X["TC_Entidades"].map(lambda x: 1 if x > 3 else 0)
        X["Tiene_Deuda_PLD"] = X["PLD_Entidades"].map(lambda x: 1 if x > 0 else 0)

        return X.drop(columns=col_drop)  # Eliminar columnas originales de deuda

# ================================
# 📌 Crear Transformador para Celulares
# ================================
class CelularTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self  # No se requiere ajuste
    
    def transform(self, X):
        X = X.copy()
        X["CANTIDAD_CELULARES"] = X[[col for col in X.columns if "CEL" in col]].notna().sum(axis=1)
        return X.drop(columns=["CELULAR1", "CELULAR2", "CELULAR3"], errors="ignore")  # Eliminar celulares originales

class ConversionColumnas(BaseEstimator , TransformerMixin):
    def fit(self , X , y = None):
        return self
    
    def transform(self , X):
        data = X.copy()
        data["PROPENSION"] = data["PROPENSION"].str.split("_").str[1].astype(int)
        data["RANGO_RCI"] = data["RANGO_RCI"].str.split("_").str[0].astype(int)
        data["COMPETITIVIDAD"] = data["COMPETITIVIDAD"].map(lambda x : 0 if x =='Missing' else 1)
        return data

import re
class LimpiarCategorias(BaseEstimator, TransformerMixin):
    def __init__(self, variables):
        self.variables = variables  # Lista de variables categóricas

    def fit(self, X, y=None):
        return self  # No necesita aprender nada

    def transform(self, X):
        X = X.copy()
        for col in self.variables:
            X[col] = X[col].apply(lambda val: "Missing" if val is None else re.sub(r'\W+', '_', str(val))).astype("object")
        return X
    
onehot_transformer = ColumnTransformer([
    ('onehot', OneHotEncoder(handle_unknown='ignore', drop='first'), columnas_dummies)
], remainder='passthrough')  # Mantiene las otras columnas sin cambios



class CustomMapper(BaseEstimator, TransformerMixin):
    def __init__(self, mappings, default_value=-1):  
        self.mappings = mappings
        self.default_value = default_value  

    def fit(self, X, y=None):
        return self  
    
    def transform(self, X):
        X = X.copy()
        for col, mapping in self.mappings.items():
            print(f"📌 Columna: {col}")  # <-- Debugging

            X[col] = X[col].apply(lambda x: self.clean_value(x, mapping))
            
            print(f"✅ Valores después del mapeo: {X[col].unique()}")  # <-- Debugging
        return X
    
    def clean_value(self, x, mapping):
        """Normaliza valores antes de mapear."""
        if pd.isna(x) or x in [None, "", " ", "  "]:  # Convertimos NaN, None y espacios en "Missing"
            x = "Missing"
        
        x = str(x).strip()  # Asegurar que no haya espacios en blanco
        return mapping.get(x, self.default_value)  # Si no está en el diccionario, asignar default

class ValueFilterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, allowed_values):
        """
        Transformador que reemplaza valores no permitidos en las columnas con None 
        y genera un reporte de valores modificados.

        :param allowed_values: Diccionario con las listas de valores permitidos por columna.
        """
        self.allowed_values = allowed_values
        self.report = {}

    def fit(self, X, y=None):
        # Inicializar el reporte
        self.report = {col: {} for col in self.allowed_values.keys()}
        return self  # No necesita ajuste

    def transform(self, X):
        X = X.copy()  # Evitar modificar el DataFrame original
        
        for col, allowed in self.allowed_values.items():
            if col in X.columns:
                # Contar valores no permitidos antes de reemplazarlos
                mask_invalid = ~X[col].isin(allowed)  # Valores que NO están en la lista permitida
                counts = X.loc[mask_invalid, col].value_counts()

                # Guardar en el reporte
                self.report[col] = counts.to_dict()

                # Reemplazar valores no permitidos por None
                X.loc[mask_invalid, col] = None
        
        return X

    def get_report(self):
        """ Devuelve el reporte de valores eliminados en cada columna """
        return self.report

# Pipeline - Feature engineering

In [192]:
from feature_engine.encoding import OrdinalEncoder

In [193]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['DNI','BASE', 'target'], axis=1), # predictive variables
    data['target'], # target
    test_size=0.1, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

X_train.shape, X_test.shape

((689947, 68), (76661, 68))

In [194]:
data.BASE.unique()

array(['2024_07', '2024_08', '2024_09', '2024_10', '2024_11', '2024_12'],
      dtype=object)

In [195]:
# set up the pipeline
price_pipe = Pipeline([
     ('filter_values', ValueFilterTransformer(allowed_values = valores_permitidos_cats)),
    ('limpiar_categorias', LimpiarCategorias(variables=limpiar_cats)),
    
    ("deuda_transformer", DeudaTransformer()),   # Agrega totales de deuda y cuenta entidades
    ("celular_transformer", CelularTransformer()),  # Cuenta cantidad de celulares y elimina columnas
    ("columnas_transformer" , ConversionColumnas() ),

        # ========================== IMPUTATION ==========================
    # add missing indicator
    ('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARS_WITH_NA)),

    # impute numerical variables with the mean
    ('mean_imputation', MeanMedianImputer(
        imputation_method='mean', variables=NUMERICAL_VARS_WITH_NA
    )),
    
    ('frequent_imputation', CategoricalImputer(
        imputation_method='frequent', variables=CATEGORICAL_VARS_WITH_NA_FREQUENT)),
     # ===================== VARIABLE TRANSFORMATION ======================
    ('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),

    ('binarizer', SklearnTransformerWrapper(
        transformer=Binarizer(threshold=0), variables=BINARIZE_VARS)),

    # =========================== mappers ===============================
    ('custom_mapper', CustomMapper(QUAL_MAPPINGS)),

    ('mapper_function',  FunctionTransformer(transformar_mensaje_variacion)),
    
     ('drop_features', DropFeatures(features_to_drop=FEATURES_TO_DROP)),
    ('MinMaxScaler', MinMaxScaler())

])

In [196]:
price_pipe.fit(X_train, y_train)
# train the pipeline

📌 Columna: MARCA_LABORAL
✅ Valores después del mapeo: [3 1 2 4]
📌 Columna: PRINCIPALIDAD_CONSUMO
✅ Valores después del mapeo: [1 0 2 3]
📌 Columna: ESTADO_CIVIL
✅ Valores después del mapeo: [2 1]
📌 Columna: ULTIMA_AGRUPACION
✅ Valores después del mapeo: [1 2 0]
📌 Columna: ultima_gestion
✅ Valores después del mapeo: [2 0 1]
📌 Columna: GENERO
✅ Valores después del mapeo: [1 0]


In [226]:
price_pipe

In [210]:
X_train_ = price_pipe.transform(X_train)

📌 Columna: MARCA_LABORAL
✅ Valores después del mapeo: [3 1 4 2]
📌 Columna: PRINCIPALIDAD_CONSUMO
✅ Valores después del mapeo: [1 2 0 3]
📌 Columna: ESTADO_CIVIL
✅ Valores después del mapeo: [2 1]
📌 Columna: ULTIMA_AGRUPACION
✅ Valores después del mapeo: [2 0 1]
📌 Columna: ultima_gestion
✅ Valores después del mapeo: [2 0 1]
📌 Columna: GENERO
✅ Valores después del mapeo: [0 1]


In [211]:
%time X_test_ = price_pipe.transform(X_test)

📌 Columna: MARCA_LABORAL
✅ Valores después del mapeo: [4 2 1 3]
📌 Columna: PRINCIPALIDAD_CONSUMO
✅ Valores después del mapeo: [2 0 1 3]
📌 Columna: ESTADO_CIVIL
✅ Valores después del mapeo: [1 2]
📌 Columna: ULTIMA_AGRUPACION
✅ Valores después del mapeo: [0 1 2]
📌 Columna: ultima_gestion
✅ Valores después del mapeo: [2 0 1]
📌 Columna: GENERO
✅ Valores después del mapeo: [1 0]
CPU times: total: 484 ms
Wall time: 491 ms


In [212]:
X_train_ = pd.DataFrame(X_train_, columns = price_pipe.named_steps["drop_features"].get_feature_names_out())
X_test_ = pd.DataFrame(X_test_, columns = price_pipe.named_steps["drop_features"].get_feature_names_out())

In [None]:
X_train_.to_parquet('../data/xtrain_postprocess.parquet')
X_test_.to_parquet('../data/xtest_postprocess.parquet')
pd.DataFrame(y_train,columns=["target"]).reset_index(drop=True).to_parquet('../data/ytrain_postprocess.parquet')
pd.DataFrame(y_test,columns=["target"]).reset_index(drop=True).to_parquet('../data/ytest_postprocess.parquet')

## Exportar Pipeline


In [229]:
joblib.dump(price_pipe, '../src/pipeline_preprocesamiento.pkl')


['../src/pipeline_preprocesamiento.pkl']

In [201]:
# Verificamos la ausencia de valores nulos (na) en el conjunto de Entrenamiento
[var for var in X_train_.columns if X_train_[var].isnull().sum() > 0]

[]

In [202]:
# Verificamos la ausencia de valores nulos (na) en el conjunto de Prueba
[var for var in X_test_.columns if X_test_[var].isnull().sum() > 0]

[]

In [216]:
X_train_.shape

(190350, 57)