# 02 — Demostración del Pipeline de Preprocesamiento

**Objetivo:** Visualizar y verificar cada paso del pipeline de limpieza y transformación de características definido en `src/`.

**Entradas:**
- `data/raw/insurance.csv`

**Salidas:**
- `data/processed/train_proc.parquet` (artefacto procesado)
- Verificaciones visuales de los datos antes y después.

**Dependencias:**
- Requiere que los módulos `src/clean.py`, `src/features.py`, y `src/config.py` estén implementados.

In [22]:
%load_ext autoreload
%autoreload 2
import sys
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Añadir src al path para poder importar nuestros módulos
sys.path.append('..')

from src.train import preprocess
from src.config import PROC_PATH, RAW_TRAIN
from src.logging_utils import setup_logger

logger = setup_logger("preprocess_demo")
sns.set(style="whitegrid")

logger.info("Notebook 02_preprocess_demo iniciado.")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
[32m2025-08-23 16:35:09.877[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m18[0m - [1mNotebook 02_preprocess_demo iniciado.[0m
[32m2025-08-23 16:35:09[0m | [1mINFO[0m | Notebook 02_preprocess_demo iniciado.
[32m2025-08-23 16:35:09[0m | [1mINFO[0m | Notebook 02_preprocess_demo iniciado.


In [23]:
logger.info("Ejecutando la función preprocess()...")
# Ejecutamos la función que orquesta toda la limpieza y feature engineering
processed_df = preprocess(raw_path=f"../{RAW_TRAIN}")
logger.info("Preprocesamiento completado.")

logger.info(f"Dimensiones del DataFrame procesado: {processed_df.shape}")
logger.info("Columnas del DataFrame procesado:")

# columnas del data_process
for col in processed_df.columns:
    print(f'    "{col}",')

[32m2025-08-23 16:35:25.965[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mEjecutando la función preprocess()...[0m
[32m2025-08-23 16:35:25[0m | [1mINFO[0m | Ejecutando la función preprocess()...
[32m2025-08-23 16:35:25[0m | [1mINFO[0m | Ejecutando la función preprocess()...
[32m2025-08-23 16:35:28.651[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m4[0m - [1mPreprocesamiento completado.[0m
[32m2025-08-23 16:35:28[0m | [1mINFO[0m | Preprocesamiento completado.
[32m2025-08-23 16:35:28[0m | [1mINFO[0m | Preprocesamiento completado.
[32m2025-08-23 16:35:28.652[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m6[0m - [1mDimensiones del DataFrame procesado: (1200000, 40)[0m
[32m2025-08-23 16:35:28[0m | [1mINFO[0m | Dimensiones del DataFrame procesado: (1200000, 40)
[32m2025-08-23 16:35:28[0m | [1mINFO[0m | Dimensiones del DataFrame procesado: (1200000, 40)
[32m2025-08-23 16:35:28.652[0m | [

In [24]:
processed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 40 columns):
 #   Column                      Non-Null Count    Dtype         
---  ------                      --------------    -----         
 0   id                          1200000 non-null  int64         
 1   Age                         1200000 non-null  float64       
 2   Annual Income               1200000 non-null  float64       
 3   Number of Dependents        1200000 non-null  float64       
 4   Education Level             1200000 non-null  int16         
 5   Health Score                1200000 non-null  float64       
 6   Policy Type                 1200000 non-null  int16         
 7   Previous Claims             1200000 non-null  float64       
 8   Vehicle Age                 1200000 non-null  float64       
 9   Credit Score                1200000 non-null  float64       
 10  Insurance Duration          1200000 non-null  float64       
 11  Policy Start Date       

In [26]:
processed_df.head()

Unnamed: 0,id,Age,Annual Income,Number of Dependents,Education Level,Health Score,Policy Type,Previous Claims,Vehicle Age,Credit Score,...,Location_Urban,Location_Unknown,Exercise Frequency_Monthly,Exercise Frequency_Rarely,Exercise Frequency_Weekly,Exercise Frequency_Unknown,Property Type_Condo,Property Type_House,Property Type_Unknown,event_timestamp
0,0,19.0,10049.0,1.0,1,22.598761,2,2.0,17.0,372.0,...,True,False,False,False,True,False,False,True,False,2023-12-23 15:21:39.134960
1,1,39.0,31678.0,3.0,2,15.569731,1,1.0,12.0,694.0,...,False,False,True,False,False,False,False,True,False,2023-06-12 15:21:39.111551
2,2,23.0,25602.0,3.0,0,47.177549,2,1.0,14.0,595.0,...,False,False,False,False,True,False,False,True,False,2023-09-30 15:21:39.221386
3,3,21.0,138543.02,2.0,1,10.938144,0,1.0,0.0,367.0,...,False,False,False,False,False,False,False,False,False,2024-06-12 15:21:39.226954
4,4,21.0,39651.0,1.0,1,20.376094,2,0.0,8.0,598.0,...,False,False,False,False,True,False,False,True,False,2021-12-01 15:21:39.252145


In [20]:
processed_df.select_dtypes(include=['float64']).columns

Index(['Age', 'Annual Income', 'Number of Dependents', 'Health Score',
       'Previous Claims', 'Vehicle Age', 'Credit Score', 'Insurance Duration',
       'Premium Amount', 'psd_year', 'psd_month', 'psd_dow', 'psd_month_sin',
       'psd_month_cos'],
      dtype='object')

In [15]:
processed_df.select_dtypes(include=['category']).columns

Index(['Gender', 'Smoking Status'], dtype='object')

In [28]:
processed_df.select_dtypes(include=['bool']).columns.to_list()

['Gender_Male',
 'Gender_Unknown',
 'Smoking Status_Yes',
 'Smoking Status_Unknown',
 'Marital Status_Married',
 'Marital Status_Single',
 'Marital Status_Unknown',
 'Occupation_Self-Employed',
 'Occupation_Unemployed',
 'Occupation_Unknown',
 'Location_Suburban',
 'Location_Urban',
 'Location_Unknown',
 'Exercise Frequency_Monthly',
 'Exercise Frequency_Rarely',
 'Exercise Frequency_Weekly',
 'Exercise Frequency_Unknown',
 'Property Type_Condo',
 'Property Type_House',
 'Property Type_Unknown']

In [11]:
processed_df.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
id,1200000.0,599999.5,0.0,299999.75,599999.5,899999.25,1199999.0,346410.305851
Age,1200000.0,41.143294,18.0,30.0,41.0,53.0,64.0,13.43402
Annual Income,1200000.0,32367.140383,498.0,8646.0,23911.0,43936.0,138543.02,31440.709329
Number of Dependents,1200000.0,2.009026,0.0,1.0,2.0,3.0,4.0,1.351022
Education Level,1200000.0,1.517826,0.0,1.0,2.0,3.0,3.0,1.112623
Health Score,1200000.0,25.550001,2.012237,16.554572,24.578648,33.77289,58.975914,11.823428
Policy Type,1200000.0,1.002743,0.0,0.0,1.0,2.0,2.0,0.816696
Previous Claims,1200000.0,0.999513,0.0,0.0,1.0,1.0,4.0,0.809647
Vehicle Age,1200000.0,9.569891,0.0,5.0,10.0,15.0,19.0,5.776174
Credit Score,1200000.0,593.162846,300.0,484.0,595.0,706.0,849.0,141.104056
