# 02 - Feature Engineering (Ingeniería de Variables)

Este notebook prepara los datos generados en el **EDA** para entrenamiento de modelos.

Pipeline del proyecto:

EDA ✔  
↓  
Feature Engineering ✔  
↓  
Dataset listo para modelo ✔  
↓  
Entrenamiento modelo (siguiente paso)

Todas las variables generadas aquí se usan luego en entrenamiento y API.


## 1. Importaciones

In [11]:
import pandas as pd
import numpy as np
from pathlib import Path

## 2. Cargar datos del proyecto

In [12]:
PROJECT_ROOT = Path("..").resolve()
DATA_PATH = PROJECT_ROOT / "data"

df = pd.read_csv(DATA_PATH / "retail_sales.csv")

print("Dataset cargado correctamente")
df.head()

Dataset cargado correctamente


Unnamed: 0,customer_id,age,income,purchase_frequency,avg_ticket,recency_days,online_ratio,future_sales
0,1,56,58353.457443,11,58.722302,285,0.951013,402.149162
1,2,69,67462.27886,7,56.573908,153,0.801525,418.461486
2,3,46,54219.250087,14,55.370637,40,0.205749,749.016655
3,4,32,66306.787159,7,75.118287,26,0.391301,423.560829
4,5,60,47034.133936,9,122.306864,273,0.927792,452.521689


## 3. Validar columnas disponibles

In [13]:
print("Columnas disponibles:")
print(df.columns.tolist())

Columnas disponibles:
['customer_id', 'age', 'income', 'purchase_frequency', 'avg_ticket', 'recency_days', 'online_ratio', 'future_sales']


## 4. Cliente de alto valor

In [14]:
threshold = df["future_sales"].quantile(0.75)

df["high_value_customer"] = (
    df["future_sales"] >= threshold
).astype(int)

df["high_value_customer"].value_counts()

high_value_customer
0    3750
1    1250
Name: count, dtype: int64

## 5. Frecuencia de compra normalizada

In [15]:
df["freq_per_day"] = (
    df["purchase_frequency"] / (df["recency_days"] + 1)
).clip(upper=5)

df[["purchase_frequency", "recency_days", "freq_per_day"]].head()

Unnamed: 0,purchase_frequency,recency_days,freq_per_day
0,11,285,0.038462
1,7,153,0.045455
2,14,40,0.341463
3,7,26,0.259259
4,9,273,0.032847


## 6. Score de engagement del cliente

In [16]:
activity_norm = (
    df["purchase_frequency"] - df["purchase_frequency"].mean()
) / df["purchase_frequency"].std()

digital_norm = (
    df["online_ratio"] - df["online_ratio"].mean()
) / df["online_ratio"].std()

df["engagement_score"] = (activity_norm + digital_norm) / 2

df[["purchase_frequency", "online_ratio", "engagement_score"]].head()

Unnamed: 0,purchase_frequency,online_ratio,engagement_score
0,11,0.951013,0.945427
1,7,0.801525,0.049258
2,14,0.205749,0.125423
3,7,0.391301,-0.664671
4,9,0.927792,0.587011


## 7. Dataset listo para modelo

In [17]:
model_df = df.drop(columns=["customer_id"])

model_df.head()

Unnamed: 0,age,income,purchase_frequency,avg_ticket,recency_days,online_ratio,future_sales,high_value_customer,freq_per_day,engagement_score
0,56,58353.457443,11,58.722302,285,0.951013,402.149162,0,0.038462,0.945427
1,69,67462.27886,7,56.573908,153,0.801525,418.461486,0,0.045455,0.049258
2,46,54219.250087,14,55.370637,40,0.205749,749.016655,1,0.341463,0.125423
3,32,66306.787159,7,75.118287,26,0.391301,423.560829,0,0.259259,-0.664671
4,60,47034.133936,9,122.306864,273,0.927792,452.521689,0,0.032847,0.587011


## 8. Guardar dataset final

In [18]:
OUTPUT_PATH = PROJECT_ROOT / "outputs"
OUTPUT_PATH.mkdir(exist_ok=True)

model_df.to_csv(
    OUTPUT_PATH / "dataset_model_ready.csv",
    index=False
)

print("Dataset final guardado correctamente")

Dataset final guardado correctamente


## 9. Validación de variables generadas

In [19]:
expected_columns = [
    "high_value_customer",
    "freq_per_day",
    "engagement_score"
]

missing = [c for c in expected_columns if c not in df.columns]

if missing:
    print("Faltan variables:", missing)
else:
    print("Todas las variables fueron creadas correctamente")

df[expected_columns].describe()

Todas las variables fueron creadas correctamente


Unnamed: 0,high_value_customer,freq_per_day,engagement_score
count,5000.0,5000.0,5000.0
mean,0.25,0.150937,-5.542233e-17
std,0.433056,0.392862,0.7038456
min,0.0,0.005682,-2.094309
25%,0.0,0.035206,-0.4896143
50%,0.0,0.054545,-0.0134632
75%,0.25,0.109697,0.4814402
max,1.0,5.0,2.411786
