# Baseline Model & Segmented Evaluation (Responsible AI)

Este notebook entrena un modelo simple (Regresión Logística y Árbol de Decisión) y evalúa el rendimiento **segmentado por grupos sensibles** (género, edad, raza, país de origen si existe).
También compara una versión **sin** y **con** `class_weight="balanced"` como estrategia básica de **mitigación de desbalance**.

> **Objetivo:** cumplir con los entregables de *entrenamiento* y *evaluación segmentada*, y dejar listo el bloque para *comparación antes/después de mitigación*.

In [13]:
# === Configuración ===
import os
import numpy as np
import pandas as pd
import kagglehub

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, f1_score, roc_auc_score, confusion_matrix, classification_report
)

pd.set_option("display.max_colwidth", None)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")
print("Versions -> pandas", pd.__version__)

Versions -> pandas 2.3.2


## 1. Carga de datos

In [14]:
# Ruta del CSV (ajusta a tu dataset). Por defecto intenta 'adult.csv' en el mismo folder.
CSV_PATH = os.environ.get("ADULT_CSV", "adult.csv")

def normalize_col(c):
    return c.strip().lower().replace("-", "_").replace(" ", "_")

# Download latest version
path = kagglehub.dataset_download("uciml/adult-census-income")

print("Path to dataset files:", path)
df = pd.read_csv(path + "/adult.csv")
df.info()
df.columns = [normalize_col(c) for c in df.columns]
print("Shape:", df.shape)
df.head()

Path to dataset files: /home/buntu/.cache/kagglehub/datasets/uciml/adult-census-income/versions/3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


## 2. Definición del problema y columnas

In [15]:
# === Definición del problema ===
TARGET_COL = "income"
CANDIDATE_TARGETS = ["income", "class", "salario", "target"]
if TARGET_COL not in df.columns:
    for cand in CANDIDATE_TARGETS:
        if cand in df.columns:
            TARGET_COL = cand
            break
print("TARGET_COL =", TARGET_COL)

# === Columnas esperadas (ajusta) ===
num_cols = [c for c in ["age", "education_num", "hours_per_week", "capital_gain", "capital_loss"] if c in df.columns]
cat_cols = [c for c in ["workclass","education","marital_status","occupation","sex","race","native_country"] if c in df.columns]

# Limpieza rápida
for c in cat_cols:
    df[c] = df[c].replace("?", np.nan).replace(" ?", np.nan)

df = df[df[TARGET_COL].notna()].copy()
df[TARGET_COL] = df[TARGET_COL].astype(str).str.strip()

X = df[num_cols + cat_cols].copy()

labels = df[TARGET_COL].astype(str).str.strip()

# 1) Caso típico Adult: etiquetas "<=50K" y ">50K"
if set(labels.unique()) >= {">50K", "<=50K"}:
    y = (labels == ">50K").astype(int)

# 2) Caso binario ya numérico
elif set(labels.unique()) <= {"0","1"}:
    y = labels.astype(int)

# 3) Caso con prefijo ">" en alguna etiqueta
elif labels.str.startswith(">").any():
    y = labels.str.startswith(">").astype(int)

else:
    raise ValueError(
        f"No reconozco las etiquetas: {sorted(labels.unique()[:10])}. "
        "Revisa TARGET_COL."
    )

print("Clases y proporción:")
print(y.value_counts(normalize=True))
assert y.nunique() == 2, "Aún hay una sola clase en y; revisa el mapeo del target."


TARGET_COL = income
Clases y proporción:
income
0   0.7592
1   0.2408
Name: proportion, dtype: float64


## 3. Preprocesamiento

In [16]:
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])
cat_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])
preprocess = ColumnTransformer(
    transformers=[
        ("num", num_transformer, num_cols),
        ("cat", cat_transformer, cat_cols),
    ],
    remainder="drop"
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
X_train.shape, X_test.shape

((24420, 6), (8141, 6))

## 4. Entrenamiento de modelos (baseline)

In [17]:
# Dos configuraciones: sin balance y con class_weight="balanced"
models = {
    "logreg_unbalanced": LogisticRegression(max_iter=200, n_jobs=None),
    "logreg_balanced": LogisticRegression(max_iter=200, class_weight="balanced", n_jobs=None),
    "tree_balanced": DecisionTreeClassifier(max_depth=6, min_samples_leaf=20, class_weight="balanced", random_state=42),
}

from collections import OrderedDict
trained = OrderedDict()
for name, clf in models.items():
    pipe = Pipeline(steps=[("preprocess", preprocess), ("model", clf)])
    pipe.fit(X_train, y_train)
    trained[name] = pipe

list(trained.keys())

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


['logreg_unbalanced', 'logreg_balanced', 'tree_balanced']

## 5. Métricas globales

In [19]:
def eval_global(model, X, y, label="test"):
    y_pred = model.predict(X)
    row = {
        "split": label,
        "accuracy": accuracy_score(y, y_pred),
        "f1": f1_score(y, y_pred)
    }
    try:
        y_proba = model.predict_proba(X)[:,1]
        row["roc_auc"] = roc_auc_score(y, y_proba)
    except Exception:
        row["roc_auc"] = np.nan
    return row

global_rows = []
for name, pipe in trained.items():
    global_rows.append({"model": name, **eval_global(pipe, X_test, y_test, "test")})

global_df = pd.DataFrame(global_rows).sort_values(by="f1", ascending=False)
display(global_df)

best_name = global_df.iloc[0]["model"]
best = trained[best_name]
y_pred = best.predict(X_test)
print(f"Mejor modelo: {best_name}")
print("Matriz de confusión [[tn, fp], [fn, tp]]:", confusion_matrix(y_test, y_pred))
print("\nReporte de clasificación:\n", classification_report(y_test, y_pred, digits=3))

Unnamed: 0,model,split,accuracy,f1,roc_auc
1,logreg_balanced,test,0.7364,0.5795,0.8228
2,tree_balanced,test,0.6593,0.5473,0.8061
0,logreg_unbalanced,test,0.8099,0.5123,0.8222


Mejor modelo: logreg_balanced
Matriz de confusión [[tn, fp], [fn, tp]]: [[4516 1665]
 [ 481 1479]]

Reporte de clasificación:
               precision    recall  f1-score   support

           0      0.904     0.731     0.808      6181
           1      0.470     0.755     0.580      1960

    accuracy                          0.736      8141
   macro avg      0.687     0.743     0.694      8141
weighted avg      0.799     0.736     0.753      8141



## 6. Evaluación segmentada por grupos sensibles

In [20]:
def group_metrics(pipe, X, y, group_series, group_name):
    df_aux = X.copy()
    df_aux["_y"] = y.values
    df_aux["_g"] = group_series.values

    rows = []
    for g in df_aux["_g"].dropna().unique():
        idx = df_aux.index[df_aux["_g"] == g]
        Xi = X.loc[idx]
        yi = y.loc[idx]
        if len(yi) < 20:
            continue
        y_pred = pipe.predict(Xi)
        tp = ((y_pred == 1) & (yi == 1)).sum()
        tn = ((y_pred == 0) & (yi == 0)).sum()
        fp = ((y_pred == 1) & (yi == 0)).sum()
        fn = ((y_pred == 0) & (yi == 1)).sum()
        tpr = tp / (tp + fn) if (tp + fn) > 0 else np.nan
        fpr = fp / (fp + tn) if (fp + tn) > 0 else np.nan
        pos_rate = (y_pred == 1).mean()

        rows.append({
            group_name: g,
            "n": int(len(yi)),
            "accuracy": accuracy_score(yi, y_pred),
            "f1": f1_score(yi, y_pred),
            "tpr": tpr,
            "fpr": fpr,
            "positive_rate": pos_rate,
        })
    out = pd.DataFrame(rows).sort_values(by="n", ascending=False)
    def gap(col):
        return out[col].max() - out[col].min() if len(out) > 0 else np.nan
    gaps = {f"gap_{c}": gap(c) for c in ["accuracy","f1","tpr","fpr","positive_rate"]}
    return out, gaps

best_name = global_df.iloc[0]["model"]
best = trained[best_name]

seg_tables = {}
gap_rows = []

if "sex" in X_test.columns:
    sex_tbl, sex_gaps = group_metrics(best, X_test, y_test, X_test["sex"], "sex")
    seg_tables["sex"] = sex_tbl
    gap_rows.append({"group":"sex", **sex_gaps})

if "age" in X_test.columns:
    age_bins = pd.cut(X_test["age"], bins=[-np.inf,29,49,69,np.inf], labels=["<30","30–49","50–69","70+"])
    age_tbl, age_gaps = group_metrics(best, X_test, y_test, age_bins, "age_bin")
    seg_tables["age_bin"] = age_tbl
    gap_rows.append({"group":"age_bin", **age_gaps})

if "race" in X_test.columns:
    race_tbl, race_gaps = group_metrics(best, X_test, y_test, X_test["race"], "race")
    seg_tables["race"] = race_tbl
    gap_rows.append({"group":"race", **race_gaps})

if "native_country" in X_test.columns:
    top_countries = X_test["native_country"].value_counts().head(5).index
    mask = X_test["native_country"].isin(top_countries)
    if mask.sum() > 0:
        tbl, gaps = group_metrics(best, X_test[mask], y_test[mask], X_test.loc[mask, "native_country"], "native_country")
        seg_tables["native_country_top5"] = tbl
        gap_rows.append({"group":"native_country_top5", **gaps})

for name, tbl in seg_tables.items():
    print(f"\n### Segmento: {name}")
    display(tbl)

gaps_df = pd.DataFrame(gap_rows)
print("\nBrechas (max-min) por grupo y métrica:")
display(gaps_df)


### Segmento: sex


Unnamed: 0,sex,n,accuracy,f1,tpr,fpr,positive_rate
0,Male,5458,0.684,0.6126,0.8187,0.3753,0.5106
1,Female,2683,0.8431,0.3533,0.3912,0.1013,0.1331



### Segmento: age_bin


Unnamed: 0,age_bin,n,accuracy,f1,tpr,fpr,positive_rate
0,30–49,3927,0.7023,0.597,0.7205,0.3057,0.4326
1,<30,2407,0.8803,0.28,0.4211,0.0928,0.1109
2,50–69,1647,0.6393,0.6387,0.8929,0.5014,0.6412
3,70+,160,0.4062,0.4025,0.8649,0.7317,0.7625



### Segmento: race


Unnamed: 0,race,n,accuracy,f1,tpr,fpr,positive_rate
1,White,7017,0.723,0.5874,0.7715,0.2937,0.4158
2,Black,725,0.8538,0.4536,0.5,0.0973,0.1462
4,Asian-Pac-Islander,257,0.7043,0.5422,0.6818,0.288,0.3891
3,Amer-Indian-Eskimo,79,0.8354,0.4348,0.5556,0.1286,0.1772
0,Other,63,0.8889,0.2222,0.3333,0.0833,0.0952



Brechas (max-min) por grupo y métrica:


Unnamed: 0,group,gap_accuracy,gap_f1,gap_tpr,gap_fpr,gap_positive_rate
0,sex,0.1591,0.2593,0.4276,0.274,0.3776
1,age_bin,0.4741,0.3587,0.4718,0.6389,0.6516
2,race,0.1846,0.3652,0.4381,0.2104,0.3206


## 7. Comparación antes vs después de mitigación (class_weight)

In [21]:
compare_rows = []
for name in ["logreg_unbalanced", "logreg_balanced"]:
    pipe = trained[name]
    row = {"model": name, **eval_global(pipe, X_test, y_test, "test")}
    compare_rows.append(row)
cmp_global = pd.DataFrame(compare_rows)
display(cmp_global)

if "sex" in X_test.columns:
    cmps = []
    for name in ["logreg_unbalanced", "logreg_balanced"]:
        pipe = trained[name]
        tbl, gaps = group_metrics(pipe, X_test, y_test, X_test["sex"], "sex")
        g = {"model": name, **gaps}
        cmps.append(g)
        print(f"\nSegmentado por sex -> {name}")
        display(tbl)
    cmp_gaps = pd.DataFrame(cmps)
    print("\nBrechas por 'sex' (menor es mejor):")
    display(cmp_gaps)

Unnamed: 0,model,split,accuracy,f1,roc_auc
0,logreg_unbalanced,test,0.8099,0.5123,0.8222
1,logreg_balanced,test,0.7364,0.5795,0.8228



Segmentado por sex -> logreg_unbalanced


Unnamed: 0,sex,n,accuracy,f1,tpr,fpr,positive_rate
0,Male,5458,0.7688,0.5563,0.4748,0.1021,0.2158
1,Female,2683,0.8934,0.1333,0.0748,0.0059,0.0134



Segmentado por sex -> logreg_balanced


Unnamed: 0,sex,n,accuracy,f1,tpr,fpr,positive_rate
0,Male,5458,0.684,0.6126,0.8187,0.3753,0.5106
1,Female,2683,0.8431,0.3533,0.3912,0.1013,0.1331



Brechas por 'sex' (menor es mejor):


Unnamed: 0,model,gap_accuracy,gap_f1,gap_tpr,gap_fpr,gap_positive_rate
0,logreg_unbalanced,0.1246,0.4229,0.4,0.0962,0.2024
1,logreg_balanced,0.1591,0.2593,0.4276,0.274,0.3776
