# Baseline Model & Segmented Evaluation (Responsible AI)

Este notebook entrena un modelo simple (Regresión Logística y Árbol de Decisión) y evalúa el rendimiento **segmentado por grupos sensibles** (género, edad, raza, país de origen si existe).
También compara una versión **sin** y **con** `class_weight="balanced"` como estrategia básica de **mitigación de desbalance**.

In [1]:
# === Configuración ===
import os
import numpy as np
import pandas as pd
import kagglehub

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, f1_score, roc_auc_score, confusion_matrix, classification_report
)

pd.set_option("display.max_colwidth", None)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")
print("Versions -> pandas", pd.__version__)

  from .autonotebook import tqdm as notebook_tqdm


Versions -> pandas 2.3.2


## 1. Carga de datos

In [2]:
# Ruta del CSV. Por defecto intenta 'adult.csv' en el mismo folder.
CSV_PATH = os.environ.get("ADULT_CSV", "adult.csv")

def normalize_col(c):
    return c.strip().lower().replace("-", "_").replace(" ", "_")

# Download latest version
path = kagglehub.dataset_download("uciml/adult-census-income")

print("Path to dataset files:", path)
df = pd.read_csv(path + "/adult.csv")
df.info()
df.columns = [normalize_col(c) for c in df.columns]
print("Shape:", df.shape)
df.head()

Downloading from https://www.kaggle.com/api/v1/datasets/download/uciml/adult-census-income?dataset_version_number=3...


100%|██████████| 450k/450k [00:00<00:00, 2.27MB/s]

Extracting files...
Path to dataset files: C:\Users\bcarr\.cache\kagglehub\datasets\uciml\adult-census-income\versions\3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), obje




Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


## 2. Definición del problema y columnas

In [3]:
# === Definición del problema ===
TARGET_COL = "income"
CANDIDATE_TARGETS = ["income", "class", "salario", "target"]
if TARGET_COL not in df.columns:
    for cand in CANDIDATE_TARGETS:
        if cand in df.columns:
            TARGET_COL = cand
            break
print("TARGET_COL =", TARGET_COL)

# === Columnas esperadas (ajusta) ===
num_cols = [c for c in ["age", "education_num", "hours_per_week", "capital_gain", "capital_loss"] if c in df.columns]
cat_cols = [c for c in ["workclass","education","marital_status","occupation","sex","race","native_country"] if c in df.columns]

# Limpieza rápida
for c in cat_cols:
    df[c] = df[c].replace("?", np.nan).replace(" ?", np.nan)

df = df[df[TARGET_COL].notna()].copy()
df[TARGET_COL] = df[TARGET_COL].astype(str).str.strip()

X = df[num_cols + cat_cols].copy()

labels = df[TARGET_COL].astype(str).str.strip()

# 1) Caso típico Adult: etiquetas "<=50K" y ">50K"
if set(labels.unique()) >= {">50K", "<=50K"}:
    y = (labels == ">50K").astype(int)

# 2) Caso binario ya numérico
elif set(labels.unique()) <= {"0","1"}:
    y = labels.astype(int)

# 3) Caso con prefijo ">" en alguna etiqueta
elif labels.str.startswith(">").any():
    y = labels.str.startswith(">").astype(int)

else:
    raise ValueError(
        f"No reconozco las etiquetas: {sorted(labels.unique()[:10])}. "
        "Revisa TARGET_COL."
    )

print("Clases y proporción:")
print(y.value_counts(normalize=True))
assert y.nunique() == 2, "Aún hay una sola clase en y; revisa el mapeo del target."


TARGET_COL = income
Clases y proporción:
income
0   0.7592
1   0.2408
Name: proportion, dtype: float64


## 3. Preprocesamiento

In [4]:
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])
cat_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])
preprocess = ColumnTransformer(
    transformers=[
        ("num", num_transformer, num_cols),
        ("cat", cat_transformer, cat_cols),
    ],
    remainder="drop"
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
X_train.shape, X_test.shape

((24420, 6), (8141, 6))

## 4. Entrenamiento de modelos (baseline)

In [17]:
# Dos configuraciones: sin balance y con class_weight="balanced"
models = {
    "logreg_unbalanced": LogisticRegression(max_iter=200, n_jobs=None),
    "tree_unbalanced": DecisionTreeClassifier(max_depth=6, min_samples_leaf=20, random_state=42)
}

from collections import OrderedDict
trained = OrderedDict()
for name, clf in models.items():
    pipe = Pipeline(steps=[("preprocess", preprocess), ("model", clf)])
    pipe.fit(X_train, y_train)
    trained[name] = pipe

list(trained.keys())

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


['logreg_unbalanced', 'tree_unbalanced']

## 5. Métricas globales

In [18]:
def eval_global(model, X, y, label="test"):
    y_pred = model.predict(X)
    row = {
        "split": label,
        "accuracy": accuracy_score(y, y_pred),
        "f1": f1_score(y, y_pred)
    }
    try:
        y_proba = model.predict_proba(X)[:,1]
        row["roc_auc"] = roc_auc_score(y, y_proba)
    except Exception:
        row["roc_auc"] = np.nan
    return row

global_rows = []
for name, pipe in trained.items():
    global_rows.append({"model": name, **eval_global(pipe, X_test, y_test, "test")})

global_df = pd.DataFrame(global_rows).sort_values(by="f1", ascending=False)
display(global_df)

best_name = global_df.iloc[0]["model"]
best = trained[best_name]
y_pred = best.predict(X_test)
print(f"Mejor modelo: {best_name}")
print("Matriz de confusión [[tn, fp], [fn, tp]]:", confusion_matrix(y_test, y_pred))
print("\nReporte de clasificación:\n", classification_report(y_test, y_pred, digits=3))

Unnamed: 0,model,split,accuracy,f1,roc_auc
0,logreg_unbalanced,test,0.8096,0.511,0.8221
1,tree_unbalanced,test,0.7978,0.4607,0.8029


Mejor modelo: logreg_unbalanced
Matriz de confusión [[tn, fp], [fn, tp]]: [[5781  400]
 [1150  810]]

Reporte de clasificación:
               precision    recall  f1-score   support

           0      0.834     0.935     0.882      6181
           1      0.669     0.413     0.511      1960

    accuracy                          0.810      8141
   macro avg      0.752     0.674     0.696      8141
weighted avg      0.794     0.810     0.793      8141



- Balance de clases (test): 0 = 6181 (76%), 1 = 1960 (24%).
- Comparación de modelos (test):
logreg_unbalanced: accuracy 0.8096, F1 0.5110, ROC AUC 0.8221 → mejor.
tree_unbalanced: accuracy 0.7978, F1 0.4607, ROC AUC 0.8029.
- Mejor modelo: logreg_unbalanced.
- Matriz de confusión: TN=5781, FP=400, FN=1150, TP=810.
Métricas por clase (logreg_unbalanced):
Clase 0: precision 0.834, recall 0.935, F1 0.882 (support 6181).
Clase 1: precision 0.669, recall 0.413, F1 0.511 (support 1960).
- Global: accuracy 0.810; macro avg F1 0.696; weighted avg F1 0.793.
- Comportamiento: modelo conservador con la clase 1 (alto FN); predice positivos ≈14.9% vs prevalencia real 24%.
- El modelo es conservador: protege la clase 0 (pocos FP, alta especificidad) pero deja pasar muchos positivos (FN altos).

## 6. Evaluación segmentada por grupos sensibles

In [19]:
def group_metrics(pipe, X, y, group_series, group_name):
    df_aux = X.copy()
    df_aux["_y"] = y.values
    df_aux["_g"] = group_series.values

    rows = []
    for g in df_aux["_g"].dropna().unique():
        idx = df_aux.index[df_aux["_g"] == g]
        Xi = X.loc[idx]
        yi = y.loc[idx]
        if len(yi) < 20:
            continue
        y_pred = pipe.predict(Xi)
        tp = ((y_pred == 1) & (yi == 1)).sum()
        tn = ((y_pred == 0) & (yi == 0)).sum()
        fp = ((y_pred == 1) & (yi == 0)).sum()
        fn = ((y_pred == 0) & (yi == 1)).sum()
        tpr = tp / (tp + fn) if (tp + fn) > 0 else np.nan
        fpr = fp / (fp + tn) if (fp + tn) > 0 else np.nan
        pos_rate = (y_pred == 1).mean()

        rows.append({
            group_name: g,
            "n": int(len(yi)),
            "accuracy": accuracy_score(yi, y_pred),
            "f1": f1_score(yi, y_pred),
            "tpr": tpr,
            "fpr": fpr,
            "positive_rate": pos_rate,
        })
    out = pd.DataFrame(rows).sort_values(by="n", ascending=False)
    def gap(col):
        return out[col].max() - out[col].min() if len(out) > 0 else np.nan
    gaps = {f"gap_{c}": gap(c) for c in ["accuracy","f1","tpr","fpr","positive_rate"]}
    return out, gaps

best_name = global_df.iloc[0]["model"]
best = trained[best_name]

seg_tables = {}
gap_rows = []

if "sex" in X_test.columns:
    sex_tbl, sex_gaps = group_metrics(best, X_test, y_test, X_test["sex"], "sex")
    seg_tables["sex"] = sex_tbl
    gap_rows.append({"group":"sex", **sex_gaps})

if "age" in X_test.columns:
    age_bins = pd.cut(X_test["age"], bins=[-np.inf,29,49,69,np.inf], labels=["<30","30–49","50–69","70+"])
    age_tbl, age_gaps = group_metrics(best, X_test, y_test, age_bins, "age_bin")
    seg_tables["age_bin"] = age_tbl
    gap_rows.append({"group":"age_bin", **age_gaps})

if "race" in X_test.columns:
    race_tbl, race_gaps = group_metrics(best, X_test, y_test, X_test["race"], "race")
    seg_tables["race"] = race_tbl
    gap_rows.append({"group":"race", **race_gaps})

if "native_country" in X_test.columns:
    top_countries = X_test["native_country"].value_counts().head(5).index
    mask = X_test["native_country"].isin(top_countries)
    if mask.sum() > 0:
        tbl, gaps = group_metrics(best, X_test[mask], y_test[mask], X_test.loc[mask, "native_country"], "native_country")
        seg_tables["native_country_top5"] = tbl
        gap_rows.append({"group":"native_country_top5", **gaps})

for name, tbl in seg_tables.items():
    print(f"\n### Segmento: {name}")
    display(tbl)

gaps_df = pd.DataFrame(gap_rows)
print("\nBrechas (max-min) por grupo y métrica:")
display(gaps_df)


### Segmento: sex


Unnamed: 0,sex,n,accuracy,f1,tpr,fpr,positive_rate
0,Male,5458,0.7684,0.5549,0.473,0.1018,0.2151
1,Female,2683,0.8934,0.1333,0.0748,0.0059,0.0134



### Segmento: age_bin


Unnamed: 0,age_bin,n,accuracy,f1,tpr,fpr,positive_rate
0,30–49,3927,0.7601,0.4875,0.3727,0.069,0.162
1,<30,2407,0.9393,0.1412,0.0902,0.011,0.0154
2,50–69,1647,0.7511,0.6103,0.5459,0.135,0.2817
3,70+,160,0.675,0.5273,0.7838,0.3577,0.4562



### Segmento: race


Unnamed: 0,race,n,accuracy,f1,tpr,fpr,positive_rate
1,White,7017,0.8002,0.5212,0.4253,0.071,0.1616
2,Black,725,0.8855,0.2783,0.1818,0.0173,0.0372
4,Asian-Pac-Islander,257,0.8093,0.5586,0.4697,0.0733,0.1751
3,Amer-Indian-Eskimo,79,0.8608,0.0,0.0,0.0286,0.0253
0,Other,63,0.9206,0.0,0.0,0.0333,0.0317



Brechas (max-min) por grupo y métrica:


Unnamed: 0,group,gap_accuracy,gap_f1,gap_tpr,gap_fpr,gap_positive_rate
0,sex,0.125,0.4216,0.3982,0.0959,0.2017
1,age_bin,0.2643,0.4691,0.6936,0.3467,0.4409
2,race,0.1204,0.5586,0.4697,0.056,0.1498


### Sexo
- Male: TPR 0.473, FPR 0.102, positive_rate 0.215
- nFemale: TPR 0.075, FPR 0.006, positive_rate 0.013
Indicios: fuerte disparidad. Equal opportunity (TPR) mucho menor en Female; demographic parity ratio ≈ 0.013/0.215 ≈ 0.06 (muy por debajo de 0.8). Señal de impacto adverso hacia Female.

### Edad
- <30: TPR 0.090, positive_rate 0.015
- 30–49: TPR 0.373, positive_rate 0.162
- 50–69: TPR 0.546, positive_rate 0.282
- 70+: TPR 0.784, FPR 0.358, positive_rate 0.456
Indicios: brecha de TPR ≈ 0.69 (muy alta). Grupos jóvenes reciben muchas menos predicciones positivas y son menos detectados; 70+ tiene TPR alto pero también FPR alto (posible sobre-asignación de positivos).

### Raza
- White: TPR 0.425, positive_rate 0.162
- Black: TPR 0.182, positive_rate 0.037
- Asian-Pac-Islander: TPR 0.470, positive_rate 0.175
- Amer-Indian-Eskimo/Other: TPR 0.000, positive_rate ~0.03 (n muy bajos)
Indicios: disparidades relevantes. Black tiene TPR y positive_rate claramente menores; grupos de n pequeño muestran TPR=0 (posible subcobertura/overfitting). Parity ratio aprox min/max ≈ 0.025/0.175 ≈ 0.14.

## Conclusión
Sí, hay indicios de sesgo: el modelo favorece sistemáticamente a Male y a grupos de mayor edad, y perjudica a Female, <30 y, en menor medida, a Black y grupos con muy pocos ejemplos. Las brechas de TPR (equal opportunity) y de positive_rate (demographic parity) son grandes.

## Estrategias de mitigación
1. Ajustar umbral global para subir TPR de la clase positiva y reevaluar brechas.
2. Probar class_weight="balanced" y/o reweighing en entrenamiento; volver a medir TPR/FPR por grupo.
3. Aumentar soporte de grupos pequeños (agregación de categorías o técnicas de re-muestreo) para evitar TPR=0 por baja n.
4. Añadir métricas formales: demographic parity ratio y equal opportunity difference por grupo y fijar un umbral aceptable (p. ej., razón ≥ 0.8).

## 7. Implementación de mitigación con class_weight = "balanced"

In [20]:
# Dos configuraciones: sin balance y con class_weight="balanced"
models = {
    "logreg_balanced": LogisticRegression(max_iter=200, class_weight="balanced", n_jobs=None),
    "tree_balanced": DecisionTreeClassifier(max_depth=6, min_samples_leaf=20, class_weight="balanced", random_state=42),
}

from collections import OrderedDict
for name, clf in models.items():
    pipe = Pipeline(steps=[("preprocess", preprocess), ("model", clf)])
    pipe.fit(X_train, y_train)
    trained[name] = pipe

list(trained.keys())

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


['logreg_unbalanced', 'tree_unbalanced', 'logreg_balanced', 'tree_balanced']

In [21]:
def eval_global(model, X, y, label="test"):
    y_pred = model.predict(X)
    row = {
        "split": label,
        "accuracy": accuracy_score(y, y_pred),
        "f1": f1_score(y, y_pred)
    }
    try:
        y_proba = model.predict_proba(X)[:,1]
        row["roc_auc"] = roc_auc_score(y, y_proba)
    except Exception:
        row["roc_auc"] = np.nan
    return row

global_rows = []
for name, pipe in trained.items():
    global_rows.append({"model": name, **eval_global(pipe, X_test, y_test, "test")})

global_df = pd.DataFrame(global_rows).sort_values(by="f1", ascending=False)
display(global_df)

best_name = global_df.iloc[0]["model"]
best = trained[best_name]
y_pred = best.predict(X_test)
print(f"Mejor modelo: {best_name}")
print("Matriz de confusión [[tn, fp], [fn, tp]]:", confusion_matrix(y_test, y_pred))
print("\nReporte de clasificación:\n", classification_report(y_test, y_pred, digits=3))

Unnamed: 0,model,split,accuracy,f1,roc_auc
2,logreg_balanced,test,0.7375,0.5797,0.8231
3,tree_balanced,test,0.6593,0.5473,0.8061
0,logreg_unbalanced,test,0.8096,0.511,0.8221
1,tree_unbalanced,test,0.7978,0.4607,0.8029


Mejor modelo: logreg_balanced
Matriz de confusión [[tn, fp], [fn, tp]]: [[4530 1651]
 [ 486 1474]]

Reporte de clasificación:
               precision    recall  f1-score   support

           0      0.903     0.733     0.809      6181
           1      0.472     0.752     0.580      1960

    accuracy                          0.738      8141
   macro avg      0.687     0.742     0.694      8141
weighted avg      0.799     0.738     0.754      8141



- Selección por F1 (clase positiva): logreg_balanced pasa a ser el mejor (F1=0.5797 vs 0.5110 en logreg_unbalanced). AUC casi igual (0.8231 vs 0.8221).
- Accuracy baja (0.738 vs 0.810), esperado al priorizar la clase minoritaria.
Matriz de confusión (logreg_balanced):
- Antes: TN=5781, FP=400, FN=1150, TP=810
- Ahora: TN=4530, FP=1651, FN=486, TP=1474
Cambios: FN −664 (mejor recall), FP +1251 (peor precisión), TP +664, TN −1251.

Métricas por clase:
- Clase 1 (positiva): precision 0.472 (dismuye), recall 0.752 (aumenta significativamente), F1 0.580 (aumenta).
- Clase 0: precision 0.903, recall 0.733, F1 0.809 (disminuye).
Tasa de positivos predichos: sube de ~14.9% a ~38.4% (más agresivo en marcar “>50K”).
AUC estable: class_weight cambia el umbral efectivo/penalización, no la capacidad de ranking.

### Resumen
Se mitigó el desbalance: el modelo logreg_balanced captura muchos más verdaderos positivos (recall de 0.413 a 0.752) a costa de más falsos positivos.

## 8. Comparación antes vs después de mitigación (class_weight)

In [26]:
compare_rows = []
for name in ["logreg_unbalanced", "logreg_balanced"]:
    pipe = trained[name]
    row = {"model": name, **eval_global(pipe, X_test, y_test, "test")}
    compare_rows.append(row)
cmp_global = pd.DataFrame(compare_rows)
display(cmp_global)

if "sex" in X_test.columns:
    cmps = []
    for name in ["logreg_unbalanced", "logreg_balanced"]:
        pipe = trained[name]
        tbl, gaps = group_metrics(pipe, X_test, y_test, X_test["sex"], "sex")
        g = {"model": name, **gaps}
        cmps.append(g)
        print(f"\nSegmentado por sex -> {name}")
        display(tbl)
    cmp_gaps = pd.DataFrame(cmps)
    print("\nBrechas por 'sex':")
    display(cmp_gaps)

if "age" in X_test.columns:
    cmps = []
    if "age" in X_test.columns:
        for name in ["logreg_unbalanced", "logreg_balanced"]:
            pipe = trained[name]
            tbl, gaps = group_metrics(pipe, X_test, y_test, age_bins, "age_bin")
            g = {"model": name, **gaps}
            cmps.append(g)
            print(f"\nSegmentado por age_bin -> {name}")
            display(tbl)
    cmp_gaps = pd.DataFrame(cmps)
    print("\nBrechas por 'age_bin':")
    display(cmp_gaps)

if "race" in X_test.columns:
    cmps = []
    for name in ["logreg_unbalanced", "logreg_balanced"]:
        pipe = trained[name]
        tbl, gaps = group_metrics(pipe, X_test, y_test, X_test["race"], "race")
        g = {"model": name, **gaps}
        cmps.append(g)
        print(f"\nSegmentado por race -> {name}")
        display(tbl)
    cmp_gaps = pd.DataFrame(cmps)
    print("\nBrechas por 'race':")
    display(cmp_gaps)

Unnamed: 0,model,split,accuracy,f1,roc_auc
0,logreg_unbalanced,test,0.8096,0.511,0.8221
1,logreg_balanced,test,0.7375,0.5797,0.8231



Segmentado por sex -> logreg_unbalanced


Unnamed: 0,sex,n,accuracy,f1,tpr,fpr,positive_rate
0,Male,5458,0.7684,0.5549,0.473,0.1018,0.2151
1,Female,2683,0.8934,0.1333,0.0748,0.0059,0.0134



Segmentado por sex -> logreg_balanced


Unnamed: 0,sex,n,accuracy,f1,tpr,fpr,positive_rate
0,Male,5458,0.6849,0.613,0.8175,0.3734,0.509
1,Female,2683,0.8446,0.3495,0.381,0.0984,0.1293



Brechas por 'sex':


Unnamed: 0,model,gap_accuracy,gap_f1,gap_tpr,gap_fpr,gap_positive_rate
0,logreg_unbalanced,0.125,0.4216,0.3982,0.0959,0.2017
1,logreg_balanced,0.1597,0.2635,0.4366,0.2751,0.3796



Segmentado por age_bin -> logreg_unbalanced


Unnamed: 0,age_bin,n,accuracy,f1,tpr,fpr,positive_rate
0,30–49,3927,0.7601,0.4875,0.3727,0.069,0.162
1,<30,2407,0.9393,0.1412,0.0902,0.011,0.0154
2,50–69,1647,0.7511,0.6103,0.5459,0.135,0.2817
3,70+,160,0.675,0.5273,0.7838,0.3577,0.4562



Segmentado por age_bin -> logreg_balanced


Unnamed: 0,age_bin,n,accuracy,f1,tpr,fpr,positive_rate
0,30–49,3927,0.7041,0.5979,0.7188,0.3024,0.4298
1,<30,2407,0.8816,0.2674,0.391,0.0897,0.1064
2,50–69,1647,0.6387,0.6387,0.8946,0.5033,0.643
3,70+,160,0.4062,0.4025,0.8649,0.7317,0.7625



Brechas por 'age_bin':


Unnamed: 0,model,gap_accuracy,gap_f1,gap_tpr,gap_fpr,gap_positive_rate
0,logreg_unbalanced,0.2643,0.4691,0.6936,0.3467,0.4409
1,logreg_balanced,0.4753,0.3714,0.5036,0.642,0.6561



Segmentado por race -> logreg_unbalanced


Unnamed: 0,race,n,accuracy,f1,tpr,fpr,positive_rate
1,White,7017,0.8002,0.5212,0.4253,0.071,0.1616
2,Black,725,0.8855,0.2783,0.1818,0.0173,0.0372
4,Asian-Pac-Islander,257,0.8093,0.5586,0.4697,0.0733,0.1751
3,Amer-Indian-Eskimo,79,0.8608,0.0,0.0,0.0286,0.0253
0,Other,63,0.9206,0.0,0.0,0.0333,0.0317



Segmentado por race -> logreg_balanced


Unnamed: 0,race,n,accuracy,f1,tpr,fpr,positive_rate
1,White,7017,0.7241,0.5872,0.7676,0.2908,0.4127
2,Black,725,0.8566,0.4694,0.5227,0.0973,0.149
4,Asian-Pac-Islander,257,0.7043,0.5422,0.6818,0.288,0.3891
3,Amer-Indian-Eskimo,79,0.8354,0.4348,0.5556,0.1286,0.1772
0,Other,63,0.873,0.2,0.3333,0.1,0.1111



Brechas por 'race':


Unnamed: 0,model,gap_accuracy,gap_f1,gap_tpr,gap_fpr,gap_positive_rate
0,logreg_unbalanced,0.1204,0.5586,0.4697,0.056,0.1498
1,logreg_balanced,0.1687,0.3872,0.4342,0.1935,0.3016


A continuación, un resumen consolidado de los tres grupos sensibles evaluados. Los detalles por grupo (tablas y brechas) se muestran en la celda anterior; aquí sintetizamos los hallazgos clave y su cambio tras aplicar `class_weight="balanced"`.

### Sexo (sex)
- Antes: TPR y positive_rate sustancialmente mayores en Male que en Female; Female con muy baja tasa de positivos y detección de la clase >50K.
- Después: TPR aumenta en ambos sexos (mejor recall), pero también suben FPR y positive_rate. Se reduce la brecha de TPR (equal opportunity), mientras que tienden a ampliarse las brechas de FPR y de positive_rate (demographic parity).

### Edad (age_bin)
- Antes: TPR crece con la edad; <30 casi no detectado; 70+ con TPR alto pero también FPR elevado. Positive_rate mucho menor en jóvenes que en mayores.
- Después: TPR mejora en todos los bins; FPR y positive_rate aumentan sobre todo en grupos mayores. En conjunto, disminuye la brecha de TPR, pero aumentan las brechas de FPR y positive_rate; la brecha de accuracy puede empeorar por más FP en mayores.

### Raza (race)
- Antes: White con TPR y positive_rate más altos que Black; grupos con muy bajo n muestran inestabilidad (p. ej., TPR≈0).
- Después: TPR sube en la mayoría de razas, reduciendo brechas de TPR y, en algunos casos, de F1. Sin embargo, FPR y positive_rate tienden a subir más en ciertos grupos, ampliando esas brechas.

### Conclusión operativa
- La mitigación por `class_weight` mejora el recall (TPR) de la clase positiva y reduce algunas brechas de oportunidad (TPR), a costa de incrementar FPR y la tasa de positivos predichos, lo que puede ampliar disparidades de selección (positive_rate) y de errores (FPR).
- Es recomendable explorar ajuste de umbral (curvas PR), métricas formales (p. ej., demographic parity ratio y equal opportunity difference), y post-procesamiento/reescalado para balancear rendimiento y equidad por grupo.