# üèÄ NBA Playoffs Simulator ‚Äî Feature Engineering
**Proyecto:** Simul√© los Playoffs NBA miles de veces‚Ä¶ y encontr√© un contender inesperado

**Notebook 02:** Construcci√≥n de features inteligentes

Este es el **momento clave** del pipeline. Aqu√≠ es donde el humano piensa
antes que la m√°quina:
- Transformamos stats crudas en se√±ales con significado deportivo
- Construimos el dataset de entrenamiento para el XGBoost
- Cada feature tiene una intuici√≥n clara: no es magia, es pensamiento aplicado

In [None]:
# ============================================================
# SETUP: Montar Drive y cargar datos del Notebook 01
# ============================================================
import pandas as pd
import numpy as np
import warnings
import os

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Montar Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Rutas del proyecto
PROJECT_DIR = '/content/drive/MyDrive/nba-playoffs-simulator'
DATA_DIR = f'{PROJECT_DIR}/data'

print(f'üìÅ Proyecto: {PROJECT_DIR}')
print(f'üìÅ Datos:    {DATA_DIR}')
print(f'\nArchivos disponibles:')
for f in sorted(os.listdir(DATA_DIR)):
    size = os.path.getsize(f'{DATA_DIR}/{f}') / 1024
    print(f'  üìÑ {f:<36} ({size:.1f} KB)')

In [None]:
# ============================================================
# Cargar datasets del Notebook 01
# ============================================================

# Stats actuales de la temporada 2025-26
df_current = pd.read_csv(f'{DATA_DIR}/team_stats_2026.csv')

# Series hist√≥ricas de playoffs (2015-2025)
df_playoffs = pd.read_csv(f'{DATA_DIR}/historical_playoffs.csv')

# Stats por temporada regular (hist√≥ricas)
df_hist_stats = pd.read_csv(f'{DATA_DIR}/historical_team_stats.csv')

# Standings hist√≥ricos
df_hist_standings = pd.read_csv(f'{DATA_DIR}/historical_standings.csv')

print(f'‚úÖ Datos cargados:')
print(f'  ‚Üí Stats actuales:     {df_current.shape}')
print(f'  ‚Üí Playoffs hist√≥ricos: {df_playoffs.shape}')
print(f'  ‚Üí Stats hist√≥ricas:    {df_hist_stats.shape}')
print(f'  ‚Üí Standings hist√≥ricos: {df_hist_standings.shape}')

---
## üîç Secci√≥n 1: Exploraci√≥n r√°pida de los datos

Antes de construir features, necesitamos entender qu√© tenemos.
Un vistazo r√°pido a las columnas disponibles y la calidad de los datos.

In [None]:
# ============================================================
# 1.1 ‚Äî ¬øQu√© columnas tenemos en las stats hist√≥ricas?
# ============================================================
print('üìä Columnas en stats hist√≥ricas:\n')
for col in df_hist_stats.columns:
    non_null = df_hist_stats[col].notna().sum()
    print(f'  {col:<24} ‚Üí {non_null}/{len(df_hist_stats)} valores')

print(f'\nüìä Temporadas disponibles: {sorted(df_hist_stats["SEASON"].unique())}')

In [None]:
# ============================================================
# 1.2 ‚Äî ¬øC√≥mo lucen las series hist√≥ricas?
# ============================================================
print('üèÜ Estructura de las series de playoffs:\n')
print(df_playoffs.head(10).to_string(index=False))

print(f'\nüìä Series por ronda:')
round_names = {1: 'First Round', 2: 'Conf. Semis', 3: 'Conf. Finals', 4: 'NBA Finals'}
for r in sorted(df_playoffs['round'].unique()):
    count = len(df_playoffs[df_playoffs['round'] == r])
    print(f'  Ronda {r} ({round_names.get(r, "?")}): {count} series')

In [None]:
# ============================================================
# 1.3 ‚Äî ¬øQu√© columnas tienen los standings hist√≥ricos?
# ============================================================
print('üå± Columnas en standings hist√≥ricos:\n')
print(list(df_hist_standings.columns))
print(f'\nüìä Vista previa:')
# Mostrar solo columnas clave
key_stand_cols = [c for c in ['TeamID', 'TeamCity', 'TeamName', 'Conference',
                               'PlayoffRank', 'Record', 'SEASON']
                  if c in df_hist_standings.columns]
df_hist_standings[key_stand_cols].head(10)

---
## ‚öôÔ∏è Secci√≥n 2: Construir el Dataset de Entrenamiento

El XGBoost necesita aprender de **matchups hist√≥ricos**: dados los perfiles de
dos equipos que se enfrentaron en playoffs, ¬øqui√©n gan√≥ la serie?

**Estructura del training set:**
- Cada fila = una serie de playoffs
- Features = diferenciales de stats entre ambos equipos (Team A - Team B)
- Target = ¬øgan√≥ el Team A? (1 = s√≠, 0 = no)

**¬øPor qu√© diferenciales?** Porque al modelo no le importa si un equipo tiene
Net Rating de +8; le importa si es **mejor que su rival**. Un +8 vs +2 es muy
diferente a un +8 vs +7.

In [None]:
# ============================================================
# 2.1 ‚Äî Asignar seeding a cada equipo en cada serie hist√≥rica
# ============================================================

def get_team_seed(team_id, season, standings_df):
    """
    Busca el seed (PlayoffRank) de un equipo en una temporada.
    Retorna el seed y la conferencia.
    """
    # Intentar con TeamID
    if 'TeamID' in standings_df.columns:
        mask = (standings_df['TeamID'] == team_id) & (standings_df['SEASON'] == season)
    else:
        return None, None

    match = standings_df[mask]
    if len(match) == 0:
        return None, None

    row = match.iloc[0]
    seed = row.get('PlayoffRank', None)
    conf = row.get('Conference', None)
    return seed, conf


# Agregar seeds a cada serie
seeds_winner = []
seeds_loser = []
confs_winner = []

for _, row in df_playoffs.iterrows():
    w_seed, w_conf = get_team_seed(row['winner_id'], row['season'], df_hist_standings)
    l_seed, l_conf = get_team_seed(row['loser_id'], row['season'], df_hist_standings)
    seeds_winner.append(w_seed)
    seeds_loser.append(l_seed)
    confs_winner.append(w_conf)

df_playoffs['winner_seed'] = seeds_winner
df_playoffs['loser_seed'] = seeds_loser
df_playoffs['conference'] = confs_winner

# Verificar
has_seeds = df_playoffs['winner_seed'].notna().sum()
print(f'‚úÖ Seeds asignados: {has_seeds}/{len(df_playoffs)} series')
print(f'\nüìä Vista previa:')
df_playoffs[['season', 'winner_abbr', 'winner_seed', 'loser_abbr', 'loser_seed',
             'round', 'total_games']].head(10)

In [None]:
# ============================================================
# 2.2 ‚Äî Definir Team A como el mejor seed (favorito)
# ============================================================
# Esto estandariza la perspectiva: Team A siempre es el favorito por seed.
# As√≠ el modelo aprende: "dado que A es mejor seed, ¬øgan√≥ o no?"

rows = []

for _, s in df_playoffs.iterrows():
    w_seed = s['winner_seed']
    l_seed = s['loser_seed']

    # Saltar series sin seed
    if pd.isna(w_seed) or pd.isna(l_seed):
        continue

    # Team A = mejor seed (n√∫mero m√°s bajo)
    if w_seed <= l_seed:
        # El ganador era el favorito
        rows.append({
            'season': s['season'],
            'round': s['round'],
            'team_a_id': s['winner_id'],
            'team_a_abbr': s['winner_abbr'],
            'team_a_seed': int(w_seed),
            'team_b_id': s['loser_id'],
            'team_b_abbr': s['loser_abbr'],
            'team_b_seed': int(l_seed),
            'team_a_won': 1,
            'series_games': s['total_games']
        })
    else:
        # El ganador era el underdog (upset)
        rows.append({
            'season': s['season'],
            'round': s['round'],
            'team_a_id': s['loser_id'],
            'team_a_abbr': s['loser_abbr'],
            'team_a_seed': int(l_seed),
            'team_b_id': s['winner_id'],
            'team_b_abbr': s['winner_abbr'],
            'team_b_seed': int(w_seed),
            'team_a_won': 0,  # El favorito NO gan√≥
            'series_games': s['total_games']
        })

df_matchups = pd.DataFrame(rows)

print(f'üìã Matchups estructurados: {len(df_matchups)} series')
print(f'\nüìä Tasa de victoria del favorito (mejor seed):')
print(f'   {df_matchups["team_a_won"].mean():.1%}\n')

# Tasa por ronda
print('üìä Tasa de victoria del favorito por ronda:')
for r in sorted(df_matchups['round'].unique()):
    subset = df_matchups[df_matchups['round'] == r]
    rate = subset['team_a_won'].mean()
    print(f'   Ronda {r}: {rate:.1%} ({len(subset)} series)')

---
## üß† Secci√≥n 3: Feature Engineering (El cerebro del modelo)

Este es el paso m√°s importante de todo el proyecto. Aqu√≠ es donde
**el humano aporta valor real**: cada feature refleja una intuici√≥n
sobre qu√© hace ganar series de playoffs.

**Features que vamos a construir (como diferenciales Team A - Team B):**

| Feature | Intuici√≥n deportiva |
|---|---|
| `net_rating_diff` | ¬øQui√©n es mejor equipo en general? |
| `off_rating_diff` | ¬øQui√©n ataca mejor? |
| `def_rating_diff` | ¬øQui√©n defiende mejor? |
| `win_pct_diff` | ¬øQui√©n gan√≥ m√°s en temporada regular? |
| `pace_diff` | ¬øQui√©n controla el ritmo? |
| `efg_pct_diff` | ¬øQui√©n tira mejor? |
| `tov_pct_diff` | ¬øQui√©n cuida m√°s el bal√≥n? |
| `reb_pct_diff` | ¬øQui√©n domina los tableros? |
| `seed_diff` | Diferencia de posici√≥n en la tabla |

**¬øPor qu√© diferenciales?** Piensa en esto como un versus.
No importa si tu ataque vale 115 puntos por 100 posesiones.
Lo que importa es cu√°nto **mejor** eres que tu rival.

In [None]:
# ============================================================
# 3.1 ‚Äî Funci√≥n de Feature Engineering para matchups
# ============================================================

# Features que usaremos como diferenciales
DIFF_FEATURES = [
    'NET_RATING',
    'OFF_RATING',
    'DEF_RATING',
    'W_PCT',
    'PACE',
    'EFG_PCT',
    'TM_TOV_PCT',
    'REB_PCT',
    'TS_PCT',
    'PIE',
    'AST_RATIO',
    'OREB_PCT',
    'DREB_PCT'
]


def build_matchup_features(matchups_df, stats_df):
    """
    Para cada serie de playoffs, calcula los diferenciales
    de features entre Team A y Team B usando las stats de
    la temporada regular correspondiente.
    """
    feature_rows = []

    for _, matchup in matchups_df.iterrows():
        season = matchup['season']

        # Buscar stats de ambos equipos en esa temporada
        season_stats = stats_df[stats_df['SEASON'] == season]

        team_a_stats = season_stats[season_stats['TEAM_ID'] == matchup['team_a_id']]
        team_b_stats = season_stats[season_stats['TEAM_ID'] == matchup['team_b_id']]

        if len(team_a_stats) == 0 or len(team_b_stats) == 0:
            continue

        team_a_stats = team_a_stats.iloc[0]
        team_b_stats = team_b_stats.iloc[0]

        row = {
            'season': season,
            'round': matchup['round'],
            'team_a_abbr': matchup['team_a_abbr'],
            'team_b_abbr': matchup['team_b_abbr'],
            'team_a_seed': matchup['team_a_seed'],
            'team_b_seed': matchup['team_b_seed'],
            'team_a_won': matchup['team_a_won'],
        }

        # Calcular diferenciales
        for feat in DIFF_FEATURES:
            if feat in team_a_stats.index and feat in team_b_stats.index:
                val_a = team_a_stats[feat]
                val_b = team_b_stats[feat]
                if pd.notna(val_a) and pd.notna(val_b):
                    # Para DEF_RATING y TM_TOV_PCT, menor es mejor
                    # Invertimos el signo para que positivo = mejor equipo A
                    if feat in ['DEF_RATING', 'TM_TOV_PCT']:
                        row[f'{feat}_diff'] = round(val_b - val_a, 4)
                    else:
                        row[f'{feat}_diff'] = round(val_a - val_b, 4)
                else:
                    row[f'{feat}_diff'] = 0
            else:
                row[f'{feat}_diff'] = 0

        # Seed difference (positivo = A tiene mejor seed)
        row['seed_diff'] = matchup['team_b_seed'] - matchup['team_a_seed']

        feature_rows.append(row)

    return pd.DataFrame(feature_rows)


print('‚úÖ Funci√≥n de Feature Engineering lista')

In [None]:
# ============================================================
# 3.2 ‚Äî Construir el dataset de entrenamiento
# ============================================================
df_training = build_matchup_features(df_matchups, df_hist_stats)

print(f'üìã Dataset de entrenamiento: {df_training.shape[0]} series √ó {df_training.shape[1]} columnas')
print(f'\nüìä Features de diferenciales generados:')
diff_cols = [c for c in df_training.columns if c.endswith('_diff')]
for col in diff_cols:
    print(f'  ‚Üí {col}')

print(f'\nüìä Target (team_a_won): {df_training["team_a_won"].value_counts().to_dict()}')
print(f'   Balance: {df_training["team_a_won"].mean():.1%} favorito gana')

In [None]:
# ============================================================
# 3.3 ‚Äî Explorar los diferenciales: ¬øtienen poder predictivo?
# ============================================================
import matplotlib.pyplot as plt
import seaborn as sns

# Estilo oscuro para que se vea pro en c√°mara
plt.style.use('dark_background')

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
fig.suptitle('¬øLos diferenciales predicen victorias en playoffs?',
             fontsize=16, fontweight='bold', y=1.02)

key_features = ['NET_RATING_diff', 'OFF_RATING_diff', 'DEF_RATING_diff',
                'W_PCT_diff', 'EFG_PCT_diff', 'seed_diff']
titles = ['Net Rating', 'Ataque', 'Defensa',
          'Win %', 'eFG%', 'Seed']

for ax, feat, title in zip(axes.flatten(), key_features, titles):
    if feat not in df_training.columns:
        ax.set_visible(False)
        continue

    wins = df_training[df_training['team_a_won'] == 1][feat]
    losses = df_training[df_training['team_a_won'] == 0][feat]

    ax.hist(wins, bins=15, alpha=0.7, label='Favorito GAN√ì', color='#00E676')
    ax.hist(losses, bins=15, alpha=0.7, label='Favorito PERDI√ì', color='#FF5252')
    ax.set_title(f'Diferencial: {title}', fontsize=12, fontweight='bold')
    ax.legend(fontsize=9)
    ax.set_xlabel(f'{feat}')

plt.tight_layout()
plt.savefig('feature_distributions.png', dpi=150, bbox_inches='tight',
            facecolor='black', edgecolor='none')
plt.show()

print('\nüìä Si las distribuciones se separan claramente, el feature es predictivo.')
print('   Verde = favorito gan√≥ | Rojo = favorito perdi√≥ (upset)')

In [None]:
# ============================================================
# 3.4 ‚Äî Correlaci√≥n de features con el resultado
# ============================================================
print('üìä Correlaci√≥n de cada feature con victoria del favorito:\n')

correlations = {}
for col in diff_cols:
    if col in df_training.columns:
        corr = df_training[col].corr(df_training['team_a_won'])
        correlations[col] = corr

corr_sorted = sorted(correlations.items(), key=lambda x: abs(x[1]), reverse=True)

for feat, corr in corr_sorted:
    bar = '‚ñà' * int(abs(corr) * 40)
    sign = '+' if corr > 0 else '-'
    print(f'  {feat:<24} {sign}{abs(corr):.3f}  {bar}')

print(f'\nüí° Features con mayor correlaci√≥n ‚Üí m√°s poder predictivo')

---
## üèÄ Secci√≥n 4: Perfiles de equipos actuales (2025-26)

Ahora aplicamos el mismo Feature Engineering a los 16 equipos que
clasificar√°n a playoffs esta temporada.

Estos perfiles alimentar√°n la simulaci√≥n Monte Carlo.

In [None]:
# ============================================================
# 4.1 ‚Äî Identificar equipos de playoffs (top 8 por conferencia)
# ============================================================

# Separar por conferencia usando los standings
east_teams = df_current[df_current['Conference'] == 'East'].nsmallest(8, 'PlayoffRank')
west_teams = df_current[df_current['Conference'] == 'West'].nsmallest(8, 'PlayoffRank')

df_playoff_teams = pd.concat([east_teams, west_teams])

print(f'üèÄ Equipos clasificados a playoffs: {len(df_playoff_teams)}\n')

print('WESTERN CONFERENCE:')
for _, t in west_teams.iterrows():
    seed = int(t['PlayoffRank'])
    print(f'  ({seed}) {t["TEAM_NAME"]:<28} {int(t["W"])}-{int(t["L"])}  '
          f'Net: {t["NET_RATING"]:+.1f}')

print(f'\nEASTERN CONFERENCE:')
for _, t in east_teams.iterrows():
    seed = int(t['PlayoffRank'])
    print(f'  ({seed}) {t["TEAM_NAME"]:<28} {int(t["W"])}-{int(t["L"])}  '
          f'Net: {t["NET_RATING"]:+.1f}')

In [None]:
# ============================================================
# 4.2 ‚Äî Construir perfiles completos para la simulaci√≥n
# ============================================================

# Las columnas de features que necesitamos para calcular diferenciales
feature_cols = DIFF_FEATURES + [
    'TEAM_ID', 'TEAM_NAME', 'W', 'L', 'W_PCT', 'GP',
    'Conference', 'PlayoffRank',
    'consistency_score', 'clutch_win_pct', 'momentum_delta',
    'last15_win_pct', 'last15_avg_plus_minus',
    'season_avg_plus_minus', 'SEASON'
]

# Filtrar solo columnas que existen
available_cols = [c for c in feature_cols if c in df_playoff_teams.columns]
df_profiles = df_playoff_teams[available_cols].copy()

# Agregar columna de seed limpia
df_profiles['SEED'] = df_profiles['PlayoffRank'].astype(int)

print(f'üìã Perfiles de playoff: {df_profiles.shape[0]} equipos √ó {df_profiles.shape[1]} columnas')
print(f'\nüìä Features disponibles por equipo:')
for col in available_cols:
    if col not in ['TEAM_ID', 'TEAM_NAME', 'SEASON', 'Conference', 'PlayoffRank']:
        print(f'  ‚Üí {col}')

In [None]:
# ============================================================
# 4.3 ‚Äî Ranking de equipos por features clave
# ============================================================

print('üèÜ RANKING DE EQUIPOS PLAYOFF ‚Äî Temporada 2025-26\n')

# Net Rating (indicador principal de calidad)
print('üìä Por Net Rating (¬øqui√©n es el mejor equipo?):')
for i, (_, t) in enumerate(df_profiles.sort_values('NET_RATING', ascending=False).iterrows(), 1):
    conf = 'W' if t['Conference'] == 'West' else 'E'
    print(f'  {i:>2}. [{conf}{int(t["SEED"])}] {t["TEAM_NAME"]:<28} {t["NET_RATING"]:+.2f}')

# Momentum
if 'momentum_delta' in df_profiles.columns:
    print(f'\nüìä Por Momentum (¬øqui√©n llega m√°s caliente?):')
    for i, (_, t) in enumerate(
        df_profiles.sort_values('momentum_delta', ascending=False).head(8).iterrows(), 1):
        conf = 'W' if t['Conference'] == 'West' else 'E'
        print(f'  {i:>2}. [{conf}{int(t["SEED"])}] {t["TEAM_NAME"]:<28} {t["momentum_delta"]:+.2f}')

# Clutch
if 'clutch_win_pct' in df_profiles.columns:
    print(f'\nüìä Por Clutch Factor (¬øqui√©n gana juegos cerrados?):')
    for i, (_, t) in enumerate(
        df_profiles.sort_values('clutch_win_pct', ascending=False).head(8).iterrows(), 1):
        conf = 'W' if t['Conference'] == 'West' else 'E'
        print(f'  {i:>2}. [{conf}{int(t["SEED"])}] {t["TEAM_NAME"]:<28} {t["clutch_win_pct"]:.1%}')

---
## üíæ Secci√≥n 5: Guardar datasets procesados

Guardamos:
1. **Training set** ‚Üí para entrenar el XGBoost en el notebook 03
2. **Perfiles actuales** ‚Üí para alimentar la simulaci√≥n Monte Carlo

In [None]:
# ============================================================
# 5.1 ‚Äî Guardar datasets (local + Google Drive)
# ============================================================
import shutil

os.makedirs('data', exist_ok=True)

datasets = {
    'training_matchups.csv': df_training,
    'team_profiles_2026.csv': df_profiles,
    'historical_matchups_raw.csv': df_matchups
}

for filename, df in datasets.items():
    local_path = f'data/{filename}'
    df.to_csv(local_path, index=False)
    shutil.copy(local_path, f'{DATA_DIR}/{filename}')
    print(f'‚úÖ {filename:<32} ‚Üí {df.shape}  [local + Drive]')

# Tambi√©n guardar la lista de features para consistencia
feature_list = [c for c in df_training.columns if c.endswith('_diff')]
with open(f'{DATA_DIR}/feature_columns.txt', 'w') as f:
    f.write('\n'.join(feature_list))
print(f'\n‚úÖ feature_columns.txt ‚Üí {len(feature_list)} features guardados')

print(f'\nüìÅ Archivos en Drive: {DATA_DIR}/')
for f_name in sorted(os.listdir(DATA_DIR)):
    print(f'  üìÑ {f_name}')

---
## ‚úÖ Resumen: Feature Engineering completado

### Lo que construimos:

**1. Dataset de entrenamiento** (`training_matchups.csv`)
- ~150 series hist√≥ricas de playoffs (2015-2025)
- Cada serie tiene diferenciales de stats entre los dos equipos
- Target binario: ¬øgan√≥ el favorito?

**2. Perfiles actuales** (`team_profiles_2026.csv`)
- 16 equipos clasificados a playoffs 2025-26
- Stats avanzadas + momentum + clutch + consistencia
- Listos para calcular matchup probabilities

**3. Columnas de features** (`feature_columns.txt`)
- Lista estandarizada para garantizar consistencia entre notebooks

### Insight clave del EDA:
La correlaci√≥n m√°s alta con victorias en playoffs suele ser **Net Rating differential**.
Pero el modelo puede capturar interacciones que un solo n√∫mero no ve.

### ‚û°Ô∏è Siguiente notebook: `03_model_calibration.ipynb`
Donde entrenamos el XGBoost, validamos su calibraci√≥n y lo dejamos listo para simular.