# üßÆ Predicci√≥n NBA 2026 ‚Äî Notebook 02: Feature Engineering (TEAM)

En este notebook partimos de los datasets procesados:

- `team_games_2025_26.csv`: dataset a nivel equipo‚Äìpartido.
- `games_2025_26_basic.csv`: dataset a nivel partido (HOME vs AWAY).

Objetivos:

1. Calcular m√©tricas avanzadas por equipo:
   - Posesiones estimadas.
   - Ratings ofensivo/defensivo/neto.
   - Ritmo de juego (PACE).
   - Estad√≠sticas m√≥viles (rolling) sobre los √∫ltimos partidos.
2. Proyectar esas m√©tricas al nivel partido (HOME vs AWAY).
3. Crear los *targets* que usaremos en los modelos:
   - `HOME_WIN` (clasificaci√≥n).
   - `MARGIN_HOME` (regresi√≥n).
   - `TOTAL_POINTS` (regresi√≥n).
   - Targets adicionales por umbral (`BLOWOUT`, `OVER_*`).
4. Guardar el dataset final `games_2025_26_features.csv`.


In [24]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", 80)

df_team_games = pd.read_csv("../data/processed/team_games_2025_26.csv", parse_dates=["GAME_DATE"])
df_games = pd.read_csv("../data/processed/games_2025_26_basic.csv", parse_dates=["GAME_DATE"])

df_team_games.head()


Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,IS_HOME,POINTS_FOR,POINTS_AGAINST,WIN,MARGIN,FGA,FTA,OREB,TOV,MATCHUP,SEASON_ID
0,22500082,2025-10-22,1610612737,ATL,Atlanta Hawks,1,118,138,0,-20,90,37,8,16,ATL vs. TOR,22025
1,22500090,2025-10-24,1610612737,ATL,Atlanta Hawks,0,111,107,1,4,85,28,9,16,ATL @ ORL,22025
2,22500101,2025-10-25,1610612737,ATL,Atlanta Hawks,1,100,117,0,-17,85,21,15,17,ATL vs. OKC,22025
3,22500115,2025-10-27,1610612737,ATL,Atlanta Hawks,0,123,128,0,-5,98,20,8,6,ATL @ CHI,22025
4,22500130,2025-10-29,1610612737,ATL,Atlanta Hawks,0,117,112,1,5,93,22,13,8,ATL @ BKN,22025


In [25]:
df_team_games.columns.tolist()
# deber√≠a incluir: GAME_ID, TEAM_ID, TEAM_NAME, IS_HOME, POINTS_FOR, POINTS_AGAINST, MARGIN, WIN, FGA, FTA, OREB, TOV, ...


['GAME_ID',
 'GAME_DATE',
 'TEAM_ID',
 'TEAM_ABBREVIATION',
 'TEAM_NAME',
 'IS_HOME',
 'POINTS_FOR',
 'POINTS_AGAINST',
 'WIN',
 'MARGIN',
 'FGA',
 'FTA',
 'OREB',
 'TOV',
 'MATCHUP',
 'SEASON_ID']

In [26]:
# C√°lculo de posesiones por equipo y partido
df_team_games["POSSESSIONS"] = (
    df_team_games["FGA"]
    - df_team_games["OREB"]
    + df_team_games["TOV"]
    + 0.44 * df_team_games["FTA"]
)

# Evitamos divisiones por cero
df_team_games["POSSESSIONS"] = df_team_games["POSSESSIONS"].replace(0, np.nan)

# Ratings ofensivo / defensivo / neto
df_team_games["OFF_RTG"] = 100 * df_team_games["POINTS_FOR"] / df_team_games["POSSESSIONS"]
df_team_games["DEF_RTG"] = 100 * df_team_games["POINTS_AGAINST"] / df_team_games["POSSESSIONS"]
df_team_games["NET_RTG"] = df_team_games["OFF_RTG"] - df_team_games["DEF_RTG"]

# PACE: posesiones medias del partido (dos equipos)
df_team_games["PACE"] = (
    df_team_games
    .groupby("GAME_ID")["POSSESSIONS"]
    .transform("mean")
)

df_team_games[["GAME_ID", "TEAM_ID", "POSSESSIONS", "OFF_RTG", "DEF_RTG", "NET_RTG", "PACE"]].head()


Unnamed: 0,GAME_ID,TEAM_ID,POSSESSIONS,OFF_RTG,DEF_RTG,NET_RTG,PACE
0,22500082,1610612737,114.28,103.255163,120.756038,-17.500875,114.52
1,22500090,1610612737,104.32,106.403374,102.569018,3.834356,104.3
2,22500101,1610612737,96.24,103.906899,121.571072,-17.664173,97.8
3,22500115,1610612737,104.8,117.366412,122.137405,-4.770992,106.02
4,22500130,1610612737,97.68,119.77887,114.660115,5.118755,99.28


## 3. Fatiga: d√≠as de descanso entre partidos (`REST_DAYS`)

Calculamos, para cada equipo, cu√°ntos d√≠as han pasado entre este partido y el anterior.

- Ordenamos por `GAME_DATE`.
- Para cada `TEAM_ID`, medimos la diferencia en d√≠as con el partido anterior.
- El primer partido de cada equipo tendr√° `NaN` y luego podemos imputarlo.

La idea: a m√°s d√≠as, m√°s descanso; a menos d√≠as (o back-to-back), mayor fatiga.


In [27]:
# Aseguramos orden por equipo y fecha
df_team_games = df_team_games.sort_values(["TEAM_ID", "GAME_DATE"]).copy()

def add_rest_days(team_df: pd.DataFrame) -> pd.DataFrame:
    team_df = team_df.sort_values("GAME_DATE").copy()
    prev_date = team_df["GAME_DATE"].shift(1)
    # Diferencia en d√≠as; si quieres puedes restar 1 para excluir el d√≠a de partido
    team_df["REST_DAYS"] = (team_df["GAME_DATE"] - prev_date).dt.days
    return team_df

df_team_games = (
    df_team_games
    .groupby("TEAM_ID", group_keys=False)
    .apply(add_rest_days)
)

df_team_games[["TEAM_ID", "GAME_DATE", "REST_DAYS"]].head(10)


  .apply(add_rest_days)


Unnamed: 0,TEAM_ID,GAME_DATE,REST_DAYS
0,1610612737,2025-10-22,
1,1610612737,2025-10-24,2.0
2,1610612737,2025-10-25,1.0
3,1610612737,2025-10-27,2.0
4,1610612737,2025-10-29,2.0
5,1610612737,2025-10-31,2.0
6,1610612737,2025-11-02,2.0
7,1610612737,2025-11-04,2.0
8,1610612737,2025-11-07,3.0
9,1610612737,2025-11-08,1.0


## 4. Racha de victorias/derrotas (`STREAK`)

Definimos `STREAK` como:

- N√∫mero de **victorias consecutivas** antes del partido (positivo).
- N√∫mero de **derrotas consecutivas** antes del partido (negativo).

Ejemplo: `W, W, L, L` ‚Üí streaks anteriores: `0, +1, +2, -1, -2`
(La racha se calcula con un `shift(1)` para no usar el resultado del partido actual).


In [28]:

def compute_streak(win_series: pd.Series) -> pd.Series:
    streaks = []
    streak = 0
    # recorre la serie desplazada: solo partidos anteriores
    for w in win_series.shift(1):
        if pd.isna(w):
            streak = 0
        else:
            if w == 1:
                # victoria
                streak = streak + 1 if streak >= 0 else 1
            else:
                # derrota
                streak = streak - 1 if streak <= 0 else -1
        streaks.append(streak)
    return pd.Series(streaks, index=win_series.index)

df_team_games["STREAK"] = (
    df_team_games
    .groupby("TEAM_ID")["WIN"]
    .transform(lambda s: compute_streak(s))
)

df_team_games[["TEAM_ID", "GAME_DATE", "WIN", "STREAK"]].head(10)


Unnamed: 0,TEAM_ID,GAME_DATE,WIN,STREAK
0,1610612737,2025-10-22,0,0
1,1610612737,2025-10-24,1,-1
2,1610612737,2025-10-25,0,1
3,1610612737,2025-10-27,0,-1
4,1610612737,2025-10-29,1,-2
5,1610612737,2025-10-31,1,1
6,1610612737,2025-11-02,0,2
7,1610612737,2025-11-04,1,-1
8,1610612737,2025-11-07,0,1
9,1610612737,2025-11-08,1,-1


## 5. Forma reciente por equipo (rolling stats)

Para cada equipo calculamos, **antes de cada partido**:

- `PF_AVG_LAST5`: promedio de puntos anotados en los √∫ltimos 5 partidos.
- `PA_AVG_LAST5`: promedio de puntos recibidos en los √∫ltimos 5 partidos.
- `WIN_RATE_LAST5`: proporci√≥n de victorias en los √∫ltimos 5 partidos.
- `MARGIN_AVG_LAST5`: margen promedio de puntos (a favor) en los √∫ltimos 5 partidos.

Usamos `shift(1)` para evitar fuga de informaci√≥n (solo usamos partidos pasados).


In [29]:
def add_rolling_stats(team_df: pd.DataFrame, window: int = 5) -> pd.DataFrame:
    team_df = team_df.sort_values("GAME_DATE").copy()

    # Promedios
    team_df["PF_AVG_LAST5"] = (
        team_df["POINTS_FOR"].shift(1).rolling(window=window, min_periods=1).mean()
    )
    team_df["PA_AVG_LAST5"] = (
        team_df["POINTS_AGAINST"].shift(1).rolling(window=window, min_periods=1).mean()
    )
    team_df["WIN_RATE_LAST5"] = (
        team_df["WIN"].shift(1).rolling(window=window, min_periods=1).mean()
    )
    team_df["MARGIN_AVG_LAST5"] = (
        team_df["MARGIN"].shift(1).rolling(window=window, min_periods=1).mean()
    )

    # Ratings por posesi√≥n
    team_df["OFF_RTG_LAST5"] = (
        team_df["OFF_RTG"].shift(1).rolling(window=window, min_periods=1).mean()
    )
    team_df["DEF_RTG_LAST5"] = (
        team_df["DEF_RTG"].shift(1).rolling(window=window, min_periods=1).mean()
    )
    team_df["NET_RTG_LAST5"] = (
        team_df["NET_RTG"].shift(1).rolling(window=window, min_periods=1).mean()
    )
    team_df["PACE_LAST5"] = (
        team_df["PACE"].shift(1).rolling(window=window, min_periods=1).mean()
    )

    # Volatilidad
    team_df["MARGIN_STD_LAST5"] = (
        team_df["MARGIN"].shift(1).rolling(window=window, min_periods=2).std()
    )
    team_df["PACE_STD_LAST5"] = (
        team_df["PACE"].shift(1).rolling(window=window, min_periods=2).std()
    )

    return team_df

df_team_games = (
    df_team_games
    .groupby("TEAM_ID", group_keys=False)
    .apply(add_rolling_stats)
)

df_team_games.head()


  .apply(add_rolling_stats)


Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,IS_HOME,POINTS_FOR,POINTS_AGAINST,WIN,MARGIN,FGA,FTA,OREB,TOV,MATCHUP,SEASON_ID,POSSESSIONS,OFF_RTG,DEF_RTG,NET_RTG,PACE,REST_DAYS,STREAK,PF_AVG_LAST5,PA_AVG_LAST5,WIN_RATE_LAST5,MARGIN_AVG_LAST5,OFF_RTG_LAST5,DEF_RTG_LAST5,NET_RTG_LAST5,PACE_LAST5,MARGIN_STD_LAST5,PACE_STD_LAST5
0,22500082,2025-10-22,1610612737,ATL,Atlanta Hawks,1,118,138,0,-20,90,37,8,16,ATL vs. TOR,22025,114.28,103.255163,120.756038,-17.500875,114.52,,0,,,,,,,,,,
1,22500090,2025-10-24,1610612737,ATL,Atlanta Hawks,0,111,107,1,4,85,28,9,16,ATL @ ORL,22025,104.32,106.403374,102.569018,3.834356,104.3,2.0,-1,118.0,138.0,0.0,-20.0,103.255163,120.756038,-17.500875,114.52,,
2,22500101,2025-10-25,1610612737,ATL,Atlanta Hawks,1,100,117,0,-17,85,21,15,17,ATL vs. OKC,22025,96.24,103.906899,121.571072,-17.664173,97.8,1.0,1,114.5,122.5,0.5,-8.0,104.829268,111.662528,-6.83326,109.41,16.970563,7.226631
3,22500115,2025-10-27,1610612737,ATL,Atlanta Hawks,0,123,128,0,-5,98,20,8,6,ATL @ CHI,22025,104.8,117.366412,122.137405,-4.770992,106.02,2.0,-1,109.666667,120.666667,0.333333,-11.0,104.521812,114.965376,-10.443564,105.54,13.076697,8.428689
4,22500130,2025-10-29,1610612737,ATL,Atlanta Hawks,0,117,112,1,5,93,22,13,8,ATL @ BKN,22025,97.68,119.77887,114.660115,5.118755,99.28,2.0,-2,113.0,122.5,0.25,-9.5,107.732962,116.758383,-9.025421,105.66,11.090537,6.886179


## 5. Merge de features TEAM-GAME al nivel PARTIDO

Usamos `df_team_games` para construir:

- `home_features`: stats del equipo local en cada partido.
- `away_features`: stats del equipo visitante.

Luego unimos todo a `df_games` usando `GAME_ID`.


In [30]:
# HOME
home_features = (
    df_team_games[df_team_games["IS_HOME"] == 1]
    .rename(columns={
        "TEAM_ID": "HOME_TEAM_ID",
        "TEAM_NAME": "HOME_TEAM_NAME",
        "PF_AVG_LAST5": "HOME_PF_AVG_LAST5",
        "PA_AVG_LAST5": "HOME_PA_AVG_LAST5",
        "WIN_RATE_LAST5": "HOME_WIN_RATE_LAST5",
        "MARGIN_AVG_LAST5": "HOME_MARGIN_AVG_LAST5",
        "OFF_RTG_LAST5": "HOME_OFF_RTG_LAST5",
        "DEF_RTG_LAST5": "HOME_DEF_RTG_LAST5",
        "NET_RTG_LAST5": "HOME_NET_RTG_LAST5",
        "PACE_LAST5": "HOME_PACE_LAST5",
        "MARGIN_STD_LAST5": "HOME_MARGIN_STD_LAST5",
        "PACE_STD_LAST5": "HOME_PACE_STD_LAST5",
    })
    [
        [
            "GAME_ID", "GAME_DATE", "HOME_TEAM_ID", "HOME_TEAM_NAME",
            "HOME_PF_AVG_LAST5", "HOME_PA_AVG_LAST5",
            "HOME_WIN_RATE_LAST5", "HOME_MARGIN_AVG_LAST5",
            "HOME_OFF_RTG_LAST5", "HOME_DEF_RTG_LAST5",
            "HOME_NET_RTG_LAST5", "HOME_PACE_LAST5",
            "HOME_MARGIN_STD_LAST5", "HOME_PACE_STD_LAST5",
        ]
    ]
)

# AWAY
away_features = (
    df_team_games[df_team_games["IS_HOME"] == 0]
    .rename(columns={
        "TEAM_ID": "AWAY_TEAM_ID",
        "TEAM_NAME": "AWAY_TEAM_NAME",
        "PF_AVG_LAST5": "AWAY_PF_AVG_LAST5",
        "PA_AVG_LAST5": "AWAY_PA_AVG_LAST5",
        "WIN_RATE_LAST5": "AWAY_WIN_RATE_LAST5",
        "MARGIN_AVG_LAST5": "AWAY_MARGIN_AVG_LAST5",
        "OFF_RTG_LAST5": "AWAY_OFF_RTG_LAST5",
        "DEF_RTG_LAST5": "AWAY_DEF_RTG_LAST5",
        "NET_RTG_LAST5": "AWAY_NET_RTG_LAST5",
        "PACE_LAST5": "AWAY_PACE_LAST5",
        "MARGIN_STD_LAST5": "AWAY_MARGIN_STD_LAST5",
        "PACE_STD_LAST5": "AWAY_PACE_STD_LAST5",
    })
    [
        [
            "GAME_ID", "AWAY_TEAM_ID", "AWAY_TEAM_NAME",
            "AWAY_PF_AVG_LAST5", "AWAY_PA_AVG_LAST5",
            "AWAY_WIN_RATE_LAST5", "AWAY_MARGIN_AVG_LAST5",
            "AWAY_OFF_RTG_LAST5", "AWAY_DEF_RTG_LAST5",
            "AWAY_NET_RTG_LAST5", "AWAY_PACE_LAST5",
            "AWAY_MARGIN_STD_LAST5", "AWAY_PACE_STD_LAST5",
        ]
    ]
)


## 6. Ventaja de descanso y limpieza de filas iniciales sin historial

Creamos:

- `REST_ADVANTAGE` = `HOME_REST_DAYS` - `AWAY_REST_DAYS`

Luego eliminamos las filas donde:

- `HOME_PF_AVG_LAST5` o `AWAY_PF_AVG_LAST5` son NaN
(son los primeros ~5 partidos de cada equipo, donde a√∫n no hay historial).


In [31]:
# Partimos de df_games b√°sico (HOME vs AWAY puntos reales)
df_model = df_games.merge(home_features, on=["GAME_ID", "GAME_DATE"])
df_model = df_model.merge(away_features, on="GAME_ID")

# Ventaja de descanso simple (puedes a√±adir REST_DAYS luego si quieres)
# Por ahora solo trabajamos con las m√©tricas recientes

# Targets ya vienen en df_games:
# - HOME_WIN
# - MARGIN_HOME
# - TOTAL_POINTS

# Targets adicionales por umbral
df_model["BLOWOUT"] = (df_model["MARGIN_HOME"].abs() >= 10).astype(int)

median_total = df_model["TOTAL_POINTS"].median()
p75_total = df_model["TOTAL_POINTS"].quantile(0.75)

df_model["OVER_MEDIAN"] = (df_model["TOTAL_POINTS"] > median_total).astype(int)
df_model["OVER_P75"] = (df_model["TOTAL_POINTS"] > p75_total).astype(int)

median_total, p75_total
feature_cols = [
    "HOME_PF_AVG_LAST5", "HOME_PA_AVG_LAST5",
    "HOME_WIN_RATE_LAST5", "HOME_MARGIN_AVG_LAST5",
    "HOME_OFF_RTG_LAST5", "HOME_DEF_RTG_LAST5",
    "HOME_NET_RTG_LAST5", "HOME_PACE_LAST5",
    "HOME_MARGIN_STD_LAST5", "HOME_PACE_STD_LAST5",
    "AWAY_PF_AVG_LAST5", "AWAY_PA_AVG_LAST5",
    "AWAY_WIN_RATE_LAST5", "AWAY_MARGIN_AVG_LAST5",
    "AWAY_OFF_RTG_LAST5", "AWAY_DEF_RTG_LAST5",
    "AWAY_NET_RTG_LAST5", "AWAY_PACE_LAST5",
    "AWAY_MARGIN_STD_LAST5", "AWAY_PACE_STD_LAST5",
]

df_model[feature_cols] = df_model[feature_cols].fillna(
    df_model[feature_cols].mean()
)

df_model[feature_cols].isna().sum().sum()



np.int64(0)

## 5. Proyecci√≥n de m√©tricas al nivel partido (HOME vs AWAY)

En esta secci√≥n construimos dos tablas:

- `home_features`: m√©tricas recientes del equipo local.
- `away_features`: m√©tricas recientes del equipo visitante.

Incluimos:

- Promedios m√≥viles de puntos anotados/recibidos.
- Win rate y margen medio.
- Ratings ofensivo / defensivo / neto por posesi√≥n.
- Ritmo de juego (PACE) y su volatilidad.
- Racha (`STREAK`) y d√≠as de descanso (`REST_DAYS`).


In [32]:
# === HOME ===
home_features = (
    df_team_games[df_team_games["IS_HOME"] == 1]
    .rename(columns={
        "TEAM_ID": "HOME_TEAM_ID",
        "TEAM_NAME": "HOME_TEAM_NAME",
        "PF_AVG_LAST5": "HOME_PF_AVG_LAST5",
        "PA_AVG_LAST5": "HOME_PA_AVG_LAST5",
        "WIN_RATE_LAST5": "HOME_WIN_RATE_LAST5",
        "MARGIN_AVG_LAST5": "HOME_MARGIN_AVG_LAST5",
        "OFF_RTG_LAST5": "HOME_OFF_RTG_LAST5",
        "DEF_RTG_LAST5": "HOME_DEF_RTG_LAST5",
        "NET_RTG_LAST5": "HOME_NET_RTG_LAST5",
        "PACE_LAST5": "HOME_PACE_LAST5",
        "MARGIN_STD_LAST5": "HOME_MARGIN_STD_LAST5",
        "PACE_STD_LAST5": "HOME_PACE_STD_LAST5",
        "STREAK": "HOME_STREAK",
        "REST_DAYS": "HOME_REST_DAYS",
    })
    [[
        "GAME_ID", "GAME_DATE",
        "HOME_TEAM_ID", "HOME_TEAM_NAME",
        "HOME_PF_AVG_LAST5", "HOME_PA_AVG_LAST5",
        "HOME_WIN_RATE_LAST5", "HOME_MARGIN_AVG_LAST5",
        "HOME_OFF_RTG_LAST5", "HOME_DEF_RTG_LAST5",
        "HOME_NET_RTG_LAST5", "HOME_PACE_LAST5",
        "HOME_MARGIN_STD_LAST5", "HOME_PACE_STD_LAST5",
        "HOME_STREAK", "HOME_REST_DAYS",
    ]]
)

# === AWAY ===
away_features = (
    df_team_games[df_team_games["IS_HOME"] == 0]
    .rename(columns={
        "TEAM_ID": "AWAY_TEAM_ID",
        "TEAM_NAME": "AWAY_TEAM_NAME",
        "PF_AVG_LAST5": "AWAY_PF_AVG_LAST5",
        "PA_AVG_LAST5": "AWAY_PA_AVG_LAST5",
        "WIN_RATE_LAST5": "AWAY_WIN_RATE_LAST5",
        "MARGIN_AVG_LAST5": "AWAY_MARGIN_AVG_LAST5",
        "OFF_RTG_LAST5": "AWAY_OFF_RTG_LAST5",
        "DEF_RTG_LAST5": "AWAY_DEF_RTG_LAST5",
        "NET_RTG_LAST5": "AWAY_NET_RTG_LAST5",
        "PACE_LAST5": "AWAY_PACE_LAST5",
        "MARGIN_STD_LAST5": "AWAY_MARGIN_STD_LAST5",
        "PACE_STD_LAST5": "AWAY_PACE_STD_LAST5",
        "STREAK": "AWAY_STREAK",
        "REST_DAYS": "AWAY_REST_DAYS",
    })
    [[
        "GAME_ID",
        "AWAY_TEAM_ID", "AWAY_TEAM_NAME",
        "AWAY_PF_AVG_LAST5", "AWAY_PA_AVG_LAST5",
        "AWAY_WIN_RATE_LAST5", "AWAY_MARGIN_AVG_LAST5",
        "AWAY_OFF_RTG_LAST5", "AWAY_DEF_RTG_LAST5",
        "AWAY_NET_RTG_LAST5", "AWAY_PACE_LAST5",
        "AWAY_MARGIN_STD_LAST5", "AWAY_PACE_STD_LAST5",
        "AWAY_STREAK", "AWAY_REST_DAYS",
    ]]
)

home_features.head(), away_features.head()


(    GAME_ID  GAME_DATE  HOME_TEAM_ID HOME_TEAM_NAME  HOME_PF_AVG_LAST5  \
 0  22500082 2025-10-22    1610612737  Atlanta Hawks                NaN   
 2  22500101 2025-10-25    1610612737  Atlanta Hawks              114.5   
 7  22500166 2025-11-04    1610612737  Atlanta Hawks              115.4   
 8  22500030 2025-11-07    1610612737  Atlanta Hawks              120.8   
 9  22500185 2025-11-08    1610612737  Atlanta Hawks              115.6   
 
    HOME_PA_AVG_LAST5  HOME_WIN_RATE_LAST5  HOME_MARGIN_AVG_LAST5  \
 0                NaN                  NaN                    NaN   
 2              122.5                  0.5                   -8.0   
 7              116.4                  0.4                   -1.0   
 8              115.4                  0.6                    5.4   
 9              111.6                  0.6                    4.0   
 
    HOME_OFF_RTG_LAST5  HOME_DEF_RTG_LAST5  HOME_NET_RTG_LAST5  \
 0                 NaN                 NaN                 NaN   


## 6. Construcci√≥n del dataset final `df_model`

Partimos de `df_games` (targets reales del partido) y le a√±adimos:

- M√©tricas recientes del equipo local (`home_features`).
- M√©tricas recientes del equipo visitante (`away_features`).
- Targets adicionales por umbral:
  - `BLOWOUT` (paliza, margen ‚â• 10).
  - `OVER_MEDIAN` y `OVER_P75` para puntos totales.
- Ventaja de descanso `REST_ADVANTAGE` = `HOME_REST_DAYS - AWAY_REST_DAYS`.


In [33]:
# Partimos de df_games (que ya ten√≠a HOME_WIN, MARGIN_HOME, TOTAL_POINTS)
df_model = df_games.merge(home_features, on=["GAME_ID", "GAME_DATE"])
df_model = df_model.merge(away_features, on="GAME_ID")

# Ventaja de descanso
df_model["REST_ADVANTAGE"] = df_model["HOME_REST_DAYS"] - df_model["AWAY_REST_DAYS"]

# Targets por umbral
df_model["BLOWOUT"] = (df_model["MARGIN_HOME"].abs() >= 10).astype(int)

median_total = df_model["TOTAL_POINTS"].median()
p75_total = df_model["TOTAL_POINTS"].quantile(0.75)

df_model["OVER_MEDIAN"] = (df_model["TOTAL_POINTS"] > median_total).astype(int)
df_model["OVER_P75"] = (df_model["TOTAL_POINTS"] > p75_total).astype(int)

median_total, p75_total


(np.float64(232.0), np.float64(246.0))

### 7. Limpieza de columnas duplicadas (`_x` / `_y`)

Al hacer los `merge`, algunas columnas aparecen duplicadas con sufijos
`_x` y `_y`. Nos quedamos con las `_x` (las de `df_games`) y eliminamos las `_y`,
quitando el sufijo del nombre final.


In [34]:
# Eliminamos todas las columnas que terminen en '_y'
cols_to_drop = [c for c in df_model.columns if c.endswith("_y")]
df_model = df_model.drop(columns=cols_to_drop)

# Renombramos columnas '_x' -> sin sufijo
df_model = df_model.rename(columns=lambda c: c[:-2] if c.endswith("_x") else c)

df_model.head(3)


Unnamed: 0,GAME_ID,GAME_DATE,HOME_TEAM_ID,HOME_TEAM_ABBR,HOME_TEAM_NAME,HOME_PTS,HOME_PA,HOME_MARGIN,AWAY_TEAM_ID,AWAY_TEAM_ABBR,AWAY_TEAM_NAME,AWAY_PTS,AWAY_PA,AWAY_MARGIN,HOME_WIN,MARGIN_HOME,TOTAL_POINTS,HOME_PF_AVG_LAST5,HOME_PA_AVG_LAST5,HOME_WIN_RATE_LAST5,HOME_MARGIN_AVG_LAST5,HOME_OFF_RTG_LAST5,HOME_DEF_RTG_LAST5,HOME_NET_RTG_LAST5,HOME_PACE_LAST5,HOME_MARGIN_STD_LAST5,HOME_PACE_STD_LAST5,HOME_STREAK,HOME_REST_DAYS,AWAY_PF_AVG_LAST5,AWAY_PA_AVG_LAST5,AWAY_WIN_RATE_LAST5,AWAY_MARGIN_AVG_LAST5,AWAY_OFF_RTG_LAST5,AWAY_DEF_RTG_LAST5,AWAY_NET_RTG_LAST5,AWAY_PACE_LAST5,AWAY_MARGIN_STD_LAST5,AWAY_PACE_STD_LAST5,AWAY_STREAK,AWAY_REST_DAYS,REST_ADVANTAGE,BLOWOUT,OVER_MEDIAN,OVER_P75
0,22500002,2025-10-21,1610612747,LAL,Los Angeles Lakers,109,119,-10,1610612744,GSW,Golden State Warriors,119,109,10,0,-10,228,,,,,,,,,,,0,,,,,,,,,,,,0,,,1,0,0
1,22500001,2025-10-21,1610612760,OKC,Oklahoma City Thunder,125,124,1,1610612745,HOU,Houston Rockets,124,125,-1,1,1,249,,,,,,,,,,,0,,,,,,,,,,,,0,,,0,1,1
2,22500082,2025-10-22,1610612737,ATL,Atlanta Hawks,118,138,-20,1610612761,TOR,Toronto Raptors,138,118,20,0,-20,256,,,,,,,,,,,0,,,,,,,,,,,,0,,,1,1,1


### 8. Selecci√≥n de *features* y guardado del dataset final


In [35]:
feature_cols = [
    # Forma ofensiva/defensiva
    "HOME_PF_AVG_LAST5", "HOME_PA_AVG_LAST5",
    "HOME_WIN_RATE_LAST5", "HOME_MARGIN_AVG_LAST5",
    "HOME_OFF_RTG_LAST5", "HOME_DEF_RTG_LAST5",
    "HOME_NET_RTG_LAST5", "HOME_PACE_LAST5",
    "HOME_MARGIN_STD_LAST5", "HOME_PACE_STD_LAST5",
    "AWAY_PF_AVG_LAST5", "AWAY_PA_AVG_LAST5",
    "AWAY_WIN_RATE_LAST5", "AWAY_MARGIN_AVG_LAST5",
    "AWAY_OFF_RTG_LAST5", "AWAY_DEF_RTG_LAST5",
    "AWAY_NET_RTG_LAST5", "AWAY_PACE_LAST5",
    "AWAY_MARGIN_STD_LAST5", "AWAY_PACE_STD_LAST5",

    # Rachas y descanso
    "HOME_STREAK", "AWAY_STREAK",
    "HOME_REST_DAYS", "AWAY_REST_DAYS",
    "REST_ADVANTAGE",
]

df_model[feature_cols] = df_model[feature_cols].fillna(
    df_model[feature_cols].mean()
)

print("N¬∫ de NaNs despu√©s del relleno:", df_model[feature_cols].isna().sum().sum())

df_model.to_csv("../data/processed/games_2025_26_features.csv", index=False)
df_model.head()


N¬∫ de NaNs despu√©s del relleno: 0


Unnamed: 0,GAME_ID,GAME_DATE,HOME_TEAM_ID,HOME_TEAM_ABBR,HOME_TEAM_NAME,HOME_PTS,HOME_PA,HOME_MARGIN,AWAY_TEAM_ID,AWAY_TEAM_ABBR,AWAY_TEAM_NAME,AWAY_PTS,AWAY_PA,AWAY_MARGIN,HOME_WIN,MARGIN_HOME,TOTAL_POINTS,HOME_PF_AVG_LAST5,HOME_PA_AVG_LAST5,HOME_WIN_RATE_LAST5,HOME_MARGIN_AVG_LAST5,HOME_OFF_RTG_LAST5,HOME_DEF_RTG_LAST5,HOME_NET_RTG_LAST5,HOME_PACE_LAST5,HOME_MARGIN_STD_LAST5,HOME_PACE_STD_LAST5,HOME_STREAK,HOME_REST_DAYS,AWAY_PF_AVG_LAST5,AWAY_PA_AVG_LAST5,AWAY_WIN_RATE_LAST5,AWAY_MARGIN_AVG_LAST5,AWAY_OFF_RTG_LAST5,AWAY_DEF_RTG_LAST5,AWAY_NET_RTG_LAST5,AWAY_PACE_LAST5,AWAY_MARGIN_STD_LAST5,AWAY_PACE_STD_LAST5,AWAY_STREAK,AWAY_REST_DAYS,REST_ADVANTAGE,BLOWOUT,OVER_MEDIAN,OVER_P75
0,22500002,2025-10-21,1610612747,LAL,Los Angeles Lakers,109,119,-10,1610612744,GSW,Golden State Warriors,119,109,10,0,-10,228,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,1,0,0
1,22500001,2025-10-21,1610612760,OKC,Oklahoma City Thunder,125,124,1,1610612745,HOU,Houston Rockets,124,125,-1,1,1,249,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,0,1,1
2,22500082,2025-10-22,1610612737,ATL,Atlanta Hawks,118,138,-20,1610612761,TOR,Toronto Raptors,138,118,20,0,-20,256,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,1,1,1
3,22500084,2025-10-22,1610612741,CHI,Chicago Bulls,115,111,4,1610612765,DET,Detroit Pistons,111,115,-4,1,4,226,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,0,0,0
4,22500086,2025-10-22,1610612749,MIL,Milwaukee Bucks,133,120,13,1610612764,WAS,Washington Wizards,120,133,-13,1,13,253,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,1,1,1


## üíæ 8. Guardar dataset enriquecido para modelos

Guardamos el resultado en `data/processed/games_2025_26_features.csv`
para usarlo en el notebook de modelos (03).


In [36]:
df_model.to_csv("../data/processed/games_2025_26_features.csv", index=False)
df_model.head()


Unnamed: 0,GAME_ID,GAME_DATE,HOME_TEAM_ID,HOME_TEAM_ABBR,HOME_TEAM_NAME,HOME_PTS,HOME_PA,HOME_MARGIN,AWAY_TEAM_ID,AWAY_TEAM_ABBR,AWAY_TEAM_NAME,AWAY_PTS,AWAY_PA,AWAY_MARGIN,HOME_WIN,MARGIN_HOME,TOTAL_POINTS,HOME_PF_AVG_LAST5,HOME_PA_AVG_LAST5,HOME_WIN_RATE_LAST5,HOME_MARGIN_AVG_LAST5,HOME_OFF_RTG_LAST5,HOME_DEF_RTG_LAST5,HOME_NET_RTG_LAST5,HOME_PACE_LAST5,HOME_MARGIN_STD_LAST5,HOME_PACE_STD_LAST5,HOME_STREAK,HOME_REST_DAYS,AWAY_PF_AVG_LAST5,AWAY_PA_AVG_LAST5,AWAY_WIN_RATE_LAST5,AWAY_MARGIN_AVG_LAST5,AWAY_OFF_RTG_LAST5,AWAY_DEF_RTG_LAST5,AWAY_NET_RTG_LAST5,AWAY_PACE_LAST5,AWAY_MARGIN_STD_LAST5,AWAY_PACE_STD_LAST5,AWAY_STREAK,AWAY_REST_DAYS,REST_ADVANTAGE,BLOWOUT,OVER_MEDIAN,OVER_P75
0,22500002,2025-10-21,1610612747,LAL,Los Angeles Lakers,109,119,-10,1610612744,GSW,Golden State Warriors,119,109,10,0,-10,228,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,1,0,0
1,22500001,2025-10-21,1610612760,OKC,Oklahoma City Thunder,125,124,1,1610612745,HOU,Houston Rockets,124,125,-1,1,1,249,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,0,1,1
2,22500082,2025-10-22,1610612737,ATL,Atlanta Hawks,118,138,-20,1610612761,TOR,Toronto Raptors,138,118,20,0,-20,256,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,1,1,1
3,22500084,2025-10-22,1610612741,CHI,Chicago Bulls,115,111,4,1610612765,DET,Detroit Pistons,111,115,-4,1,4,226,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,0,0,0
4,22500086,2025-10-22,1610612749,MIL,Milwaukee Bucks,133,120,13,1610612764,WAS,Washington Wizards,120,133,-13,1,13,253,117.239333,117.559333,0.494857,-0.32,112.383405,112.731232,-0.347827,104.389002,13.39477,4.452629,0,1.96,117.600095,117.086952,0.508095,0.513143,112.543709,112.025836,0.517873,104.57089,13.093572,4.475017,0,1.965714,-0.005747,1,1,1


# ‚úÖ Resumen del Notebook 02 ‚Äî Feature Engineering

En este notebook:

- Convertimos el dataset de partidos a vista TEAM-GAME (`df_team_games`).
- Calculamos para cada equipo, antes de cada partido:
  - Promedio de puntos anotados/recibidos en los √∫ltimos 5 partidos.
  - Win rate y margen promedio de los √∫ltimos 5 partidos.
  - Racha de victorias/derrotas (`STREAK`).
  - D√≠as de descanso (`REST_DAYS`).
- Proyectamos estos features al nivel PARTIDO:
  - Variables `HOME_...` y `AWAY_...` para local y visitante.
  - Ventaja de descanso (`REST_ADVANTAGE`).
- Eliminamos partidos sin historial suficiente (primeros partidos de cada equipo).
- Guardamos el dataset enriquecido en:
  `data/processed/games_2025_26_features.csv`.


