# ðŸ§© PredicciÃ³n NBA 2026 â€” Feature Engineering

En este notebook construimos *features* avanzadas a partir del dataset de partidos
ya procesado en el notebook 01.

Trabajamos sobre la temporada actual y agregamos informaciÃ³n como:

- Forma reciente de cada equipo (promedios Ãºltimos partidos)
- Racha de victorias/derrotas (*streak*)
- Porcentaje de victorias recientes
- DÃ­as de descanso entre partidos (fatiga)
- Ventaja de descanso entre local y visitante

El objetivo es generar un dataset enriquecido listo para entrenar modelos
(clasificaciÃ³n y regresiÃ³n) en el siguiente notebook.


In [1]:
import pandas as pd

pd.set_option("display.max_columns", 80)

df_games = pd.read_csv("../data/processed/games_2025_26_basic.csv",
                       parse_dates=["GAME_DATE"])
df_games.head()


Unnamed: 0,GAME_ID,GAME_DATE,HOME_TEAM_ID,HOME_TEAM_NAME,HOME_TEAM_ABBR,HOME_PTS,AWAY_TEAM_ID,AWAY_TEAM_NAME,AWAY_TEAM_ABBR,AWAY_PTS,MARGIN_HOME,HOME_WIN,TOTAL_POINTS
0,22500001,2025-10-21,1610612760,Oklahoma City Thunder,OKC,125,1610612745,Houston Rockets,HOU,124,1,1,249
1,22500002,2025-10-21,1610612747,Los Angeles Lakers,LAL,109,1610612744,Golden State Warriors,GSW,119,-10,0,228
2,22500003,2025-10-22,1610612752,New York Knicks,NYK,119,1610612739,Cleveland Cavaliers,CLE,111,8,1,230
3,22500004,2025-10-22,1610612742,Dallas Mavericks,DAL,92,1610612759,San Antonio Spurs,SAS,125,-33,0,217
4,22500080,2025-10-22,1610612766,Charlotte Hornets,CHA,136,1610612751,Brooklyn Nets,BKN,117,19,1,253


## 1. ConstrucciÃ³n del dataset TEAM-GAME

A partir de `df_games` (una fila por partido), vamos a crear un dataframe
`df_team_games` donde:

- Cada fila representa un **equipo en un partido**.
- Tendremos columnas:
  - `TEAM_ID`, `TEAM_NAME`
  - `IS_HOME` (1 si fue local, 0 si fue visitante)
  - `POINTS_FOR`, `POINTS_AGAINST`
  - `WIN` (1 = ganÃ³ ese equipo, 0 = perdiÃ³)
  - `GAME_DATE`, `GAME_ID`


In [2]:
rows = []
for _, row in df_games.iterrows():
    # Local
    rows.append({
        "GAME_ID": row["GAME_ID"],
        "GAME_DATE": row["GAME_DATE"],
        "TEAM_ID": row["HOME_TEAM_ID"],
        "TEAM_NAME": row["HOME_TEAM_NAME"],
        "IS_HOME": 1,
        "POINTS_FOR": row["HOME_PTS"],
        "POINTS_AGAINST": row["AWAY_PTS"],
        "WIN": 1 if row["HOME_PTS"] > row["AWAY_PTS"] else 0,
    })
    # Visitante
    rows.append({
        "GAME_ID": row["GAME_ID"],
        "GAME_DATE": row["GAME_DATE"],
        "TEAM_ID": row["AWAY_TEAM_ID"],
        "TEAM_NAME": row["AWAY_TEAM_NAME"],
        "IS_HOME": 0,
        "POINTS_FOR": row["AWAY_PTS"],
        "POINTS_AGAINST": row["HOME_PTS"],
        "WIN": 1 if row["AWAY_PTS"] > row["HOME_PTS"] else 0,
    })

df_team_games = pd.DataFrame(rows).sort_values(
    ["TEAM_ID", "GAME_DATE"]
).reset_index(drop=True)

df_team_games["MARGIN"] = df_team_games["POINTS_FOR"] - df_team_games["POINTS_AGAINST"]
df_team_games.head()


Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ID,TEAM_NAME,IS_HOME,POINTS_FOR,POINTS_AGAINST,WIN,MARGIN
0,22500082,2025-10-22,1610612737,Atlanta Hawks,1,118,138,0,-20
1,22500090,2025-10-24,1610612737,Atlanta Hawks,0,111,107,1,4
2,22500101,2025-10-25,1610612737,Atlanta Hawks,1,100,117,0,-17
3,22500115,2025-10-27,1610612737,Atlanta Hawks,0,123,128,0,-5
4,22500130,2025-10-29,1610612737,Atlanta Hawks,0,117,112,1,5


## 2. Forma reciente por equipo (rolling stats)

Para cada equipo calculamos, **antes de cada partido**:

- `PF_AVG_LAST5`: promedio de puntos anotados en los Ãºltimos 5 partidos.
- `PA_AVG_LAST5`: promedio de puntos recibidos en los Ãºltimos 5 partidos.
- `WIN_RATE_LAST5`: proporciÃ³n de victorias en los Ãºltimos 5 partidos.
- `MARGIN_AVG_LAST5`: margen promedio de puntos (a favor) en los Ãºltimos 5 partidos.

Usamos `shift(1)` para evitar fuga de informaciÃ³n (solo usamos partidos pasados).


In [3]:
def add_rolling_stats(team_df: pd.DataFrame, window: int = 5) -> pd.DataFrame:
    team_df = team_df.sort_values("GAME_DATE").copy()

    team_df["PF_AVG_LAST5"] = (
        team_df["POINTS_FOR"].shift(1)
        .rolling(window=window, min_periods=1)
        .mean()
    )
    team_df["PA_AVG_LAST5"] = (
        team_df["POINTS_AGAINST"].shift(1)
        .rolling(window=window, min_periods=1)
        .mean()
    )
    team_df["WIN_RATE_LAST5"] = (
        team_df["WIN"].shift(1)
        .rolling(window=window, min_periods=1)
        .mean()
    )
    team_df["MARGIN_AVG_LAST5"] = (
        team_df["MARGIN"].shift(1)
        .rolling(window=window, min_periods=1)
        .mean()
    )
    return team_df

df_team_games = (
    df_team_games
    .groupby("TEAM_ID", group_keys=False)
    .apply(add_rolling_stats)
)

df_team_games.head()


  .apply(add_rolling_stats)


Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ID,TEAM_NAME,IS_HOME,POINTS_FOR,POINTS_AGAINST,WIN,MARGIN,PF_AVG_LAST5,PA_AVG_LAST5,WIN_RATE_LAST5,MARGIN_AVG_LAST5
0,22500082,2025-10-22,1610612737,Atlanta Hawks,1,118,138,0,-20,,,,
1,22500090,2025-10-24,1610612737,Atlanta Hawks,0,111,107,1,4,118.0,138.0,0.0,-20.0
2,22500101,2025-10-25,1610612737,Atlanta Hawks,1,100,117,0,-17,114.5,122.5,0.5,-8.0
3,22500115,2025-10-27,1610612737,Atlanta Hawks,0,123,128,0,-5,109.666667,120.666667,0.333333,-11.0
4,22500130,2025-10-29,1610612737,Atlanta Hawks,0,117,112,1,5,113.0,122.5,0.25,-9.5


## 3. Racha de victorias/derrotas (*streak*)

Definimos `STREAK` como:

- NÃºmero de victorias consecutivas antes del partido (positivo).
- NÃºmero de derrotas consecutivas antes del partido (negativo).

Ejemplo:
- [W, W, L, L] â†’ streaks anteriores: 0, +1, +2, -1, -2 (desplazado con `shift(1)`).


In [4]:
def compute_streak(win_series: pd.Series) -> pd.Series:
    streaks = []
    streak = 0
    for w in win_series.shift(1):  # solo info previa
        if pd.isna(w):
            streak = 0
        else:
            if w == 1:
                streak = streak + 1 if streak >= 0 else 1
            else:
                streak = streak - 1 if streak <= 0 else -1
        streaks.append(streak)
    return pd.Series(streaks, index=win_series.index)

df_team_games["STREAK"] = (
    df_team_games
    .groupby("TEAM_ID", group_keys=False)["WIN"]
    .apply(compute_streak)
)




## 4. DÃ­as de descanso y fatiga

Calculamos el nÃºmero de dÃ­as de descanso de cada equipo antes de cada partido:

- `REST_DAYS` = dÃ­as entre el partido actual y el anterior del mismo equipo.
- Si es el primer partido de la temporada para ese equipo â†’ `NaN` (luego lo manejamos).

Esto nos permitirÃ¡ calcular la **ventaja de descanso** entre local y visitante.


In [5]:
def add_rest_days(team_df: pd.DataFrame) -> pd.DataFrame:
    team_df = team_df.sort_values("GAME_DATE").copy()
    prev_date = team_df["GAME_DATE"].shift(1)
    team_df["REST_DAYS"] = (team_df["GAME_DATE"] - prev_date).dt.days
    return team_df

df_team_games = (
    df_team_games
    .groupby("TEAM_ID", group_keys=False)
    .apply(add_rest_days)
)

df_team_games.head()

  .apply(add_rest_days)


Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ID,TEAM_NAME,IS_HOME,POINTS_FOR,POINTS_AGAINST,WIN,MARGIN,PF_AVG_LAST5,PA_AVG_LAST5,WIN_RATE_LAST5,MARGIN_AVG_LAST5,STREAK,REST_DAYS
0,22500082,2025-10-22,1610612737,Atlanta Hawks,1,118,138,0,-20,,,,,0,
1,22500090,2025-10-24,1610612737,Atlanta Hawks,0,111,107,1,4,118.0,138.0,0.0,-20.0,-1,2.0
2,22500101,2025-10-25,1610612737,Atlanta Hawks,1,100,117,0,-17,114.5,122.5,0.5,-8.0,1,1.0
3,22500115,2025-10-27,1610612737,Atlanta Hawks,0,123,128,0,-5,109.666667,120.666667,0.333333,-11.0,-1,2.0
4,22500130,2025-10-29,1610612737,Atlanta Hawks,0,117,112,1,5,113.0,122.5,0.25,-9.5,-2,2.0


## 5. Merge de features TEAM-GAME al nivel PARTIDO

Usamos `df_team_games` para construir:

- `home_features`: stats del equipo local en cada partido.
- `away_features`: stats del equipo visitante.

Luego unimos todo a `df_games` usando `GAME_ID`.


In [6]:
# HOME
home_features = (
    df_team_games[df_team_games["IS_HOME"] == 1]
    .rename(columns={
        "PF_AVG_LAST5": "HOME_PF_AVG_LAST5",
        "PA_AVG_LAST5": "HOME_PA_AVG_LAST5",
        "WIN_RATE_LAST5": "HOME_WIN_RATE_LAST5",
        "MARGIN_AVG_LAST5": "HOME_MARGIN_AVG_LAST5",
        "STREAK": "HOME_STREAK",
        "REST_DAYS": "HOME_REST_DAYS",
    })
    [["GAME_ID", "TEAM_ID",
      "HOME_PF_AVG_LAST5", "HOME_PA_AVG_LAST5",
      "HOME_WIN_RATE_LAST5", "HOME_MARGIN_AVG_LAST5",
      "HOME_STREAK", "HOME_REST_DAYS"]]
)

# AWAY
away_features = (
    df_team_games[df_team_games["IS_HOME"] == 0]
    .rename(columns={
        "PF_AVG_LAST5": "AWAY_PF_AVG_LAST5",
        "PA_AVG_LAST5": "AWAY_PA_AVG_LAST5",
        "WIN_RATE_LAST5": "AWAY_WIN_RATE_LAST5",
        "MARGIN_AVG_LAST5": "AWAY_MARGIN_AVG_LAST5",
        "STREAK": "AWAY_STREAK",
        "REST_DAYS": "AWAY_REST_DAYS",
    })
    [["GAME_ID", "TEAM_ID",
      "AWAY_PF_AVG_LAST5", "AWAY_PA_AVG_LAST5",
      "AWAY_WIN_RATE_LAST5", "AWAY_MARGIN_AVG_LAST5",
      "AWAY_STREAK", "AWAY_REST_DAYS"]]
)

df_model = (
    df_games
    .merge(home_features, left_on=["GAME_ID", "HOME_TEAM_ID"],
           right_on=["GAME_ID", "TEAM_ID"], how="left")
    .merge(away_features, left_on=["GAME_ID", "AWAY_TEAM_ID"],
           right_on=["GAME_ID", "TEAM_ID"], how="left",
           suffixes=("_HOME", "_AWAY"))
    .drop(columns=["TEAM_ID_HOME", "TEAM_ID_AWAY"])
)

df_model["REST_ADVANTAGE"] = df_model["HOME_REST_DAYS"] - df_model["AWAY_REST_DAYS"]
df_model.head()


Unnamed: 0,GAME_ID,GAME_DATE,HOME_TEAM_ID,HOME_TEAM_NAME,HOME_TEAM_ABBR,HOME_PTS,AWAY_TEAM_ID,AWAY_TEAM_NAME,AWAY_TEAM_ABBR,AWAY_PTS,MARGIN_HOME,HOME_WIN,TOTAL_POINTS,HOME_PF_AVG_LAST5,HOME_PA_AVG_LAST5,HOME_WIN_RATE_LAST5,HOME_MARGIN_AVG_LAST5,HOME_STREAK,HOME_REST_DAYS,AWAY_PF_AVG_LAST5,AWAY_PA_AVG_LAST5,AWAY_WIN_RATE_LAST5,AWAY_MARGIN_AVG_LAST5,AWAY_STREAK,AWAY_REST_DAYS,REST_ADVANTAGE
0,22500001,2025-10-21,1610612760,Oklahoma City Thunder,OKC,125,1610612745,Houston Rockets,HOU,124,1,1,249,,,,,0,,,,,,0,,
1,22500002,2025-10-21,1610612747,Los Angeles Lakers,LAL,109,1610612744,Golden State Warriors,GSW,119,-10,0,228,,,,,0,,,,,,0,,
2,22500003,2025-10-22,1610612752,New York Knicks,NYK,119,1610612739,Cleveland Cavaliers,CLE,111,8,1,230,,,,,0,,,,,,0,,
3,22500004,2025-10-22,1610612742,Dallas Mavericks,DAL,92,1610612759,San Antonio Spurs,SAS,125,-33,0,217,,,,,0,,,,,,0,,
4,22500080,2025-10-22,1610612766,Charlotte Hornets,CHA,136,1610612751,Brooklyn Nets,BKN,117,19,1,253,,,,,0,,,,,,0,,


## 6. Ventaja de descanso y limpieza de filas iniciales sin historial

Creamos:

- `REST_ADVANTAGE` = `HOME_REST_DAYS` - `AWAY_REST_DAYS`

Luego eliminamos las filas donde:

- `HOME_PF_AVG_LAST5` o `AWAY_PF_AVG_LAST5` son NaN
(son los primeros ~5 partidos de cada equipo, donde aÃºn no hay historial).


In [8]:
cols_rolling = [
    "HOME_PF_AVG_LAST5", "AWAY_PF_AVG_LAST5",
    "HOME_PA_AVG_LAST5", "AWAY_PA_AVG_LAST5",
    "HOME_WIN_RATE_LAST5", "AWAY_WIN_RATE_LAST5",
    "HOME_MARGIN_AVG_LAST5", "AWAY_MARGIN_AVG_LAST5",
    "HOME_STREAK", "AWAY_STREAK",
    "HOME_REST_DAYS", "AWAY_REST_DAYS",
    "REST_ADVANTAGE",
]

# Rellenamos NaNs con la media de cada columna
df_model[cols_rolling] = df_model[cols_rolling].fillna(
    df_model[cols_rolling].mean()
)

df_model[cols_rolling].isna().sum()


HOME_PF_AVG_LAST5        0
AWAY_PF_AVG_LAST5        0
HOME_PA_AVG_LAST5        0
AWAY_PA_AVG_LAST5        0
HOME_WIN_RATE_LAST5      0
AWAY_WIN_RATE_LAST5      0
HOME_MARGIN_AVG_LAST5    0
AWAY_MARGIN_AVG_LAST5    0
HOME_STREAK              0
AWAY_STREAK              0
HOME_REST_DAYS           0
AWAY_REST_DAYS           0
REST_ADVANTAGE           0
dtype: int64

## ðŸ’¾ 8. Guardar dataset enriquecido para modelos

Guardamos el resultado en `data/processed/games_2025_26_features.csv`
para usarlo en el notebook de modelos (03).


In [9]:
output_model_path = "../data/processed/games_2025_26_features.csv"
df_model.to_csv(output_model_path, index=False)
output_model_path


'../data/processed/games_2025_26_features.csv'

# âœ… Resumen del Notebook 02 â€” Feature Engineering

En este notebook:

- Convertimos el dataset de partidos a vista TEAM-GAME (`df_team_games`).
- Calculamos para cada equipo, antes de cada partido:
  - Promedio de puntos anotados/recibidos en los Ãºltimos 5 partidos.
  - Win rate y margen promedio de los Ãºltimos 5 partidos.
  - Racha de victorias/derrotas (`STREAK`).
  - DÃ­as de descanso (`REST_DAYS`).
- Proyectamos estos features al nivel PARTIDO:
  - Variables `HOME_...` y `AWAY_...` para local y visitante.
  - Ventaja de descanso (`REST_ADVANTAGE`).
- Eliminamos partidos sin historial suficiente (primeros partidos de cada equipo).
- Guardamos el dataset enriquecido en:
  `data/processed/games_2025_26_features.csv`.

En el siguiente notebook (`03_modelos_regresion.ipynb`) entrenaremos modelos
de clasificaciÃ³n y regresiÃ³n (Random Forest / Gradient Boosting / XGBoost) para
predecir:

- Victoria del local (`HOME_WIN`).
- Margen de victoria (`MARGIN_HOME`).
- Puntos totales del partido (`TOTAL_POINTS`).
- Puntos por equipo (`HOME_PTS`, `AWAY_PTS`).
