# üß© Predicci√≥n NBA 2026 ‚Äî Feature Engineering

En este notebook construimos *features* avanzadas a partir del dataset de partidos
ya procesado en el notebook 01.

Trabajamos sobre la temporada actual y agregamos informaci√≥n como:

- Forma reciente de cada equipo (promedios √∫ltimos partidos)
- Racha de victorias/derrotas (*streak*)
- Porcentaje de victorias recientes
- D√≠as de descanso entre partidos (fatiga)
- Ventaja de descanso entre local y visitante

El objetivo es generar un dataset enriquecido listo para entrenar modelos
(clasificaci√≥n y regresi√≥n) en el siguiente notebook.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 50)

# Cargar el dataset base de partidos que guardamos en el notebook 01
games_path = "../data/processed/games_2025_26_basic.csv"
df_games = pd.read_csv(games_path, parse_dates=["GAME_DATE"])

df_games.head()


Unnamed: 0,GAME_ID,GAME_DATE,HOME_TEAM_ID,HOME_TEAM_NAME,HOME_TEAM_ABBR,HOME_PTS,AWAY_TEAM_ID,AWAY_TEAM_NAME,AWAY_TEAM_ABBR,AWAY_PTS,MARGIN_HOME,HOME_WIN,TOTAL_POINTS
0,22500001,2025-10-21,1610612760,Oklahoma City Thunder,OKC,125,1610612745,Houston Rockets,HOU,124,1,1,249
1,22500002,2025-10-21,1610612747,Los Angeles Lakers,LAL,109,1610612744,Golden State Warriors,GSW,119,-10,0,228
2,22500003,2025-10-22,1610612752,New York Knicks,NYK,119,1610612739,Cleveland Cavaliers,CLE,111,8,1,230
3,22500004,2025-10-22,1610612742,Dallas Mavericks,DAL,92,1610612759,San Antonio Spurs,SAS,125,-33,0,217
4,22500080,2025-10-22,1610612766,Charlotte Hornets,CHA,136,1610612751,Brooklyn Nets,BKN,117,19,1,253


## 1. Construcci√≥n del dataset TEAM-GAME

A partir de `df_games` (una fila por partido), vamos a crear un dataframe
`df_team_games` donde:

- Cada fila representa un **equipo en un partido**.
- Tendremos columnas:
  - `TEAM_ID`, `TEAM_NAME`
  - `IS_HOME` (1 si fue local, 0 si fue visitante)
  - `POINTS_FOR`, `POINTS_AGAINST`
  - `WIN` (1 = gan√≥ ese equipo, 0 = perdi√≥)
  - `GAME_DATE`, `GAME_ID`


In [2]:
rows = []

for _, row in df_games.iterrows():
    # Equipo local
    rows.append({
        "GAME_ID": row["GAME_ID"],
        "GAME_DATE": row["GAME_DATE"],
        "TEAM_ID": row["HOME_TEAM_ID"],
        "TEAM_NAME": row["HOME_TEAM_NAME"],
        "IS_HOME": 1,
        "POINTS_FOR": row["HOME_PTS"],
        "POINTS_AGAINST": row["AWAY_PTS"],
        "WIN": 1 if row["HOME_WIN"] == 1 else 0,
    })
    # Equipo visitante
    rows.append({
        "GAME_ID": row["GAME_ID"],
        "GAME_DATE": row["GAME_DATE"],
        "TEAM_ID": row["AWAY_TEAM_ID"],
        "TEAM_NAME": row["AWAY_TEAM_NAME"],
        "IS_HOME": 0,
        "POINTS_FOR": row["AWAY_PTS"],
        "POINTS_AGAINST": row["HOME_PTS"],
        "WIN": 1 if row["HOME_WIN"] == 0 else 0,
    })

df_team_games = pd.DataFrame(rows)
df_team_games = df_team_games.sort_values(["TEAM_ID", "GAME_DATE"]).reset_index(drop=True)
df_team_games.head()


Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ID,TEAM_NAME,IS_HOME,POINTS_FOR,POINTS_AGAINST,WIN
0,22500082,2025-10-22,1610612737,Atlanta Hawks,1,118,138,0
1,22500083,2025-10-22,1610612738,Boston Celtics,1,116,117,0
2,22500003,2025-10-22,1610612739,Cleveland Cavaliers,0,111,119,0
3,22500085,2025-10-22,1610612740,New Orleans Pelicans,0,122,128,0
4,22500084,2025-10-22,1610612741,Chicago Bulls,1,115,111,1


## 2. Forma reciente por equipo (rolling stats)

Para cada equipo calculamos, **antes de cada partido**:

- `PF_AVG_LAST5`: promedio de puntos anotados en los √∫ltimos 5 partidos.
- `PA_AVG_LAST5`: promedio de puntos recibidos en los √∫ltimos 5 partidos.
- `WIN_RATE_LAST5`: proporci√≥n de victorias en los √∫ltimos 5 partidos.
- `MARGIN_AVG_LAST5`: margen promedio de puntos (a favor) en los √∫ltimos 5 partidos.

Usamos `shift(1)` para evitar fuga de informaci√≥n (solo usamos partidos pasados).


In [3]:
df_team_games["MARGIN"] = df_team_games["POINTS_FOR"] - df_team_games["POINTS_AGAINST"]

def add_rolling_stats(team_df, window=5):
    team_df = team_df.sort_values("GAME_DATE").copy()

    team_df["PF_AVG_LAST5"] = (
        team_df["POINTS_FOR"]
        .shift(1)
        .rolling(window)
        .mean()
    )
    team_df["PA_AVG_LAST5"] = (
        team_df["POINTS_AGAINST"]
        .shift(1)
        .rolling(window)
        .mean()
    )
    team_df["WIN_RATE_LAST5"] = (
        team_df["WIN"]
        .shift(1)
        .rolling(window)
        .mean()
    )
    team_df["MARGIN_AVG_LAST5"] = (
        team_df["MARGIN"]
        .shift(1)
        .rolling(window)
        .mean()
    )
    return team_df

df_team_games = (
    df_team_games
    .groupby("TEAM_ID", group_keys=False)
    .apply(add_rolling_stats)
)

df_team_games.head(10)


  .apply(add_rolling_stats)


Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ID,TEAM_NAME,IS_HOME,POINTS_FOR,POINTS_AGAINST,WIN,MARGIN,PF_AVG_LAST5,PA_AVG_LAST5,WIN_RATE_LAST5,MARGIN_AVG_LAST5
0,22500082,2025-10-22,1610612737,Atlanta Hawks,1,118,138,0,-20,,,,
1,22500083,2025-10-22,1610612738,Boston Celtics,1,116,117,0,-1,,,,
2,22500003,2025-10-22,1610612739,Cleveland Cavaliers,0,111,119,0,-8,,,,
3,22500085,2025-10-22,1610612740,New Orleans Pelicans,0,122,128,0,-6,,,,
4,22500084,2025-10-22,1610612741,Chicago Bulls,1,115,111,1,4,,,,
5,22500004,2025-10-22,1610612742,Dallas Mavericks,1,92,125,0,-33,,,,
6,22500002,2025-10-21,1610612744,Golden State Warriors,0,119,109,1,10,,,,
7,22500001,2025-10-21,1610612745,Houston Rockets,0,124,125,0,-1,,,,
8,22500087,2025-10-22,1610612746,LA Clippers,0,108,129,0,-21,,,,
9,22500002,2025-10-21,1610612747,Los Angeles Lakers,1,109,119,0,-10,,,,


## 3. Racha de victorias/derrotas (*streak*)

Definimos `STREAK` como:

- N√∫mero de victorias consecutivas antes del partido (positivo).
- N√∫mero de derrotas consecutivas antes del partido (negativo).

Ejemplo:
- [W, W, L, L] ‚Üí streaks anteriores: 0, +1, +2, -1, -2 (desplazado con `shift(1)`).


In [4]:
def compute_streak(win_series: pd.Series) -> pd.Series:
    streaks = []
    streak = 0
    for w in win_series.shift(1):  # solo hasta el partido anterior
        if pd.isna(w):
            streak = 0
        else:
            if w == 1:
                streak = streak + 1 if streak >= 0 else 1
            else:
                streak = streak - 1 if streak <= 0 else -1
        streaks.append(streak)
    # IMPORTANTE: devolver la Serie con el mismo √≠ndice que entra
    return pd.Series(streaks, index=win_series.index)

df_team_games = df_team_games.sort_values(["TEAM_ID", "GAME_DATE"]).reset_index(drop=True)

df_team_games["STREAK"] = (
    df_team_games
        .groupby("TEAM_ID", group_keys=False)["WIN"]
        .apply(compute_streak)
)

df_team_games.head(12)


Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ID,TEAM_NAME,IS_HOME,POINTS_FOR,POINTS_AGAINST,WIN,MARGIN,PF_AVG_LAST5,PA_AVG_LAST5,WIN_RATE_LAST5,MARGIN_AVG_LAST5,STREAK
0,22500082,2025-10-22,1610612737,Atlanta Hawks,1,118,138,0,-20,,,,,0
1,22500083,2025-10-22,1610612738,Boston Celtics,1,116,117,0,-1,,,,,0
2,22500003,2025-10-22,1610612739,Cleveland Cavaliers,0,111,119,0,-8,,,,,0
3,22500085,2025-10-22,1610612740,New Orleans Pelicans,0,122,128,0,-6,,,,,0
4,22500084,2025-10-22,1610612741,Chicago Bulls,1,115,111,1,4,,,,,0
5,22500004,2025-10-22,1610612742,Dallas Mavericks,1,92,125,0,-33,,,,,0
6,22500002,2025-10-21,1610612744,Golden State Warriors,0,119,109,1,10,,,,,0
7,22500001,2025-10-21,1610612745,Houston Rockets,0,124,125,0,-1,,,,,0
8,22500087,2025-10-22,1610612746,LA Clippers,0,108,129,0,-21,,,,,0
9,22500002,2025-10-21,1610612747,Los Angeles Lakers,1,109,119,0,-10,,,,,0


## 4. D√≠as de descanso y fatiga

Calculamos el n√∫mero de d√≠as de descanso de cada equipo antes de cada partido:

- `REST_DAYS` = d√≠as entre el partido actual y el anterior del mismo equipo.
- Si es el primer partido de la temporada para ese equipo ‚Üí `NaN` (luego lo manejamos).

Esto nos permitir√° calcular la **ventaja de descanso** entre local y visitante.


In [5]:
def add_rest_days(team_df):
    team_df = team_df.sort_values("GAME_DATE").copy()
    prev_date = team_df["GAME_DATE"].shift(1)
    team_df["REST_DAYS"] = (team_df["GAME_DATE"] - prev_date).dt.days
    return team_df

df_team_games = (
    df_team_games
    .groupby("TEAM_ID", group_keys=False)
    .apply(add_rest_days)
)

df_team_games.head(10)


  .apply(add_rest_days)


Unnamed: 0,GAME_ID,GAME_DATE,TEAM_ID,TEAM_NAME,IS_HOME,POINTS_FOR,POINTS_AGAINST,WIN,MARGIN,PF_AVG_LAST5,PA_AVG_LAST5,WIN_RATE_LAST5,MARGIN_AVG_LAST5,STREAK,REST_DAYS
0,22500082,2025-10-22,1610612737,Atlanta Hawks,1,118,138,0,-20,,,,,0,
1,22500083,2025-10-22,1610612738,Boston Celtics,1,116,117,0,-1,,,,,0,
2,22500003,2025-10-22,1610612739,Cleveland Cavaliers,0,111,119,0,-8,,,,,0,
3,22500085,2025-10-22,1610612740,New Orleans Pelicans,0,122,128,0,-6,,,,,0,
4,22500084,2025-10-22,1610612741,Chicago Bulls,1,115,111,1,4,,,,,0,
5,22500004,2025-10-22,1610612742,Dallas Mavericks,1,92,125,0,-33,,,,,0,
6,22500002,2025-10-21,1610612744,Golden State Warriors,0,119,109,1,10,,,,,0,
7,22500001,2025-10-21,1610612745,Houston Rockets,0,124,125,0,-1,,,,,0,
8,22500087,2025-10-22,1610612746,LA Clippers,0,108,129,0,-21,,,,,0,
9,22500002,2025-10-21,1610612747,Los Angeles Lakers,1,109,119,0,-10,,,,,0,


## 5. Merge de features TEAM-GAME al nivel PARTIDO

Usamos `df_team_games` para construir:

- `home_features`: stats del equipo local en cada partido.
- `away_features`: stats del equipo visitante.

Luego unimos todo a `df_games` usando `GAME_ID`.


In [6]:
# Vista HOME
home_features = (
    df_team_games[df_team_games["IS_HOME"] == 1]
    .rename(columns={
        "POINTS_FOR": "HOME_POINTS_FOR",
        "POINTS_AGAINST": "HOME_POINTS_AGAINST",
        "PF_AVG_LAST5": "HOME_PF_AVG_LAST5",
        "PA_AVG_LAST5": "HOME_PA_AVG_LAST5",
        "WIN_RATE_LAST5": "HOME_WIN_RATE_LAST5",
        "MARGIN_AVG_LAST5": "HOME_MARGIN_AVG_LAST5",
        "STREAK": "HOME_STREAK",
        "REST_DAYS": "HOME_REST_DAYS",
    })
)

home_features = home_features[[
    "GAME_ID", "TEAM_ID",
    "HOME_PF_AVG_LAST5", "HOME_PA_AVG_LAST5",
    "HOME_WIN_RATE_LAST5", "HOME_MARGIN_AVG_LAST5",
    "HOME_STREAK", "HOME_REST_DAYS"
]]

# Vista AWAY
away_features = (
    df_team_games[df_team_games["IS_HOME"] == 0]
    .rename(columns={
        "POINTS_FOR": "AWAY_POINTS_FOR",
        "POINTS_AGAINST": "AWAY_POINTS_AGAINST",
        "PF_AVG_LAST5": "AWAY_PF_AVG_LAST5",
        "PA_AVG_LAST5": "AWAY_PA_AVG_LAST5",
        "WIN_RATE_LAST5": "AWAY_WIN_RATE_LAST5",
        "MARGIN_AVG_LAST5": "AWAY_MARGIN_AVG_LAST5",
        "STREAK": "AWAY_STREAK",
        "REST_DAYS": "AWAY_REST_DAYS",
    })
)

away_features = away_features[[
    "GAME_ID", "TEAM_ID",
    "AWAY_PF_AVG_LAST5", "AWAY_PA_AVG_LAST5",
    "AWAY_WIN_RATE_LAST5", "AWAY_MARGIN_AVG_LAST5",
    "AWAY_STREAK", "AWAY_REST_DAYS"
]]

home_features.head(), away_features.head()


(    GAME_ID     TEAM_ID  HOME_PF_AVG_LAST5  HOME_PA_AVG_LAST5  \
 0  22500082  1610612737                NaN                NaN   
 1  22500083  1610612738                NaN                NaN   
 4  22500084  1610612741                NaN                NaN   
 5  22500004  1610612742                NaN                NaN   
 9  22500002  1610612747                NaN                NaN   
 
    HOME_WIN_RATE_LAST5  HOME_MARGIN_AVG_LAST5  HOME_STREAK  HOME_REST_DAYS  
 0                  NaN                    NaN            0             NaN  
 1                  NaN                    NaN            0             NaN  
 4                  NaN                    NaN            0             NaN  
 5                  NaN                    NaN            0             NaN  
 9                  NaN                    NaN            0             NaN  ,
     GAME_ID     TEAM_ID  AWAY_PF_AVG_LAST5  AWAY_PA_AVG_LAST5  \
 2  22500003  1610612739                NaN                NaN   
 

In [7]:
df_model = (
    df_games
    .merge(home_features, left_on=["GAME_ID", "HOME_TEAM_ID"], right_on=["GAME_ID", "TEAM_ID"], how="left")
    .merge(away_features, left_on=["GAME_ID", "AWAY_TEAM_ID"], right_on=["GAME_ID", "TEAM_ID"], how="left", suffixes=("_HOME", "_AWAY"))
)

# Ya no necesitamos las columnas TEAM_ID_HOME / TEAM_ID_AWAY del merge
df_model = df_model.drop(columns=["TEAM_ID_HOME", "TEAM_ID_AWAY"])

df_model.head()


Unnamed: 0,GAME_ID,GAME_DATE,HOME_TEAM_ID,HOME_TEAM_NAME,HOME_TEAM_ABBR,HOME_PTS,AWAY_TEAM_ID,AWAY_TEAM_NAME,AWAY_TEAM_ABBR,AWAY_PTS,MARGIN_HOME,HOME_WIN,TOTAL_POINTS,HOME_PF_AVG_LAST5,HOME_PA_AVG_LAST5,HOME_WIN_RATE_LAST5,HOME_MARGIN_AVG_LAST5,HOME_STREAK,HOME_REST_DAYS,AWAY_PF_AVG_LAST5,AWAY_PA_AVG_LAST5,AWAY_WIN_RATE_LAST5,AWAY_MARGIN_AVG_LAST5,AWAY_STREAK,AWAY_REST_DAYS
0,22500001,2025-10-21,1610612760,Oklahoma City Thunder,OKC,125,1610612745,Houston Rockets,HOU,124,1,1,249,,,,,0,,,,,,0,
1,22500002,2025-10-21,1610612747,Los Angeles Lakers,LAL,109,1610612744,Golden State Warriors,GSW,119,-10,0,228,,,,,0,,,,,,0,
2,22500003,2025-10-22,1610612752,New York Knicks,NYK,119,1610612739,Cleveland Cavaliers,CLE,111,8,1,230,,,,,0,,,,,,0,
3,22500004,2025-10-22,1610612742,Dallas Mavericks,DAL,92,1610612759,San Antonio Spurs,SAS,125,-33,0,217,,,,,0,,,,,,0,
4,22500080,2025-10-22,1610612766,Charlotte Hornets,CHA,136,1610612751,Brooklyn Nets,BKN,117,19,1,253,,,,,0,,,,,,0,


## 6. Ventaja de descanso y limpieza de filas iniciales sin historial

Creamos:

- `REST_ADVANTAGE` = `HOME_REST_DAYS` - `AWAY_REST_DAYS`

Luego eliminamos las filas donde:

- `HOME_PF_AVG_LAST5` o `AWAY_PF_AVG_LAST5` son NaN
(son los primeros ~5 partidos de cada equipo, donde a√∫n no hay historial).


In [8]:
df_model["REST_ADVANTAGE"] = df_model["HOME_REST_DAYS"] - df_model["AWAY_REST_DAYS"]

len_before = len(df_model)

df_model_clean = df_model.copy()

cols_rolling = [
    "HOME_PF_AVG_LAST5", "AWAY_PF_AVG_LAST5",
    "HOME_PA_AVG_LAST5", "AWAY_PA_AVG_LAST5",
    "HOME_WIN_RATE_LAST5", "AWAY_WIN_RATE_LAST5",
    "HOME_MARGIN_AVG_LAST5", "AWAY_MARGIN_AVG_LAST5",
    "HOME_STREAK", "AWAY_STREAK",
    "HOME_REST_DAYS", "AWAY_REST_DAYS",
    "REST_ADVANTAGE",
]

# Rellenamos NaN con la media de cada columna (para los primeros partidos sin historial)
df_model_clean[cols_rolling] = df_model_clean[cols_rolling].fillna(
    df_model_clean[cols_rolling].mean()
)

len_after = len(df_model_clean)
len_before, len_after



(12, 12)

## 7. Vista r√°pida de features y targets

Revisamos las columnas finales disponibles para entrenar los modelos.


In [9]:
df_model_clean.head()


Unnamed: 0,GAME_ID,GAME_DATE,HOME_TEAM_ID,HOME_TEAM_NAME,HOME_TEAM_ABBR,HOME_PTS,AWAY_TEAM_ID,AWAY_TEAM_NAME,AWAY_TEAM_ABBR,AWAY_PTS,MARGIN_HOME,HOME_WIN,TOTAL_POINTS,HOME_PF_AVG_LAST5,HOME_PA_AVG_LAST5,HOME_WIN_RATE_LAST5,HOME_MARGIN_AVG_LAST5,HOME_STREAK,HOME_REST_DAYS,AWAY_PF_AVG_LAST5,AWAY_PA_AVG_LAST5,AWAY_WIN_RATE_LAST5,AWAY_MARGIN_AVG_LAST5,AWAY_STREAK,AWAY_REST_DAYS,REST_ADVANTAGE
0,22500001,2025-10-21,1610612760,Oklahoma City Thunder,OKC,125,1610612745,Houston Rockets,HOU,124,1,1,249,,,,,0,,,,,,0,,
1,22500002,2025-10-21,1610612747,Los Angeles Lakers,LAL,109,1610612744,Golden State Warriors,GSW,119,-10,0,228,,,,,0,,,,,,0,,
2,22500003,2025-10-22,1610612752,New York Knicks,NYK,119,1610612739,Cleveland Cavaliers,CLE,111,8,1,230,,,,,0,,,,,,0,,
3,22500004,2025-10-22,1610612742,Dallas Mavericks,DAL,92,1610612759,San Antonio Spurs,SAS,125,-33,0,217,,,,,0,,,,,,0,,
4,22500080,2025-10-22,1610612766,Charlotte Hornets,CHA,136,1610612751,Brooklyn Nets,BKN,117,19,1,253,,,,,0,,,,,,0,,


In [10]:
df_model_clean.columns


Index(['GAME_ID', 'GAME_DATE', 'HOME_TEAM_ID', 'HOME_TEAM_NAME',
       'HOME_TEAM_ABBR', 'HOME_PTS', 'AWAY_TEAM_ID', 'AWAY_TEAM_NAME',
       'AWAY_TEAM_ABBR', 'AWAY_PTS', 'MARGIN_HOME', 'HOME_WIN', 'TOTAL_POINTS',
       'HOME_PF_AVG_LAST5', 'HOME_PA_AVG_LAST5', 'HOME_WIN_RATE_LAST5',
       'HOME_MARGIN_AVG_LAST5', 'HOME_STREAK', 'HOME_REST_DAYS',
       'AWAY_PF_AVG_LAST5', 'AWAY_PA_AVG_LAST5', 'AWAY_WIN_RATE_LAST5',
       'AWAY_MARGIN_AVG_LAST5', 'AWAY_STREAK', 'AWAY_REST_DAYS',
       'REST_ADVANTAGE'],
      dtype='object')

## üíæ 8. Guardar dataset enriquecido para modelos

Guardamos el resultado en `data/processed/games_2025_26_features.csv`
para usarlo en el notebook de modelos (03).


In [11]:
output_model_path = "../data/processed/games_2025_26_features.csv"
df_model_clean.to_csv(output_model_path, index=False)
output_model_path


'../data/processed/games_2025_26_features.csv'

# ‚úÖ Resumen del Notebook 02 ‚Äî Feature Engineering

En este notebook:

- Convertimos el dataset de partidos a vista TEAM-GAME (`df_team_games`).
- Calculamos para cada equipo, antes de cada partido:
  - Promedio de puntos anotados/recibidos en los √∫ltimos 5 partidos.
  - Win rate y margen promedio de los √∫ltimos 5 partidos.
  - Racha de victorias/derrotas (`STREAK`).
  - D√≠as de descanso (`REST_DAYS`).
- Proyectamos estos features al nivel PARTIDO:
  - Variables `HOME_...` y `AWAY_...` para local y visitante.
  - Ventaja de descanso (`REST_ADVANTAGE`).
- Eliminamos partidos sin historial suficiente (primeros partidos de cada equipo).
- Guardamos el dataset enriquecido en:
  `data/processed/games_2025_26_features.csv`.

En el siguiente notebook (`03_modelos_regresion.ipynb`) entrenaremos modelos
de clasificaci√≥n y regresi√≥n (Random Forest / Gradient Boosting / XGBoost) para
predecir:

- Victoria del local (`HOME_WIN`).
- Margen de victoria (`MARGIN_HOME`).
- Puntos totales del partido (`TOTAL_POINTS`).
- Puntos por equipo (`HOME_PTS`, `AWAY_PTS`).
