# üèÄ NBA Playoffs Simulator ‚Äî Data Collection
**Proyecto:** Simul√© los Playoffs NBA miles de veces‚Ä¶ y encontr√© un contender inesperado

**Notebook 01:** Adquisici√≥n y estructuraci√≥n de datos

Este notebook recolecta:
1. **Stats avanzadas actuales (2025-26)** ‚Äî El perfil de cada equipo hoy
2. **Game logs actuales** ‚Äî Para calcular consistencia y momentum
3. **Resultados hist√≥ricos de playoffs (2015-2025)** ‚Äî Para entrenar el modelo
4. **Stats hist√≥ricas por temporada** ‚Äî Features de cada equipo en cada a√±o

In [None]:
# ============================================================
# INSTALACI√ìN DE DEPENDENCIAS
# ============================================================
!pip install nba_api --quiet
print('‚úÖ nba_api instalado correctamente')

In [None]:
# ============================================================
# IMPORTS Y CONFIGURACI√ìN
# ============================================================
import pandas as pd
import numpy as np
import time
import warnings
import os

from nba_api.stats.endpoints import (
    leaguedashteamstats,
    leaguestandingsv3,
    leaguegamelog
)

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Crear carpeta de datos
os.makedirs('data', exist_ok=True)

# Pausa entre llamadas API para evitar bloqueos de stats.nba.com
API_DELAY = 0.8  # segundos

print('‚úÖ Librer√≠as cargadas')

---
## üìä Secci√≥n 1: Stats Avanzadas ‚Äî Temporada Actual (2025-26)

Aqu√≠ obtenemos el **perfil anal√≠tico** de cada equipo en la temporada actual.

Las stats avanzadas (Net Rating, Pace, eFG%, etc.) son las que realmente
miden la calidad de un equipo, no solo su r√©cord.

In [None]:
# ============================================================
# 1.1 ‚Äî Stats avanzadas de equipo (temporada actual)
# ============================================================
CURRENT_SEASON = '2025-26'

print(f'üì° Obteniendo stats avanzadas ‚Äî Temporada {CURRENT_SEASON}...')

# Advanced stats: OFF_RATING, DEF_RATING, NET_RATING, PACE, TS%, eFG%, etc.
advanced = leaguedashteamstats.LeagueDashTeamStats(
    season=CURRENT_SEASON,
    measure_type_detailed_defense='Advanced',
    season_type_all_star='Regular Season'
)
df_advanced = advanced.get_data_frames()[0]
time.sleep(API_DELAY)

# Four Factors: eFG%, TOV%, OREB%, FT Rate (ofensivo y defensivo)
four_factors = leaguedashteamstats.LeagueDashTeamStats(
    season=CURRENT_SEASON,
    measure_type_detailed_defense='Four Factors',
    season_type_all_star='Regular Season'
)
df_four = four_factors.get_data_frames()[0]
time.sleep(API_DELAY)

# Base stats: W, L, PTS, REB, AST, etc.
base = leaguedashteamstats.LeagueDashTeamStats(
    season=CURRENT_SEASON,
    measure_type_detailed_defense='Base',
    season_type_all_star='Regular Season'
)
df_base = base.get_data_frames()[0]
time.sleep(API_DELAY)

print(f'  ‚Üí Advanced stats: {df_advanced.shape}')
print(f'  ‚Üí Four Factors:   {df_four.shape}')
print(f'  ‚Üí Base stats:     {df_base.shape}')
print('‚úÖ Stats actuales obtenidas')

In [None]:
# ============================================================
# 1.2 ‚Äî Standings actuales (Conference, Seed, R√©cord)
# ============================================================
print(f'üì° Obteniendo standings ‚Äî Temporada {CURRENT_SEASON}...')

standings = leaguestandingsv3.LeagueStandingsV3(
    season=CURRENT_SEASON,
    season_type='Regular Season'
)
df_standings = standings.get_data_frames()[0]
time.sleep(API_DELAY)

print(f'  ‚Üí Standings: {df_standings.shape}')
print('‚úÖ Standings obtenidos')

In [None]:
# ============================================================
# 1.3 ‚Äî Unificar datos actuales en una sola tabla
# ============================================================

# --- Columnas relevantes de cada fuente ---

# Advanced
adv_cols = [
    'TEAM_ID', 'TEAM_NAME',
    'W', 'L', 'W_PCT', 'GP',
    'OFF_RATING', 'DEF_RATING', 'NET_RATING',
    'PACE', 'TS_PCT', 'EFG_PCT',
    'AST_PCT', 'AST_TO', 'AST_RATIO',
    'OREB_PCT', 'DREB_PCT', 'REB_PCT',
    'TM_TOV_PCT', 'PIE'
]
# Filtrar columnas que existan (por si la API cambia nombres)
adv_cols = [c for c in adv_cols if c in df_advanced.columns]
df_current = df_advanced[adv_cols].copy()

# Four Factors (solo columnas de oponente para contexto defensivo)
four_opp_cols = [c for c in df_four.columns if 'OPP' in c]
four_merge = df_four[['TEAM_ID'] + four_opp_cols].copy()
df_current = df_current.merge(four_merge, on='TEAM_ID', how='left')

# Base stats: PTS, REB, AST, STL, BLK, TOV, PLUS_MINUS
base_extra = ['TEAM_ID', 'PTS', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'PLUS_MINUS']
base_extra = [c for c in base_extra if c in df_base.columns]
df_current = df_current.merge(df_base[base_extra], on='TEAM_ID', how='left')

# Standings: Conference, PlayoffRank
stand_cols = ['TeamID', 'Conference', 'PlayoffRank', 'ConferenceRecord']
stand_cols = [c for c in stand_cols if c in df_standings.columns]
df_stand_merge = df_standings[stand_cols].rename(columns={'TeamID': 'TEAM_ID'})
df_current = df_current.merge(df_stand_merge, on='TEAM_ID', how='left')

# Agregar columna de temporada
df_current['SEASON'] = CURRENT_SEASON

print(f'üìã Tabla unificada: {df_current.shape[0]} equipos √ó {df_current.shape[1]} columnas')
print(f'\nColumnas disponibles:\n{list(df_current.columns)}')
df_current.sort_values('NET_RATING', ascending=False).head(10)

---
## üéÆ Secci√≥n 2: Game Logs ‚Äî Temporada Actual

Necesitamos los resultados **juego a juego** para calcular:
- **Consistencia** ‚Üí desviaci√≥n est√°ndar del diferencial de puntos
- **Momentum** ‚Üí rendimiento en los √∫ltimos 15 juegos vs toda la temporada
- **Clutch factor** ‚Üí r√©cord en juegos decididos por ‚â§5 puntos

In [None]:
# ============================================================
# 2.1 ‚Äî Game logs de toda la liga (temporada actual)
# ============================================================
print(f'üì° Obteniendo game logs ‚Äî Temporada {CURRENT_SEASON}...')

gamelogs = leaguegamelog.LeagueGameLog(
    season=CURRENT_SEASON,
    season_type_all_star='Regular Season'
)
df_gamelogs = gamelogs.get_data_frames()[0]
time.sleep(API_DELAY)

print(f'  ‚Üí {df_gamelogs.shape[0]} registros de juegos individuales')
print(f'  ‚Üí {df_gamelogs["TEAM_ID"].nunique()} equipos')
print('‚úÖ Game logs obtenidos')

df_gamelogs.head(3)

In [None]:
# ============================================================
# 2.2 ‚Äî M√©tricas derivadas de game logs
# ============================================================

def compute_gamelog_features(df_logs):
    """
    A partir de los game logs, calcula por equipo:
    - consistency_score: 1 / std(PLUS_MINUS) ‚Üí mayor = m√°s consistente
    - clutch_record: win% en juegos con |PLUS_MINUS| <= 5
    - last15_win_pct: win% en los √∫ltimos 15 juegos
    - last15_plus_minus: promedio PLUS_MINUS √∫ltimos 15
    """
    results = []

    for team_id, team_df in df_logs.groupby('TEAM_ID'):
        team_df = team_df.sort_values('GAME_DATE', ascending=True).copy()
        team_name = team_df['TEAM_ABBREVIATION'].values[0]

        # --- Consistencia ---
        std_pm = team_df['PLUS_MINUS'].std()
        consistency = round(1 / std_pm, 4) if std_pm > 0 else 0

        # --- Clutch (juegos decididos por 5 puntos o menos) ---
        clutch_games = team_df[team_df['PLUS_MINUS'].abs() <= 5]
        if len(clutch_games) > 0:
            clutch_wins = (clutch_games['WL'] == 'W').sum()
            clutch_pct = round(clutch_wins / len(clutch_games), 4)
        else:
            clutch_pct = 0.5

        # --- √öltimos 15 juegos (momentum) ---
        last15 = team_df.tail(15)
        l15_win_pct = round((last15['WL'] == 'W').sum() / len(last15), 4)
        l15_plus_minus = round(last15['PLUS_MINUS'].mean(), 2)

        # --- Season averages para calcular momentum_delta ---
        season_plus_minus = round(team_df['PLUS_MINUS'].mean(), 2)

        results.append({
            'TEAM_ID': team_id,
            'TEAM_ABBR': team_name,
            'consistency_score': consistency,
            'clutch_win_pct': clutch_pct,
            'clutch_games_played': len(clutch_games),
            'last15_win_pct': l15_win_pct,
            'last15_avg_plus_minus': l15_plus_minus,
            'season_avg_plus_minus': season_plus_minus,
            'momentum_delta': round(l15_plus_minus - season_plus_minus, 2)
        })

    return pd.DataFrame(results)


df_gamelog_features = compute_gamelog_features(df_gamelogs)

print(f'üìã Features de game logs: {df_gamelog_features.shape}')
print(f'\nEquipos con MAYOR momentum (mejorando):')
df_gamelog_features.sort_values('momentum_delta', ascending=False).head(5)[
    ['TEAM_ABBR', 'consistency_score', 'clutch_win_pct', 'last15_win_pct', 'momentum_delta']
]

In [None]:
# ============================================================
# 2.3 ‚Äî Integrar features de game logs a la tabla principal
# ============================================================
df_current = df_current.merge(
    df_gamelog_features.drop(columns=['TEAM_ABBR']),
    on='TEAM_ID',
    how='left'
)

print(f'üìã Tabla actual completa: {df_current.shape[0]} equipos √ó {df_current.shape[1]} columnas')
print('‚úÖ Features de game logs integradas')

---
## üèÜ Secci√≥n 3: Resultados Hist√≥ricos de Playoffs (2015-2025)

Este es el **dataset de entrenamiento** para nuestro modelo XGBoost.

Necesitamos reconstruir cada serie de playoffs:
- Qui√©n jug√≥ contra qui√©n
- Qui√©n gan√≥ y en cu√°ntos juegos
- En qu√© ronda fue

Esto nos da ~150 series hist√≥ricas para entrenar el modelo.

In [None]:
# ============================================================
# 3.1 ‚Äî Funci√≥n para reconstruir series desde game logs
# ============================================================

def get_playoff_series(season):
    """
    Obtiene los game logs de playoffs para una temporada
    y reconstruye cada serie (qui√©n gan√≥, en cu√°ntos juegos, qu√© ronda).
    """
    try:
        games = leaguegamelog.LeagueGameLog(
            season=season,
            season_type_all_star='Playoffs'
        )
        df = games.get_data_frames()[0]
    except Exception as e:
        print(f'  ‚ö†Ô∏è Error en {season}: {e}')
        return pd.DataFrame()

    if df.empty:
        return pd.DataFrame()

    # --- Identificar los 2 equipos en cada juego ---
    game_info = []
    for game_id in df['GAME_ID'].unique():
        game_rows = df[df['GAME_ID'] == game_id].sort_values('TEAM_ID')
        if len(game_rows) != 2:
            continue

        teams = game_rows[['TEAM_ID', 'TEAM_ABBREVIATION', 'WL']].values
        winner_idx = 0 if teams[0][2] == 'W' else 1

        game_info.append({
            'game_id': game_id,
            'game_date': game_rows['GAME_DATE'].values[0],
            'team_a_id': int(teams[0][0]),
            'team_a_abbr': teams[0][1],
            'team_b_id': int(teams[1][0]),
            'team_b_abbr': teams[1][1],
            'winner_id': int(teams[winner_idx][0])
        })

    if not game_info:
        return pd.DataFrame()

    games_df = pd.DataFrame(game_info)

    # --- Agrupar juegos por par de equipos = una serie ---
    games_df['series_key'] = games_df.apply(
        lambda x: tuple(sorted([x['team_a_id'], x['team_b_id']])), axis=1
    )

    series_list = []
    for key, group in games_df.groupby('series_key'):
        team_ids = list(key)
        group = group.sort_values('game_date')

        # Contar victorias por equipo
        wins = {tid: (group['winner_id'] == tid).sum() for tid in team_ids}

        # Obtener abreviaciones
        abbrs = {}
        for _, row in group.iterrows():
            abbrs[row['team_a_id']] = row['team_a_abbr']
            abbrs[row['team_b_id']] = row['team_b_abbr']

        winner_id = max(wins, key=wins.get)
        loser_id = [t for t in team_ids if t != winner_id][0]

        series_list.append({
            'season': season,
            'winner_id': winner_id,
            'winner_abbr': abbrs.get(winner_id, ''),
            'winner_wins': wins[winner_id],
            'loser_id': loser_id,
            'loser_abbr': abbrs.get(loser_id, ''),
            'loser_wins': wins[loser_id],
            'total_games': len(group),
            'first_game_date': group['game_date'].values[0]
        })

    series_df = pd.DataFrame(series_list).sort_values('first_game_date')

    # --- Asignar n√∫mero de ronda por orden cronol√≥gico ---
    # Playoff est√°ndar: 8 R1 + 4 R2 + 2 CF + 1 Finals = 15 series
    round_sizes = [8, 4, 2, 1]
    round_labels = [1, 2, 3, 4]
    rounds = []
    idx = 0
    for size, label in zip(round_sizes, round_labels):
        rounds.extend([label] * min(size, max(0, len(series_df) - idx)))
        idx += size
    series_df['round'] = rounds[:len(series_df)]

    return series_df


print('‚úÖ Funci√≥n de reconstrucci√≥n de series lista')

In [None]:
# ============================================================
# 3.2 ‚Äî Recolectar series de playoffs (2015-16 a 2024-25)
# ============================================================
HISTORICAL_SEASONS = [
    '2015-16', '2016-17', '2017-18', '2018-19', '2019-20',
    '2020-21', '2021-22', '2022-23', '2023-24', '2024-25'
]

all_series = []

print('üèÜ Recolectando series hist√≥ricas de playoffs...\n')
for season in HISTORICAL_SEASONS:
    print(f'  üì° {season}...', end=' ')
    series = get_playoff_series(season)
    if not series.empty:
        all_series.append(series)
        print(f'{len(series)} series encontradas')
    else:
        print('‚ö†Ô∏è Sin datos')
    time.sleep(API_DELAY)

df_historical_playoffs = pd.concat(all_series, ignore_index=True)

print(f'\nüìã Total: {len(df_historical_playoffs)} series hist√≥ricas')
print(f'   Rondas: {df_historical_playoffs["round"].value_counts().sort_index().to_dict()}')
print('‚úÖ Datos hist√≥ricos de playoffs completos')

In [None]:
# ============================================================
# 3.3 ‚Äî Vista previa: Playoffs hist√≥ricos
# ============================================================
round_names = {1: 'First Round', 2: 'Conf. Semis', 3: 'Conf. Finals', 4: 'NBA Finals'}

print('üèÜ √öltimas NBA Finals en el dataset:\n')
finals = df_historical_playoffs[df_historical_playoffs['round'] == 4].copy()
finals['round_name'] = 'NBA Finals'
finals['result'] = finals.apply(
    lambda x: f"{x['winner_abbr']} {int(x['winner_wins'])}-{int(x['loser_wins'])} {x['loser_abbr']}",
    axis=1
)
print(finals[['season', 'result']].to_string(index=False))

---
## üìà Secci√≥n 4: Stats Hist√≥ricas por Temporada Regular

Para entrenar el modelo necesitamos las **stats de temporada regular** de cada equipo
que jug√≥ playoffs. As√≠ podemos calcular los diferenciales de features entre equipos
y ense√±arle al XGBoost qu√© combinaciones de ventajas predicen victorias en series.

In [None]:
# ============================================================
# 4.1 ‚Äî Funci√≥n para obtener stats hist√≥ricas
# ============================================================

def get_season_team_stats(season):
    """
    Obtiene stats avanzadas + base de todos los equipos
    para una temporada regular dada.
    """
    try:
        # Stats avanzadas
        adv = leaguedashteamstats.LeagueDashTeamStats(
            season=season,
            measure_type_detailed_defense='Advanced',
            season_type_all_star='Regular Season'
        )
        df_adv = adv.get_data_frames()[0]
        time.sleep(API_DELAY)

        # Stats base
        base = leaguedashteamstats.LeagueDashTeamStats(
            season=season,
            measure_type_detailed_defense='Base',
            season_type_all_star='Regular Season'
        )
        df_base = base.get_data_frames()[0]
        time.sleep(API_DELAY)

        # Seleccionar columnas clave de advanced
        adv_cols = [
            'TEAM_ID', 'TEAM_NAME',
            'GP', 'W', 'L', 'W_PCT',
            'OFF_RATING', 'DEF_RATING', 'NET_RATING',
            'PACE', 'TS_PCT', 'EFG_PCT',
            'AST_PCT', 'AST_TO', 'AST_RATIO',
            'OREB_PCT', 'DREB_PCT', 'REB_PCT',
            'TM_TOV_PCT', 'PIE'
        ]
        adv_cols = [c for c in adv_cols if c in df_adv.columns]
        df = df_adv[adv_cols].copy()

        # Agregar PTS y PLUS_MINUS de base
        base_cols = ['TEAM_ID', 'PTS', 'PLUS_MINUS']
        base_cols = [c for c in base_cols if c in df_base.columns]
        df = df.merge(df_base[base_cols], on='TEAM_ID', how='left')

        df['SEASON'] = season
        return df

    except Exception as e:
        print(f'  ‚ö†Ô∏è Error en {season}: {e}')
        return pd.DataFrame()


print('‚úÖ Funci√≥n de stats hist√≥ricas lista')

In [None]:
# ============================================================
# 4.2 ‚Äî Recolectar stats de cada temporada
# ============================================================
all_stats = []

print('üìà Recolectando stats hist√≥ricas por temporada...\n')
for season in HISTORICAL_SEASONS:
    print(f'  üì° {season}...', end=' ')
    stats = get_season_team_stats(season)
    if not stats.empty:
        all_stats.append(stats)
        print(f'{len(stats)} equipos')
    else:
        print('‚ö†Ô∏è Sin datos')

df_historical_stats = pd.concat(all_stats, ignore_index=True)

print(f'\nüìã Total: {len(df_historical_stats)} registros equipo-temporada')
print(f'   Temporadas: {df_historical_stats["SEASON"].nunique()}')
print(f'   Equipos por temporada: ~{len(df_historical_stats) // df_historical_stats["SEASON"].nunique()}')
print('‚úÖ Stats hist√≥ricas completas')

In [None]:
# ============================================================
# 4.3 ‚Äî Vista previa: mejores equipos por Net Rating cada a√±o
# ============================================================
print('üìä Mejor Net Rating por temporada:\n')
for season in HISTORICAL_SEASONS:
    season_data = df_historical_stats[df_historical_stats['SEASON'] == season]
    if season_data.empty:
        continue
    best = season_data.sort_values('NET_RATING', ascending=False).iloc[0]
    print(f"  {season}: {best['TEAM_NAME']:<28} (Net Rating: {best['NET_RATING']:+.1f})")

---
## üå± Secci√≥n 5: Standings Hist√≥ricos (Seeding)

Para saber qu√© equipo ten√≠a ventaja de cancha en cada serie,
necesitamos el **seed** de cada equipo en su conferencia.

In [None]:
# ============================================================
# 5.1 ‚Äî Recolectar standings hist√≥ricos
# ============================================================

all_standings = []

print('üå± Recolectando standings hist√≥ricos...\n')
for season in HISTORICAL_SEASONS:
    print(f'  üì° {season}...', end=' ')
    try:
        stand = leaguestandingsv3.LeagueStandingsV3(
            season=season,
            season_type='Regular Season'
        )
        df_s = stand.get_data_frames()[0]
        df_s['SEASON'] = season
        all_standings.append(df_s)
        print(f'{len(df_s)} equipos')
    except Exception as e:
        print(f'‚ö†Ô∏è Error: {e}')
    time.sleep(API_DELAY)

df_historical_standings = pd.concat(all_standings, ignore_index=True)

print(f'\nüìã Total: {len(df_historical_standings)} registros')
print('‚úÖ Standings hist√≥ricos completos')

---
## ‚òÅÔ∏è Secci√≥n 6: Conectar Google Drive

Montamos Google Drive para guardar los datasets de forma **persistente**.
As√≠ el notebook 02 (y los siguientes) pueden leer estos archivos directamente.

In [None]:
# ============================================================
# 6.1 ‚Äî Montar Google Drive y crear carpeta del proyecto
# ============================================================
from google.colab import drive
drive.mount('/content/drive')

# Carpeta del proyecto en Drive
PROJECT_DIR = '/content/drive/MyDrive/nba-playoffs-simulator'
DATA_DIR = f'{PROJECT_DIR}/data'

os.makedirs(DATA_DIR, exist_ok=True)

print(f'‚úÖ Google Drive montado')
print(f'üìÅ Carpeta del proyecto: {PROJECT_DIR}')

---
## üíæ Secci√≥n 7: Guardar Datasets

Guardamos los archivos tanto **localmente** (en el runtime de Colab)
como en **Google Drive** (persistente entre sesiones).

Archivos generados:
1. `team_stats_2026.csv` ‚Üí Stats + features del equipo actual (temporada 2025-26)
2. `team_gamelogs_2026.csv` ‚Üí Juego a juego (por si necesitamos recalcular)
3. `historical_playoffs.csv` ‚Üí Series de playoffs para entrenar el modelo
4. `historical_team_stats.csv` ‚Üí Stats por temporada para features hist√≥ricos
5. `historical_standings.csv` ‚Üí Seeding por temporada

In [None]:
# ============================================================
# 7.1 ‚Äî Guardar todos los datasets (local + Google Drive)
# ============================================================
import shutil

datasets = {
    'team_stats_2026.csv': df_current,
    'team_gamelogs_2026.csv': df_gamelogs,
    'historical_playoffs.csv': df_historical_playoffs,
    'historical_team_stats.csv': df_historical_stats,
    'historical_standings.csv': df_historical_standings
}

for filename, df in datasets.items():
    # Guardar local
    local_path = f'data/{filename}'
    df.to_csv(local_path, index=False)

    # Copiar a Google Drive
    drive_path = f'{DATA_DIR}/{filename}'
    shutil.copy(local_path, drive_path)

    print(f'‚úÖ {filename:<32} ‚Üí {df.shape}  [local + Drive]')

print(f'\nüìÅ Archivos locales:  data/')
print(f'‚òÅÔ∏è  Archivos en Drive: {DATA_DIR}/')

---
## ‚úÖ Resumen: Data Collection completado

| Dataset | Filas | Descripci√≥n |
|---|---|---|
| `team_stats_2026.csv` | 30 equipos | Stats actuales + consistency/clutch/momentum |
| `team_gamelogs_2026.csv` | ~2,400+ juegos | Logs juego a juego (referencia) |
| `historical_playoffs.csv` | ~150 series | Resultados de series 2015-2025 |
| `historical_team_stats.csv` | ~300 registros | Stats avanzadas por temporada |
| `historical_standings.csv` | ~300 registros | Seeding y conferencia por temporada |

Todos los archivos est√°n guardados en Google Drive (`nba-playoffs-simulator/data/`)
y pueden ser accedidos desde cualquier notebook.

### ‚û°Ô∏è Siguiente notebook: `02_feature_engineering.ipynb`
Donde transformamos estos datos crudos en features inteligentes para el modelo XGBoost.