# üß† Feature Engineering for NBA Game Prediction

This section describes the **Feature Engineering (FE)** process used to enrich our raw game data and generate meaningful predictors for modeling NBA game outcomes.

---

## üéØ Objective

We aim to predict the **probability of the home team winning** a given NBA game. The output will feed into a web interface, showing the predicted win probabilities for each team.

To achieve this:

* We process game data in **team-game format** (1 row per team per game).
* After feature engineering, we convert to **wide format** (1 row per game) for modeling.

---

## üß± Base Columns

We start from the team-game format with the following base columns:

* `game_id`, `season`, `date.start`, `arena.city`, `winner_id`, `team_id`, `opponent_id`
* `is_home`, `win`, `date`, `hour`, `time_of_day`
* Box score stats: `points`, `fgm`, `fga`, `fgp`, `tpm`, `tpa`, `tpp`, `ftm`, `fta`, `ftp`,
  `offReb`, `defReb`, `totReb`, `assists`, `steals`, `blocks`, `turnovers`, `pFouls`,
  `plusMinus`, `fastBreakPoints`, `pointsInPaint`, `biggestLead`, `secondChancePoints`,
  `pointsOffTurnovers`, `longestRun`

---

## üõ†Ô∏è Feature Engineering Plan

### 1. üîÅ **Rolling averages** (last 5 games)

Smoothed performance indicators over the last 5 games for a team.

**Columns used:**

* `points`, `fgm`, `fga`, `fgp`, `tpm`, `tpa`, `tpp`, `ftm`, `fta`, `ftp`,
  `offReb`, `defReb`, `totReb`, `assists`, `steals`, `blocks`, `turnovers`, `plusMinus`

**Features created:**

* `rolling_avg_<col>_5`
* `rolling_std_<col>_5`

### 2. üìà **Expanding averages**

Team evolution over the season (cumulative average since season start).

**Columns used:**

* Same as rolling averages.

**Features created:**

* `exp_avg_<col>`

### 3. üí• **Matchup differentials** (calculated in wide format only)

After converting to wide format, we calculate the difference between home and away teams.

**Features created (examples):**

* `net_points_5 = home_rolling_avg_points_5 - away_rolling_avg_points_5`
* `net_exp_avg_points = home_exp_avg_points - away_exp_avg_points`

This step replaces the need to precompute opponent stats in the team-game format to avoid duplication.

### 4. üè† **Home/Away effects**

Rolling averages split by home and away performance.

**Columns used:**

* Same as rolling averages.

**Features created:**

* `rolling_avg_<col>_home_5`, `rolling_avg_<col>_away_5`

### 5. üìä **Season-to-date (STD) metrics**

Cumulative indicators of a team's season.

**Features created:**

* `games_played`
* `win_rate = wins / games_played`
* `avg_margin_victory = avg(points - opponent_points)`

### 6. üìÖ **Time-based features**

To capture time-related trends or fatigue.

**Features created:**

* `days_since_last_game`
* `days_into_season`
* `day_of_week`

### 7. üî£ **Team encoding**

Used for model representation (optional or in modeling stage).

**Options:**

* One-hot encoding of `team_id` and `opponent_id`
* External rating like ELO or team strength score

---

## üßæ Output of this Stage

A cleaned and enriched **team-game dataframe** with one row per team per game, including all engineered features.

This will then be transformed into **wide format**, merging home and away teams into a single row per game with a binary target: `home_win`.

---

## ‚úÖ Why this setup?

* Team-game format allows easy use of **rolling, expanding, and home/away** features.
* Opponent features are computed in **wide format** to avoid redundancy.
* Wide format aligns with how predictions will be used in production (web app).
* Final model will return: `P(home wins)` ‚Üí from which we infer `P(away wins)`.

---

Next step: implement each block of features in code using `groupby`, `rolling`, `expanding`, `shift`, and `merge` operations.


In [10]:
import pandas as pd

In [11]:
pd.set_option("display.max_columns", None)
team_games = pd.read_parquet("team_games_df.parquet")

In [12]:
# para is_home = 1, el team_id es el de el home team
# para is_home = 0, el team_id es el de el away team

# vamos a sacar un df con team_id y su arena.city unqiue values solo para is_home=1

tmp = team_games.query("is_home==1")[['team_id','arena.city']].dropna()
# hay parecer no podemos usar unique por la api tiene algunos errores de registro
tmp = team_games.groupby('team_id')['arena.city'].apply(lambda x: x.mode()[0]) #.reset_index(name='arena.city'), con map no se usa reset_index
team_games['arena.city'] = team_games['arena.city'].fillna(team_games['team_id'].map(tmp))

# Rolling Averages

In [None]:
rolling_cols = ['points', 'fgm', 'fga', 'fgp', 'tpm', 'tpa', 'tpp', 'ftm', 'fta', 'ftp',
  'offReb', 'defReb', 'totReb', 'assists', 'steals', 'blocks', 'turnovers', 'plusMinus'] # tambien de points  saldra points std
windows_minperiods = {5:3} #,10:6, 3:2} # si quiee=res experimentar con mas windows

team_games = team_games.sort_values(['team_id', 'date.start']).reset_index(drop=True)

for col in rolling_cols:
    for window, minperiods in windows_minperiods.items():
        team_games[f"roll_avg_{col}_{window}"] = team_games.groupby(['team_id'])[col].transform(
            lambda x: x.shift(1).rolling(window=window, min_periods=minperiods).mean())
            

# puede haber algunos nans por filas iniciales como 2015, o porque la api no regreso stats de algunos juegos
# despues haremos una limpieza antes de modelar

team_games['roll_std_points_5'] = team_games.groupby('team_id')['points'].transform(lambda x: x.shift(1).rolling(window=5, min_periods=3).std())
#team_games['roll_std_points_5'] = team_games.groupby('team_id')['points'].transform(lambda x: x.shift(1).rolling(window=3, min_periods=2).std())
#team_games['roll_std_points_5'] = team_games.groupby('team_id')['points'].transform(lambda x: x.shift(1).rolling(window=10, min_periods=6).std())



