# **LineupLab**: NBA Matchup Prediction using Transformer Networks

## Project Overview
This project is part of the final requirement for the **Introduction to Deep Learning** course. The objective is to develop a machine learning model that predicts NBA matchup outcomes based on player lineups and team configurations. 

By leveraging the BallDontLie API, we will retrieve, clean, and process NBA data to create a dataset suitable for training and testing. A transformer-based deep learning model will be implemented using PyTorch to analyze player lineups and generate predictions.

## Goals
1. **Data Exploration**: Analyze and preprocess NBA data to ensure compatibility with the model.
2. **Model Creation**: Build a transformer network to learn relationships between players in a lineup and predict game outcomes.
3. **Hyperparameter Tuning**: Experiment with learning rate, optimizer, number of epochs, and other hyperparameters to optimize performance.
4. **Evaluation and Analysis**: Evaluate model performance using metrics such as accuracy, loss, and F1-score. Provide insights into the model's strengths, limitations, and potential improvements.

## Key Features
- **Transformer Networks**: Leveraging multi-head attention to capture player and team relationships.
- **Comprehensive Dataset**: Utilizing player stats, game results, and team information from the BallDontLie API.
- **Visualization and Analysis**: Incorporating visual representations of data distributions, training progress, and performance metrics.

This notebook will serve as the main documentation for the project, including all steps from data retrieval to model evaluation.

Created with the aid of ChatGPT 4o - Using AI to build AI

In [None]:
# Importing necessary libraries
import requests  # For API requests
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import torch  # For deep learning model implementation
import torch.nn as nn  # For neural network components
import torch.optim as optim  # For optimization algorithms
from torch.utils.data import Dataset, DataLoader  # For data handling in PyTorch
from torch.utils.data._utils.collate import default_collate
from nba_api.stats.endpoints import leaguedashlineups
import json
import time
import os
from concurrent.futures import ThreadPoolExecutor
import threading
from sklearn.preprocessing import MinMaxScaler
import warnings


# Ensure plots are displayed inline
%matplotlib inline

# Display confirmation message
print("Libraries successfully loaded!")


## **I. Data Exploration and Preparation**

### Overview
In this section, we will collect, prepare, and process NBA data using the **BallDontLie API** to build a dataset for training a transformer-based deep learning model.

### Goals and Actions
1. **Data Collection**:
   - Retrieve detailed NBA game data (2003–2023) using game IDs.
   - Collect individual player statistics (e.g., minutes played, offensive and defensive ratings, usage percentages) for each game.
   - Extract team information (e.g., home and away team IDs, scores) and player metadata (e.g., names, positions).

2. **Data Cleaning**:
   - Normalize data formats, particularly for time strings (e.g., parsing minutes played).
   - Handle missing or incomplete data by assigning default or null values where necessary.
   - Rename and standardize column names (e.g., `id` → `game_id`) for consistency.

3. **Parallelized Processing**:
   - Implement a **threaded processing solution** to speed up API calls and data extraction.
   - Use incremental saving to ensure progress is retained during long data processing tasks.

4. **Dataset Preparation**:
   - Combine game-level and player-level statistics into a single dataset.
   - Structure the data for model input, including creating columns for the top 12 players (home and away) with their associated metrics.
   - Include player positions directly from the API to enhance the feature set for modeling.

5. **Static Tensor Creation**:
   - Generate and save **static tensors** for each game, containing normalized, game-independent features for both teams.
   - Normalize features (e.g., player stats, height, and weight) using `MinMaxScaler` for consistency.
   - Organize tensors in season-specific directories (`static_game_tensors`) for efficient retrieval during training.
   - Ensure static tensors are padded to maintain consistent dimensions for the transformer network.

### Summary
This section provides the foundation for the predictive model by ensuring the dataset is comprehensive, clean, and ready for efficient use in a transformer-based deep learning framework. The data is now structured and stored in a format that enables seamless integration with dynamic encodings during training.


In [None]:
# Fetching Game Data for NBA Seasons 2003–2023
# -----------------------------------------------------------
# This script retrieves game data for NBA seasons between 2003 and 2023 
# using the BallDon'tLie API. The data includes game IDs, dates, seasons, 
# scores, team names, and team IDs for home and away teams. 
# The data is retrieved in a paginated manner and stored in a CSV file 
# for future use in analysis or model training.

# Key Steps:
# 1. Define the API base URL and authorization key for secure access.
# 2. Set the range of seasons (2003–2023) for which data will be fetched.
# 3. Define a function (`fetch_games_for_season`) that:
#    - Iterates through the pages of results for a given season using cursors.
#    - Handles potential API rate limits with throttling.
#    - Handles errors with retry logic.
#    - Collects game data in a structured format.
# 4. Iterate over each season and call the function to retrieve all games.
# 5. Convert the collected raw data into a structured Pandas DataFrame, 
#    with relevant columns such as game date, season, scores, team names, 
#    and team IDs.
# 6. Save the DataFrame to a CSV file named `games_2003_2023.csv`.

# -----------------------------------------------------------
# Note:
# - Throttling (`time.sleep`) is used to avoid hitting API rate limits.
# - A retry mechanism waits 60 seconds in case of a non-200 response.
# - Paginated API responses are managed using a `cursor` for fetching 
#   additional pages of data until no more data is available.
# - The collected data is structured in a list of dictionaries and 
#   transformed into a Pandas DataFrame for easier manipulation.
# -----------------------------------------------------------

# Base URL and API Key
BASE_URL = "https://api.balldontlie.io/v1/games"
API_KEY = ""

# Headers for the API request
HEADERS = {
    "Authorization": API_KEY
}

# Define the seasons to retrieve (2003 to 2023)
START_YEAR = 2003
END_YEAR = 2023

# List to store game data
all_games = []

# Function to fetch games for a specific season
def fetch_games_for_season(season):
    cursor = None  # Start without a cursor
    while True:
        print(f"Fetching season {season}, cursor: {cursor}")
        
        # Construct the API URL with cursor for pagination
        url = f"{BASE_URL}?seasons[]={season}&per_page=100"
        if cursor:
            url += f"&cursor={cursor}"
        
        response = requests.get(url, headers=HEADERS)
        
        if response.status_code != 200:
            print(f"Error fetching data: {response.status_code}. Retrying in 60 seconds...")
            time.sleep(60)
            continue

        data = response.json()
        
        # Ensure the response contains new data
        if not data['data']:
            print(f"No more data found for season {season}, exiting loop.")
            break  # Exit loop if no more games are found

        # Add new games to the list
        all_games.extend(data['data'])
        print(f"Fetched {len(data['data'])} games. Total games collected: {len(all_games)}")
        
        # Update the cursor for the next page
        cursor = data.get('meta', {}).get('next_cursor', None)
        if not cursor:  # No more pages
            print(f"All pages fetched for season {season}.")
            break
        
        # Throttle requests to avoid hitting rate limits
        time.sleep(0.5)

# Fetch data for each season
for season in range(START_YEAR, END_YEAR + 1):
    fetch_games_for_season(season)

# Process the collected data into a DataFrame
print("Processing data into DataFrame...")
games_data = [
    {
        "id": game["id"],
        "date": game["date"],
        "season": game["season"],
        "status": game["status"],
        "home_team_score": game["home_team_score"],
        "visitor_team_score": game["visitor_team_score"],
        "home_team_name": game["home_team"]["full_name"],
        "home_team_id": game["home_team"]["id"],
        "visitor_team_name": game["visitor_team"]["full_name"],
        "visitor_team_id": game["visitor_team"]["id"]
    }
    for game in all_games
]

games_df = pd.DataFrame(games_data)

# Save the data to a CSV file
output_file = "games_2003_2023.csv"
games_df.to_csv(output_file, index=False)
print(f"Data saved to {output_file}.")


#### **Success!** We have successfully gathered the scores for every game from 2003-2023

#### Now we will enrich this data-frame with the top 12 most used players for each team including their Player-ID's, Position, Minutes, Offensive Rating, Defensive Rating, and Usage Pctg.

In [None]:
#Example of how we are going to expand data with information from stats and advanced_stats - we will do this on a loop for every game we have gathered
game_id = 15486  # Example game ID

# Base URLs and API Key
BASE_URL_STATS = "https://api.balldontlie.io/v1/stats"
BASE_URL_ADVANCED = "https://api.balldontlie.io/v1/stats/advanced"
API_KEY = ""
HEADERS = {"Authorization": API_KEY}

# Function to fetch stats
def fetch_stats(url, game_id):
    response = requests.get(f"{url}?game_ids[]={game_id}&per_page=100", headers=HEADERS)
    if response.status_code == 200:
        return response.json()["data"]
    else:
        raise Exception(f"Error fetching stats: {response.status_code}, {response.text}")

# Refined parse_minutes function
def parse_minutes(value):
    try:
        if isinstance(value, str):
            if ":" in value:  # Time string in "MM:SS" format
                parts = value.split(":")
                minutes = int(parts[0])
                seconds = int(parts[1])
                return minutes + seconds / 60  # Convert seconds to fractional minutes
            elif value.isdigit():  # Whole number string like "38"
                return float(value)  # Convert directly to float
        return 0  # Default for invalid or missing values
    except Exception as e:
        print(f"Error parsing minutes value '{value}': {e}")
        return 0

# Fetch data
base_stats = fetch_stats(BASE_URL_STATS, game_id)
advanced_stats = fetch_stats(BASE_URL_ADVANCED, game_id)

# Convert to DataFrames
base_df = pd.DataFrame(base_stats)
adv_df = pd.DataFrame(advanced_stats)

# Parse minutes played
base_df["minutes_played"] = base_df["min"].apply(parse_minutes)

# Merge base and advanced stats on player ID
base_df["player_id"] = base_df["player"].apply(lambda x: x["id"])
adv_df["player_id"] = adv_df["player"].apply(lambda x: x["id"])
adv_df["position"] = adv_df["player"].apply(lambda x: x.get("position", None))  # Extract position
merged_df = pd.merge(
    base_df,
    adv_df[["player_id", "offensive_rating", "defensive_rating", "usage_percentage", "position"]],
    on="player_id",
    how="inner"
)

# Add team and player full name
merged_df["team_id"] = merged_df["team"].apply(lambda x: x["id"])
merged_df["team_name"] = merged_df["team"].apply(lambda x: x["full_name"])
merged_df["full_name"] = merged_df["player"].apply(lambda x: f"{x['first_name']} {x['last_name']}")

# Split into home and away teams and take top 12 players by minutes played
home_team_id = 14  # Los Angeles Lakers
away_team_id = 7   # Dallas Mavericks
home_players = merged_df[merged_df["team_id"] == home_team_id].nlargest(12, "minutes_played")
away_players = merged_df[merged_df["team_id"] == away_team_id].nlargest(12, "minutes_played")

# Print top 12 players for each team
print("\nTop 12 Home Players:")
print(home_players[["full_name", "team_name", "minutes_played", "position", "offensive_rating", "defensive_rating", "usage_percentage"]])

print("\nTop 12 Away Players:")
print(away_players[["full_name", "team_name", "minutes_played", "position", "offensive_rating", "defensive_rating", "usage_percentage"]])

# Combine results into a single row for testing
game_row = {
    "id": game_id,
    "date": "2003-10-28",
    "season": 2003,
    "status": "Final",
    "home_team_score": 109,
    "visitor_team_score": 93,
    "home_team_name": "Los Angeles Lakers",
    "home_team_id": home_team_id,
    "visitor_team_name": "Dallas Mavericks",
    "visitor_team_id": away_team_id,
}

for i in range(1, 13):
    if i <= len(home_players):
        game_row[f"home_player_{i}_id"] = home_players.iloc[i - 1]["player_id"]
        game_row[f"home_player_{i}_name"] = home_players.iloc[i - 1]["full_name"]
        game_row[f"home_player_{i}_minutes"] = home_players.iloc[i - 1]["minutes_played"]
        game_row[f"home_player_{i}_position"] = home_players.iloc[i - 1]["position"]
        game_row[f"home_player_{i}_off_rating"] = home_players.iloc[i - 1]["offensive_rating"]
        game_row[f"home_player_{i}_def_rating"] = home_players.iloc[i - 1]["defensive_rating"]
        game_row[f"home_player_{i}_usage"] = home_players.iloc[i - 1]["usage_percentage"]
    else:
        game_row[f"home_player_{i}_id"] = None
        game_row[f"home_player_{i}_name"] = None
        game_row[f"home_player_{i}_minutes"] = None
        game_row[f"home_player_{i}_position"] = None
        game_row[f"home_player_{i}_off_rating"] = None
        game_row[f"home_player_{i}_def_rating"] = None
        game_row[f"home_player_{i}_usage"] = None

    if i <= len(away_players):
        game_row[f"away_player_{i}_id"] = away_players.iloc[i - 1]["player_id"]
        game_row[f"away_player_{i}_name"] = away_players.iloc[i - 1]["full_name"]
        game_row[f"away_player_{i}_minutes"] = away_players.iloc[i - 1]["minutes_played"]
        game_row[f"away_player_{i}_position"] = away_players.iloc[i - 1]["position"]
        game_row[f"away_player_{i}_off_rating"] = away_players.iloc[i - 1]["offensive_rating"]
        game_row[f"away_player_{i}_def_rating"] = away_players.iloc[i - 1]["defensive_rating"]
        game_row[f"away_player_{i}_usage"] = away_players.iloc[i - 1]["usage_percentage"]
    else:
        game_row[f"away_player_{i}_id"] = None
        game_row[f"away_player_{i}_name"] = None
        game_row[f"away_player_{i}_minutes"] = None
        game_row[f"away_player_{i}_position"] = None
        game_row[f"away_player_{i}_off_rating"] = None
        game_row[f"away_player_{i}_def_rating"] = None
        game_row[f"away_player_{i}_usage"] = None

# Create final DataFrame
final_df = pd.DataFrame([game_row])

# Save the data to a CSV file
output_file = "miniexp_games_2003_2023_top12_with_positions.csv"
final_df.to_csv(output_file, index=False)
print(f"Data saved to {output_file}.")


In [None]:
# Load the existing games DataFrame, and expand with our new categories ready to be filled in the next step
games_df = pd.read_csv("games_2003_2023.csv")

# Define the player-specific columns for home and away teams, including position
player_columns = []
for i in range(1, 13):
    player_columns.extend([
        f"home_player_{i}_id", f"home_player_{i}_name", f"home_player_{i}_position",
        f"home_player_{i}_minutes", f"home_player_{i}_off_rating",
        f"home_player_{i}_def_rating", f"home_player_{i}_usage",
    ])
for i in range(1, 13):
    player_columns.extend([
        f"away_player_{i}_id", f"away_player_{i}_name", f"away_player_{i}_position",
        f"away_player_{i}_minutes", f"away_player_{i}_off_rating",
        f"away_player_{i}_def_rating", f"away_player_{i}_usage",
    ])

# Create an empty DataFrame for player columns
empty_player_df = pd.DataFrame(columns=player_columns)

# Initialize all values as None
for col in empty_player_df.columns:
    empty_player_df[col] = None

# Append the empty player DataFrame to games_df
games_df = pd.concat([games_df, empty_player_df], axis=1)

# Rename the `id` column to `game_id`
if 'id' in games_df.columns:
    games_df.rename(columns={'id': 'game_id'}, inplace=True)
    print("Column 'id' renamed to 'game_id'.")
else:
    print("'id' column not found. Ensure the dataset is correct.")

# Save the updated dataset
expanded_games_file_updated = "expanded_games_2003_2023.csv"
games_df.to_csv(expanded_games_file_updated, index=False)
print(f"Updated dataset saved as {expanded_games_file_updated}.")


#### **Great.** Now we are ready to load our player statistics from historic data into the expanded_games_2003_2023.csv file
#### We are now going to be parsing JSON from BallDontLie's Basic and Advanced statistics for player data using game_id.
#### This was taking quite a while (originally was going to take like 12+ hours - without error - to upload). So, I created a threaded implementation to speed things up. 

In [None]:
# Filling out the expanded_games_2003_2023 file with the player-by-player data retrieved from BallDon'tLie's base and advanced player statistics API calls.
# 
# This script is designed to enhance the existing game dataset (`expanded_games_2003_2023.csv`) 
# by appending detailed player-level statistics for each game. The data includes both basic 
# and advanced player statistics fetched from two API endpoints. 
# 
# Key Steps:
# 1. **Load Dataset**:
#    - Load the existing game data into a Pandas DataFrame. This file is assumed to already 
#      contain game-level details such as game ID, teams, and scores, along with placeholders 
#      for player-level statistics.
#    - Check for the existence of the file to prevent errors.
# 
# 2. **Identify Unprocessed Games**:
#    - Determine which games are missing player-level data by checking for `NaN` values in the 
#      player-related columns (e.g., `home_player_1_id`).
#    - Create a list of `game_id`s for games that need processing.
# 
# 3. **Set Up Multithreading**:
#    - Use Python's `ThreadPoolExecutor` to process games concurrently, improving efficiency 
#      for large datasets.
#    - Implement a `lock` to ensure safe access to shared resources like the `games_df` DataFrame.
#    - Define a `batch_size` to periodically save progress during processing.
# 
# 4. **Fetch and Process Data for Each Game**:
#    - Use the `fetch_stats` function to retrieve player statistics from the base and advanced 
#      API endpoints for a given game ID.
#    - Parse the fetched data into DataFrames and calculate additional metrics, such as 
#      `minutes_played`, by transforming raw values.
#    - Merge the base and advanced stats using `player_id` as the key and extract relevant 
#      player attributes like offensive/defensive ratings, usage percentage, and position.
# 
# 5. **Organize Player Data**:
#    - Separate players into home and away teams based on `team_id`.
#    - Select the top 12 players per team based on minutes played to ensure relevance.
#    - Populate a dictionary (`player_data`) with player statistics for each team, including 
#      placeholders for missing players (set to `None`).
# 
# 6. **Update the Dataset**:
#    - Use the `update_games_df` function to fill in the player-level columns for each game in 
#      the main DataFrame.
#    - Save progress periodically using the `save_progress` function, ensuring no data is lost 
#      in case of an interruption.
# 
# 7. **Final Save**:
#    - Once all games are processed, save the completed dataset to a file.
#    - Print a confirmation message to indicate the process is complete.
# 
# Notes:
# - The script includes error handling to skip games with missing or invalid data, logging 
#   messages to inform the user of any issues.
# - Multithreading significantly reduces processing time for large datasets but requires careful 
#   management of shared resources to avoid data corruption.
# - API responses are parsed and validated to ensure compatibility with the dataset structure.
# 
# The final output is an updated CSV file (`expanded_games_2003_2023.csv`) containing both game-level 
# and player-level statistics, ready for advanced analysis or model training.

expanded_games_file = "expanded_games_2003_2023.csv"
if os.path.exists(expanded_games_file):
    games_df = pd.read_csv(expanded_games_file)
else:
    raise FileNotFoundError(f"{expanded_games_file} not found.")

# Identify unprocessed games
unprocessed_games = games_df[games_df["home_player_1_id"].isna()]["game_id"].tolist()
print(f"Found {len(unprocessed_games)} unprocessed games.")

# Set threading and batch processing parameters
lock = threading.Lock()
processed_games = 0
batch_size = 100

def process_game(game_id):
    try:
        # print(f"Processing game ID {game_id}...")

        # Fetch stats for the game
        base_stats = fetch_stats(BASE_URL_STATS, game_id)
        advanced_stats = fetch_stats(BASE_URL_ADVANCED, game_id)

        if not base_stats or not advanced_stats:
            print(f"No stats found for game ID {game_id}. Skipping...")
            return None, game_id

        # Convert to DataFrames
        base_df = pd.DataFrame(base_stats)
        adv_df = pd.DataFrame(advanced_stats)

        # Parse minutes played
        base_df["minutes_played"] = base_df["min"].apply(parse_minutes)

        # Merge base and advanced stats on player ID
        base_df["player_id"] = base_df["player"].apply(lambda x: x["id"])
        adv_df["player_id"] = adv_df["player"].apply(lambda x: x["id"])
        adv_df["position"] = adv_df["player"].apply(lambda x: x.get("position", None))  # Extract position
        merged_df = pd.merge(
            base_df,
            adv_df[["player_id", "offensive_rating", "defensive_rating", "usage_percentage", "position"]],
            on="player_id",
            how="inner"
        )

        # Add team and player full name
        merged_df["team_id"] = merged_df["team"].apply(lambda x: x["id"])
        merged_df["team_name"] = merged_df["team"].apply(lambda x: x["full_name"])
        merged_df["full_name"] = merged_df["player"].apply(lambda x: f"{x['first_name']} {x['last_name']}")

        # Extract home and away team IDs from the main DataFrame
        home_team_id = games_df.loc[games_df["game_id"] == game_id, "home_team_id"].values[0]
        away_team_id = games_df.loc[games_df["game_id"] == game_id, "visitor_team_id"].values[0]

        # Split into home and away players and select top 12 by minutes played
        home_players = merged_df[merged_df["team_id"] == home_team_id].nlargest(12, "minutes_played")
        away_players = merged_df[merged_df["team_id"] == away_team_id].nlargest(12, "minutes_played")

        # Create a dictionary to store processed player data
        player_data = {}
        for i in range(1, 13):
            for team, players in [("home", home_players), ("away", away_players)]:
                if i <= len(players):
                    player_data[f"{team}_player_{i}_id"] = players.iloc[i - 1]["player_id"]
                    player_data[f"{team}_player_{i}_name"] = players.iloc[i - 1]["full_name"]
                    player_data[f"{team}_player_{i}_minutes"] = players.iloc[i - 1]["minutes_played"]
                    player_data[f"{team}_player_{i}_position"] = players.iloc[i - 1]["position"]
                    player_data[f"{team}_player_{i}_off_rating"] = players.iloc[i - 1]["offensive_rating"]
                    player_data[f"{team}_player_{i}_def_rating"] = players.iloc[i - 1]["defensive_rating"]
                    player_data[f"{team}_player_{i}_usage"] = players.iloc[i - 1]["usage_percentage"]
                else:
                    player_data[f"{team}_player_{i}_id"] = None
                    player_data[f"{team}_player_{i}_name"] = None
                    player_data[f"{team}_player_{i}_minutes"] = None
                    player_data[f"{team}_player_{i}_position"] = None
                    player_data[f"{team}_player_{i}_off_rating"] = None
                    player_data[f"{team}_player_{i}_def_rating"] = None
                    player_data[f"{team}_player_{i}_usage"] = None

        # print(f"Finished processing game ID {game_id}.")
        return player_data, game_id

    except Exception as e:
        print(f"Error processing game {game_id}: {e}")
        return None, game_id
# Function to update the DataFrame with processed game data
def update_games_df(game_data, game_id):
    global games_df
    with lock:
        for key, value in game_data.items():
            games_df.loc[games_df["game_id"] == game_id, key] = value

# Function to save progress to a file
def save_progress():
    global games_df
    with lock:
        games_df.to_csv(expanded_games_file, index=False)
        print(f"Progress saved at {time.strftime('%Y-%m-%d %H:%M:%S')}")

# ThreadPoolExecutor for processing games
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = executor.map(process_game, unprocessed_games)
    for game_data, game_id in futures:
        if game_data:
            update_games_df(game_data, game_id)
            processed_games += 1
            if processed_games % batch_size == 0:
                save_progress()

# Final save
save_progress()
print("Processing complete.")


In [None]:
# Created playerBio.csv with player_Ids normalized, heights, and weights for quick lookup during static tensor creation
GAME_DATA_FILE = "expanded_games_2003_2023.csv"
PLAYER_BIO_FILE = "playerBio.csv"

# Load the game data
games_df = pd.read_csv(GAME_DATA_FILE)

# Extract player ID columns for both home and away teams
player_columns = [f"{team}_player_{i}_id" for team in ["home", "away"] for i in range(1, 13)]
player_ids = games_df[player_columns].stack().dropna().astype(int).unique()  # Stack, drop NaN, ensure integers

# Create a DataFrame for unique player IDs
player_bio_df = pd.DataFrame(player_ids, columns=["player_id"])

# Save the unique player IDs to playerBio.csv
player_bio_df.to_csv(PLAYER_BIO_FILE, index=False)
print(f"Created {PLAYER_BIO_FILE} with {len(player_bio_df)} unique player IDs.")

# Normalize player IDs
player_bio_df["normalized_id"] = range(len(player_bio_df))  # Assign normalized IDs sequentially

# Save the updated player bio file
player_bio_df.to_csv(PLAYER_BIO_FILE, index=False)
print(f"Updated {PLAYER_BIO_FILE} with normalized player IDs.")
print(player_bio_df.head())

API_KEY = ""
PLAYER_API_URL = "https://api.balldontlie.io/v1/players"

# Load playerBio.csv
player_bio_df = pd.read_csv(PLAYER_BIO_FILE)

# Add height and weight columns if they do not exist
if "height" not in player_bio_df.columns:
    player_bio_df["height"] = None
if "weight" not in player_bio_df.columns:
    player_bio_df["weight"] = None

# Get all player_ids that still need height and weight
players_to_fetch = player_bio_df[player_bio_df["height"].isna()]["player_id"].tolist()

# Batch size for saving progress
batch_size = 50

# Loop over player IDs
for idx, player_id in enumerate(players_to_fetch, start=1):
    try:
        # Make the API request
        response = requests.get(f"{PLAYER_API_URL}?player_ids[]={player_id}", headers={"Authorization": API_KEY})
        if response.status_code == 200:
            player_data = response.json()

            # Check if the API response contains valid data
            if "data" in player_data and len(player_data["data"]) > 0:
                player = player_data["data"][0]

                # Parse height
                height_str = player.get("height", "0")
                if "-" in height_str:  # Format is "6-1"
                    feet, inches = map(int, height_str.split("-"))
                    height = feet * 12 + inches
                elif height_str.isdigit():  # Format is "6"
                    height = int(height_str) * 12
                else:
                    height = 0  # Default for invalid/missing height

                # Parse weight
                weight = int(player.get("weight", 0))

                # Update playerBio DataFrame
                player_bio_df.loc[player_bio_df["player_id"] == player_id, "height"] = height
                player_bio_df.loc[player_bio_df["player_id"] == player_id, "weight"] = weight
            else:
                print(f"No valid data for player ID {player_id}.")
        else:
            print(f"Failed to fetch data for player ID {player_id}: {response.status_code}")
    except Exception as e:
        print(f"Error fetching data for player ID {player_id}: {e}")

    # Save progress every 50 players
    if idx % batch_size == 0:
        player_bio_df.to_csv(PLAYER_BIO_FILE, index=False)
        print(f"Saved progress after processing {idx} players.")

    # Delay to avoid hitting rate limits
    time.sleep(0.5)

# Final save
player_bio_df.to_csv(PLAYER_BIO_FILE, index=False)
print("Completed fetching height and weight for all players. Saved final progress.")

In [None]:
# Creating a tensor for a single game example
# -----------------------------------------------------------
# This block demonstrates the process of generating and saving static tensors for a single game.
# Static tensors represent team-level features (e.g., player stats) that do not change during the game.
# The tensors are saved in a structured directory format for easy access during model training.
# Key steps include:
# 1. Defining the storage structure for tensors.
# 2. Fetching player-specific static features (e.g., height, weight, minutes played).
# 3. Normalizing and converting player data into tensors.
# 4. Saving the tensors to disk with season-specific organization.
# -----------------------------------------------------------

# File for player height and weight lookup
PLAYER_BIO_FILE = "playerBio.csv"

# Directory for saving static tensors
TENSOR_DIR = "static_game_tensors"
os.makedirs(TENSOR_DIR, exist_ok=True)

# Create the player bio file if it doesn't exist
if not os.path.exists(PLAYER_BIO_FILE):
    pd.DataFrame(columns=["player_id", "height", "weight"]).to_csv(PLAYER_BIO_FILE, index=False)

def fetch_or_lookup_player_stats(player_id):
    """
    Fetch the player's height and weight from the lookup file or API.
    """
    player_bio = pd.read_csv(PLAYER_BIO_FILE)
    if player_id in player_bio["player_id"].values:
        player_row = player_bio[player_bio["player_id"] == player_id].iloc[0]
        return {"height": player_row["height"], "weight": player_row["weight"]}
    else:
        # Fallback for height/weight API lookup if not in the CSV
        return {"height": 72, "weight": 200}  # Default values for height/weight

def process_team_static(game_row, team_prefix):
    """
    Process a team's static features for saving.
    """
    players = []
    for i in range(1, 13):
        player_id = game_row[f"{team_prefix}_player_{i}_id"]
        player_data = {
            "player_id": int(player_id) if not pd.isna(player_id) else None,
            "minutes": game_row[f"{team_prefix}_player_{i}_minutes"],
            "off_rating": game_row[f"{team_prefix}_player_{i}_off_rating"],
            "def_rating": game_row[f"{team_prefix}_player_{i}_def_rating"],
            "usage": game_row[f"{team_prefix}_player_{i}_usage"],
        }
        if player_data["player_id"] is not None:
            stats = fetch_or_lookup_player_stats(player_data["player_id"])
            player_data.update(stats)
        players.append(player_data)
    
    # Normalize static features
    numeric_features = ["minutes", "off_rating", "def_rating", "usage", "height", "weight"]
    scaler = MinMaxScaler()
    numeric_data = [[player[f] for f in numeric_features] for player in players if player["player_id"] is not None]
    if len(numeric_data) > 0:
        scaler.fit(numeric_data)
        normalized_data = scaler.transform(numeric_data)
    else:
        normalized_data = []

    team_tensor = []
    for i, player in enumerate(players):
        if player["player_id"] is None:
            continue
        numeric_values = [player[f] if f in player and player[f] is not None else 0 for f in numeric_features]
        normalized_features = scaler.transform([numeric_values])[0] if len(numeric_data) > 0 else [0] * len(numeric_features)
        static_tensor = torch.tensor(normalized_features, dtype=torch.float32)
        team_tensor.append(static_tensor)
    
    # Pad to ensure 12 players
    while len(team_tensor) < 12:
        team_tensor.append(torch.zeros_like(team_tensor[0]))

    return torch.stack(team_tensor)

def save_static_tensors(game_row, home_tensor, away_tensor):
    """
    Save home and away static tensors for a game in season-specific folders.
    """
    season = game_row["season"]  # Extract season
    season_dir = os.path.join(TENSOR_DIR, f"season_{season}")
    os.makedirs(season_dir, exist_ok=True)  # Create season directory if it doesn't exist

    # Save tensors
    game_id = game_row["game_id"]
    home_path = os.path.join(season_dir, f"{game_id}_home_static.pt")
    away_path = os.path.join(season_dir, f"{game_id}_away_static.pt")
    torch.save(home_tensor, home_path)
    torch.save(away_tensor, away_path)
    print(f"Saved static tensors for game {game_id} in season {season}: Home -> {home_path}, Away -> {away_path}")

# Example: Process and save static tensors for a single game
game_row = games_df[games_df["game_id"] == game_id].iloc[0]
home_static_tensor = process_team_static(game_row, "home")
away_static_tensor = process_team_static(game_row, "away")
save_static_tensors(game_row, home_static_tensor, away_static_tensor)


In [39]:
# Processing and Saving Static Tensors for All Games in the Dataset
# -----------------------------------------------------------------
# This section handles the creation of static tensors for every game in the dataset.
# Static tensors represent normalized, game-independent features for each team's players,
# such as player stats (minutes, offensive/defensive ratings, usage), and physical attributes
# (height, weight). These tensors are saved for each game and organized into season-specific folders.
#
# Workflow:
# 1. **Player Bio Cache**:
#    - A CSV file (`playerBio.csv`) is used to store height and weight data for players to minimize
#      redundant API calls. Default values are provided for players not found in the cache.
#
# 2. **Feature Normalization**:
#    - Player statistics and physical attributes are normalized using `MinMaxScaler`.
#    - Ensures that all numeric features are scaled consistently for better input to the model.
#
# 3. **Tensor Construction**:
#    - Static features for each team (home and away) are converted into PyTorch tensors.
#    - Tensors are padded to ensure 12 players per team, maintaining a consistent shape.
#
# 4. **Game Organization**:
#    - Processed tensors are saved in season-specific directories within `static_game_tensors`.
#    - Each tensor file is named using the game ID and team (e.g., `gameID_home_static.pt`).
#
# 5. **Validation and Error Handling**:
#    - Games with missing or invalid data (e.g., missing player IDs or stats) are skipped with a warning.
#    - Errors during processing are caught and logged without interrupting the overall loop.
#
# 6. **Dataset Loading and Iteration**:
#    - The expanded dataset (`expanded_games_2003_2023.csv`) is loaded and iterated row by row.
#    - Static tensors for each game are processed and saved in a batch manner.
#
# Output:
# - Static tensors are stored in `static_game_tensors` organized by season.
# - These tensors will later be combined with dynamic encodings (e.g., `playerID`, `position`, `teamID`)
#   at runtime during training or testing of the deep learning model.
#
# Notes:
# - Tensors are crucial for representing game data in a format suitable for transformer networks.
# - This preprocessing step is designed for efficient retrieval and consistent data structure.

PLAYER_BIO_FILE = "playerBio.csv"

# Directory for saving static tensors
TENSOR_DIR = "static_game_tensors_redo"
os.makedirs(TENSOR_DIR, exist_ok=True)

# Create the player bio file if it doesn't exist
if not os.path.exists(PLAYER_BIO_FILE):
    pd.DataFrame(columns=["player_id", "height", "weight"]).to_csv(PLAYER_BIO_FILE, index=False)

def fetch_or_lookup_player_stats(player_id):
    """
    Fetch the player's height and weight from the lookup file or API.
    """
    player_bio = pd.read_csv(PLAYER_BIO_FILE)
    if player_id in player_bio["player_id"].values:
        player_row = player_bio[player_bio["player_id"] == player_id].iloc[0]
        return {"height": player_row["height"], "weight": player_row["weight"]}
    else:
        return {"height": 72, "weight": 200}  # Default values

def process_team_static(game_row, team_prefix):
    """
    Process a team's static features for saving.
    """
    players = []
    for i in range(1, 13):
        player_id = game_row[f"{team_prefix}_player_{i}_id"]
        player_data = {
            "player_id": int(player_id) if not pd.isna(player_id) else None,
            "minutes": game_row[f"{team_prefix}_player_{i}_minutes"],
            "off_rating": game_row[f"{team_prefix}_player_{i}_off_rating"],
            "def_rating": game_row[f"{team_prefix}_player_{i}_def_rating"],
            "usage": game_row[f"{team_prefix}_player_{i}_usage"],
        }
        if player_data["player_id"] is not None:
            stats = fetch_or_lookup_player_stats(player_data["player_id"])
            player_data.update(stats)
        players.append(player_data)

    # Normalize static features
    numeric_features = ["minutes", "off_rating", "def_rating", "usage", "height", "weight"]
    scaler = MinMaxScaler()
    numeric_data = [[player[f] for f in numeric_features] for player in players if player["player_id"] is not None]
    if len(numeric_data) > 0:
        scaler.fit(numeric_data)
        normalized_data = scaler.transform(numeric_data)
    else:
        normalized_data = []

    team_tensor = []
    for i, player in enumerate(players):
        if player["player_id"] is None:
            continue
        numeric_values = [player[f] if f in player and player[f] is not None else 0 for f in numeric_features]
        normalized_features = scaler.transform([numeric_values])[0] if len(numeric_data) > 0 else [0] * len(numeric_features)
        static_tensor = torch.tensor(normalized_features, dtype=torch.float32)
        team_tensor.append(static_tensor)

    # Sort players by playtime (minutes played) in descending order
    team_tensor = [tensor for _, tensor in sorted(zip(numeric_data, team_tensor), key=lambda x: x[0][0], reverse=True)]

    # Pad to ensure 12 players
    while len(team_tensor) < 12:
        team_tensor.append(torch.zeros_like(team_tensor[0]))

    return torch.stack(team_tensor)

def save_static_tensors(game_row):
    """
    Save home and away static tensors for a game in season-specific folders.
    """
    try:
        # Extract season and game ID
        season = game_row["season"]
        game_id = game_row["game_id"]

        # Skip rows with invalid critical values
        if pd.isna(season) or pd.isna(game_id) or pd.isna(game_row["home_team_id"]) or pd.isna(game_row["visitor_team_id"]):
            print(f"Skipping invalid game data: Game ID {game_id}")
            return

        # Process home and away teams
        home_static_tensor = process_team_static(game_row, "home")
        away_static_tensor = process_team_static(game_row, "away")

        # Save tensors
        season_dir = os.path.join(TENSOR_DIR, f"season_{season}")
        os.makedirs(season_dir, exist_ok=True)
        home_path = os.path.join(season_dir, f"{game_id}_home_static.pt")
        away_path = os.path.join(season_dir, f"{game_id}_away_static.pt")
        torch.save(home_static_tensor, home_path)
        torch.save(away_static_tensor, away_path)
        print(f"Saved static tensors for game {game_id} in season {season}: Home -> {home_path}, Away -> {away_path}")
    except Exception as e:
        print(f"Error processing game {game_row['game_id'] if 'game_id' in game_row else 'unknown'}: {e}")

# Load the dataset
dataset_path = "expanded_games_2003_2023.csv"
try:
    games_df = pd.read_csv(dataset_path)
    print(f"Dataset '{dataset_path}' loaded successfully. Number of rows: {len(games_df)}")
except FileNotFoundError:
    print(f"Error: File '{dataset_path}' not found.")
    raise

# Loop through all games and save tensors
for _, game_row in games_df.iterrows():
    save_static_tensors(game_row)


Dataset 'expanded_games_2003_2023.csv' loaded successfully. Number of rows: 26947
Saved static tensors for game 15486 in season 2003: Home -> static_game_tensors_redo/season_2003/15486_home_static.pt, Away -> static_game_tensors_redo/season_2003/15486_away_static.pt
Saved static tensors for game 15487 in season 2003: Home -> static_game_tensors_redo/season_2003/15487_home_static.pt, Away -> static_game_tensors_redo/season_2003/15487_away_static.pt
Saved static tensors for game 15778 in season 2003: Home -> static_game_tensors_redo/season_2003/15778_home_static.pt, Away -> static_game_tensors_redo/season_2003/15778_away_static.pt
Saved static tensors for game 15488 in season 2003: Home -> static_game_tensors_redo/season_2003/15488_home_static.pt, Away -> static_game_tensors_redo/season_2003/15488_away_static.pt
Saved static tensors for game 15489 in season 2003: Home -> static_game_tensors_redo/season_2003/15489_home_static.pt, Away -> static_game_tensors_redo/season_2003/15489_away_sta

KeyboardInterrupt: 

#### Another issue that popped up during the creation and testing of our model had to deal with the size of playerID numbers in newer seasons (they grew huge).
#### We overcome that here through normalization in the PlayerBio.csv file.

### Summary of Accomplishments

- Successfully retrieved and processed **~27,000 NBA games** spanning two decades (2003–2023).
- Built a comprehensive dataset incorporating detailed player and game statistics, including **player usage**, **minutes played**, and **team compositions**.
- Overcame challenges such as missing data, API rate limits, and long processing times using **incremental saving** and **parallelized API calls**.
- Created normalized **static tensors** for each game, containing player-level metrics (e.g., offensive/defensive ratings, height, and weight) and structured them for efficient model integration.
- Established a flexible framework that supports the addition of more detailed player statistics (e.g., height/weight updates) through quick API lookups if needed.

This dataset now serves as a robust foundation for training the transformer-based deep learning model, enabling advanced analysis and predictions in subsequent sections.


## **II. Model Creation**

### Overview
This section focuses on building a transformer-based deep learning model to predict NBA game outcomes. The model will analyze static player features and dynamically learned embeddings to generate predictions.

### Goals
1. **Model Architecture**:
   - Implement a transformer network using **PyTorch**.
   - Use multi-head attention mechanisms to analyze player-to-player and team-to-team relationships.
   - Experiment with architectures that include residual connections to enhance model depth and stability.

2. **Input and Output Design**:
   - **Inputs**:
     - Static tensors for each team, including player statistics normalized per game.
     - Dynamically learned embeddings for PlayerID, PositionID, TeamID, and Season, added at runtime.
   - **Outputs**:
     - Predict the final scores for the home and away teams.

3. **Model Training**:
   - Train the model chronologically, using data from past games to predict future ones.
   - Define the training loop with appropriate loss functions and optimizers.
   - Split data into training, validation, and test sets based on game chronology for temporal consistency.

### Implementation Steps
1. **Set Up Dynamic Embeddings**:
   - Initialize embeddings for PlayerID, PositionID, TeamID, and Season, which will be added dynamically to the static tensors at runtime.

2. **Combine Static and Dynamic Features**:
   - Design a process to concatenate static and dynamic features for each player token during training and inference.

3. **Define the Transformer Architecture**:
   - Specify input dimensions, number of attention heads, and transformer layers.
   - Experiment with designs, including multi-head attention blocks and fully connected layers.

4. **Configure the Training Pipeline**:
   - Choose a regression-based loss function (e.g., Mean Squared Error) and an optimizer (e.g., Adam).
   - Set hyperparameters such as learning rate, batch size, and number of epochs.

5. **Initial Testing**:
   - Train the model on a subset of the data (e.g., a single season) to validate functionality and debug issues.
   - Evaluate performance metrics (e.g., MSE, RMSE) and refine the architecture.

6. **Scaling and Tuning**:
   - Train on the full dataset after initial testing.
   - Perform hyperparameter tuning to optimize model performance.

---

This section will document the detailed process of creating and implementing the model, highlighting the reasoning behind architectural and design decisions.


In [53]:
# Transformer Input Token Creation for NBA Prediction Model
# -----------------------------------------------------------
# This section focuses on creating player tokens, which serve as input for a transformer-based deep learning model.
# The tokens combine dynamic embeddings (e.g., player IDs, positions, team relationships) with static tensors
# (game-independent player features like normalized stats and physical attributes).
# -----------------------------------------------------------

STATIC_TENSOR_DIR = "static_game_tensors_redo"

# Embedding dimensions
E_player = 18  # PlayerID embedding size
E_position = 4
E_team = 8
E_season = 4

# Max ranges for embedding indices
max_normalized_player_id = 2500  # Using normalized PlayerIDs
max_team_id = 32  # Total teams (incl. Supersonics)
max_position_id = 3  # C, F, G, None -> 0, 1, 2, 3
max_season_id = 22  # Seasons 2003 to 2023 -> 0 to 20

# Initialize embeddings
player_embedding = nn.Embedding(max_normalized_player_id + 1, E_player)
position_embedding = nn.Embedding(max_position_id + 1, E_position)
team_embedding = nn.Embedding(max_team_id + 1, E_team)
season_embedding = nn.Embedding(max_season_id + 1, E_season)

# Load the player ID mapping from playerBio.csv
player_bio_df = pd.read_csv("playerBio.csv").set_index("player_id")
player_id_mapping = player_bio_df["normalized_id"].to_dict()

def load_static_tensor(game_id, team_type, season):
    """
    Load the static tensor for a team (home/away) in a specific game.
    """
    tensor_path = os.path.join(STATIC_TENSOR_DIR, f"season_{season}", f"{game_id}_{team_type}_static.pt")
    if not os.path.exists(tensor_path):
        raise FileNotFoundError(f"Static tensor not found: {tensor_path}")
    return torch.load(tensor_path)


def create_player_token(player_id, position_id, team_id_for, team_id_against, season_id, static_tensor, player_id_mapping):
    """
    Concatenate dynamic embeddings with static features to create a full player token.
    Replaces NaN height and weight in static_tensor with league averages (6'7", 220 lbs).
    """
    # Map raw player ID to normalized ID
    normalized_player_id = player_id_mapping.get(player_id, 0)  # Default to 0 for invalid IDs

    try:
        # Replace NaN height and weight in static_tensor with league averages
        static_tensor = torch.nan_to_num(static_tensor, nan=torch.tensor(79.0 if i == 0 else 220.0, dtype=torch.float32)) 

        # Generate dynamic embeddings
        player_id_tensor = player_embedding(torch.tensor(normalized_player_id, dtype=torch.long))
        position_tensor = position_embedding(torch.tensor(position_id if position_id is not None else 3, dtype=torch.long))
        team_for_tensor = team_embedding(torch.tensor(team_id_for, dtype=torch.long))
        team_against_tensor = team_embedding(torch.tensor(team_id_against, dtype=torch.long))
        season_tensor = season_embedding(torch.tensor(season_id, dtype=torch.long))

        # Validate all tensors for NaN
        for tensor, name in [
            (player_id_tensor, "player_id_tensor"),
            (position_tensor, "position_tensor"),
            (team_for_tensor, "team_for_tensor"),
            (team_against_tensor, "team_against_tensor"),
            (season_tensor, "season_tensor"),
            (static_tensor, "static_tensor"),
        ]:
            assert not torch.isnan(tensor).any(), f"NaN detected in {name}"

        # Concatenate dynamic and static features
        token = torch.cat([
            player_id_tensor,
            position_tensor,
            team_for_tensor,
            team_against_tensor,
            season_tensor,
            static_tensor
        ])

        # Validate the final token
        assert not torch.isnan(token).any(), "NaN detected in player token"

        return token

    except Exception as e:
        print(f"Error in token creation for player_id {player_id} (normalized ID: {normalized_player_id}): {e}")
        # Return a zeroed-out token as a fallback
        return torch.zeros(
            E_player + E_position + E_team * 2 + E_season + static_tensor.shape[0]
        )


def create_game_tokens(game_id, season, home_team_id, away_team_id, games_df, player_id_mapping):
    """
    Generate tokens for all players in a game (home and away).
    """
    # Mapping for position to IDs
    position_mapping = {"C": 0, "F": 1, "G": 2, None: 3}

    # Load static tensors
    home_static_tensor = load_static_tensor(game_id, "home", season)
    away_static_tensor = load_static_tensor(game_id, "away", season)

    # Extract player data dynamically from the dataframe
    home_players = [
        {"player_id": int(games_df[f"home_player_{i}_id"]) if not pd.isna(games_df[f"home_player_{i}_id"]) else None,
         "position_id": position_mapping.get(games_df[f"home_player_{i}_position"], 3)}
        for i in range(1, 13)
    ]
    away_players = [
        {"player_id": int(games_df[f"away_player_{i}_id"]) if not pd.isna(games_df[f"away_player_{i}_id"]) else None,
         "position_id": position_mapping.get(games_df[f"away_player_{i}_position"], 3)}
        for i in range(1, 13)
    ]

    # Create player tokens for home team
    home_tokens = []
    for i, player in enumerate(home_players):
        if player["player_id"] is None:
            continue
        player_token = create_player_token(
            player_id=player["player_id"],
            position_id=player["position_id"],
            team_id_for=home_team_id,
            team_id_against=away_team_id,
            season_id=season - 2003,  # Normalize season to 0-20
            static_tensor=home_static_tensor[i],
            player_id_mapping=player_id_mapping
        )
        home_tokens.append(player_token)

    # Create player tokens for away team
    away_tokens = []
    for i, player in enumerate(away_players):
        if player["player_id"] is None:
            continue
        player_token = create_player_token(
            player_id=player["player_id"],
            position_id=player["position_id"],
            team_id_for=away_team_id,
            team_id_against=home_team_id,
            season_id=season - 2003,  # Normalize season to 0-20
            static_tensor=away_static_tensor[i],
            player_id_mapping=player_id_mapping
        )
        away_tokens.append(player_token)

    # Ensure each team has exactly 12 tokens
    while len(home_tokens) < 12:
        home_tokens.append(torch.zeros_like(home_tokens[0]))
    while len(away_tokens) < 12:
        away_tokens.append(torch.zeros_like(away_tokens[0]))

    # Stack tokens for both teams
    home_tokens = torch.stack(home_tokens)
    away_tokens = torch.stack(away_tokens)

    # Validate final tokens
    assert not torch.isnan(home_tokens).any(), "NaN detected in final home tokens"
    assert not torch.isnan(away_tokens).any(), "NaN detected in final away tokens"

    return home_tokens, away_tokens

# Load the player ID mapping from playerBio.csv
player_bio_df = pd.read_csv("playerBio.csv").set_index("player_id")
player_id_mapping = player_bio_df["normalized_id"].to_dict()

# Example Usage
game_id = 15486
season = 2003
home_team_id = 13  # Lakers
away_team_id = 7  # Mavericks

# Load games dataframe
games_df = pd.read_csv("expanded_games_2003_2023.csv").set_index("game_id").loc[game_id]

# Generate tokens
home_tokens, away_tokens = create_game_tokens(game_id, season, home_team_id, away_team_id, games_df, player_id_mapping)

# Print results
print(f"Home Tokens Shape: {home_tokens.shape}")
print(f"Away Tokens Shape: {away_tokens.shape}")
for i in range(12):
    print(f"Home Token {i}:\n{home_tokens[i]}")
for i in range(12):
    print(f"Away Token {i}:\n{away_tokens[i]}")


Home Tokens Shape: torch.Size([12, 48])
Away Tokens Shape: torch.Size([12, 48])
Home Token 0:
tensor([ 0.2969,  1.1050, -1.0692, -0.1681,  1.6968,  0.4612,  1.1499,  1.1575,
        -0.1793,  1.6901, -0.6646,  2.1300, -1.1579,  1.1120, -0.6387, -0.3155,
         1.2548, -1.2322,  1.1627, -0.0210,  0.8587,  0.1787, -0.0986,  2.3293,
        -1.9904,  0.0079, -0.3418, -1.3153, -1.6089, -0.1789, -1.7088, -0.3227,
        -1.5775, -1.1946, -0.8241,  0.8506,  0.1329, -1.4886, -0.1724,  0.7989,
         1.3324,  0.4034,  1.0000,  0.9069,  0.7266,  0.7092,  0.0000,  0.1786],
       grad_fn=<SelectBackward0>)
Home Token 1:
tensor([ 1.2598, -0.2457,  1.0140,  0.0974, -1.6120, -0.3490,  0.3838,  0.2361,
         1.4533, -0.2876, -0.3169, -0.0651, -0.4180, -1.2510, -0.5809, -0.9387,
        -1.0161, -0.7057,  1.1627, -0.0210,  0.8587,  0.1787, -0.0986,  2.3293,
        -1.9904,  0.0079, -0.3418, -1.3153, -1.6089, -0.1789, -1.7088, -0.3227,
        -1.5775, -1.1946, -0.8241,  0.8506,  0.1329, -1.4

### Anatomy of our 24x48 Tensor Transformer Network Inputs

For each NBA game, we represent the **home team** and **away team** with two separate tensors, each with a shape of `12x48`. These tensors include both **static features** and **dynamic embeddings** for the top 12 players from each team. Here's a breakdown of the tensor structure:

---

#### **Per Team Tensor Dimensions**
- **12 Rows**: Each row represents a player token, ordered by minutes played during the game.
- **48 Columns**: Each column corresponds to a feature derived from either static or dynamic data.

---

#### **Feature Breakdown (48 Features per Player)**

1. **Dynamic Embeddings (42 Features)**:
   - **PlayerID Embedding (18 Features)**:
     - A learned embedding representing the unique player.
   - **PositionID Embedding (4 Features)**:
     - A learned embedding representing the player's position (`C`, `F`, `G`, or `None`).
   - **TeamID Embedding (16 Features)**:
     - Two separate embeddings for:
       1. The team the player is playing **for** (8 features).
       2. The team the player is playing **against** (8 features).
   - **Season Embedding (4 Features)**:
     - A learned embedding representing the season the game took place in.

2. **Static Features (6 Features)**:
   - **Minutes Played** (normalized): The player's total minutes on the court.
   - **Offensive Rating** (normalized): Player's offensive efficiency.
   - **Defensive Rating** (normalized): Player's defensive efficiency.
   - **Usage Percentage** (normalized): Player's involvement in the offense.
   - **Height** (normalized): Player's height in inches.
   - **Weight** (normalized): Player's weight in pounds.

---

#### **Token Example**
For a single game:
- **Shape**: `[1, 48]`
- **Example for Player 1**:
  ```plaintext
  tensor([PlayerID Embedding (18), 
          PositionID Embedding (4), 
          Team For Embedding (8), 
          Team Against Embedding (8), 
          Season Embedding (4), 
          Minutes Played (1), 
          Offensive Rating (1), 
          Defensive Rating (1), 
          Usage Percentage (1), 
          Height (1), 
          Weight (1)])


In [51]:
warnings.filterwarnings(
    "ignore",
    message=(
        "enable_nested_tensor is True, but self.use_nested_tensor is False because"
        " encoder_layer.self_attn.batch_first was not True"
    ),
)
warnings.filterwarnings(
    "ignore",
    message=r"You are using `torch.load` with `weights_only=False`.*",
    category=FutureWarning
)

STATIC_TENSOR_DIR = "static_game_tensors_redo"

# Dataset class for the 2003 season
class NBA2003Dataset(Dataset):
    def __init__(self, games_df):
        self.games_df = games_df[games_df["season"] == 2003]  # Filter for 2003 season
        self.game_ids = self.games_df.index.tolist()

    def __len__(self):
        return len(self.game_ids)

    def __getitem__(self, idx):
        game_id = self.game_ids[idx]

        try:
            # Extract the row as a Series
            game_row = self.games_df.loc[game_id]
            if isinstance(game_row, pd.DataFrame):  # Handle duplicate game_id scenario
                game_row = game_row.iloc[0]  # Take the first occurrence

            # Ensure all scalar values
            season = int(game_row["season"])
            home_team_id = int(game_row["home_team_id"])
            away_team_id = int(game_row["visitor_team_id"])

            # Use your existing create_game_tokens function
            home_tokens, away_tokens = create_game_tokens(
                game_id, season, home_team_id, away_team_id, game_row, player_id_mapping
            )

            # Target extraction
            target = torch.tensor([game_row["home_team_score"], game_row["visitor_team_score"]], dtype=torch.float32)

            return home_tokens, away_tokens, target

        except FileNotFoundError as e:
            print(f"Skipping game {game_id}: {e}")
            return None  # Returning None will be handled in the DataLoader collate function


# Custom collate function to handle None
def custom_collate_fn(batch):
    # Filter out None values
    batch = [item for item in batch if item is not None]
    if len(batch) == 0:
        return None, None, None
    return default_collate(batch)  # Use PyTorch's default collate for valid data


# Load games and player mapping
games_df = pd.read_csv("expanded_games_2003_2023.csv").set_index("game_id")
player_bio_df = pd.read_csv("playerBio.csv").set_index("player_id")
player_id_mapping = player_bio_df["normalized_id"].to_dict()

# Initialize dataset and data loader
dataset = NBA2003Dataset(games_df)
data_loader = DataLoader(dataset, batch_size=16, shuffle=True, num_workers=0, collate_fn=custom_collate_fn)

# Model definition remains unchanged
class LineupLab(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads, num_layers, output_dim=2):
        super(LineupLab, self).__init__()
        self.home_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers,
        )
        self.away_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers,
        )
        self.combined_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers,
        )
        self.fc = nn.Sequential(
            nn.Flatten(start_dim=1),  # Flatten the tokens and feature dimensions for each batch
            nn.Linear(input_dim * 24, 128),  # Adjust for 24 tokens (12 home + 12 away)
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )

    def forward(self, home_tokens, away_tokens):
        # Apply home and away transformers
        home_output = self.home_transformer(home_tokens)  # Shape: [batch_size, 12, input_dim]
        away_output = self.away_transformer(away_tokens)  # Shape: [batch_size, 12, input_dim]
        
        # Concatenate outputs along the token dimension
        combined_input = torch.cat((home_output, away_output), dim=1)  # Shape: [batch_size, 24, input_dim]
        
        # Pass through combined transformer
        combined_output = self.combined_transformer(combined_input)  # Shape: [batch_size, 24, input_dim]
        
        # Flatten and pass through fully connected layers
        scores = self.fc(combined_output)  # Shape: [batch_size, output_dim]
        return scores


# Initialize model
input_dim = 48
hidden_dim = 128
num_heads = 4
num_layers = 2
model = LineupLab(input_dim, hidden_dim, num_heads, num_layers)

# Optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.MSELoss()

# Training loop
epochs = 10
for epoch in range(epochs):
    model.train()
    epoch_loss = 0
    for batch in data_loader:
        if batch is None:
            continue
        home_tokens, away_tokens, target = batch
        optimizer.zero_grad()
        prediction = model(home_tokens, away_tokens)
        loss = loss_fn(prediction, target)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    print(f"Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss:.4f}")


Skipping game 17959: Static tensor not found: static_game_tensors_redo/season_2003/17959_home_static.pt
Epoch 1/10, Loss: 119347.3756
Skipping game 17959: Static tensor not found: static_game_tensors_redo/season_2003/17959_home_static.pt
Epoch 2/10, Loss: 11867.7079
Skipping game 17959: Static tensor not found: static_game_tensors_redo/season_2003/17959_home_static.pt
Epoch 3/10, Loss: 11704.0942
Skipping game 17959: Static tensor not found: static_game_tensors_redo/season_2003/17959_home_static.pt
Epoch 4/10, Loss: 11834.8847
Skipping game 17959: Static tensor not found: static_game_tensors_redo/season_2003/17959_home_static.pt
Epoch 5/10, Loss: 12010.7127
Skipping game 17959: Static tensor not found: static_game_tensors_redo/season_2003/17959_home_static.pt
Epoch 6/10, Loss: 11470.1634
Skipping game 17959: Static tensor not found: static_game_tensors_redo/season_2003/17959_home_static.pt
Epoch 7/10, Loss: 10307.1736
Skipping game 17959: Static tensor not found: static_game_tensors_re

In [55]:
warnings.filterwarnings(
    "ignore",
    message=(
        "enable_nested_tensor is True, but self.use_nested_tensor is False because"
        " encoder_layer.self_attn.batch_first was not True"
    ),
)
warnings.filterwarnings(
    "ignore",
    message=r"You are using `torch.load` with `weights_only=False`.*",
    category=FutureWarning
)

STATIC_TENSOR_DIR = "static_game_tensors_redo"

# Dataset class for all seasons
class NBAFullDataset(Dataset):
    def __init__(self, games_df):
        self.games_df = games_df
        self.game_ids = self.games_df.index.tolist()

    def __len__(self):
        return len(self.game_ids)

    def __getitem__(self, idx):
        game_id = self.game_ids[idx]

        try:
            # Extract the row as a Series
            game_row = self.games_df.loc[game_id]
            if isinstance(game_row, pd.DataFrame):  # Handle duplicate game_id scenario
                game_row = game_row.iloc[0]  # Take the first occurrence

            # Ensure all scalar values
            season = int(game_row["season"])
            home_team_id = int(game_row["home_team_id"])
            away_team_id = int(game_row["visitor_team_id"])

            # Use the existing create_game_tokens function
            home_tokens, away_tokens = create_game_tokens(
                game_id, season, home_team_id, away_team_id, game_row, player_id_mapping
            )

            # Target extraction
            target = torch.tensor([game_row["home_team_score"], game_row["visitor_team_score"]], dtype=torch.float32)

            return home_tokens, away_tokens, target

        except FileNotFoundError as e:
            print(f"Skipping game {game_id}: {e}")
            return None  # Returning None will be handled in the DataLoader collate function


# Custom collate function to handle None
def custom_collate_fn(batch):
    # Filter out None values
    batch = [item for item in batch if item is not None]
    if len(batch) == 0:
        return None, None, None
    return default_collate(batch)  # Use PyTorch's default collate for valid data


# Load games and player mapping
games_df = pd.read_csv("expanded_games_2003_2023.csv").set_index("game_id")
player_bio_df = pd.read_csv("playerBio.csv").set_index("player_id")
player_id_mapping = player_bio_df["normalized_id"].to_dict()

# Split the data by year
train_df = games_df[games_df["season"].isin(range(2003, 2019))]
val_df = games_df[games_df["season"].isin(range(2019, 2022))]
test_df = games_df[games_df["season"].isin(range(2022, 2024))]

# Print sizes to verify splits
print(f"Training data: {len(train_df)} games")
print(f"Validation data: {len(val_df)} games")
print(f"Testing data: {len(test_df)} games")

# Create datasets
train_dataset = NBAFullDataset(train_df)
val_dataset = NBAFullDataset(val_df)
test_dataset = NBAFullDataset(test_df)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_workers=0, collate_fn=custom_collate_fn)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False, num_workers=0, collate_fn=custom_collate_fn)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False, num_workers=0, collate_fn=custom_collate_fn)

# Model definition remains unchanged
class LineupLab(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads, num_layers, output_dim=2):
        super(LineupLab, self).__init__()
        self.home_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers,
        )
        self.away_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers,
        )
        self.combined_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers,
        )
        self.fc = nn.Sequential(
            nn.Flatten(start_dim=1),  # Flatten the tokens and feature dimensions for each batch
            nn.Linear(input_dim * 24, 128),  # Adjust for 24 tokens (12 home + 12 away)
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )

    def forward(self, home_tokens, away_tokens):
        # Apply home and away transformers
        home_output = self.home_transformer(home_tokens)  # Shape: [batch_size, 12, input_dim]
        away_output = self.away_transformer(away_tokens)  # Shape: [batch_size, 12, input_dim]

        # Concatenate outputs along the token dimension
        combined_input = torch.cat((home_output, away_output), dim=1)  # Shape: [batch_size, 24, input_dim]

        # Pass through combined transformer
        combined_output = self.combined_transformer(combined_input)  # Shape: [batch_size, 24, input_dim]

        # Flatten and pass through fully connected layers
        scores = self.fc(combined_output)  # Shape: [batch_size, output_dim]
        return scores

# Initialize model, optimizer, and loss function
input_dim = 48
hidden_dim = 128
num_heads = 4
num_layers = 2
model = LineupLab(input_dim, hidden_dim, num_heads, num_layers)

optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.MSELoss()

# Percentage loss function (MAPE)
def percentage_loss(pred, target):
    return torch.mean(torch.abs((pred - target) / target)) * 100

# Training and validation loop
epochs = 10
for epoch in range(epochs):
    # Training phase
    model.train()
    train_loss = 0
    train_percentage_loss = 0
    train_samples = 0

    for batch in train_loader:
        if batch is None:
            continue

        # Unpack batch
        home_tokens, away_tokens, target = batch

        # Forward pass
        optimizer.zero_grad()
        prediction = model(home_tokens, away_tokens)

        # Compute losses
        mse_loss = loss_fn(prediction, target)
        perc_loss = percentage_loss(prediction, target)

        # Backward pass
        mse_loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Gradient clipping
        optimizer.step()

        train_loss += mse_loss.item()
        train_percentage_loss += perc_loss.item()
        train_samples += target.size(0)  # Add the number of games in the batch

    train_loss_per_game_team = train_loss / (2 * train_samples)  # Divide by 2 for home and away
    train_percentage_loss_avg = train_percentage_loss / train_samples

    # Validation phase
    model.eval()
    val_loss = 0
    val_percentage_loss = 0
    val_samples = 0

    with torch.no_grad():
        for batch in val_loader:
            if batch is None:
                continue

            # Unpack batch
            home_tokens, away_tokens, target = batch

            # Forward pass
            prediction = model(home_tokens, away_tokens)

            # Compute losses
            mse_loss = loss_fn(prediction, target)
            perc_loss = percentage_loss(prediction, target)

            val_loss += mse_loss.item()
            val_percentage_loss += perc_loss.item()
            val_samples += target.size(0)  # Add the number of games in the batch

    val_loss_per_game_team = val_loss / (2 * val_samples)  # Divide by 2 for home and away
    val_percentage_loss_avg = val_percentage_loss / val_samples

    print(f"Epoch {epoch + 1}/{epochs}")
    print(f"Train Loss: {train_loss:.4f}, Train Loss/Game/Team: {train_loss_per_game_team:.4f}")
    print(f"Train Percentage Loss: {train_percentage_loss_avg:.2f}%")
    print(f"Validation Loss: {val_loss:.4f}, Validation Loss/Game/Team: {val_loss_per_game_team:.4f}")
    print(f"Validation Percentage Loss: {val_percentage_loss_avg:.2f}%")

# Testing phase
model.eval()
test_loss = 0
test_percentage_loss = 0
test_samples = 0

with torch.no_grad():
    for batch in test_loader:
        if batch is None:
            continue

        home_tokens, away_tokens, target = batch

        # Forward pass
        prediction = model(home_tokens, away_tokens)

        # Compute losses
        mse_loss = loss_fn(prediction, target)
        perc_loss = percentage_loss(prediction, target)

        test_loss += mse_loss.item()
        test_percentage_loss += perc_loss.item()
        test_samples += target.size(0)

test_loss_per_game_team = test_loss / (2 * test_samples)
test_percentage_loss_avg = test_percentage_loss / test_samples

print(f"Cumulative Test Loss: {test_loss:.4f}")
print(f"Test Loss Per Game Per Team: {test_loss_per_game_team:.4f}")
print(f"Test Percentage Loss: {test_percentage_loss_avg:.2f}%")


Training data: 20685 games
Validation data: 3649 games
Testing data: 2613 games
Skipping game 19630: Static tensor not found: static_game_tensors_redo/season_2005/19630_home_static.pt
Skipping game 34983: Static tensor not found: static_game_tensors_redo/season_2016/34983_home_static.pt
Skipping game 25294: Static tensor not found: static_game_tensors_redo/season_2009/25294_home_static.pt
Skipping game 49061: Static tensor not found: static_game_tensors_redo/season_2018/49061_home_static.pt
Skipping game 22715: Static tensor not found: static_game_tensors_redo/season_2007/22715_home_static.pt
Skipping game 21576: Static tensor not found: static_game_tensors_redo/season_2008/21576_home_static.pt
Skipping game 46970: Static tensor not found: static_game_tensors_redo/season_2018/46970_home_static.pt
Skipping game 49071: Static tensor not found: static_game_tensors_redo/season_2018/49071_home_static.pt
Skipping game 19068: Static tensor not found: static_game_tensors_redo/season_2006/19068

In [60]:
# Directory to save embeddings
embedding_save_dir = "saved_embeddings"
os.makedirs(embedding_save_dir, exist_ok=True)

# Save each embedding separately
torch.save(player_embedding.state_dict(), os.path.join(embedding_save_dir, "player_embedding.pt"))
torch.save(position_embedding.state_dict(), os.path.join(embedding_save_dir, "position_embedding.pt"))
torch.save(team_embedding.state_dict(), os.path.join(embedding_save_dir, "team_embedding.pt"))
torch.save(season_embedding.state_dict(), os.path.join(embedding_save_dir, "season_embedding.pt"))

print("Dynamic embeddings saved successfully.")


Dynamic embeddings saved successfully.


In [59]:
# Reset dynamic embeddings
def reset_embeddings():
    global player_embedding, position_embedding, team_embedding, season_embedding

    # Reinitialize embeddings with the same dimensions and ranges
    player_embedding = nn.Embedding(max_normalized_player_id + 1, E_player)
    position_embedding = nn.Embedding(max_position_id + 1, E_position)
    team_embedding = nn.Embedding(max_team_id + 1, E_team)
    season_embedding = nn.Embedding(max_season_id + 1, E_season)

    print("Dynamic embeddings have been reset to their initial state.")

# Example usage
reset_embeddings()


Dynamic embeddings have been reset to their initial state.


In [57]:
# Directory to load embeddings
embedding_save_dir = "saved_embeddings"

# Reload embeddings
player_embedding = nn.Embedding(max_normalized_player_id + 1, E_player)
player_embedding.load_state_dict(torch.load(os.path.join(embedding_save_dir, "player_embedding.pt")))

position_embedding = nn.Embedding(max_position_id + 1, E_position)
position_embedding.load_state_dict(torch.load(os.path.join(embedding_save_dir, "position_embedding.pt")))

team_embedding = nn.Embedding(max_team_id + 1, E_team)
team_embedding.load_state_dict(torch.load(os.path.join(embedding_save_dir, "team_embedding.pt")))

season_embedding = nn.Embedding(max_season_id + 1, E_season)
season_embedding.load_state_dict(torch.load(os.path.join(embedding_save_dir, "season_embedding.pt")))

print("Dynamic embeddings loaded successfully.")


Dynamic embeddings loaded successfully.


In [46]:
# Full training cycle for LineupLab with training, validation, and testing.
# Define year ranges for splits
train_years = range(2003, 2019)  # Training: 2003–2018
val_years = range(2019, 2022)    # Validation: 2019–2021
test_years = range(2022, 2024)   # Testing: 2022–2023

# Split the data by year
games_df = pd.read_csv("expanded_games_2003_2023.csv").set_index("game_id")
train_df = games_df[games_df["season"].isin(train_years)]
val_df = games_df[games_df["season"].isin(val_years)]
test_df = games_df[games_df["season"].isin(test_years)]

# Print sizes to verify splits
print(f"Training data: {len(train_df)} games")
print(f"Validation data: {len(val_df)} games")
print(f"Testing data: {len(test_df)} games")

# Dataset class for all seasons with data validation
class NBAFullDataset(Dataset):
    def __init__(self, games_df):
        # Clean dataset: remove rows with missing or invalid data
        self.games_df = games_df.dropna()
        self.game_ids = self.games_df.index.tolist()
        self.position_mapping = {"C": 0, "F": 1, "G": 2, None: 3}  # Map positions to integers

    def __len__(self):
        return len(self.game_ids)

    def __getitem__(self, idx):
        game_id = self.game_ids[idx]
        try:
            # Extract the row as a Series
            game_row = self.games_df.loc[game_id]
            if isinstance(game_row, pd.DataFrame):  # Handle duplicate game_id scenario
                game_row = game_row.iloc[0]  # Take the first occurrence

            # Ensure all scalar values
            season = int(game_row["season"])
            home_team_id = int(game_row["home_team_id"])
            away_team_id = int(game_row["visitor_team_id"])

            # Load static tensors
            home_static_tensor_path = f"static_game_tensors_redo/season_{season}/{game_id}_home_static.pt"
            away_static_tensor_path = f"static_game_tensors_redo/season_{season}/{game_id}_away_static.pt"

            if not os.path.exists(home_static_tensor_path) or not os.path.exists(away_static_tensor_path):
                raise FileNotFoundError(f"Static tensor not found for game {game_id}")

            home_static_tensor = torch.load(home_static_tensor_path)
            away_static_tensor = torch.load(away_static_tensor_path)

            # Extract player data dynamically from the dataframe
            home_players = [
                {"player_id": int(game_row[f"home_player_{i}_id"]) if not pd.isna(game_row[f"home_player_{i}_id"]) else None,
                 "position_id": self.position_mapping.get(game_row[f"home_player_{i}_position"], 3)}
                for i in range(1, 13)
            ]
            away_players = [
                {"player_id": int(game_row[f"away_player_{i}_id"]) if not pd.isna(game_row[f"away_player_{i}_id"]) else None,
                 "position_id": self.position_mapping.get(game_row[f"away_player_{i}_position"], 3)}
                for i in range(1, 13)
            ]

            # Create tokens for both teams
            home_tokens, away_tokens = create_game_tokens(
                game_id, season, home_team_id, away_team_id, game_row
            )

            # Target extraction
            target = torch.tensor([game_row["home_team_score"], game_row["visitor_team_score"]], dtype=torch.float32)

            # Sanity checks
            assert not torch.isnan(home_tokens).any(), f"NaN in home tokens for game {game_id}"
            assert not torch.isnan(away_tokens).any(), f"NaN in away tokens for game {game_id}"
            assert not torch.isnan(target).any(), f"NaN in target for game {game_id}"

            return home_tokens, away_tokens, target

        except FileNotFoundError as e:
            print(f"Skipping game {game_id}: {e}")
            return None  # Return None for missing data

# Custom collate function to handle None
def custom_collate_fn(batch):
    batch = [item for item in batch if item is not None]  # Filter out None values
    if len(batch) == 0:
        return None, None, None
    return default_collate(batch)  # Use PyTorch's default collate for valid data

# Create datasets
train_dataset = NBAFullDataset(train_df)
val_dataset = NBAFullDataset(val_df)
test_dataset = NBAFullDataset(test_df)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_workers=0, collate_fn=custom_collate_fn)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False, num_workers=0, collate_fn=custom_collate_fn)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False, num_workers=0, collate_fn=custom_collate_fn)

# Model definition
class LineupLab(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads, num_layers, output_dim=2):
        super(LineupLab, self).__init__()
        self.home_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers,
        )
        self.away_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers,
        )
        self.combined_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers,
        )
        self.fc = nn.Sequential(
            nn.Flatten(start_dim=1),  # Flatten the tokens and feature dimensions for each batch
            nn.Linear(input_dim * 24, 128),  # Adjust for 24 tokens (12 home + 12 away)
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )

    def forward(self, home_tokens, away_tokens):
        # Apply home and away transformers
        home_output = self.home_transformer(home_tokens)  # Shape: [batch_size, 12, input_dim]
        away_output = self.away_transformer(away_tokens)  # Shape: [batch_size, 12, input_dim]

        # Concatenate outputs along the token dimension
        combined_input = torch.cat((home_output, away_output), dim=1)  # Shape: [batch_size, 24, input_dim]

        # Pass through combined transformer
        combined_output = self.combined_transformer(combined_input)  # Shape: [batch_size, 24, input_dim]

        # Flatten and pass through fully connected layers
        scores = self.fc(combined_output)  # Shape: [batch_size, output_dim]
        return scores

# Initialize model, optimizer, and loss function
input_dim = 48
hidden_dim = 128
num_heads = 4
num_layers = 2
model = LineupLab(input_dim, hidden_dim, num_heads, num_layers)

optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.MSELoss()

# Training loop with debugging
epochs = 10
for epoch in range(epochs):
    model.train()
    train_loss = 0
    for batch in train_loader:
        if batch is None:
            continue

        # Unpack batch
        home_tokens, away_tokens, target = batch

        # Forward pass
        optimizer.zero_grad()
        prediction = model(home_tokens, away_tokens)

        # Compute loss
        loss = loss_fn(prediction, target)

        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # Gradient clipping
        optimizer.step()

        train_loss += loss.item()

    print(f"Epoch {epoch + 1}/{epochs}, Train Loss: {train_loss:.4f}")

# Testing phase
model.eval()
test_loss = 0
with torch.no_grad():
    for batch in test_loader:
        if batch is None:
            continue
        home_tokens, away_tokens, target = batch
        prediction = model(home_tokens, away_tokens)
        loss = loss_fn(prediction, target)
        test_loss += loss.item()
print(f"Cumulative Test Loss: {test_loss:.4f}")        
print(f"Test Loss Per Game Per Team: {test_loss / (2 * len(test_loader.dataset)):.4f}")


Training data: 20685 games
Validation data: 3649 games
Testing data: 2613 games


AssertionError: NaN detected in player token

In [None]:
# Full training cycle for LineupLab with training, validation, and testing.
# Define year ranges for splits
train_years = range(2003, 2019)  # Training: 2003–2018
val_years = range(2019, 2022)    # Validation: 2019–2021
test_years = range(2022, 2024)   # Testing: 2022–2023

# Split the data by year
games_df = pd.read_csv("expanded_games_2003_2023.csv").set_index("game_id")
train_df = games_df[games_df["season"].isin(train_years)]
val_df = games_df[games_df["season"].isin(val_years)]
test_df = games_df[games_df["season"].isin(test_years)]

# Print sizes to verify splits
print(f"Training data: {len(train_df)} games")
print(f"Validation data: {len(val_df)} games")
print(f"Testing data: {len(test_df)} games")

# Dataset class for all seasons
class NBAFullDataset(Dataset):
    def __init__(self, games_df, position_mapping):
        self.games_df = games_df
        self.position_mapping = position_mapping
        self.game_ids = self.games_df.index.tolist()

    def __len__(self):
        return len(self.game_ids)

    def __getitem__(self, idx):
        game_id = self.game_ids[idx]
        try:
            game_row = self.games_df.loc[game_id]
            if isinstance(game_row, pd.DataFrame):  # Handle duplicate game_id scenario
                game_row = game_row.iloc[0]

            season = int(game_row["season"])
            home_team_id = int(game_row["home_team_id"])
            away_team_id = int(game_row["visitor_team_id"])

            # Load static tensors
            home_static_tensor_path = f"static_game_tensors/season_{season}/{game_id}_home_static.pt"
            away_static_tensor_path = f"static_game_tensors/season_{season}/{game_id}_away_static.pt"
            if not os.path.exists(home_static_tensor_path) or not os.path.exists(away_static_tensor_path):
                raise FileNotFoundError(f"Static tensor not found for game {game_id}")

            home_static_tensor = torch.load(home_static_tensor_path)
            away_static_tensor = torch.load(away_static_tensor_path)

            # Extract player data
            home_players = [
                {"player_id": int(game_row[f"home_player_{i}_id"]) if not pd.isna(game_row[f"home_player_{i}_id"]) else None,
                 "position_id": self.position_mapping.get(game_row[f"home_player_{i}_position"], 3)}
                for i in range(1, 13)
            ]
            away_players = [
                {"player_id": int(game_row[f"away_player_{i}_id"]) if not pd.isna(game_row[f"away_player_{i}_id"]) else None,
                 "position_id": self.position_mapping.get(game_row[f"away_player_{i}_position"], 3)}
                for i in range(1, 13)
            ]

            return {
                "home_static_tensor": home_static_tensor,
                "away_static_tensor": away_static_tensor,
                "home_players": home_players,
                "away_players": away_players,
                "season_id": season - 2003,  # Normalize season
                "home_team_id": home_team_id,
                "away_team_id": away_team_id,
                "target": torch.tensor([game_row["home_team_score"], game_row["visitor_team_score"]], dtype=torch.float32)
            }
        except FileNotFoundError as e:
            print(f"Skipping game {game_id}: {e}")
            return None


# Custom collate function
def custom_collate_fn(batch):
    batch = [item for item in batch if item is not None]
    if len(batch) == 0:
        return None
    return default_collate(batch)

# Embedding parameters
max_player_id = 2500
max_position_id = 3
max_team_id = 32
max_season_id = 22

# Model definition
class LineupLab(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads, num_layers, max_player_id, max_position_id, max_team_id, max_season_id, output_dim=2):
        super(LineupLab, self).__init__()
        # Embedding layers
        self.player_embedding = nn.Embedding(max_player_id + 1, 18)
        self.position_embedding = nn.Embedding(max_position_id + 1, 4)
        self.team_embedding = nn.Embedding(max_team_id + 1, 8)
        self.season_embedding = nn.Embedding(max_season_id + 1, 4)

        # Transformers
        self.home_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers
        )
        self.away_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers
        )
        self.combined_transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=input_dim, nhead=num_heads, dim_feedforward=hidden_dim),
            num_layers=num_layers
        )

        # Fully connected layers
        self.fc = nn.Sequential(
            nn.Flatten(start_dim=1),
            nn.Linear(input_dim * 24, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )

    def forward(self, data):
        # Generate tokens for both teams
        def generate_team_tokens(players, team_id_for, team_id_against, static_tensor, season_id):
            tokens = []
            for i, player in enumerate(players):
                if player["player_id"] is None:
                    continue
                player_token = torch.cat([
                    self.player_embedding(torch.tensor(player["player_id"])),
                    self.position_embedding(torch.tensor(player["position_id"])),
                    self.team_embedding(torch.tensor(team_id_for)),
                    self.team_embedding(torch.tensor(team_id_against)),
                    self.season_embedding(torch.tensor(season_id)),
                    static_tensor[i]
                ])
                tokens.append(player_token)
            return torch.stack(tokens)

        home_tokens = generate_team_tokens(
            data["home_players"], data["home_team_id"], data["away_team_id"], data["home_static_tensor"], data["season_id"]
        )
        away_tokens = generate_team_tokens(
            data["away_players"], data["away_team_id"], data["home_team_id"], data["away_static_tensor"], data["season_id"]
        )

        # Apply transformers
        home_output = self.home_transformer(home_tokens)
        away_output = self.away_transformer(away_tokens)
        combined_input = torch.cat((home_output, away_output), dim=1)
        combined_output = self.combined_transformer(combined_input)

        # Fully connected layers
        return self.fc(combined_output)


# Create datasets
position_mapping = {"C": 0, "F": 1, "G": 2, None: 3}
train_dataset = NBAFullDataset(train_df, position_mapping)
val_dataset = NBAFullDataset(val_df, position_mapping)
test_dataset = NBAFullDataset(test_df, position_mapping)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=custom_collate_fn)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False, collate_fn=custom_collate_fn)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False, collate_fn=custom_collate_fn)

# Initialize model, optimizer, and loss function
model = LineupLab(
    input_dim=48,
    hidden_dim=128,
    num_heads=4,
    num_layers=2,
    max_player_id=max_player_id,
    max_position_id=max_position_id,
    max_team_id=max_team_id,
    max_season_id=max_season_id
)
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.MSELoss()

# Training and validation loop
epochs = 10
for epoch in range(epochs):
    model.train()
    train_loss = 0
    for batch in train_loader:
        if batch is None:
            continue
        optimizer.zero_grad()
        prediction = model(batch)
        loss = loss_fn(prediction, batch["target"])
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            if batch is None:
                continue
            prediction = model(batch)
            loss = loss_fn(prediction, batch["target"])
            val_loss += loss.item()

    print(f"Epoch {epoch + 1}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")


In [None]:
# Function to test the model on a single game
def test_model_on_game(game_id, season, model, games_df, static_tensor_dir="static_game_tensors"):
    model.eval()  # Set the model to evaluation mode
    position_mapping = {"C": 0, "F": 1, "G": 2, None: 3}  # Map positions to integers

    try:
        # Extract game row from DataFrame
        game_row = games_df.loc[game_id]
        if isinstance(game_row, pd.DataFrame):  # Handle duplicate game_id scenario
            game_row = game_row.iloc[0]  # Take the first occurrence

        # Load static tensors
        home_static_tensor = load_static_tensor(game_id, "home", season)
        away_static_tensor = load_static_tensor(game_id, "away", season)

        # Extract player data dynamically
        home_players = [
            {"player_id": int(game_row[f"home_player_{i}_id"]) if not pd.isna(game_row[f"home_player_{i}_id"]) else None,
             "position_id": position_mapping.get(game_row[f"home_player_{i}_position"], 3)}
            for i in range(1, 13)
        ]
        away_players = [
            {"player_id": int(game_row[f"away_player_{i}_id"]) if not pd.isna(game_row[f"away_player_{i}_id"]) else None,
             "position_id": position_mapping.get(game_row[f"away_player_{i}_position"], 3)}
            for i in range(1, 13)
        ]

        # Create tokens for both teams
        home_tokens, away_tokens = create_game_tokens(
            game_id, season, int(game_row["home_team_id"]), int(game_row["visitor_team_id"]), game_row
        )

        # Check if home_tokens and away_tokens are already tensors, otherwise convert them
        if not isinstance(home_tokens, torch.Tensor):
            home_tokens_tensor = torch.tensor(home_tokens).unsqueeze(0)  # Shape: [1, 12, input_dim]
        else:
            home_tokens_tensor = home_tokens.unsqueeze(0)

        if not isinstance(away_tokens, torch.Tensor):
            away_tokens_tensor = torch.tensor(away_tokens).unsqueeze(0)  # Shape: [1, 12, input_dim]
        else:
            away_tokens_tensor = away_tokens.unsqueeze(0)



        # Perform prediction
        with torch.no_grad():
            predicted_scores = model(home_tokens_tensor, away_tokens_tensor)

        # Extract ground truth scores
        actual_scores = torch.tensor(
            [game_row["home_team_score"], game_row["visitor_team_score"]], dtype=torch.float32
        )

        # Output results
        print(f"Game ID: {game_id}")
        print(f"Season: {season}")
        print(f"Predicted Scores: Home: {predicted_scores[0, 0].item():.2f}, Away: {predicted_scores[0, 1].item():.2f}")
        print(f"Actual Scores: Home: {actual_scores[0].item()}, Away: {actual_scores[1].item()}")

    except Exception as e:
        print(f"Error processing game {game_id}: {e}")


# Example usage
games_df = pd.read_csv("expanded_games_2003_2023.csv").set_index("game_id")
game_id_to_test = 15882412  # Replace with the actual game ID from another season
season_to_test = 2023    # Replace with the actual season for the game ID

test_model_on_game(game_id_to_test, season_to_test, model, games_df)


## **III. Hyperparameter Tuning**

### Overview
This section explores the impact of various hyperparameters on the model's performance. By systematically adjusting key parameters, we aim to optimize the transformer network for better predictions.

### Goals
1. **Experimentation**:
   - Test different values for hyperparameters such as:
     - Learning rate.
     - Number of epochs.
     - Optimizer (e.g., Adam, SGD).
     - Batch size.
     - Number of transformer layers and attention heads.
2. **Performance Evaluation**:
   - Assess the impact of each hyperparameter on model accuracy, loss, and F1-score.
   - Document observations to identify the most effective configurations.

### Implementation Steps
1. **Baseline Configuration**:
   - Train the model with default or commonly used hyperparameter values.
   - Record baseline performance metrics.
2. **Iterative Testing**:
   - Adjust one hyperparameter at a time while keeping others constant.
   - Monitor changes in performance and identify trends.
3. **Optimal Configuration**:
   - Combine the best-performing hyperparameters into a final configuration for training the model.

This section will detail the experiments conducted and the resulting insights into hyperparameter optimization for the transformer network.


## **IV. Evaluation and Analysis**

### Overview
In this section, we evaluate the performance of the transformer-based model using two primary loss metrics and other relevant performance indicators. The focus will be on understanding the model's strengths, limitations, and areas for improvement.

### Metrics
1. **Score Prediction Accuracy**:
   - Measure the actual distance between predicted scores and the true game scores (e.g., Mean Squared Error or Mean Absolute Error).
   - Assess how well the model captures the scoring trends in games.
2. **Winning Outcome Prediction**:
   - Evaluate the model's ability to correctly predict the winning team (e.g., Accuracy, F1-score).
   - Analyze classification performance using confusion matrices.

### Goals
1. **Performance Metrics**:
   - Quantify how accurately the model predicts game outcomes and scores.
   - Identify patterns or biases in the model’s predictions.
2. **Visual Representations**:
   - Plot training and validation loss over epochs.
   - Generate confusion matrices for winning outcome predictions.
   - Visualize score prediction distributions.
3. **Strengths and Limitations**:
   - Discuss areas where the model performs well and where it struggles.
   - Identify real-world scenarios where the model could be applied effectively.
4. **Future Improvements**:
   - Suggest ways to enhance the model, such as adjusting hyperparameters, adding new features, or increasing dataset size.

This section will summarize the model's overall performance, supported by quantitative metrics and visualizations.
