# LineupLab: NBA Matchup Prediction using Transformer Networks

## Project Overview
This project is part of the final requirement for the **Introduction to Deep Learning** course. The objective is to develop a machine learning model that predicts NBA matchup outcomes based on player lineups and team configurations. 

By leveraging the BallDontLie API, we will retrieve, clean, and process NBA data to create a dataset suitable for training and testing. A transformer-based deep learning model will be implemented using PyTorch to analyze player lineups and generate predictions.

## Goals
1. **Data Exploration**: Analyze and preprocess NBA data to ensure compatibility with the model.
2. **Model Creation**: Build a transformer network to learn relationships between players in a lineup and predict game outcomes.
3. **Hyperparameter Tuning**: Experiment with learning rate, optimizer, number of epochs, and other hyperparameters to optimize performance.
4. **Evaluation and Analysis**: Evaluate model performance using metrics such as accuracy, loss, and F1-score. Provide insights into the model's strengths, limitations, and potential improvements.

## Key Features
- **Transformer Networks**: Leveraging multi-head attention to capture player and team relationships.
- **Comprehensive Dataset**: Utilizing player stats, game results, and team information from the BallDontLie API.
- **Visualization and Analysis**: Incorporating visual representations of data distributions, training progress, and performance metrics.

This notebook will serve as the main documentation for the project, including all steps from data retrieval to model evaluation.


In [25]:
# Import necessary libraries
import requests  # For API requests
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import torch  # For deep learning model implementation
import torch.nn as nn  # For neural network components
import torch.optim as optim  # For optimization algorithms
from torch.utils.data import Dataset, DataLoader  # For data handling in PyTorch
from nba_api.stats.endpoints import leaguedashlineups
import json
import time

# Ensure plots are displayed inline
%matplotlib inline

# Display confirmation message
print("Libraries successfully loaded!")


Libraries successfully loaded!


## Data Exploration and Preparation

### Overview
The first step in this project is to collect and clean NBA data from the **BallDontLie API**. This involves retrieving player statistics, game results, and team information to create a dataset suitable for training a transformer-based deep learning model.

### Goals
1. **Data Collection**: Retrieve comprehensive NBA data, including:
   - Player information (e.g., stats, positions, teams).
   - Game results and team statistics.
   - Lineup configurations and other relevant details.
2. **Data Cleaning**: 
   - Handle missing or inconsistent data.
   - Format and preprocess data into a tokenized structure compatible with transformer networks.
   - Normalize numerical features for model training.
3. **Data Analysis**: Explore the dataset to understand key features, distributions, and relationships between variables.

This section will document all steps taken to prepare the data for input into the machine learning model, ensuring it is accurate, complete, and well-structured.


In [None]:
# Base URL and API Key
BASE_URL = "https://api.balldontlie.io/v1/games"
API_KEY = "3c5f3508-5962-4809-8f3e-2b42449e253f"

# Headers for the API request
HEADERS = {
    "Authorization": API_KEY
}

# Define the seasons to retrieve (2003 to 2023)
START_YEAR = 2003
END_YEAR = 2023

# List to store game data
all_games = []

# Function to fetch games for a specific season
def fetch_games_for_season(season):
    cursor = None  # Start without a cursor
    while True:
        print(f"Fetching season {season}, cursor: {cursor}")
        
        # Construct the API URL with cursor for pagination
        url = f"{BASE_URL}?seasons[]={season}&per_page=100"
        if cursor:
            url += f"&cursor={cursor}"
        
        response = requests.get(url, headers=HEADERS)
        
        if response.status_code != 200:
            print(f"Error fetching data: {response.status_code}. Retrying in 60 seconds...")
            time.sleep(60)
            continue

        data = response.json()
        
        # Ensure the response contains new data
        if not data['data']:
            print(f"No more data found for season {season}, exiting loop.")
            break  # Exit loop if no more games are found

        # Add new games to the list
        all_games.extend(data['data'])
        print(f"Fetched {len(data['data'])} games. Total games collected: {len(all_games)}")
        
        # Update the cursor for the next page
        cursor = data.get('meta', {}).get('next_cursor', None)
        if not cursor:  # No more pages
            print(f"All pages fetched for season {season}.")
            break
        
        # Throttle requests to avoid hitting rate limits
        time.sleep(0.5)

# Fetch data for each season
for season in range(START_YEAR, END_YEAR + 1):
    fetch_games_for_season(season)

# Process the collected data into a DataFrame
print("Processing data into DataFrame...")
games_data = [
    {
        "id": game["id"],
        "date": game["date"],
        "season": game["season"],
        "status": game["status"],
        "home_team_score": game["home_team_score"],
        "visitor_team_score": game["visitor_team_score"],
        "home_team_name": game["home_team"]["full_name"],
        "home_team_id": game["home_team"]["id"],
        "visitor_team_name": game["visitor_team"]["full_name"],
        "visitor_team_id": game["visitor_team"]["id"]
    }
    for game in all_games
]

games_df = pd.DataFrame(games_data)

# Save the data to a CSV file
output_file = "games_2003_2023.csv"
games_df.to_csv(output_file, index=False)
print(f"Data saved to {output_file}.")


Fetching season 2003, cursor: None
Fetched 100 games. Total games collected: 100
Fetching season 2003, cursor: 16582
Fetched 100 games. Total games collected: 200
Fetching season 2003, cursor: 17359
Fetched 100 games. Total games collected: 300
Fetching season 2003, cursor: 12772
Fetched 100 games. Total games collected: 400
Fetching season 2003, cursor: 13238
Fetched 100 games. Total games collected: 500
Fetching season 2003, cursor: 16627
Fetched 100 games. Total games collected: 600
Fetching season 2003, cursor: 15539
Fetched 100 games. Total games collected: 700
Fetching season 2003, cursor: 16428
Fetched 100 games. Total games collected: 800
Fetching season 2003, cursor: 17812
Fetched 100 games. Total games collected: 900
Fetching season 2003, cursor: 16065
Fetched 100 games. Total games collected: 1000
Fetching season 2003, cursor: 13761
Fetched 100 games. Total games collected: 1100
Fetching season 2003, cursor: 13772
Fetched 100 games. Total games collected: 1200
Fetching seaso

#### Success! We have successfully gathered the scores for every game from 2003-2023

#### Now we will enrich this data-frame with home_player_1-12 and away_player_1-12 Player-ID's as well as their Offensive/Defensive Ratings.

In [34]:
# Game details - Here is an example of how we can extract statistics from a game given it's game_id (which we have saved in games_df above for all games 2003-2023)
game_id = 15486  # Example game ID

# Base URLs and API Key
BASE_URL_STATS = "https://api.balldontlie.io/v1/stats"
BASE_URL_ADVANCED = "https://api.balldontlie.io/v1/stats/advanced"
API_KEY = "3c5f3508-5962-4809-8f3e-2b42449e253f"
HEADERS = {"Authorization": API_KEY}

# Function to fetch stats
def fetch_stats(url, game_id):
    response = requests.get(f"{url}?game_ids[]={game_id}&per_page=100", headers=HEADERS)
    if response.status_code == 200:
        return response.json()["data"]
    else:
        raise Exception(f"Error fetching stats: {response.status_code}, {response.text}")

# Fetch data
base_stats = fetch_stats(BASE_URL_STATS, game_id)
advanced_stats = fetch_stats(BASE_URL_ADVANCED, game_id)

# Convert to DataFrames
base_df = pd.DataFrame(base_stats)
adv_df = pd.DataFrame(advanced_stats)

# Process base stats for minutes played
def parse_minutes(value):
    try:
        if isinstance(value, str):
            parts = value.split(":")
            return int(parts[0]) + int(parts[1]) / 60
        else:
            return 0
    except:
        return 0

base_df["minutes_played"] = base_df["min"].apply(parse_minutes)

# Merge base and advanced stats on player ID
base_df["player_id"] = base_df["player"].apply(lambda x: x["id"])
adv_df["player_id"] = adv_df["player"].apply(lambda x: x["id"])
merged_df = pd.merge(base_df, adv_df[["player_id", "offensive_rating", "defensive_rating", "usage_percentage"]], on="player_id")

# Add team and player full name
merged_df["team_id"] = merged_df["team"].apply(lambda x: x["id"])
merged_df["team_name"] = merged_df["team"].apply(lambda x: x["full_name"])
merged_df["full_name"] = merged_df["player"].apply(lambda x: f"{x['first_name']} {x['last_name']}")

# Split into home and away teams
home_team_id = 14  # Los Angeles Lakers
away_team_id = 7   # Dallas Mavericks
home_players = merged_df[merged_df["team_id"] == home_team_id].nlargest(12, "minutes_played")
away_players = merged_df[merged_df["team_id"] == away_team_id].nlargest(12, "minutes_played")

# Base game information
game_row = {
    "id": game_id,
    "date": "2003-10-28",
    "season": 2003,
    "status": "Final",
    "home_team_score": 109,
    "visitor_team_score": 93,
    "home_team_name": "Los Angeles Lakers",
    "home_team_id": home_team_id,
    "visitor_team_name": "Dallas Mavericks",
    "visitor_team_id": away_team_id,
}

# Add player statistics for top 12 players per team
for i in range(1, 13):
    if i <= len(home_players):
        game_row[f"home_player_{i}_id"] = home_players.iloc[i - 1]["player_id"]
        game_row[f"home_player_{i}_name"] = home_players.iloc[i - 1]["full_name"]
        game_row[f"home_player_{i}_minutes"] = home_players.iloc[i - 1]["minutes_played"]
        game_row[f"home_player_{i}_off_rating"] = home_players.iloc[i - 1]["offensive_rating"]
        game_row[f"home_player_{i}_def_rating"] = home_players.iloc[i - 1]["defensive_rating"]
        game_row[f"home_player_{i}_usage"] = home_players.iloc[i - 1]["usage_percentage"]
    else:
        game_row[f"home_player_{i}_id"] = None
        game_row[f"home_player_{i}_name"] = None
        game_row[f"home_player_{i}_minutes"] = None
        game_row[f"home_player_{i}_off_rating"] = None
        game_row[f"home_player_{i}_def_rating"] = None
        game_row[f"home_player_{i}_usage"] = None

for i in range(1,13):
    if i <= len(away_players):
        game_row[f"away_player_{i}_id"] = away_players.iloc[i - 1]["player_id"]
        game_row[f"away_player_{i}_name"] = away_players.iloc[i - 1]["full_name"]
        game_row[f"away_player_{i}_minutes"] = away_players.iloc[i - 1]["minutes_played"]
        game_row[f"away_player_{i}_off_rating"] = away_players.iloc[i - 1]["offensive_rating"]
        game_row[f"away_player_{i}_def_rating"] = away_players.iloc[i - 1]["defensive_rating"]
        game_row[f"away_player_{i}_usage"] = away_players.iloc[i - 1]["usage_percentage"]
    else:
        game_row[f"away_player_{i}_id"] = None
        game_row[f"away_player_{i}_name"] = None
        game_row[f"away_player_{i}_minutes"] = None
        game_row[f"away_player_{i}_off_rating"] = None
        game_row[f"away_player_{i}_def_rating"] = None
        game_row[f"away_player_{i}_usage"] = None

# Create final DataFrame
final_df = pd.DataFrame([game_row])

# Save the data to a CSV file
output_file = "expanded_games_2003_2023.csv"
final_df.to_csv(output_file, index=False)
print(f"Data saved to {output_file}.")


Data saved to expanded_games_2003_2023.csv.


In [None]:
# Load the existing games DataFrame
games_df = pd.read_csv("games_2003_2023.csv")  # Replace with your actual CSV file path
game_ids = games_df["id"].tolist()

# Base URLs and API Key
BASE_URL_STATS = "https://api.balldontlie.io/v1/stats"
BASE_URL_ADVANCED = "https://api.balldontlie.io/v1/stats/advanced"
API_KEY = "3c5f3508-5962-4809-8f3e-2b42449e253f"
HEADERS = {"Authorization": API_KEY}

# Function to fetch stats
def fetch_stats(url, game_id):
    response = requests.get(f"{url}?game_ids[]={game_id}&per_page=100", headers=HEADERS)
    if response.status_code == 200:
        return response.json()["data"]
    else:
        print(f"Error fetching stats for game {game_id}: {response.status_code}, {response.text}")
        return []

# Process base stats for minutes played
def parse_minutes(value):
    try:
        if isinstance(value, str):
            parts = value.split(":")
            return int(parts[0]) + int(parts[1]) / 60
        else:
            return 0
    except:
        return 0

# List to store all game rows
all_games = []

# Loop over game IDs
for game_id in game_ids:
    try:
        # Fetch data
        base_stats = fetch_stats(BASE_URL_STATS, game_id)
        advanced_stats = fetch_stats(BASE_URL_ADVANCED, game_id)

        # Skip if no data
        if not base_stats or not advanced_stats:
            continue

        # Convert to DataFrames
        base_df = pd.DataFrame(base_stats)
        adv_df = pd.DataFrame(advanced_stats)

        # Process base stats
        base_df["minutes_played"] = base_df["min"].apply(parse_minutes)
        base_df["player_id"] = base_df["player"].apply(lambda x: x["id"])
        adv_df["player_id"] = adv_df["player"].apply(lambda x: x["id"])

        # Merge base and advanced stats
        merged_df = pd.merge(
            base_df,
            adv_df[["player_id", "offensive_rating", "defensive_rating", "usage_percentage"]],
            on="player_id"
        )

        # Add team and player full name
        merged_df["team_id"] = merged_df["team"].apply(lambda x: x["id"])
        merged_df["team_name"] = merged_df["team"].apply(lambda x: x["full_name"])
        merged_df["full_name"] = merged_df["player"].apply(lambda x: f"{x['first_name']} {x['last_name']}")

        # Identify home and away teams
        home_team_id = games_df.loc[games_df["id"] == game_id, "home_team_id"].values[0]
        away_team_id = games_df.loc[games_df["id"] == game_id, "visitor_team_id"].values[0]
        home_players = merged_df[merged_df["team_id"] == home_team_id].nlargest(12, "minutes_played")
        away_players = merged_df[merged_df["team_id"] == away_team_id].nlargest(12, "minutes_played")

        # Base game information
        game_row = {
            "id": game_id,
            "date": games_df.loc[games_df["id"] == game_id, "date"].values[0],
            "season": games_df.loc[games_df["id"] == game_id, "season"].values[0],
            "status": games_df.loc[games_df["id"] == game_id, "status"].values[0],
            "home_team_score": games_df.loc[games_df["id"] == game_id, "home_team_score"].values[0],
            "visitor_team_score": games_df.loc[games_df["id"] == game_id, "visitor_team_score"].values[0],
            "home_team_name": games_df.loc[games_df["id"] == game_id, "home_team_name"].values[0],
            "home_team_id": home_team_id,
            "visitor_team_name": games_df.loc[games_df["id"] == game_id, "visitor_team_name"].values[0],
            "visitor_team_id": away_team_id,
        }

        # Add player statistics for top 12 players per team
        for i in range(1, 13):
            if i <= len(home_players):
                game_row[f"home_player_{i}_id"] = home_players.iloc[i - 1]["player_id"]
                game_row[f"home_player_{i}_name"] = home_players.iloc[i - 1]["full_name"]
                game_row[f"home_player_{i}_minutes"] = home_players.iloc[i - 1]["minutes_played"]
                game_row[f"home_player_{i}_off_rating"] = home_players.iloc[i - 1]["offensive_rating"]
                game_row[f"home_player_{i}_def_rating"] = home_players.iloc[i - 1]["defensive_rating"]
                game_row[f"home_player_{i}_usage"] = home_players.iloc[i - 1]["usage_percentage"]
            else:
                game_row[f"home_player_{i}_id"] = None
                game_row[f"home_player_{i}_name"] = None
                game_row[f"home_player_{i}_minutes"] = None
                game_row[f"home_player_{i}_off_rating"] = None
                game_row[f"home_player_{i}_def_rating"] = None
                game_row[f"home_player_{i}_usage"] = None

            if i <= len(away_players):
                game_row[f"away_player_{i}_id"] = away_players.iloc[i - 1]["player_id"]
                game_row[f"away_player_{i}_name"] = away_players.iloc[i - 1]["full_name"]
                game_row[f"away_player_{i}_minutes"] = away_players.iloc[i - 1]["minutes_played"]
                game_row[f"away_player_{i}_off_rating"] = away_players.iloc[i - 1]["offensive_rating"]
                game_row[f"away_player_{i}_def_rating"] = away_players.iloc[i - 1]["defensive_rating"]
                game_row[f"away_player_{i}_usage"] = away_players.iloc[i - 1]["usage_percentage"]
            else:
                game_row[f"away_player_{i}_id"] = None
                game_row[f"away_player_{i}_name"] = None
                game_row[f"away_player_{i}_minutes"] = None
                game_row[f"away_player_{i}_off_rating"] = None
                game_row[f"away_player_{i}_def_rating"] = None
                game_row[f"away_player_{i}_usage"] = None

        # Append to all_games
        all_games.append(game_row)

        # Throttle API requests to avoid rate limits
        time.sleep(1)

    except Exception as e:
        print(f"Error processing game {game_id}: {e}")

# Create final DataFrame
final_df = pd.DataFrame(all_games)

# Save the data to a CSV file
output_file = "expanded_games_2003_2023.csv"
final_df.to_csv(output_file, index=False)
print(f"Data saved to {output_file}.")


## Model Creation

### Overview
This section focuses on building the transformer-based deep learning model to predict NBA game outcomes. The model will analyze player lineups and their relationships to generate predictions.

### Goals
1. **Model Architecture**:
   - Implement a transformer network using **PyTorch**.
   - Utilize multi-head attention mechanisms to analyze player and team relationships.
   - Include potential residual connections to enhance model depth and stability.
2. **Input and Output Design**:
   - Process tokenized player information as inputs.
   - Predict game outcomes (e.g., winners, scores) as outputs.
3. **Model Training**:
   - Define the training loop, including loss functions and optimizers.
   - Split data into training, validation, and test sets for evaluation.

### Implementation Steps
1. **Define the Transformer Architecture**:
   - Specify input dimensions, number of attention heads, and transformer layers.
2. **Configure the Training Pipeline**:
   - Choose a loss function and optimizer (e.g., CrossEntropyLoss, Adam).
   - Set hyperparameters like learning rate, number of epochs, and batch size.
3. **Initial Testing**:
   - Train the model on a subset of the data to ensure functionality.
   - Evaluate initial performance before moving to hyperparameter tuning.

This section will document the step-by-step process of creating and implementing the model, including explanations for each architectural choice.


## Hyperparameter Tuning

### Overview
This section explores the impact of various hyperparameters on the model's performance. By systematically adjusting key parameters, we aim to optimize the transformer network for better predictions.

### Goals
1. **Experimentation**:
   - Test different values for hyperparameters such as:
     - Learning rate.
     - Number of epochs.
     - Optimizer (e.g., Adam, SGD).
     - Batch size.
     - Number of transformer layers and attention heads.
2. **Performance Evaluation**:
   - Assess the impact of each hyperparameter on model accuracy, loss, and F1-score.
   - Document observations to identify the most effective configurations.

### Implementation Steps
1. **Baseline Configuration**:
   - Train the model with default or commonly used hyperparameter values.
   - Record baseline performance metrics.
2. **Iterative Testing**:
   - Adjust one hyperparameter at a time while keeping others constant.
   - Monitor changes in performance and identify trends.
3. **Optimal Configuration**:
   - Combine the best-performing hyperparameters into a final configuration for training the model.

This section will detail the experiments conducted and the resulting insights into hyperparameter optimization for the transformer network.


## Evaluation and Analysis

### Overview
In this section, we evaluate the performance of the transformer-based model using two primary loss metrics and other relevant performance indicators. The focus will be on understanding the model's strengths, limitations, and areas for improvement.

### Metrics
1. **Score Prediction Accuracy**:
   - Measure the actual distance between predicted scores and the true game scores (e.g., Mean Squared Error or Mean Absolute Error).
   - Assess how well the model captures the scoring trends in games.
2. **Winning Outcome Prediction**:
   - Evaluate the model's ability to correctly predict the winning team (e.g., Accuracy, F1-score).
   - Analyze classification performance using confusion matrices.

### Goals
1. **Performance Metrics**:
   - Quantify how accurately the model predicts game outcomes and scores.
   - Identify patterns or biases in the model’s predictions.
2. **Visual Representations**:
   - Plot training and validation loss over epochs.
   - Generate confusion matrices for winning outcome predictions.
   - Visualize score prediction distributions.
3. **Strengths and Limitations**:
   - Discuss areas where the model performs well and where it struggles.
   - Identify real-world scenarios where the model could be applied effectively.
4. **Future Improvements**:
   - Suggest ways to enhance the model, such as adjusting hyperparameters, adding new features, or increasing dataset size.

This section will summarize the model's overall performance, supported by quantitative metrics and visualizations.
