# Project Title and Authors
Project Title: NBA Predictor

Authors: Sam Motto and Alex Lehman

## Introduction
We tackled a **classification** problem: predicting the likelihood that each NBA team in a season will win the championship based on historical performance data. Our primary data source is **Basketball-Reference.com**, from which we collected both advanced and per-possession statistics.




In [22]:
import pandas as pd
import numpy as np
import re
import time
import os

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from collections import defaultdict

## Methods: Data
We combined advanced and per-possession tables for each season from 2004 to 2024. After cleaning and merging, we labeled the champion teams for each season. Key features include offensive/defensive metrics like eFG%, TOV%, and net rating. We split the data by leaving one season out at a time for testing (leave-one-year-out cross-validation). The final combined dataset was then used to build and evaluate our models.


In [29]:
# --------------------------
# Helper: Clean Team Name
# --------------------------
def clean_team_name(name):
    return re.sub(r'[^\w\s]', '', name).strip()

# --------------------------
# Fetch and Process Tables
# --------------------------
def fetch_table(url, table_id):
    header = 1 if table_id == 'advanced-team' else 0
    df = pd.read_html(url, header=header, attrs={'id': table_id})[0]
    df.dropna(axis=1, how='all', inplace=True)

    for col in ["Rk", "Arena", "Attend.", "Attend./G"]:
        if col in df.columns:
            df.drop(columns=col, inplace=True)

    if "Team" in df.columns:
        df["Team"] = df["Team"].astype(str).apply(clean_team_name)

    if table_id == 'advanced-team':
        df = df[:-1].reset_index(drop=True)
        rename_map = {
            "eFG%": "Off_eFG%",
            "TOV%": "Off_TOV%",
            "FT/FGA": "Off_FT/FGA",
            "eFG%.1": "Def_eFG%",
            "TOV%.1": "Def_TOV%",
            "FT/FGA.1": "Def_FT/FGA"
        }
        df.rename(columns=rename_map, inplace=True)

    return df

# --------------------------
# Download and Combine Data
# --------------------------
base_url = "https://www.basketball-reference.com/leagues/NBA_{}.html"
os.makedirs("data", exist_ok=True)

for year in range(2004, 2025):
    print(f"Processing {year} season...")
    url = base_url.format(year)

    df_advanced = None
    df_perposs = None

    try:
        # NOTE: We have already provided the data from 2004-2024
        # If you try to fetch too many tables at once you get blacklisted for being a bot
        #df_advanced = fetch_table(url, "advanced-team")
        #df_perposs = fetch_table(url, "per_poss-team")
        pass
    except Exception as e:
        print(f"Failed to fetch data for {year}: {e}")
        continue

    if df_advanced is not None and df_perposs is not None:
        merged = pd.merge(df_advanced, df_perposs, on="Team", suffixes=("_adv", "_poss"))
        merged["Year"] = year
        merged.to_csv(f"data/{year}_merged.csv", index=False)
        print(f"Saved merged data to data/{year}_merged.csv")
    else:
        print(f"Skipping {year} already have data.")

    time.sleep(1)

Processing 2004 season...
Skipping 2004 already have data.
Processing 2005 season...
Skipping 2005 already have data.
Processing 2006 season...
Skipping 2006 already have data.
Processing 2007 season...
Skipping 2007 already have data.
Processing 2008 season...
Skipping 2008 already have data.
Processing 2009 season...
Skipping 2009 already have data.
Processing 2010 season...
Skipping 2010 already have data.
Processing 2011 season...
Skipping 2011 already have data.
Processing 2012 season...
Skipping 2012 already have data.
Processing 2013 season...
Skipping 2013 already have data.
Processing 2014 season...
Skipping 2014 already have data.
Processing 2015 season...
Skipping 2015 already have data.
Processing 2016 season...
Skipping 2016 already have data.
Processing 2017 season...
Skipping 2017 already have data.
Processing 2018 season...
Skipping 2018 already have data.
Processing 2019 season...
Skipping 2019 already have data.
Processing 2020 season...
Skipping 2020 already have dat

In [24]:
# --------------------------
# Combine All and Add Champion Label
# --------------------------
print("\nMerging all years into one file...")

# Load and concatenate all year files
all_years = []
for filename in os.listdir("data"):
    if filename.endswith("_merged.csv"):
        all_years.append(pd.read_csv(os.path.join("data", filename)))
df = pd.concat(all_years, ignore_index=True)

# Champion team per year
champion_teams = {
    "2004": "Detroit Pistons",
    "2005": "San Antonio Spurs",
    "2006": "Miami Heat",
    "2007": "San Antonio Spurs",
    "2008": "Boston Celtics",
    "2009": "Los Angeles Lakers",
    "2010": "Los Angeles Lakers",
    "2011": "Dallas Mavericks",
    "2012": "Miami Heat",
    "2013": "Miami Heat",
    "2014": "San Antonio Spurs",
    "2015": "Golden State Warriors",
    "2016": "Cleveland Cavaliers",
    "2017": "Golden State Warriors",
    "2018": "Golden State Warriors",
    "2019": "Toronto Raptors",
    "2020": "Los Angeles Lakers",
    "2021": "Milwaukee Bucks",
    "2022": "Golden State Warriors",
    "2023": "Denver Nuggets",
    "2024": "Boston Celtics"
}

# Clean team names just in case
df["Team"] = df["Team"].astype(str).apply(clean_team_name)

# Label champions
df["Champion"] = df.apply(
    lambda row: 1 if row["Team"].strip() == champion_teams.get(str(row["Year"]), None) else 0, axis=1)

# Save final combined file
df.to_csv("data/combined_data_with_champions.csv", index=False)
print("Saved combined_data_with_champions.csv.")


Merging all years into one file...
Saved combined_data_with_champions.csv.


## Methods: Training/Validation
We explored five different classification models:
1. Logistic Regression
2. Random Forest
3. Gradient Boosting
4. K-Neighbors
5. Support Vector Classifier (SVC)

To select the best model, we performed a **leave-one-year-out** validation, training on all seasons except one and then testing on the held-out season. We tracked each model’s ability to correctly “rank” the actual champion. The model with the lowest average champion rank across all seasons was chosen as our final model.


In [25]:
df = pd.read_csv("data/combined_data_with_champions.csv")

# Choose features (remove non-numeric and labels)
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
for col in ["Champion", "Year"]:
    if col in numeric_features:
        numeric_features.remove(col)

# Define candidate models
models = {
    "LogisticRegression": LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42),
    "GradientBoosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "KNeighbors": KNeighborsClassifier(),
    "SVC": SVC(probability=True, class_weight='balanced', random_state=42)
}

# Track champion ranks per model
champion_ranks = defaultdict(list)

# Leave-one-year-out loop
for test_year in range(2004, 2025):
    print(f"\nEvaluating on {test_year} season...")
    
    train_df = df[df["Year"] != test_year]
    test_df = df[df["Year"] == test_year]

    X_train = train_df[numeric_features]
    y_train = train_df["Champion"]
    X_test = test_df[numeric_features]

    # Get the true champion team name
    true_champ_row = test_df[test_df["Champion"] == 1]
    if true_champ_row.empty:
        print(f"No champion found for {test_year}, skipping...")
        continue
    true_champ_team = true_champ_row.iloc[0]["Team"]

    for model_name, clf in models.items():
        pipeline = Pipeline([
            ("scaler", StandardScaler()),
            ("clf", clf)
        ])
        pipeline.fit(X_train, y_train)

        # Predict probabilities
        probs = pipeline.predict_proba(X_test)[:, 1]
        test_df_copy = test_df.copy()
        test_df_copy["Predicted_Prob"] = probs / probs.sum()
        test_df_copy["Predicted_Rank"] = test_df_copy["Predicted_Prob"].rank(ascending=False, method="first")

        # Get champion's predicted rank
        champ_rank = test_df_copy[test_df_copy["Team"] == true_champ_team]["Predicted_Rank"].values[0]
        champion_ranks[model_name].append(champ_rank)
        print(f"{model_name}: '{true_champ_team}' ranked #{int(champ_rank)}")
        
# Calculate average rank of champions per model
print("\nAverage Rank of Champions Across Seasons:")
avg_ranks = {}
for model_name, ranks in champion_ranks.items():
    avg = np.mean(ranks)
    avg_ranks[model_name] = avg
    print(f"{model_name}: Average Rank = {avg:.2f}")

# Identify the best model
best_model = min(avg_ranks, key=avg_ranks.get)
print(f"\nBest Model: {best_model} (Lowest Avg Champion Rank)")


Evaluating on 2004 season...
LogisticRegression: 'Detroit Pistons' ranked #5
RandomForest: 'Detroit Pistons' ranked #4
GradientBoosting: 'Detroit Pistons' ranked #5
KNeighbors: 'Detroit Pistons' ranked #2
SVC: 'Detroit Pistons' ranked #5

Evaluating on 2005 season...
LogisticRegression: 'San Antonio Spurs' ranked #2
RandomForest: 'San Antonio Spurs' ranked #1
GradientBoosting: 'San Antonio Spurs' ranked #2
KNeighbors: 'San Antonio Spurs' ranked #3
SVC: 'San Antonio Spurs' ranked #3

Evaluating on 2006 season...
LogisticRegression: 'Miami Heat' ranked #4
RandomForest: 'Miami Heat' ranked #5
GradientBoosting: 'Miami Heat' ranked #5
KNeighbors: 'Miami Heat' ranked #6
SVC: 'Miami Heat' ranked #5

Evaluating on 2007 season...
LogisticRegression: 'San Antonio Spurs' ranked #1
RandomForest: 'San Antonio Spurs' ranked #1
GradientBoosting: 'San Antonio Spurs' ranked #1
KNeighbors: 'San Antonio Spurs' ranked #1
SVC: 'San Antonio Spurs' ranked #1

Evaluating on 2008 season...
LogisticRegression:

## Results on Test Set
After identifying the best model based on champion ranks, we further tested it on earlier seasons (e.g., 1991–1998). We examined how it rated each team during Michael Jordan’s championship runs with the Chicago Bulls. Metrics such as the predicted probability and resulting rank were used to evaluate model performance. We also calculated summary statistics, including the average rank for the Bulls in those seasons.


In [26]:
best_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", models[best_model])
])
best_pipeline.fit(df[numeric_features], df["Champion"])

test_folder = "test"
bulls_ranks = []

for year in [1991, 1992, 1993, 1996, 1997, 1998]:
    test_path = os.path.join(test_folder, f"{year}_merged.csv")
    df_test = pd.read_csv(test_path)
    X_test = df_test[numeric_features]

    # Predict and normalize
    probs = best_pipeline.predict_proba(X_test)[:, 1]
    df_test["Predicted_Prob"] = probs / probs.sum()
    df_test["Predicted_Rank"] = df_test["Predicted_Prob"].rank(ascending=False, method="first")

    # Track Bulls' rank
    bulls_row = df_test[df_test["Team"].str.strip() == "Chicago Bulls"]
    if not bulls_row.empty:
        bulls_rank = int(bulls_row["Predicted_Rank"].values[0])
        bulls_ranks.append((year, bulls_rank))
        print(f"{year}: Chicago Bulls ranked #{bulls_rank}")
    else:
        print(f"{year}: Chicago Bulls not found.")

# Summary
avg_bulls_rank = np.mean([rank for _, rank in bulls_ranks])
print(f"\nAverage Bulls Rank (1991–1998): {avg_bulls_rank:.2f}")

1991: Chicago Bulls ranked #2
1992: Chicago Bulls ranked #1
1993: Chicago Bulls ranked #2
1996: Chicago Bulls ranked #1
1997: Chicago Bulls ranked #1
1998: Chicago Bulls ranked #1

Average Bulls Rank (1991–1998): 1.33


## Discussion / Conclusion

Our model, trained on seasons from 2004–2024, was tested on earlier years to evaluate how well it could generalize to unseen eras. When applied to the 1991–1998 seasons — all won by the Chicago Bulls — the model showed excellent performance: it ranked the Bulls **#1 in four out of six years**, and **#2 in the other two**. That level of consistency across different basketball eras gives us strong confidence in the model's ability to assess team strength.

Looking ahead, we recognize that some features like **net rating (NRtg)** or **simple rating system (SRS)** may deserve different weighting or scaling. While historically strong, their raw values may not fully capture playoff readiness or roster context. Additionally, while using only post-2004 data kept things consistent (modern pace, 3-point era, etc.), incorporating older seasons with care might further improve the model — as long as we adjust for playstyle shifts.

After identifying **Logistic Regression** as our best model based on average champion ranking during validation, we ran it on the current **2025 season**. The model predicted the following top 5 teams most likely to win the championship:

| Rank | Team                    | Predicted Probability |
|------|-------------------------|------------------------|
| 1    | **Cleveland Cavaliers** | 15.62%                |
| 2    | Boston Celtics          | 15.44%                |
| 3    | New York Knicks         | 13.77%                |
| 4    | LA Clippers             | 13.31%                |
| 5    | OKC Thunder             | 13.20%                |

And here are the **5 lowest-ranked teams** by predicted championship probability:

| Rank | Team                     | Predicted Probability       |
|------|--------------------------|------------------------------|
| 26   | Charlotte Hornets        | 6.15e-08                    |
| 27   | Brooklyn Nets            | 4.89e-08                    |
| 28   | New Orleans Pelicans     | 2.54e-08                    |
| 29   | Utah Jazz                | 3.92e-09                    |
| 30   | Washington Wizards       | 1.37e-10                    |

While these probabilities don’t guarantee an outcome, they reflect relative strength compared to the rest of the league.

## Disclosures
We used ChatGPT to help clean up our code comments, basically turning our shorthand comments into clearer, more detailed explanations. We also used it for general debugging and figuring out where stuff was breaking.