# Project Title and Authors
Project Title: NBA Predictor
Authors: Sam Motto and Alex Lehman

## Introduction
"Introduce the regression or classification task your group has chosen. Specify clearly whether it is a regression or a classification problem. Provide a reference or note the source of your dataset (e.g., UCI, Kaggle, Google’s Dataset Search, etc.)."



In [12]:
import pandas as pd
import numpy as np
import re
import time
import os

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from collections import defaultdict

## Methods: Data
"Describe your dataset in detail. Include its source, key features, and any important contextual information. Explain the data cleaning or pre-processing steps and how you split the data into training/validation and test sets."

In [10]:
# --------------------------
# Helper: Clean Team Name
# --------------------------
def clean_team_name(name):
    return re.sub(r'[^\w\s]', '', name).strip()

# --------------------------
# Fetch and Process Tables
# --------------------------
def fetch_table(url, table_id):
    header = 1 if table_id == 'advanced-team' else 0
    df = pd.read_html(url, header=header, attrs={'id': table_id})[0]
    df.dropna(axis=1, how='all', inplace=True)

    for col in ["Rk", "Arena", "Attend.", "Attend./G"]:
        if col in df.columns:
            df.drop(columns=col, inplace=True)

    if "Team" in df.columns:
        df["Team"] = df["Team"].astype(str).apply(clean_team_name)

    if table_id == 'advanced-team':
        df = df[:-1].reset_index(drop=True)
        rename_map = {
            "eFG%": "Off_eFG%",
            "TOV%": "Off_TOV%",
            "FT/FGA": "Off_FT/FGA",
            "eFG%.1": "Def_eFG%",
            "TOV%.1": "Def_TOV%",
            "FT/FGA.1": "Def_FT/FGA"
        }
        df.rename(columns=rename_map, inplace=True)

    return df

# --------------------------
# Download and Combine Data
# --------------------------
base_url = "https://www.basketball-reference.com/leagues/NBA_{}.html"
os.makedirs("data", exist_ok=True)

for year in range(2004, 2025):
    print(f"Processing {year} season...")
    url = base_url.format(year)

    df_advanced = None
    df_perposs = None

    try:
        # NOTE: We have already provided the data from 2004-2024
        # If you try to fetch too many tables at once you get blacklisted for being a bot
        #df_advanced = fetch_table(url, "advanced-team")
        #df_perposs = fetch_table(url, "per_poss-team")
        pass
    except Exception as e:
        print(f"Failed to fetch data for {year}: {e}")
        continue

    if df_advanced is not None and df_perposs is not None:
        merged = pd.merge(df_advanced, df_perposs, on="Team", suffixes=("_adv", "_poss"))
        merged["Year"] = year
        merged.to_csv(f"data/{year}_merged.csv", index=False)
        print(f"Saved merged data to data/{year}_merged.csv")
    else:
        print(f"Skipping {year} already have data.")

    time.sleep(1)

Processing 1996 season...
Saved merged data to data/1996_merged.csv
Processing 1997 season...
Saved merged data to data/1997_merged.csv
Processing 1998 season...
Saved merged data to data/1998_merged.csv


In [19]:
# --------------------------
# Combine All and Add Champion Label
# --------------------------
print("\nMerging all years into one file...")

# Load and concatenate all year files
all_years = []
for filename in os.listdir("data"):
    if filename.endswith("_merged.csv"):
        all_years.append(pd.read_csv(os.path.join("data", filename)))
df = pd.concat(all_years, ignore_index=True)

# Champion team per year
champion_teams = {
    "2004": "Detroit Pistons",
    "2005": "San Antonio Spurs",
    "2006": "Miami Heat",
    "2007": "San Antonio Spurs",
    "2008": "Boston Celtics",
    "2009": "Los Angeles Lakers",
    "2010": "Los Angeles Lakers",
    "2011": "Dallas Mavericks",
    "2012": "Miami Heat",
    "2013": "Miami Heat",
    "2014": "San Antonio Spurs",
    "2015": "Golden State Warriors",
    "2016": "Cleveland Cavaliers",
    "2017": "Golden State Warriors",
    "2018": "Golden State Warriors",
    "2019": "Toronto Raptors",
    "2020": "Los Angeles Lakers",
    "2021": "Milwaukee Bucks",
    "2022": "Golden State Warriors",
    "2023": "Denver Nuggets",
    "2024": "Boston Celtics"
}

# Clean team names just in case
df["Team"] = df["Team"].astype(str).apply(clean_team_name)

# Label champions
df["Champion"] = df.apply(
    lambda row: 1 if row["Team"].strip() == champion_teams.get(str(row["Year"]), None) else 0, axis=1)

# Save final combined file
df.to_csv("data/combined_data_with_champions.csv", index=False)
print("Saved combined_data_with_champions.csv.")


Merging all years into one file...
Saved combined_data_with_champions.csv.


## Methods: Training/Validation
Describe how you performed the training/validation process. Summarize the at least five different model variations that you tried (e.g., changes in features, different algorithms, or hyperparameters) and explain your method for model selection.

In [20]:
df = pd.read_csv("data/combined_data_with_champions.csv")

# Choose features (remove non-numeric and labels)
numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()
for col in ["Champion", "Year"]:
    if col in numeric_features:
        numeric_features.remove(col)

# Define candidate models
models = {
    "LogisticRegression": LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42),
    "GradientBoosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "KNeighbors": KNeighborsClassifier(),
    "SVC": SVC(probability=True, class_weight='balanced', random_state=42)
}

# Track champion ranks per model
champion_ranks = defaultdict(list)

# Leave-one-year-out loop
for test_year in range(2004, 2025):
    print(f"\nEvaluating on {test_year} season...")
    
    train_df = df[df["Year"] != test_year]
    test_df = df[df["Year"] == test_year]

    X_train = train_df[numeric_features]
    y_train = train_df["Champion"]
    X_test = test_df[numeric_features]

    # Get the true champion team name
    true_champ_row = test_df[test_df["Champion"] == 1]
    if true_champ_row.empty:
        print(f"No champion found for {test_year}, skipping...")
        continue
    true_champ_team = true_champ_row.iloc[0]["Team"]

    for model_name, clf in models.items():
        pipeline = Pipeline([
            ("scaler", StandardScaler()),
            ("clf", clf)
        ])
        pipeline.fit(X_train, y_train)

        # Predict probabilities
        probs = pipeline.predict_proba(X_test)[:, 1]
        test_df_copy = test_df.copy()
        test_df_copy["Predicted_Prob"] = probs / probs.sum()
        test_df_copy["Predicted_Rank"] = test_df_copy["Predicted_Prob"].rank(ascending=False, method="first")

        # Get champion's predicted rank
        champ_rank = test_df_copy[test_df_copy["Team"] == true_champ_team]["Predicted_Rank"].values[0]
        champion_ranks[model_name].append(champ_rank)
        print(f"{model_name}: '{true_champ_team}' ranked #{int(champ_rank)}")
        
# Calculate average rank of champions per model
print("\nAverage Rank of Champions Across Seasons:")
avg_ranks = {}
for model_name, ranks in champion_ranks.items():
    avg = np.mean(ranks)
    avg_ranks[model_name] = avg
    print(f"{model_name}: Average Rank = {avg:.2f}")

# Identify the best model
best_model = min(avg_ranks, key=avg_ranks.get)
print(f"\nBest Model: {best_model} (Lowest Avg Champion Rank)")


Evaluating on 2004 season...
LogisticRegression: 'Detroit Pistons' ranked #5
RandomForest: 'Detroit Pistons' ranked #4
GradientBoosting: 'Detroit Pistons' ranked #5
KNeighbors: 'Detroit Pistons' ranked #2
SVC: 'Detroit Pistons' ranked #5

Evaluating on 2005 season...
LogisticRegression: 'San Antonio Spurs' ranked #2
RandomForest: 'San Antonio Spurs' ranked #1
GradientBoosting: 'San Antonio Spurs' ranked #2
KNeighbors: 'San Antonio Spurs' ranked #3
SVC: 'San Antonio Spurs' ranked #3

Evaluating on 2006 season...
LogisticRegression: 'Miami Heat' ranked #4
RandomForest: 'Miami Heat' ranked #5
GradientBoosting: 'Miami Heat' ranked #5
KNeighbors: 'Miami Heat' ranked #6
SVC: 'Miami Heat' ranked #5

Evaluating on 2007 season...
LogisticRegression: 'San Antonio Spurs' ranked #1
RandomForest: 'San Antonio Spurs' ranked #1
GradientBoosting: 'San Antonio Spurs' ranked #1
KNeighbors: 'San Antonio Spurs' ranked #1
SVC: 'San Antonio Spurs' ranked #1

Evaluating on 2008 season...
LogisticRegression:

## Results on Test Set
Describe how you evaluated your final model using the test set. Include key performance metrics and relevant visualizations (e.g., ROC curves, confusion matrices, graphs, summary statistics, accuracy, etc.).

In [21]:
best_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", models[best_model])
])
best_pipeline.fit(df[numeric_features], df["Champion"])

test_folder = "test"
bulls_ranks = []

for year in [1991, 1992, 1993, 1996, 1997, 1998]:
    test_path = os.path.join(test_folder, f"{year}_merged.csv")
    df_test = pd.read_csv(test_path)
    X_test = df_test[numeric_features]

    # Predict and normalize
    probs = best_pipeline.predict_proba(X_test)[:, 1]
    df_test["Predicted_Prob"] = probs / probs.sum()
    df_test["Predicted_Rank"] = df_test["Predicted_Prob"].rank(ascending=False, method="first")

    # Track Bulls' rank
    bulls_row = df_test[df_test["Team"].str.strip() == "Chicago Bulls"]
    if not bulls_row.empty:
        bulls_rank = int(bulls_row["Predicted_Rank"].values[0])
        bulls_ranks.append((year, bulls_rank))
        print(f"{year}: Chicago Bulls ranked #{bulls_rank}")
    else:
        print(f"{year}: Chicago Bulls not found.")

# Summary
avg_bulls_rank = np.mean([rank for _, rank in bulls_ranks])
print(f"\nAverage Bulls Rank (1991–1998): {avg_bulls_rank:.2f}")

1991: Chicago Bulls ranked #2
1992: Chicago Bulls ranked #1
1993: Chicago Bulls ranked #2
1996: Chicago Bulls ranked #1
1997: Chicago Bulls ranked #1
1998: Chicago Bulls ranked #1

Average Bulls Rank (1991–1998): 1.33


## Discussion / Conclusion
Discuss the outcomes of your project. Summarize your key findings, interpret the results, and offer reflections on what worked well and what could be improved in future iterations of your work.

## Disclosures
Include a brief description of your use of ChatGPT or any other AI tools to help with your project, if applicable.