# Project Title and Authors
Project Title: NBA Predictor
Authors: Sam Motto and Alex Lehman

## Introduction
"Introduce the regression or classification task your group has chosen. Specify clearly whether it is a regression or a classification problem. Provide a reference or note the source of your dataset (e.g., UCI, Kaggle, Google’s Dataset Search, etc.)."



In [3]:
import pandas as pd
import re
import time
import os

## Methods: Data
"Describe your dataset in detail. Include its source, key features, and any important contextual information. Explain the data cleaning or pre-processing steps and how you split the data into training/validation and test sets."

In [4]:
# --------------------------
# Helper: Clean Team Name
# --------------------------
def clean_team_name(name):
    return re.sub(r'[^\w\s]', '', name).strip()

# --------------------------
# Fetch and Process Tables
# --------------------------
def fetch_table(url, table_id):
    header = 1 if table_id == 'advanced-team' else 0
    df = pd.read_html(url, header=header, attrs={'id': table_id})[0]
    df.dropna(axis=1, how='all', inplace=True)

    for col in ["Rk", "Arena", "Attend.", "Attend./G"]:
        if col in df.columns:
            df.drop(columns=col, inplace=True)

    if "Team" in df.columns:
        df["Team"] = df["Team"].astype(str).apply(clean_team_name)

    if table_id == 'advanced-team':
        df = df[:-1].reset_index(drop=True)
        rename_map = {
            "eFG%": "Off_eFG%",
            "TOV%": "Off_TOV%",
            "FT/FGA": "Off_FT/FGA",
            "eFG%.1": "Def_eFG%",
            "TOV%.1": "Def_TOV%",
            "FT/FGA.1": "Def_FT/FGA"
        }
        df.rename(columns=rename_map, inplace=True)

    return df

# --------------------------
# Download and Combine Data
# --------------------------
base_url = "https://www.basketball-reference.com/leagues/NBA_{}.html"
os.makedirs("data", exist_ok=True)

for year in range(1990, 1995):
    print(f"Processing {year} season...")
    url = base_url.format(year)

    df_advanced = None
    df_perposs = None

    try:
        # NOTE: We have already provided the data from 2004-2024
        # If you try to fetch too many tables at once you get blacklisted for being a bot
        df_advanced = fetch_table(url, "advanced-team")
        df_perposs = fetch_table(url, "per_poss-team")
        pass
    except Exception as e:
        print(f"Failed to fetch data for {year}: {e}")
        continue

    if df_advanced is not None and df_perposs is not None:
        merged = pd.merge(df_advanced, df_perposs, on="Team", suffixes=("_adv", "_poss"))
        merged["Year"] = year
        merged.to_csv(f"data/{year}_merged.csv", index=False)
        print(f"Saved merged data to data/{year}_merged.csv")
    else:
        print(f"Skipping {year} already have data.")

    time.sleep(1)

Processing 1990 season...
Failed to fetch data for 1990: Missing optional dependency 'lxml'.  Use pip or conda to install lxml.
Processing 1991 season...
Failed to fetch data for 1991: Missing optional dependency 'lxml'.  Use pip or conda to install lxml.
Processing 1992 season...
Failed to fetch data for 1992: Missing optional dependency 'lxml'.  Use pip or conda to install lxml.
Processing 1993 season...
Failed to fetch data for 1993: Missing optional dependency 'lxml'.  Use pip or conda to install lxml.
Processing 1994 season...
Failed to fetch data for 1994: Missing optional dependency 'lxml'.  Use pip or conda to install lxml.


In [8]:
# --------------------------
# Combine All and Add Champion Label
# --------------------------
print("\nMerging all years into one file...")

# Load and concatenate all year files
all_years = []
for filename in os.listdir("data"):
    if filename.endswith("_merged.csv"):
        all_years.append(pd.read_csv(os.path.join("data", filename)))
df = pd.concat(all_years, ignore_index=True)

# Champion team per year
champion_teams = {
    "2004": "Detroit Pistons",
    "2005": "San Antonio Spurs",
    "2006": "Miami Heat",
    "2007": "San Antonio Spurs",
    "2008": "Boston Celtics",
    "2009": "Los Angeles Lakers",
    "2010": "Los Angeles Lakers",
    "2011": "Dallas Mavericks",
    "2012": "Miami Heat",
    "2013": "Miami Heat",
    "2014": "San Antonio Spurs",
    "2015": "Golden State Warriors",
    "2016": "Cleveland Cavaliers",
    "2017": "Golden State Warriors",
    "2018": "Golden State Warriors",
    "2019": "Toronto Raptors",
    "2020": "Los Angeles Lakers",
    "2021": "Milwaukee Bucks",
    "2022": "Golden State Warriors",
    "2023": "Denver Nuggets",
}

# Clean team names just in case
df["Team"] = df["Team"].astype(str).apply(clean_team_name)

# Label champions
df["Champion"] = df.apply(
    lambda row: 1 if row["Team"].strip() == champion_teams.get(str(row["Year"]), None) else 0, axis=1)

# Save final combined file
df.to_csv("data/combined_data_with_champions.csv", index=False)
print("Saved combined_data_with_champions.csv.")


Merging all years into one file...
Saved combined_data_with_champions.csv.


## Methods: Training/Validation
Describe how you performed the training/validation process. Summarize the at least five different model variations that you tried (e.g., changes in features, different algorithms, or hyperparameters) and explain your method for model selection.

In [None]:
# put training here

## Results on Test Set
Describe how you evaluated your final model using the test set. Include key performance metrics and relevant visualizations (e.g., ROC curves, confusion matrices, graphs, summary statistics, accuracy, etc.).

In [None]:
# put results on test set here

## Discussion / Conclusion
Discuss the outcomes of your project. Summarize your key findings, interpret the results, and offer reflections on what worked well and what could be improved in future iterations of your work.

## Disclosures
We used ChatGPT to help clean up our code comments, basically turning our shorthand comments into clearer, more detailed explanations. We also used it for general debugging and figuring out where stuff was breaking.
