# Homework Four
For this homework I wanted to continue looking at Decision Trees, but this go around with further exploration. I want to explore how I can use AdaBoosting and Gradient Boosting to predict whether LeBron's in game statistics can serve as a predictor for the team winning or not.

## Collecting the data
In a previous homework, I created a script to scrape LeBron's gamelogs from the last five seasons from the NBA API and store them in a .csv file:

In [None]:
from nba_api.stats.endpoints import playergamelog
from nba_api.stats.static import players
import pandas as pd
import os

lebron = players.find_players_by_full_name("LeBron James")[0]
lebron_id = lebron['id']

seasons = ['2020-21', '2021-22', '2022-23', '2023-24', '2024-25']
all_gamelogs = pd.DataFrame()

for season in seasons:
    gamelogs = playergamelog.PlayerGameLog(player_id=lebron_id, season=season)
    gamelogs_df = gamelogs.get_data_frames()[0]
    gamelogs_df['SEASON'] = season
    all_gamelogs = pd.concat([all_gamelogs, gamelogs_df], ignore_index=True)

data_directory = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'data', 'lebron')
output_file = os.path.join(data_directory, 'lebron_gamelogs_last_five.csv')

all_gamelogs.to_csv(output_file, index=False)
print("Data saved!")

print(all_gamelogs.head())

I will be using this dataset once again for this homework.

## Basic Decision Tree
First I decided to create a basic decision tree based on LeBron's in game stats to see how it would perform

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import os

data_directory = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'data', 'lebron')
data_file = os.path.join(data_directory, 'lebron_gamelogs_last_five.csv')
data = pd.read_csv(data_file)

data['WIN'] = data['WL'].apply(lambda x: 1 if x == 'W' else 0)

features = ['PTS', 'FG_PCT', 'FG3_PCT', 'AST', 'REB', 'TOV', 'STL', 'MIN']
X = data[features]
y = data['WIN']

trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2, random_state=50)

dt = DecisionTreeClassifier(random_state=None, max_depth=None, min_samples_leaf=1, 
                            min_samples_split=2, criterion='entropy', class_weight=None, min_weight_fraction_leaf=0.0,
                            splitter='best')

dt.fit(trainX, trainY)
print(dt.score(testX, testY))

0.5714285714285714

Not a great model, let's try using a Random Forest on this dataset instead.

In [None]:
features = ['PTS', 'FG_PCT', 'FG3_PCT', 'AST', 'MIN']
X = data[features]
y = data['WIN']

trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2, random_state=50)

rf = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None,
                            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
                            min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
                            oob_score=False, random_state=None, verbose=0, warm_start=False)

rf.fit(trainX, trainY)
print(rf.score(testX, testY))

0.5892857142857143

Still not that good. But that is fine, it gives us better insight into how LeBron's offensive performance impacts winning, and it looks like it doesn't that much.

## Boosting
Using an ADA Boost, I looked to see if that would improve the model:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import os

data_directory = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'data', 'lebron')
data_file = os.path.join(data_directory, 'lebron_gamelogs_last_five.csv')
data = pd.read_csv(data_file)

data['WIN'] = data['WL'].apply(lambda x: 1 if x == 'W' else 0)
features = ['PTS', 'FG_PCT', 'FG3_PCT', 'AST', 'MIN']
X = data[features]
y = data['WIN']

trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2, random_state=50)

base_estimator = DecisionTreeClassifier(
    max_depth=6,
    random_state=None
)

ada = AdaBoostClassifier(
    estimator=base_estimator,
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)
ada.fit(trainX, trainY)

test_accuracy = ada.score(testX, testY)
print(f"Testing Accuracy: {test_accuracy:.2f}")

Testing Accuracy: 0.54 
It does not look like any sort of tree is going to predict whether LeBron's team won or not based on his performance, but why is that? Well for starters LeBron is typically the best player on his team, and they best players are going to perform well no matter the situation. Basketball is a team sport and you can't rely on one player to carry your team.

## Using Gradient Boosting to Predict Points
The first thing I had to do here is add some fields to my data. I wanted to see if I could predict LeBron's points for a game based on whether the game was home or away and his last 5 point averages. To do this I had to modify the imported dataset:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import os
data_directory = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(__file__))), 'data', 'lebron')
data_file = os.path.join(data_directory, 'lebron_gamelogs_last_five.csv')
data = pd.read_csv(data_file)

data['GAME_DATE'] = pd.to_datetime(data['GAME_DATE'], format='%b %d, %Y')

data['WIN'] = data['WL'].apply(lambda x: 1 if x == 'W' else 0)
data['HOME'] = data['MATCHUP'].apply(lambda x: 1 if "vs" in x else 0)

data.sort_values(['SEASON', 'GAME_DATE'], inplace=True)

def calculate_avg_pts_last_5(season_df):
    season_df['AVG_PTS_LAST_5'] = season_df['PTS'].rolling(window=5, min_periods=5).mean()
    return season_df.iloc[5:]

data = (
    data.groupby('SEASON', group_keys=False, as_index=False)
    .apply(lambda df: calculate_avg_pts_last_5(df), include_groups=False)
    .reset_index(drop=True)
)

Without hyperparameters, I ran the model like this: 

In [None]:
X = data[['HOME', 'AVG_PTS_LAST_5']]
y = data['PTS']

trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2, random_state=None)
dt = GradientBoostingRegressor(random_state=None)
dt.fit(trainX, trainY)
predY = dt.predict(testX)

mae = mean_absolute_error(testY, predY)
rmse = mean_squared_error(testY, predY)

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

Mean Absolute Error (MAE): 5.81
Root Mean Squared Error (RMSE): 53.00  .   That mean absolute error isn't terribl, but the RMSE is very high, and can be improved upon, so I added some more features. I added columns for LeBron's average vs the opponent as well as checking if the game was on the second night of a back to back (playing two games in one night):

In [None]:
data.sort_values(['SEASON_ID', 'GAME_DATE'], inplace=True)
print(data.head())
def calculate_avg_pts_last_5(season_df):
    season_df['AVG_PTS_LAST_5'] = season_df['PTS'].rolling(window=5, min_periods=5).mean()
    return season_df.iloc[5:]


data = (
    data.groupby('SEASON_ID', group_keys=False, as_index=False)
    .apply(lambda df: calculate_avg_pts_last_5(df))
    .reset_index(drop=True)
)
data.dropna(subset=['AVG_PTS_LAST_5'], inplace=True)
data['Game_Date_Diff'] = data.groupby('SEASON_ID')['GAME_DATE'].diff().dt.days
data['BACK_TO_BACK'] = data['Game_Date_Diff'].apply(lambda x: 1 if x == 1 else 0)
data['OPPONENT'] = data['MATCHUP'].apply(lambda x: x.split()[-1])

data['AVG_PTS_VS_OPPONENT'] = data.groupby('OPPONENT')['PTS'].transform(lambda x: x.expanding().mean())

data.drop(columns=['Game_Date_Diff', 'OPPONENT'], inplace=True)

X = data[['HOME', 'AVG_PTS_LAST_5', 'BACK_TO_BACK', 'AVG_PTS_VS_OPPONENT']]
y = data['PTS'] 
trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.2, random_state=42)
model = GradientBoostingRegressor(random_state=42)
model.fit(trainX, trainY)

Mean Absolute Error (MAE): 5.29
Root Mean Squared Error (RMSE): 42.75. This lowered both Errors, but the RMSE is still pretty high.