<a href="https://colab.research.google.com/github/dani0621/Project01_ML/blob/main/ADV_CSIII_Project_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NBA Statistics Prediction Model**

### **Imports**

In [172]:
# importing files and libraries, connecting data file from Google Drive
import pandas as pd
import numpy as np
import math

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn import metrics
from google.colab import files
from google.colab import drive

drive.mount("/content/drive")
data=pd.read_csv('/content/drive/My Drive/MLA/team.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### **Clean Data/ Basic Analysis**

In [173]:
# Filter all the games played before 10/19/2018 as only the data from the past few years would be accurate in the our current win prediction model
data_filtered = data[data['date'] >= '10/19/2018']

# filters out rows that don't have values
data_clean = data_filtered.dropna().drop_duplicates()

# Winning percentage equation from ChatGPT. NBA winning percentages require point differential which is a statistic I don't have. I asked ChatGPT for other ways of calculating win percentages and this was the one it gave me:
data['W%'] = data.groupby('team')['win'].transform(lambda x: x.cumsum() / (x.index + 1))

# True Shooting Percentage (gotten from Google, looked up calculation formula for true shotting percentage nba)
# to_percentage= (points[PTS] * 100)/(2*(field goal attempts[FGA] + 0.44 * Free throw attempts [fta]))
data['TO%'] = (data['PTS'] * 100) / (2 * (data['FGA'] + 0.44 * data['FTA']))

data_cleaned = data[['home','away', 'date', 'team','season', 'win', 'PTS', 'REB', 'TOV', '+/-', 'FTM', 'FTA', 'AST', 'BLK', 'TO%', 'W%', 'OREB', 'DREB', 'FG%', 'FT%', '3P%', 'FGA', 'FGM', '3PM', 'STL']]

### **Logistic Regressions**

#### **Points**

In [174]:
# All of the rest of the logistic regressions were copied from this model (all AI used here for debugging/ errors/corrections applies to the following logsitic regressions as well)
# performLinRegPt creates the model that predicts the number of points the home team will , most of this model was from YT videos that are documented in the README.
def performLinRegPTS(dataframe):
  # factors
  featureColumns = ['W%', 'REB', 'TOV', '+/-', 'TO%', 'FTM', 'FTA', 'AST', 'BLK', 'FT%', 'FG%', 'OREB', '3PM'] #There were bugs with this that I used Gemini to resolve (forgot to include those columns in my data_cleaned so the columns didn't match up and resulted in an error).


  X = dataframe[featureColumns]
  Y = dataframe['PTS']

  # training and testing sets
  X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True)

  linreg = LinearRegression()
  linreg.fit(X_train, Y_train)
  Y_pred = linreg.predict(X_test)

  r2 = r2_score(Y_test, Y_pred) # originally used "sklearn.metrics import accuracy score" but since my data is continuous, this had a bug and when I asked Gemini, this was the solution that was provided
  global global_r2_pts # made global in order to access in print statement later
  global_r2_pts = r2

  # model performance, accuracy, from Jake Kandell's NBA Predictor that is cited in the README
  print("Mean Absolute Error:", metrics.mean_absolute_error(Y_test, Y_pred))
  print("Mean Squared Error:", metrics.mean_squared_error(Y_test, Y_pred))
  print("Root Mean Squared Error:", metrics.root_mean_squared_error(Y_test, Y_pred)) # used Gemini to correct this, originally had an error with the metrics function (the funcion name was going to be changed in the future so it had a warning sign)
  print(f"R-squared: {global_r2_pts}")
  print('----------------------------------')

  # coefficients of factors within the multiple variable linear regression line, also from Jake Kandell's NBA Predictor
  print('Coefficient Information:')
  for i, feature in enumerate(featureColumns):
      print(f"{feature}: {linreg.coef_[i]}")

  return linreg

pt_model = performLinRegPTS(data_cleaned)

Mean Absolute Error: 1.9898205228091077
Mean Squared Error: 7.250260815337064
Root Mean Squared Error: 2.692630835323896
R-squared: 0.961263083479729
----------------------------------
Coefficient Information:
W%: -27.528460217467877
REB: 0.11628550664406721
TOV: -0.18538398624432362
+/-: -0.029171984090845926
TO%: -6.7130451084391884
FTM: 4.457002048283626
FTA: -1.6768591350285613
AST: 0.12222758287897297
BLK: -0.002926477926993189
FT%: 0.043276065896767996
FG%: 7.552358688809156
OREB: 0.1834352913344229
3PM: 4.723191080683752


#### **Rebounds**

In [175]:
# performLinRegRB creates the model that predicts the number of rebounds the home team will make
def performLinRegRB(dataframe):
  # factors
  featureColumns = ['OREB', 'DREB','FG%', 'PTS', 'FT%', '+/-', 'BLK', '3P%']

  X = dataframe[featureColumns]
  Y = dataframe['REB']

  # training and testing sets
  X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True)

  linreg = LinearRegression()
  linreg.fit(X_train, Y_train)
  Y_pred = linreg.predict(X_test)
  r2 = r2_score(Y_test, Y_pred)
  global global_r2_rb
  global_r2_rb = r2

  # model performance, accuracy
  print("Mean Absolute Error:", metrics.mean_absolute_error(Y_test, Y_pred))
  print("Mean Squared Error:", metrics.mean_squared_error(Y_test, Y_pred))
  print("Root Mean Squared Error:", metrics.root_mean_squared_error(Y_test, Y_pred))
  print(f"R-squared: {global_r2_rb}")
  print('----------------------------------')

  # coefficients of factors within the multiple variable linear regression line
  print('Coefficient Information:')
  for i, feature in enumerate(featureColumns):
      print(f"{feature}: {linreg.coef_[i]}")

  return linreg

rb_model = performLinRegRB(data_cleaned)

Mean Absolute Error: 6.780843535893902e-15
Mean Squared Error: 8.267029139530236e-29
Root Mean Squared Error: 9.092320462637816e-15
R-squared: 1.0
----------------------------------
Coefficient Information:
OREB: 0.999999999999999
DREB: 0.9999999999999996
FG%: 0.0
PTS: 4.996003610813204e-16
FT%: -2.220446049250313e-16
+/-: 1.5265566588595902e-16
BLK: 7.91033905045424e-16
3P%: -2.220446049250313e-16


#### **Turnovers**

In [176]:
# performLinRegTO creates the model that predicts the number of turnovers the home team will make
def performLinRegTO(dataframe):
  # factors
  featureColumns = ['AST', 'REB', '+/-', "TO%", "FGA", "3P%", "STL","OREB","DREB", "FG%", "FT%"]

  X = dataframe[featureColumns]
  Y = dataframe['TOV']

  #  training and testing sets
  X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True)

  linreg = LinearRegression()
  linreg.fit(X_train, Y_train)
  Y_pred = linreg.predict(X_test)
  r2 = r2_score(Y_test, Y_pred)
  global global_r2_to
  global_r2_to = r2

  # model performance, accuracy
  print("Mean Absolute Error:", metrics.mean_absolute_error(Y_test, Y_pred))
  print("Mean Squared Error:", metrics.mean_squared_error(Y_test, Y_pred))
  print("Root Mean Squared Error:", metrics.root_mean_squared_error(Y_test, Y_pred))
  print(f"R-squared: {global_r2_to}")
  print('----------------------------------')

  # coefficients of factors within the multiple variable linear regression line
  print('Coefficient Information:')
  for i, feature in enumerate(featureColumns):
      print(f"{feature}: {linreg.coef_[i]}")

  return linreg

to_model = performLinRegTO(data_cleaned)

Mean Absolute Error: 2.4644556403748776
Mean Squared Error: 9.550926586419886
Root Mean Squared Error: 3.0904573425983224
R-squared: 0.3999822296122554
----------------------------------
Coefficient Information:
AST: 0.06297350687186756
REB: 0.29052661346558983
+/-: -0.19311804125435839
TO%: 0.05118309883273594
FGA: -0.3077789049212875
3P%: 0.025861892263149625
STL: 0.47222723380966963
OREB: 0.18169449334830737
DREB: 0.10883212011728206
FG%: 0.19004771803722326
FT%: 0.027874930966953468


#### **Assists**

In [177]:
# performLinRegAS creates the model that predicts the number of assists the home team will make
def performLinRegAS(dataframe):
  #factors
  featureColumns = ['PTS', 'FGM', 'TOV', '+/-']

  X = dataframe[featureColumns]
  Y = dataframe['AST']

  # training and testing sets
  X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True)

  linreg = LinearRegression()
  linreg.fit(X_train, Y_train)
  Y_pred = linreg.predict(X_test)
  r2 = r2_score(Y_test, Y_pred)
  global global_r2_as
  global_r2_as = r2

  # model performance, accuracy
  print("Mean Absolute Error:", metrics.mean_absolute_error(Y_test, Y_pred))
  print("Mean Squared Error:", metrics.mean_squared_error(Y_test, Y_pred))
  print("Root Mean Squared Error:", metrics.root_mean_squared_error(Y_test, Y_pred))
  print(f"R-squared: {global_r2_as}")
  print('----------------------------------')

  # coefficients of factors within the multiple variable linear regression line
  print('Coefficient Information:')
  for i, feature in enumerate(featureColumns):
      print(f"{feature}: {linreg.coef_[i]}")

  return linreg

as_model = performLinRegAS(data_cleaned)

Mean Absolute Error: 3.1088928401598808
Mean Squared Error: 15.079532864267359
Root Mean Squared Error: 3.8832374205380953
R-squared: 0.46285466635412076
----------------------------------
Coefficient Information:
PTS: 0.034039116873522124
FGM: 0.527068780148347
TOV: 0.07289457262179229
+/-: 0.04540861334488039


#### **Free Throws Made**

In [178]:
# performLinRegFT creates the model that predicts the number of free throws the home team will make
def performLinRegFT(dataframe):
  # factors
  featureColumns = ['PTS', 'FGA', 'REB']

  X = dataframe[featureColumns]
  Y = dataframe['FTM']

  # training and testing sets
  X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True)

  linreg = LinearRegression()
  linreg.fit(X_train, Y_train)
  Y_pred = linreg.predict(X_test)
  r2 = r2_score(Y_test, Y_pred)
  global global_r2_ft
  global_r2_ft = r2

  # model performance, accuracy
  print("Mean Absolute Error:", metrics.mean_absolute_error(Y_test, Y_pred))
  print("Mean Squared Error:", metrics.mean_squared_error(Y_test, Y_pred))
  print("Root Mean Squared Error:", metrics.root_mean_squared_error(Y_test, Y_pred))
  print(f"R-squared: {global_r2_ft}")
  print('----------------------------------')

  # coefficients of factors within the multiple variable linear regression line
  print('Coefficient Information:')
  for i, feature in enumerate(featureColumns):
      print(f"{feature}: {linreg.coef_[i]}")

  return linreg

ft_model = performLinRegFT(data_cleaned)

Mean Absolute Error: 4.241471663787593
Mean Squared Error: 28.582730891816823
Root Mean Squared Error: 5.34628196897777
R-squared: 0.2580716288908135
----------------------------------
Coefficient Information:
PTS: 0.20542215863990398
FGA: -0.41380116522957094
REB: 0.19557826793725192


#### **Blocks**

In [179]:
# performLinRegBL creates the model that predicts the number of blocks the home team will make
def performLinRegBL(dataframe):
  # factors
  featureColumns = ['REB', '+/-', 'STL']

  X = dataframe[featureColumns]
  Y = dataframe['BLK']

  # training and testing sets
  X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True)

  linreg = LinearRegression()
  linreg.fit(X_train, Y_train)
  Y_pred = linreg.predict(X_test)
  r2 = r2_score(Y_test, Y_pred)
  global global_r2_bl
  global_r2_bl = r2

  # model performance, accuracy
  print("Mean Absolute Error:", metrics.mean_absolute_error(Y_test, Y_pred))
  print("Mean Squared Error:", metrics.mean_squared_error(Y_test, Y_pred))
  print("Root Mean Squared Error:", metrics.root_mean_squared_error(Y_test, Y_pred))
  print(f"R-squared: {global_r2_bl}")
  print('----------------------------------')

  # coefficients of factors within the multiple variable linear regression line
  print('Coefficient Information:')
  for i, feature in enumerate(featureColumns):
      print(f"{feature}: {linreg.coef_[i]}")

  return linreg

bl_model = performLinRegBL(data_cleaned)

Mean Absolute Error: 1.9836677265868443
Mean Squared Error: 6.279345348697958
Root Mean Squared Error: 2.505862196669633
R-squared: 0.04783517099650747
----------------------------------
Coefficient Information:
REB: 0.050033906444684126
+/-: 0.02673897229383953
STL: -0.0015176630248344453


#### **Field Goals Made**

In [180]:
# performLinRegFGM creates the model that predicts the number of field goals the home team will make
def performLinRegFGM(dataframe):
  # factors
  featureColumns = ['FGA', 'FG%', 'PTS','3PM', 'AST', 'TOV', '3P%', 'TO%']

  X = dataframe[featureColumns]
  Y = dataframe['FGM']

  # training and testing sets
  X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True)

  linreg = LinearRegression()
  linreg.fit(X_train, Y_train)
  Y_pred = linreg.predict(X_test)
  r2 = r2_score(Y_test, Y_pred)
  global global_r2_fgm
  global_r2_fgm = r2

  # model performance, accuracy
  print("Mean Absolute Error:", metrics.mean_absolute_error(Y_test, Y_pred))
  print("Mean Squared Error:", metrics.mean_squared_error(Y_test, Y_pred))
  print("Root Mean Squared Error:", metrics.root_mean_squared_error(Y_test, Y_pred))
  print(f"R-squared: {global_r2_fgm}")
  print('----------------------------------')

  # coefficients of factors within the multiple variable linear regression line
  print('Coefficient Information:')
  for i, feature in enumerate(featureColumns):
      print(f"{feature}: {linreg.coef_[i]}")

  return linreg

fgm_model = performLinRegFGM(data_cleaned)

Mean Absolute Error: 0.3032681089186106
Mean Squared Error: 0.20843469918927887
Root Mean Squared Error: 0.4565464918157612
R-squared: 0.9932721478283176
----------------------------------
Coefficient Information:
FGA: 0.4072944419965778
FG%: 0.8581281406828246
PTS: 0.03716008890695488
3PM: 0.03834974393248469
AST: 0.007622021724911665
TOV: -0.0017173140660781683
3P%: -0.0030041086041882603
TO%: -0.11379529654977974


### **Converting Abreviations to Full Names**

In [181]:
# converts the inputed abreviation to their actual name, used ChatGPT to get the actual names for all of the abreviations, made the a list of the actual names for me
team_names = {
  'ATL': 'Atlanta Hawks',
  'BKN': 'Brooklyn Nets',
  'BOS': 'Boston Celtics',
  'CHA': 'Charlotte Hornets',
  'CHI': 'Chicago Bulls',
  'CLE': 'Cleveland Cavaliers',
  'DAL': 'Dallas Mavericks',
  'DEN': 'Denver Nuggets',
  'DET': 'Detroit Pistons',
  'GSW': 'Golden State Warriors',
  'HOU': 'Houston Rockets',
  'IND': 'Indiana Pacers',
  'LAC': 'Los Angeles Clippers',
  'LAL': 'Los Angeles Lakers',
  'MEM': 'Memphis Grizzlies',
  'MIA': 'Miami Heat',
  'MIL': 'Milwaukee Bucks',
  'MIN': 'Minnesota Timberwolves',
  'NOP': 'New Orleans Pelicans',
  'OKC': 'Oklahoma City Thunder',
  'ORL': 'Orlando Magic',
  'PHI': 'Philadelphia 76ers',
  'PHX': 'Phoenix Suns',
  'POR': 'Portland Trail Blazers',
  'SAC': 'Sacramento Kings',
  'SAS': 'San Antonio Spurs',
  'TOR': 'Toronto Raptors',
  'UTA': 'Utah Jazz',
  'WAS': 'Washington Wizards'
  }

def get_full_team_name(team_abbreviation):
  # abreviation -> full name
  return team_names.get(team_abbreviation)

### **Prediction for One Game**

In [182]:
# takes user input of the game to predict using the home team, the away team, date of the game, and the season the game was played
# gives the predicted chance of winning for the home game
def predict_game_performance(home_team, away_team, game_date, season, dataframe):
  # date formatting is consistent with file, Gemini fixed previous issue so that the game could be located even if the date is slightlyoff
  game_date = pd.to_datetime(game_date)
  data_cleaned.loc[:, 'date'] = pd.to_datetime(data_cleaned['date'])
  formatted_date = game_date.strftime('%m-%d-%Y')

  # gets stats for home and away team on that game day
  game_data = dataframe[(dataframe['home'] == home_team) & (dataframe['date'] == game_date) & (dataframe['season'] == season)]
  opp_data = dataframe[(dataframe['team'] == away_team) & (dataframe['date'] == game_date) & (dataframe['season'] == season)]

  if game_data.empty:
      print(f"No data available for the game between the {get_full_team_name(home_team)} and the {get_full_team_name(away_team)} on {formatted_date}.")
      return

  # specifying the factors for each linear regression line (didn't use Gemini to write this but the section f line below this was having issues with the columns because the test models wanted to access certain variables while others didn't- this was explained by Gemini- so I just made the columns separate for each one)
  pt_featureColumns = ['W%', 'REB', 'TOV', '+/-', 'TO%', 'FTM', 'FTA', 'AST', 'BLK', 'FT%', 'FG%', 'OREB', '3PM']
  rb_featureColumns = ['OREB', 'DREB','FG%', 'PTS', 'FT%', '+/-', 'BLK', '3P%']
  to_featureColumns = ['AST', 'REB', '+/-', "TO%", "FGA", "3P%", "STL","OREB","DREB", "FG%", "FT%"]
  as_featureColumns = ['PTS', 'FGM', 'TOV', '+/-']
  ft_featureColumns = ['PTS', 'FGA', 'REB']
  bl_featureColumns = ['REB', '+/-', 'STL']
  fgm_featureColumns = ['FGA', 'FG%', 'PTS','3PM', 'AST', 'TOV', '3P%', 'TO%']


  # predict the stats for the home team
  pt_stats_pred = pt_model.predict(game_data[pt_featureColumns])
  rb_stats_pred = rb_model.predict(game_data[rb_featureColumns])
  to_stats_pred = to_model.predict(game_data[to_featureColumns])
  as_stats_pred = as_model.predict(game_data[as_featureColumns])
  ft_stats_pred = ft_model.predict(game_data[ft_featureColumns])
  bl_stats_pred = bl_model.predict(game_data[bl_featureColumns])
  fgm_stats_pred = fgm_model.predict(game_data[fgm_featureColumns])
  opp_stats_pred = pt_model.predict(opp_data[pt_featureColumns])


  # predictions for all variables
  points_pred = pt_stats_pred[0]
  rebounds_pred = rb_stats_pred[0]
  turnovers_pred = to_stats_pred[0]
  assists_pred = as_stats_pred[0]
  free_throws_pred = ft_stats_pred[0]
  blocks_pred = bl_stats_pred[0]
  fgm_pred = fgm_stats_pred[0]
  opp_pts_pred= opp_stats_pred[0]

  # rounding points to the nearest integer
  round_pts_home = round(points_pred)
  round_pts_away = round(opp_pts_pred)

  # printing prediction results for all linear regression predictions (pts, win probability, rebounds, turnovers, assists, free throws, blocks, and field goals made)
  if round_pts_home > round_pts_away:
    print(f'On {formatted_date}, the {get_full_team_name(home_team)} (home team) will defeat the {get_full_team_name(away_team)} (away team) by scoring {round(points_pred)} points to {round(opp_pts_pred)} with the following statistics:')
  elif np.equal(round_pts_home,round_pts_away):
    print(f'On {formatted_date}, the {get_full_team_name(home_team)} (home team) will tie the {get_full_team_name(away_team)} (away team) by both scoring {round(points_pred)} points with the following statistics:')
  else:
    print(f'On {formatted_date}, the {get_full_team_name(home_team)} (home team) will lose to the {get_full_team_name(away_team)} (away team) by scoring {round(points_pred)} points to {round(opp_pts_pred)} with the following statistics:')

  print(f'- {round(rebounds_pred)} rebounds')
  print(f'- {round(turnovers_pred)} turnovers')
  print(f'- {round(assists_pred)} assists')
  print(f'- {round(free_throws_pred)} free throws')
  print(f'- {round(blocks_pred)} blocks')
  print(f'- {round(fgm_pred)} field goals made')


# Print statements for all inputs
print('This NBA Predictor model predicts the scores of the two NBA teams the user inputs on the date of the game (also user input) along with \nother basketball statistics for the home team. The capabilty of the linear regression models to encapsulate the data for the statistics \nare as follows (scale of 0-1 where 0 is when the model explains 0 of the variance and 1 is where the model explains all of the variance):\n')
print(f'- Points: {global_r2_pts:.6f}')
print(f'- Rebounds: {global_r2_rb:.6f}')
print(f'- Turnovers: {global_r2_to:.6f}')
print(f'- Assists: {global_r2_as:.6f}')
print(f'- Free Throws Made: {global_r2_ft:.6f}')
print(f'- Blocks: {global_r2_bl:.6f}')
print(f'- Field Goals Made: {global_r2_fgm:.6f}')
print('\n----------------------------------')
print('Predictions:')
# predicting for one game (user entered home team, away team, date of game, and season)
predict_game_performance('NOP', 'PHX', '2022-04-22', 2022, data_cleaned)
print('\n')
predict_game_performance('CLE', 'WAS', '2021-11-10', 2022, data_cleaned)
print('\n')
predict_game_performance('DEN', 'MIN', '2024-05-14', 2024, data_cleaned)
print('\n')
predict_game_performance('DAL', 'IND', '2022-03-02', 2022, data_cleaned)

This NBA Predictor model predicts the scores of the two NBA teams the user inputs on the date of the game (also user input) along with 
other basketball statistics for the home team. The capabilty of the linear regression models to encapsulate the data for the statistics 
are as follows (scale of 0-1 where 0 is when the model explains 0 of the variance and 1 is where the model explains all of the variance):

- Points: 0.961263
- Rebounds: 1.000000
- Turnovers: 0.399982
- Assists: 0.462855
- Free Throws Made: 0.258072
- Blocks: 0.047835
- Field Goals Made: 0.993272

----------------------------------
Predictions:
On 04-22-2022, the New Orleans Pelicans (home team) will tie the Phoenix Suns (away team) by both scoring 112 points with the following statistics:
- 45 rebounds
- 16 turnovers
- 23 assists
- 22 free throws
- 5 blocks
- 38 field goals made


On 11-10-2021, the Cleveland Cavaliers (home team) will lose to the Washington Wizards (away team) by scoring 95 points to 100 with the fo