<a href="https://colab.research.google.com/github/cbsebastian24/randomStuff/blob/main/module_project_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module Project 5

The `nba_api` is a powerful Python library that provides programmatic access to NBA data sourced directly from NBA.com. ("API" standards for "application programmer interface" and describes how we can interact with software.) Developed and maintained by the community, it allows users to retrieve a wide range of statistics, game logs, player profiles, team data, shot charts, and advanced metrics through a clean and flexible interface. The library is especially popular among data scientists and basketball analysts for building dashboards, conducting performance analysis, and training predictive models.


In [None]:
#| include: false

import pandas as pd
import numpy as np
import seaborn as sb
import statsmodels.api as sm

!pip install nba_api

Collecting nba_api
  Downloading nba_api-1.11.3-py3-none-any.whl.metadata (5.8 kB)
Downloading nba_api-1.11.3-py3-none-any.whl (318 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m319.0/319.0 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nba_api
Successfully installed nba_api-1.11.3


## Question 1: Using the Library

Documentation for the library can be found at the [`nba_api` GitHub repository](https://github.com/swar/nba_api/blob/master/docs/table_of_contents.md). We will walk through a few examples.


### Teams and Players

Static (unchanging) data is accessed through the `static` subpackage. These include teams and players.

In [None]:
from nba_api.stats.static import players
from nba_api.stats.static import teams

all_players = pd.DataFrame(players.get_players())
all_teams = pd.DataFrame(teams.get_teams())

* Find the oldest team in the data set. (Hint: recall the use of `.idxmin`).
* What proportion of players in the data set are active?
* Find and display the row for "LeBron James". Save the "id" column's value into a variable `lebron_id`.

In [None]:
### solution

In [None]:
#### solution

In [None]:
### solution

### Dynamic Data

An "endpoint" is a URL we can connect to. The `nba_api` package exposes several endpoints from the NBA's API. These endpoints are constantly updated. Unlike the static data, we should expect these to be updated with some frequency, particularly for the current season. We will focus on older seasons to make our results more consistent, but you could certainly use this interface if you wanted more immediate data.

#### Players in the 2023-2024 Season

In [None]:
import nba_api.stats.endpoints.leaguedashplayerstats as playerstats

In [None]:
# Get player stats for the 2024–25 season
response = playerstats.LeagueDashPlayerStats(season='2024-25')
player_stats_2425 = response.get_data_frames()[0] # gives a list with one table in it
player_stats_2425.columns

Index(['PLAYER_ID', 'PLAYER_NAME', 'NICKNAME', 'TEAM_ID', 'TEAM_ABBREVIATION',
       'AGE', 'GP', 'W', 'L', 'W_PCT', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M',
       'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST',
       'TOV', 'STL', 'BLK', 'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS',
       'NBA_FANTASY_PTS', 'DD2', 'TD3', 'WNBA_FANTASY_PTS', 'GP_RANK',
       'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK', 'FGM_RANK', 'FGA_RANK',
       'FG_PCT_RANK', 'FG3M_RANK', 'FG3A_RANK', 'FG3_PCT_RANK', 'FTM_RANK',
       'FTA_RANK', 'FT_PCT_RANK', 'OREB_RANK', 'DREB_RANK', 'REB_RANK',
       'AST_RANK', 'TOV_RANK', 'STL_RANK', 'BLK_RANK', 'BLKA_RANK', 'PF_RANK',
       'PFD_RANK', 'PTS_RANK', 'PLUS_MINUS_RANK', 'NBA_FANTASY_PTS_RANK',
       'DD2_RANK', 'TD3_RANK', 'WNBA_FANTASY_PTS_RANK', 'TEAM_COUNT'],
      dtype='object')

* Which player as played the most games? (See the `GP` column)
* What are the three teams with the highest median points scored? (See the `TEAM_ABBREVIATION` and `PTS` columns)
* What was LeBron James' free throw percentage (`FT_PCT`)
* Plot the distribution of "field goal" (two point shot) percentage column (`FG_PCT`) for all players

In [None]:
### solution

In [None]:
### solution

In [None]:
### solution

In [None]:
### solution

#### Games in the 2023-2024 Season

We have records of individual games:

In [None]:
from nba_api.stats.endpoints import leaguegamelog

response = leaguegamelog.LeagueGameLog(season = '2024-25')
gamelog_2425 = response.get_data_frames()[0]
gamelog_2425["GAME_DATE"] = pd.to_datetime(gamelog_2425["GAME_DATE"])
gamelog_2425.columns

Index(['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M',
       'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST',
       'STL', 'BLK', 'TOV', 'PF', 'PTS', 'PLUS_MINUS', 'VIDEO_AVAILABLE'],
      dtype='object')

* Demonstrate that all games include two teams by using the `GAME_ID` column
* What two teams played in the game with the biggest point spread (see the `PLUS_MINUS` column)
* Plot the distribution of the number of games won per team

In [None]:
### solution

In [None]:
### solution

In [None]:
### solution

## Question 2

### Part a

Graph the column `"FGA"` (field goal attempt, number of two point shots players take) in the `player_stats_2425` table. Does this plot exhibit any skew? If so, try out the three transformations we have covered (reciprocal, square root, and log) to see which would help address the skew. There are a number of players who did not attempt any field goals. Both the log and reciprocal transformations need strictly positive data. You can replace all zero values by appending `.clip(1)` when using the data from that column.

Label your transformed column "TFGA"; we will use it in subsequent sections.

In [None]:
### Solution

In [None]:
#### solution

In [None]:
### solution

### Part b



![As Michael Scott says, "You miss 100% of the shots you don't take - Wayne Gretzky."](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*iPquMvzdAlPD3w5ZD24jhA.jpeg)

Are players who take more shots those that also make more shots? Graph the joint distribution of `TFGA` and `FG_PCT`.

In [None]:
### solution

Perform a linear regression of FG_PCT (Y) on TFGA (X).

In [None]:
### Solution

What does the $R^2$ tell you about how well this model works to explain variation in field goal percentage?

**ANSWER**:

### Part c

Using the `sm.nonparametric.lowess` function, fit a smoothed conditional mean model relating FG_PCT to TFGA. Plot the results.

In [None]:
### solution

### Part d

Using your lowess results from the previous part, notice that the first column corresponds to all the TFGA values in the data set ($X_i$), and the second column corresponds to the $\hat Y_i$ values. This table is sorted from smallest to largest on $X_i$. You can reorder the `player_stats_2425` to be in the same order:

In [None]:
# uncomment after creating the TFGA column
# ps_sorted = player_stats_2425.sort_values("TFGA")
# ps_sorted[["PLAYER_NAME", "TFGA"]].head()

In [None]:
# ps_sorted[["PLAYER_NAME", "TFGA"]].tail()

Compute the $R^2$ for this model by finding the sum of squared residuals using your lowess fit:

$$R^2 = 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1} (Y_i - \bar Y)^2 }$$

In [None]:
### solution

Comment on on whether this model shows improvement over the linear regression model?

**Answer:**

### Part e

Try out the `frac` argument to the `lowess` function. The default value is 2/3. Try something smaller and something bigger. Can you find a value that works better than the default value? Report your best $R^2$ value and plot the result.

In [None]:
### solution

In [None]:
### solution

### Part f

Interpret the plot from the previous page. What does it suggest about the relationship between the number of shots a player attempts and the percentage of those shots the player makes? Include the $R^2$ of this model. What does that tell us about whether this model is the final word on this topic?

*Answer*

## Question 3

Continuing to think about shot success, we can think of the each shot as a binary variables $Y = 1$ (make the shot) or $Y = 0$ miss the shot. We can model the conditional probability of making a shot using *logistic regression* which fits the model:

$$P(Y = 1 \mid \mathbf{X} = \mathbf{x}) = \frac{e^{a + b_1 x_1 + \ldots + b_p x_p}}{1 + e^{a + b_1 x_1 + \ldots b_p x_p}}$$

We will investigate the conditional probability of making a shot based on several player level features.

### Part a

Notice that we have our data in a slightly different format than this model. Then the `FGM` (field goals made) is the sum of the $Y_j$ for $j = 1$ to $j = \text{FGA}$. This isn't a problem for logistic regression, but we need to provide two outcomes: the number of successful shots and the number of missed shots.

- Create a new column called `FGMISS` that is the difference of `FGA` and `FGM`.
- Pull out a table with the field goal columns (`M`, `A`, and `MISS`) and age.
- using the `sm.GLM` function include both `FGM` and `FGMISS` as the outcome as a single table
- Use age as the predictor
- pass the `family = sm.families.Binomial()` argument
- print out the summary from this model

In [None]:
### solution

Answer the following questions:

- Would you reject the claim that the probability of making a shot is independent of age at the $\alpha = 0.05$ level?
- Comparing younger and older players, which group are more likely to make a shot?
- The median age of the player stats table is 25. How likely is a 25 year old player to make a field goal?


**Answer:**


### Part b

Are some teams more likely to make shots than others? Using the `pd.get_dummies` function, turn the `TEAM_ABBREVIATION` column into dummy variables and perform a logistic regression of shot success on this variable (be careful about how you either use reference encoding or leave out the intercept). Print out the summary table and find the team that is most likely to make a field goal.

In [None]:
### solution

*Answer*:

## Question 4

Games are played at the stadium of a of one of the teams, which we call the "home team." The other team is the "away" or "visiting" team. There is a theory that teams play better at home. Let's find out.

In [None]:
gamelog_2425.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,DREB,REB,AST,STL,BLK,TOV,PF,PTS,PLUS_MINUS,VIDEO_AVAILABLE
0,22024,1610612738,BOS,Boston Celtics,22400061,2024-10-22,BOS vs. NYK,W,240,48,...,29,40,33,6,3,4,15,132,23,1
1,22024,1610612747,LAL,Los Angeles Lakers,22400062,2024-10-22,LAL vs. MIN,W,240,42,...,31,46,22,7,8,7,22,110,7,1
2,22024,1610612750,MIN,Minnesota Timberwolves,22400062,2024-10-22,MIN @ LAL,L,240,35,...,35,47,17,4,1,16,22,103,-7,1
3,22024,1610612752,NYK,New York Knicks,22400061,2024-10-22,NYK @ BOS,L,240,43,...,29,34,20,2,3,12,12,109,-23,1
4,22024,1610612737,ATL,Atlanta Hawks,22400064,2024-10-23,ATL vs. BKN,W,240,39,...,33,45,25,12,9,16,20,120,4,1


Notice that this has one row for each team and game combination. It appears that the home team has a `"MATCHUP"` value of the form `"TEAM1 vs. TEAM2"` and the visiting team lists it as "TEAM2 @ TEAM1". Here is some code to help us create a table with one row per each game.

Before we do that, however, we are going to grab the previous game for each team to see how they performed.


In [None]:
df = gamelog_2425.copy()

df.sort_values(["TEAM_ID", "GAME_DATE"])

df['PREV_GAME_ID'] = df.groupby("TEAM_ID")["GAME_ID"].shift(1)
df = pd.merge(df, df, left_on = "PREV_GAME_ID", right_on = "GAME_ID", suffixes = (None, "_PREVIOUS"))

def extract_teams(row):
    if "vs" in row['MATCHUP']:
        # Home team case: "TEAM1 vs TEAM2"
        home, away = row['MATCHUP'].split(' vs. ')
    elif "@" in row['MATCHUP']:
        # Away team case: "TEAM2 @ TEAM1"
        away, home = row['MATCHUP'].split(' @ ')
    else:
        home, away = None, None
    return pd.Series([home, away])

df[['home_team', 'away_team']] = df.apply(extract_teams, axis=1)

home_df = df[df['MATCHUP'].str.contains('vs.')].copy()
away_df = df[df['MATCHUP'].str.contains('@')].copy()

# Rename relevant columns for merging
home_df = home_df.add_prefix('home_')
away_df = away_df.add_prefix('away_')

# Merge on game_id
games = pd.merge(
    home_df,
    away_df,
    left_on='home_GAME_ID',
    right_on='away_GAME_ID'
)

games.columns

Index(['home_SEASON_ID', 'home_TEAM_ID', 'home_TEAM_ABBREVIATION',
       'home_TEAM_NAME', 'home_GAME_ID', 'home_GAME_DATE', 'home_MATCHUP',
       'home_WL', 'home_MIN', 'home_FGM',
       ...
       'away_STL_PREVIOUS', 'away_BLK_PREVIOUS', 'away_TOV_PREVIOUS',
       'away_PF_PREVIOUS', 'away_PTS_PREVIOUS', 'away_PLUS_MINUS_PREVIOUS',
       'away_VIDEO_AVAILABLE_PREVIOUS', 'away_PREV_GAME_ID_PREVIOUS',
       'away_home_team', 'away_away_team'],
      dtype='object', length=124)

#### Part a

Using the `games` table, answer the following:

- How often do home teams win?
- Plot the distribution of points scored at home and the distribution of points score away for all teams.
- Comment on these results. Do you see evidence of a "home team advantage?"

In [None]:
#### solution

In [None]:
#### solution

In [None]:
#### solution

#### Part b

Using a decision tree, fit a model of "away_WIN" using which ever variables you wish *from the previous game* for both teams.

#### Part c

Split the data into two groups: a training and a test set. Try different values of tree depth and compare the results on the model fit evaluated using the test set. Summarize your results in a brief paragraph.


#### Part d

Use a random forest model. How does it compare to the best decision tree you created?

## Question 5

### Part a

Using the `games` table from the previous section, consider this to be a sample of all games that could occur.

Using bootstrapping, get a confidence interval for the median points difference between teams in a game (you can use the `home_PLUS_MINUS` column for this).

Comment on this result. What does it tell us about home team advantage?


In [None]:
#### solution

**Answer:**


### Part b

Consider two types of games:

- Those in which the home team won the previous game
- Those in which the home team lost the previous game

Perform a permutation test to compare these two groups on whether they win the next game. As a test statistic, use the difference of proportions that win the current game in the two groups. Again, comment on the results with respect home team advantage. Do home teams that won the previous game have an additional advantage?

In [None]:
#### Solution

**Answer:**
