# Hypothesis testing with NBA statistics

In phase II of the final project, we will examine four phenomena in basketball:
1. Home court advantage: We'll test the hypothesis that teams have an advantage when playing at home.
2. Weekend vs weekday: We'll test the hypothesis that teams have different scoring patterns on weekdays vs weekends.
3. Michael Jordan effect: We'll test the hypothesis that Michael Jordan improved the Chicago Bulls scores.
4. Rest days (optional): We'll test the hypothesis that having more rest days affects game scores.

The tests will be run with real NBA data. This dataset is provided to you in `nba_data.csv` in the same directory as this notebook. We also provide a utility function in `utils.py`.

**Due date:** May 10, 2024 11:59pm

**Total:** 65 pts

**Submission:** Please submit a zip file containing this .ipynb file and a pdf version of this file

In [None]:
#Import the necessary libraries
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import time
from scipy import stats
import warnings
import utils

warnings.filterwarnings('ignore')
plt.rc('font', size = 14)

## 1. Data

Let's start by exploring and visualizing the data.

**(1a).** [1 pt] Read the data file (`nba_data.csv`) as a pandas dataframe. Store the dataframe object in a variable called `df`. What is the total number of games in the dataset? Display the first 5 rows.

*Hint:* First, use the `read_csv()` function in pandas. Then, to convert the date column into pandas DateTime format, use the `pd.to_datetime()` function with format string `"%A, %B %d, %Y"`

**(1b).** [2 pts] Plot the average home and away score per game for each season (score on y-axis, season on x-axis). Start with season 1960 and end with season 2017. Comment on the general trends across all these years. Label the plot axes, add a title, and include a legend.

**A:** (Type your answer here)

**(1c).** [1 pt] Add a column `total_score` to `df`.

To help you check your code works, we added a test case below that should pass with your amended dataframe. Note that the test cases in this project are not exhaustive, so you should still check if your function is correct even if the test cases pass.

In [None]:
assert df.loc[np.logical_and(df['home_team']=='Boston Celtics',
                             df['date'] == pd.to_datetime('2017-05-25'))]['Total score'].values[0] == 237

**(1d).** [2 pts] Write a function called <b>weekend_new_column()</b> to create two new columns: <br><br>

`day_of_week`: this column should be set to an integer between $0$ an $6$, both inclusive. If the game was played on Monday then set `day_of_week = 0`, if it was played on Tuesday then `day_of_week = 1`, ...., and if it was played on Sunday then `day_of_week = 6`.

`weekend`: this column should be set to either 0 or 1. If the game was played over the weekend, i.e. on Saturday or Sunday then it should be = 1, else 0.

Feel free to follow the example here: https://stackoverflow.com/questions/52398383/finding-day-of-the-week-for-a-datetime64.

<b>Make sure to double-check that the column content matches the convention we defined above! (It is easy to make off-by-one mistakes)</b>

Your function should pass the test cases below.

In [None]:
def weekend_new_column(df):
    '''
    Add two columns to the dataframe
    @param df: dataframe containing column 'date', will be modified by function
    @return: None
    '''
    pass

In [None]:
weekend_new_column(df)
# rewrite test cases to check against date column in case students shuffled the dataframe
assert df.loc[df['date'] == pd.to_datetime('2017-06-07')].day_of_week.values[0] == 2
assert df.loc[df['date'] == pd.to_datetime('2017-06-07')].weekend.values[0]     == 0
assert df.loc[df['date'] == pd.to_datetime('2017-06-04')].day_of_week.values[0] == 6
assert df.loc[df['date'] == pd.to_datetime('2017-06-04')].weekend.values[0]     == 1

## 2. Generalized likelihood ratio test for home advantage as a proportion

We will use the NBA dataset to determine whether "home-court advantage" is statistically significant. Some people believe a team will perform better playing at home because they are more familiar with the environment, less tired from traveling, and more supported by their fans. This article explains what home court advantage is: https://bleacherreport.com/articles/1520496-how-important-is-home-court-advantage-in-the-nba

We will define home advantage as follows for this part: The proportion of wins at home for a team is higher than the proportion of wins away for that team.

Let $A_k$ be an indicator random variable for whether team $k$ is playing a game at home. Let $B_k$ be an indicator random variable for whether team $k$ wins the game. We are testing for the independence of $A_k$ and $B_k$. When testing for independence of two discrete variables, we can construct a contingency table where the rows are the values of one variable and the columns are the values of the other variable. Then we can use the generalized likelihood ratio test. Let $i$ and $j$ index the rows and columns, respectively, of the contingency table. This test examines whether the number of observations $O_{ij}$ in each entry of the contingency table aligns with the expected numbers $E_{ij}$. The statistic is based on likelihood ratios:

\begin{equation}
    G = 2 \sum_{i,j} O_{ij} \ln \left(\frac{O_{ij}}{E_{ij}}\right)
\end{equation}

The G-statistic follows a chi-squared distribution with (number of rows - 1)(number of columns - 1) as the number of degrees of freedom.

**(2a).** [2 pts] For a single team $k$, state your null and alternative hypotheses using the random variables $A_k$ and $B_k$ defined above.

**A:** (Type your answer here)

**(2b).** [1 pt] We will restrict the data to seasons 2000-2017 for parts 2 and 3. This is because the mean yearly scores prior to season 2000 demonstrate far greater variance than during this most recent period. Ideally, we want to answer questions about home advantage, for example, during a period when the inter-seasonal variability was less extreme. Create a subset of the dataframe called `df_millenial_subset` for seasons in the range $[2000, 2017]$ (both inclusive).

**(2c).** [1 pt] To ensure we have sufficient data for each hypothesis test, we will only test the home advantage hypothesis for teams that play in all 18 seasons. Create a list `team_names_in_all_seasons` of teams that play in all 18 seasons. Print the number of teams in your list.

Note that although we will not test the hypotheses for the excluded teams, matches that include those teams may still be used for the hypothesis tests you conduct.

**(2d).** [2 pts] For each team in the list you created in 2c, create a 2x2 contingency table that contains `home` and `away` as the two rows and `wins` and `losses` as the two columns. The entry in the row `home` and column `wins` is the number of home games that are won by that team. The entry in the row `away` and column `losses` is the number of away games that are lost by that team. The other two entries are defined analogously.

Keep a list called `observed_tables` containing the contingency tables. This list should follow the order in `team_names_in_all_seasons`.

Print the table for the New York Knicks.

Your table for the Boston Celtics should pass the test case given below. The test case is written assuming the contingency tables are saved in pandas data frames since that allows the rows and columns to be titled. Other data structures will also be accepted. If you use a different data structure, feel free to modify the test cases in 2d and 2e, as well as the function specification in 2f.

In [None]:
boston_celtics_idx = team_names_in_all_seasons.index('Boston Celtics')
assert np.all(observed_tables[boston_celtics_idx].values == np.array([[506, 308], [359, 450]]))

**(2e).** [2 pts] Create a list of tables `expected_tables` in the same format containing the expected number of samples for each entry under the null hypothesis. The teams should be in the same order as `team_names`.

Print the table for the New York Knicks.

Your table for the Boston Celtics should pass the test case given below.

*Hint:* The marginal probabilities for each row and column may be helpful when computing the expected values.

In [None]:
boston_celtics_idx = team_names_in_all_seasons.index('Boston Celtics')
assert np.all(np.around(expected_tables[boston_celtics_idx].values, decimals = 1) 
                        == np.array([[433.8, 380.2], [431.2, 377.8]]))

**(2f).** [2 pts] Write a function that computes the G-statistic from a 2x2 contingency table containing the observed counts and a 2x2 contingency table containing the expected counts.

Your function should pass the test below.

In [None]:
def compute_g_statistic(observed_df,
                        expected_df):
    '''
    Compute the G-statistic from the observed and expected contingency tables.
    @param observed_df: pandas DataFrame, observed contingency table, 
                        contains columns wins, losses
    @param expected_df: pandas DataFrame, expected contingency table, contains same columns
    @return: float, G-statistic
    '''
    assert len(observed_df) == 2
    assert len(expected_df) == 2
    assert len(observed_df.columns) == 2
    assert len(expected_df.columns) == 2
    
    pass

In [None]:
test_observed_table = pd.DataFrame(data = {'wins': [6, 3],
                                           'losses': [4, 5]},
                                   columns = ['wins', 'losses'])
test_expected_table = pd.DataFrame(data = {'wins': [5, 4],
                                           'losses': [5, 4]},
                                   columns = ['wins', 'losses'])
assert np.around(compute_g_statistic(test_observed_table, test_expected_table), decimals = 1) == .9

**(2g).** [2 pts] Write a function that computes the p-value from the G-statistic from a 2x2 contingency table. The statistic follows a chi-squared distribution. How many degrees of freedom does this distribution have?

*Hint:* You may find the **scipy.stats.chi2** object to be useful. 

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html 

**A:** (Type your answer here)

In [None]:
def compute_p_value_from_g_statistic(g_statistic):
    '''
    Compute the p-value for a G-statistic
    @param g_statistic: float
    @return: float, p-value
    '''
    pass

**(2h).** [2 pts] Using the functions you wrote in parts 2f and 2g, compute the G-statistics and p-values for each team. Display the results for all teams in a pandas dataframe with columns `Team name`, `G-statistic`, `P-value`.

**(2i).** [1 pt] Pick any team. State whether the computed p-value supports the idea that the team you selected has ahome advantage.

**A:** (Type your answer here)

**(2j).** [1 pt] When testing multiple hypotheses, some true null hypothesis may be rejected by random chance. (For a fun illustration, see https://xkcd.com/882/.) To guarantee the likelihood of rejecting any true null hypothesis is at most .05, we can check if each p-value is below .05 divided by the number of hypotheses tested. Union bound gives us the desired guarantee. This multiple hypothesis correction is called Bonferroni's correction. We will learn about multiple hypothesis corrections in a later lecture.

Using the p-value threshold in Bonferroni's correction, for which teams can we conclude there is home advantage?

**A:** (Type your answer here)

## 3. One-sample t-test for home advantage as a mean of differences between paired games

In this part, we will consider another definition for home advantage: Consider a pair of games between teams $i$ and $j$. In one game, team $i$ is playing at home and scores $h_i$, while team $j$ is playing away and scores $a_j$. In another game, team $i$ is playing away and scores $a_i$, while team $j$ is playing at home and scores $h_j$. For this pair of games, home advantage is defined as

\begin{equation}
    \left(h_i - a_j\right) - \left(a_i - h_j\right)
\end{equation}

We will perform a one-sided t-test for home advantage. The t-statistic is

\begin{equation}
    t = \frac{\hat{\mu} - \mu_0}{se\left(\hat{\mu}\right)}
\end{equation}

**(3a).** [2 pts] Implement a function called `compute_home_advantage` to compute the home advantage. Your function should use `construct_game_pairs` provided in `utils.py` to construct pairs of games in the same season. For any pair of teams $i$ and $j$, each game with team $i$ at home and team $j$ away in a season is paired with each game with team $i$ away and team $j$ at home in the same season. This function will create a dataframe with one game pair per row.  Your job is to add a column `home_advantage` that computes the home advantage.

Your function should pass the test case below.

In [None]:
def compute_home_advantage(df):
    '''
    Create a dataframe with one pair of games per row and add a column containing the home_advantage for each pair
    @param df: pandas DataFrame containing columns date, season, home_team, away_team, home_pt, away_pt, etc.
    @return: pandas DataFrame containing one paired game per row and home_advantage column
    '''
    pass

In [None]:
pair_games_df = compute_home_advantage(df_millenial_subset)
assert pair_games_df.loc[np.logical_and.reduce((
    pair_games_df['season'] == 2017,
    pair_games_df['pair_game1_date'] == pd.to_datetime('2017-05-02'),
    pair_games_df['pair_game1_home_team'] == 'Boston Celtics',
    pair_games_df['pair_game1_away_team'] == 'Washington Wizards',
    pair_games_df['pair_game2_date'] == pd.to_datetime('2017-05-07')))].home_advantage.values[0] == 29

**(3b).** [2 pts] Let's examine the distribution of home advantage.

1. Print the mean home advantage.
2. Create a histogram showing the density of the data with 40 bins.

**(3c).** [2 pts] Let $\mu_{HA}$ be the mean advantage. State your null and alternative hypotheses in math and in words. Please define "home court advantage" in the context of this test instead of using those exact words.

**A:** (Type your answer here)

**(3d).** [3 pts] Assuming the game pairs are independent, run a 1-sided t-test to assess whether the mean home advantage is significantly above 0. Print the t-statistic and p-value. Interpret your result.

Hint: You may find `scipy.stats.ttest_1samp` useful.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html

**A:** (Type your conclusion here)

**(3e).** [2 pts] The one-sample t-test you just ran makes the following 2 assumptions:

1. The mean home advantage is Normally distributed.
2. The home advantages from all the game pairs are independent and identically distributed.

Do these 2 assumptions hold?

*Hint:* This question can be answered by reasoning. No data analysis necessary.

**A:** (Type your answer here)

As a fun fact, in the National Hockey League, home advantage is actually baked into the rules. During face offs, the home team gets the last change: The home team decides which players to send onto the ice after seeing which players the away team sent. There is no equivalent rule in the NBA, so home advantage in NBA (if it exists) is purely psychological.

## 4. Weekend vs weekday

We will use the NBA dataset to assess whether the total number of points scored is different in weekend vs weekday games. Some people believe more points are scored on the weekend because players are more well-supported by packed audiences on the weekend. On the other hand, some people believe that because players take more time off the weekend, they are more focused and in shape on weekdays, so they might score more points on weekdays.

In this section, we will run a two-sided two-sample t-test. You may assume the two samples $X$ and $Y$ have equal variance. The t-statistic is 

\begin{equation}
    t = \frac{\left(\bar{X} - \bar{Y}\right) - \left(\mu_X - \mu_Y\right)}{s \sqrt{\frac{1}{n_X} + \frac{1}{n_Y}}}
\end{equation}

where $s^2$ is the pooled sample variance:

\begin{equation}
    s^2 = \frac{\left(n_X - 1\right) s_X^2 + \left(n_Y - 1\right) s_Y^2}{\left(n_X - 1\right) + \left(n_Y - 1\right)}
\end{equation}

$n_X$ and $n_Y$ are the numbers of samples, and $s_X^2$ and $s_Y^2$ are the empirical variances.

**(4a).** [1 pt] Create two subsets `df_weekend` and `df_weekday` of games played on weekends/weekdays, respectively. We will continue to restrict to games in 2000-2017 for part 4.

**(4b).** [1 pt] What are average total scores for weekend vs weekday games? Print both.

**(4c).** [1 pt] What are the standard deviations of the total scores for weekend vs weekday games? Print both.

**A:** (Type your answer here)

**(4d).** [2 pts] Plot a histogram of the densities of the total scores in games played on weekends and weekdays.

**(4e).** [2 pts] Let $\mu_0$ and $\mu_1$ be the mean of total scores on the weekdays and weekends, respectively. State your null and alternative hypotheses in math and words.

**A:** (Type your answer here)

**(4f).** [3 pts] Assume the total scores from each game are independent and identically distributed. Assume the variances are equal. Run the t-test for your hypotheses. Print the t-statistic and p-value. What would you conclude?

*Hint:* You may use the `ttest_ind` function in `scipy.stats`

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

**A:** (Type your conclusion here)

Note: Although we assumed the total scores from each game are independent and identically distributed, this does not necessarily hold. For instance, in the next part, we will see how games with Michael Jordan might tend to have higher scores!

## 5. The Michael Jordan Effect

Widely regarded as the all-time greatest players in the history of the NBA, Michael Jordan led the Chicago Bulls to six championships (1991-93 and 1996-98). The other constants during the 1991-1998 Bulls era were hall of famer Scottie Pipen and coach Phil Jackson. We would like to study whether Jordan's presence on the team was (statistically) significant in bringing the championships to the Bulls.

Conveniently (for us), Jordan retired from basketball after winning three consecutive NBA championships with the Chicago Bulls. He retired ahead of the 1994 season but returned back from retirement late during the 1995 season to eventually lead the Bulls to three more titles in the 1996, 1997, 1998 seasons.

During his absence, both Pipen and Jackson were still at the Bulls, allowing us to study the impact of Michael Jordan's absence on the Chicago Bulls' performance. This fact will allow us to assume that the only change in the years with or without Jordan in the period 1992-1997 was Michael Jordan's presence/absence. You may assume there are no other confounding factors at play.

**(5a).** [2 pts] Produce a subset of the data which includes the following seasons only: 1992-1997. Additionally, the subset should only include games (home or away) featuring the Chicago Bulls.

**(5b).** [3 pts] Create three new columns on the subset as follows: <br>
1. `for_score`: points scored by the Chicago Bulls (home or away) in the seasons 1992-1997. <br>
2. `against_score`: points scored against the Chicago Bulls (home or away) in the seasons 1992-1997. <br>
3. `net_score`: net score by the Chicago Bulls (home or away) in the seasons 1992-1997.

Your amended dataframe should pass the tests below.

In [None]:
test_game = bulls_df.loc[bulls_df['date'] == pd.to_datetime('1997-06-13')]
assert test_game.for_score.values[0] == 90
assert test_game.against_score.values[0] == 86
assert test_game.net_score.values[0] == 4

**(5c).** [1 pt] Using the subset above, produce two disjoint subsets as follows: <br> 
`df_subset_jordan`: includes seasons 1992, 1993, 1996, 1997 (we leave 1994 and 1995 out). <br>
`df_subset_no_jordan`: includes seasons 1994, 1995 only.

**(5d).** [2 pts] Let $\mu_1$ be the means of `for_score` for the Chicago bulls during the Jordan seasons ('92, '93, '96, '97). Let $\mu_0$ be the means of `for_score` for the Chicago bulls during the seasons without him ('94, '95). What are the null and alternate hypotheses if we want to assess the Michael Jordan effect on `for_score` in these two time periods? Please state in math and in words.

**A:** (Type your answer below)

**(5e).** [1 pt] Produce the mean `for_score` by the Chicago Bulls during the Jordan seasons and the seasons without him. Print both.

**(5f).** [2 pts] Conduct a two-sample t-test for differences between the `for_score` during the Michael Jordan seasons and seasons without him. Feel free to use the `scipy.stats` library to perform the t-test for independent samples and assume equal variances. Print the t-statistic and the associated p-value.

**(5g).** [3 pts] Repeat 5e and 5f for the `against_score`. Print the t-statistic and the associated p-value.

**(5h).** [2 pts] What would you conclude based on the two tests?

**A:** (Type your answer here)

**(5i).** [3 pts] Repeat 5e and 5f for the `net_score`. Print the t-statistic and the associated p-value.

**(5j).** [1 pt] Does this test confirm the conclusion you reached based on the earlier two tests?

**A:** (Type your answer here)

**(5k).** [2 pts] While Scottie Pipen and Phil Jackson were constants during the entire period 1992-1998, can you think of reasons to cast doubt on the conclusions reached by the tests?

*Hint:* We want you to question whether the assumption about "no other confounding variables" is valid? Could there be other factors at play?

**A:** (Type your answer here)

## 6. Do back-to-back games impact scoring? (OPTIONAL: NOT FOR GRADE)
We want to test whether teams playing back-to-back games (i.e. playing on consecutive nights of the week) has any effect on total scoring during a game. 

**(6a).** Write a function, `rest_days()` which takes as input the complete dataset as a dataframe and adds a new column which records the total number of nights of rest that the home and away teams have had for each game. Call your function to update the dataframe.

*Note:* the column must track the <b>sum</b> of the rest days for the home and away team.

If a team has a prolonged rest (i.e. > 7 days), set that to 8 days. This will help eliminate the skewed effects of strikes and long periods off (e.g. the summer). The first game for a team can be set to 8 days as well. Therefore, the range for the new column is 2-16 days (both inclusive).

Your function should pass the test case below.

In [None]:
def rest_days(df):
    '''
    Add a column for the total number of rest days the home and away team have before a game
    @param df: pandas DataFrame, contains columns date, home_team, away_team, etc., will be modified by function
    @return: None
    '''
    pass

In [None]:
rest_days(df)
assert df.loc[np.logical_and.reduce((df['home_team'] == 'Boston Celtics',
                                     df['away_team'] == 'Miami Heat',
                                     df['date'] == pd.to_datetime('2017-03-26')))].rest.values[0] == 5

**(6b).** Produce a histogram of the rest days with 10 bins

**(6c).** What does the histogram tell you about the schedule of an NBA team during the season?

**A:** (Type your answer here)

**(6d).** For the seasons 2000-2017 only, produce two subsets:<br>
`back_to_back`: with games where the rest days are <= 3 (i.e. at least one team is playing back-to-back)<br>
`longer`: games where the rest days are >=4 and <= 5.

**(6e).** What would your Null Hypothesis be if you wanted to test whether back_to_back games produce different total scores than games with combined rest between 4-5 days? Assume there are no other confounding variables at play.

**A:** (Type your answer here)

**(6f).** Print the mean total scores per game for the <b>back_to_back</b> and <b>longer</b> subsets (for seasons 2000-2017).

**(6g).** For the seasons 2000-2017 and using the two subsets you created, conduct a two-sample t-test to test your hypothesis. Feel free to use the `scipy.stats` library to perform the t-test for independent samples with equal variances. Print the t-statistic and the associated p-value.

**(6h).** What do you conclude about back-to-back games?

**A:** (Type your answer here)

**(6i).** Do you think the difference/no difference between <= 3 rest days and 4-5 rest days could be a result of confounding (i.e. other factors which may explain or be responsible for the observations)? What could be the confounding factors?

*Hint:* Think about situations during the season when the distributions of rest days could be different. 

**A:** (Type your answer here)