# Project Two: Hypothesis Testing¶

This notebook contains the step-by-step directions for Project Two. It is very important to run through the steps in order. Some steps depend on the outputs of earlier steps. Once you have completed the steps in this notebook, be sure to write your summary report.

You are a data analyst for a basketball team and have access to a large set of historical data that you can use to analyze performance patterns. The coach of the team and your management have requested that you perform several hypothesis tests to statistically validate claims about your team's performance. This analysis will provide evidence for these claims and help make key decisions to improve the performance of the team. You will use the Python programming language to perform the statistical analyses and then prepare a report of your findings for the team’s management. Since the managers are not data analysts, you will need to interpret your findings and describe their practical implications.

There are four important variables in the data set that you will study in Project Two.

| Variable | What it Represents                                          |
| -------- | ----------------------------------------------------------- |
|  pts     | Points scored by the team in a game                         |
|  elo_n   | A measure of relative skill level of the team in the league |
| year_id  | Year when the team played the games                         |
| fran_id  | Name of the NBA team                                        |

The ELO rating, represented by the variable **elo_n**, is used as a measure of the relative skill of a team. This measure is inferred based on the final score of a game, the game location, and the outcome of the game relative to the probability of that outcome. The higher the number, the higher the relative skill of a team.

In addition to studying data on your own team, your management has also assigned you a second team so that you can compare its performance with your own team's.

| Team          | What it Represents                                                            |
| ------------- | ----------------------------------------------------------------------------- |
| Your Team     | The team that has hired you as an analyst                                     |
| Assigned Team | The team that the management has assigned to you to compare against your team |

Reminder: It may be beneficial to review the summary report template for Project Two prior to starting this Python script. That will give you an idea of the questions you will need to answer with the outputs of this script.

## Step 1: Data Preparation & the Assigned Team

This step uploads the data set from a CSV file. It also selects the Assigned Team for this analysis. Do not make any changes to the code block below.

The Assigned Team is Chicago Bulls from the years 1996 - 1998

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
from IPython.display import display, HTML


nba_orig_df = pd.read_csv('../nbaallelo.csv')
nba_orig_df = nba_orig_df[(nba_orig_df['lg_id']=='NBA') & (nba_orig_df['is_playoffs']==0)]
columns_to_keep = ['game_id','year_id','fran_id','pts','opp_pts','elo_n','opp_elo_n', 'game_location', 'game_result']
nba_orig_df = nba_orig_df[columns_to_keep]

# The dataframe for the assigned team is called assigned_team_df.
# The assigned team is the Bulls from 1996-1998.
assigned_years_league_df = nba_orig_df[(nba_orig_df['year_id'].between(1996, 1998))]
assigned_team_df = assigned_years_league_df[(assigned_years_league_df['fran_id']=='Bulls')]
assigned_team_df = assigned_team_df.reset_index(drop=True)

display(HTML(assigned_team_df.head().to_html()))
print("printed only the first five observations...")
print("Number of rows in the dataset =", len(assigned_team_df))

Unnamed: 0,game_id,year_id,fran_id,pts,opp_pts,elo_n,opp_elo_n,game_location,game_result
0,199511030CHI,1996,Bulls,105,91,1598.2924,1531.7449,H,W
1,199511040CHI,1996,Bulls,107,85,1604.394,1458.6415,H,W
2,199511070CHI,1996,Bulls,117,108,1605.7983,1310.9349,H,W
3,199511090CLE,1996,Bulls,106,88,1618.8701,1452.8268,A,W
4,199511110CHI,1996,Bulls,110,106,1621.1591,1490.2861,H,W


printed only the first five observations...
Number of rows in the dataset = 246


## Step 2: Pick Your Team

In this step, you will pick your team. The range of years that you will study for your team is 2013-2015. Make the following edits to the code block below:

1. Replace ??TEAM?? with your choice of team from one of the following team names.
  Bucks, Bulls, Cavaliers, Celtics, Clippers, Grizzlies, Hawks, Heat, Jazz, Kings, Knicks, Lakers, Magic, Mavericks, Nets, Nuggets, Pacers, Pelicans, Pistons, Raptors, Rockets, Sixers, Spurs, Suns, Thunder, Timberwolves, Trailblazers, Warriors, Wizards

*Remember to enter the team name within single quotes. For example, if you picked the Suns, then ??TEAM?? should be replaced with 'Suns'.*

In [3]:
# Range of years: 2013-2015 (Note: The line below selects all teams within the three-year period 2013-2015. This is not your team's dataframe.
your_years_leagues_df = nba_orig_df[(nba_orig_df['year_id'].between(2013, 2015))]

# The dataframe for your team is called your_team_df.
your_team_df = your_years_leagues_df[(your_years_leagues_df['fran_id']=='Knicks')]
your_team_df = your_team_df.reset_index(drop=True)

display(HTML(your_team_df.head().to_html()))
print("printed only the first five observations...")
print("Number of rows in the dataset =", len(your_team_df))

Unnamed: 0,game_id,year_id,fran_id,pts,opp_pts,elo_n,opp_elo_n,game_location,game_result
0,201211020NYK,2013,Knicks,104,84,1548.2699,1647.6675,H,W
1,201211040NYK,2013,Knicks,100,84,1557.5126,1535.9276,H,W
2,201211050PHI,2013,Knicks,110,88,1580.3411,1513.0991,A,W
3,201211090NYK,2013,Knicks,104,94,1586.0647,1533.1604,H,W
4,201211130ORL,2013,Knicks,99,89,1594.3969,1421.3483,A,W


printed only the first five observations...
Number of rows in the dataset = 246


## Step 3: Hypothesis Test for the Population Mean (I)

A relative skill level of 1340 represents a critically low skill level in the league. The management of your team has hypothesized that the average relative skill level of your team in the years 2013-2015 is greater than 1340. Test this claim using a 5% level of significance. For this test, assume that the population standard deviation for relative skill level is unknown. Make the following edits to the code block below:

  - Null hypothesis: The null hypothesis is that the average relative skill level of your team in the years 2013-2015 is equal to 1340.
  - Alternative hypothesis: The alternative hypothesis is that the average relative skill is greater than 1340.
  - Test Type: This is a one-tailed test because the alternative hypothesis is directional, since that direction is greater than. The test is a right-tailed test. 
  - Find the p-value and make a decision: Based on the p-value, decide whether to reject the null hypothesis. If the p-value is less than 0.05, reject the null hypothesis. Otherwise, do not reject the null hypothesis.
  - Find the confidence interval: Find the 95% confidence interval for the average relative skill level of your team in the years 2013-2015.

Replace ??DATAFRAME_YOUR_TEAM?? with the name of your team's dataframe. See Step 2 for the name of your team's dataframe.
Replace ??RELATIVE_SKILL?? with the name of the variable for relative skill. See the table included in the Project Two instructions above to pick the variable name. Enclose this variable in single quotes. For example, if the variable name is var2 then replace ??RELATIVE_SKILL?? with 'var2'.
Replace ??NULL_HYPOTHESIS_VALUE?? with the mean value of the relative skill under the null hypothesis.
After you are done with your edits, click the block of code below and hit the Run button above.

In [4]:
from curses.ascii import alt
from numpy import mean
import scipy.stats as st


null_hyp_val = 1340
alpha = 0.05
n = len(your_team_df['elo_n'])

# Mean relative skill level of your team
mean_elo_your_team = your_team_df['elo_n'].mean()
print("Mean Relative Skill of your team in the years 2013 to 2015 =", round(mean_elo_your_team,2))

# Calculate the degrees of freedom: n-1
df = n - 1
print("Degrees of Freedom =", df)

# Calculate the standard deviation of the sample
std_dev = your_team_df['elo_n'].std()
print("Standard Deviation of the Sample =", round(std_dev,2))


# Hypothesis Test
test_statistic, p_value = st.ttest_1samp(your_team_df['elo_n'], null_hyp_val)

print("Hypothesis Test for the Population Mean")
print("Test Statistic =", round(test_statistic,2))
print("P-value =", round(p_value,4) / 2)

if round(p_value,4)/2 < alpha:
    print("Reject the null hypothesis, the mean relative skill level of the team is significantly different from 1340")
else:
    print("Fail to reject the null hypothesis, the mean relative skill level of the team is not significantly different from 1340")

Mean Relative Skill of your team in the years 2013 to 2015 = 1471.29
Degrees of Freedom = 245
Standard Deviation of the Sample = 110.85
Hypothesis Test for the Population Mean
Test Statistic = 18.58
P-value = 0.0
Reject the null hypothesis, the mean relative skill level of the team is significantly different from 1340


## Step 4: Hypothesis Test for the Population Mean (II)

A team averaging 106 points is likely to do very well during the regular season. The coach of your team has hypothesized that your team scored at an average of less than 106 points in the years 2013-2015. Test this claim at a 1% level of significance. For this test, assume that the population standard deviation for relative skill level is unknown.

- Null Hypothesis: The null hypothesis is that the average points scored by your team is equal to 106. H0: μ = 106
- Alternative Hypothesis: The alternative hypothesis is that the average points scored by your team is less than 106. Ha: μ < 106
- Test Type: This is a one-tailed test because the alternative hypothesis is directional, since that direction is less than. The test is a left-tailed test.

1. The dataframe for your team is called your_team_df.
2. The variable 'pts' represents the points scored by your team.
3. Calculate and print the mean points scored by your team during the years you picked.
4. Identify the mean score under the null hypothesis. You only have to identify this value and do not have to print it. (Hint: this is given in the problem statement)
5. Assuming that the population standard deviation is unknown, use Python methods to carry out the hypothesis test.
6. Calculate and print the test statistic rounded to two decimal places.
7. Calculate and print the P-value rounded to four decimal places.

In [108]:
from curses.ascii import alt
from numpy import mean
import scipy.stats as st


null_hyp_val = 106
alpha = 0.01
n = len(your_team_df['pts'])

# Mean points scored by your team.
mean_points_your_team = your_team_df['pts'].mean()
print("Mean Points Scored by your team in the years 2013 to 2015 =", round(mean_points_your_team,2))

# Calculate the degrees of freedom: n-1
df = n - 1
print("Degrees of Freedom =", df)

# Calculate the standard deviation of the sample
std_dev = your_team_df['pts'].std()
print("Standard Deviation of the Sample =", round(std_dev,2))


# Hypothesis Test
test_statistic, p_value = st.ttest_1samp(your_team_df['pts'], null_hyp_val)

print("Hypothesis Test for the Population Mean")
print("Test Statistic =", round(test_statistic,2))
print("P-value =", round(p_value,4) / 2)

if round(p_value,4)/2 < alpha:
    print("Reject the null hypothesis, the mean points scored by the team is significantly different from 106")
else:
    print("Fail to reject the null hypothesis, the mean points scored by the team is not significantly different from 106")

Mean Points Scored by your team in the years 2013 to 2015 = 96.81
Degrees of Freedom = 245
Standard Deviation of the Sample = 11.23
Hypothesis Test for the Population Mean
Test Statistic = -12.84
P-value = 0.0
Reject the null hypothesis, the mean points scored by the team is significantly different from 106


## Step 5: Hypothesis Test for the Population Proportion

Suppose the management claims that the proportion of games that your team wins when scoring 102 or more points is 0.90. Test this claim using a 5% level of significance. Make the following edits to the code block below:

- The null hypothesis is that the proportion of games that your team wins when scoring 102 or more points is equal to 0.90. Ho: p = 0.90.
- The alternative hypothesis is that the proportion of games that your team wins when scoring 102 or more points is not equal to 0.90.
- Test Type: This is a one-tailed test because we are only testing for the team scoring GREATER than 102 points. The test is a right-tailed test.

1. Replace ??COUNT_VAR?? with the variable name that represents the number of games won when your team scores over 102 points. (Hint: this variable is in the code block below).
2. Replace ??NOBS_VAR?? with the variable name that represents the total number of games when your team scores over 102 points. (Hint: this variable is in the code block below).
3. Replace ??NULL_HYPOTHESIS_VALUE?? with the proportion under the null hypothesis.

Since the P-value is less than 0.05, reject the null hypothesis. Otherwise, do not reject the null hypothesis.

In [5]:
from statsmodels.stats.proportion import proportions_ztest


null_hyp_val = 0.90
alpha = 0.05
n = len(your_team_df)

your_team_gt_102_df = your_team_df[(your_team_df['pts'] >= 102)]

# Number of games WON when your team scored 102 or more pts.
counts = (your_team_gt_102_df['game_result'] == 'W').sum()
print("Number of games WON by your team when scoring 102 points or more in the years 2013 to 2015 =", counts)

# Total number of games PLAYED when your team scores 102 or more pts
nobs = len(your_team_gt_102_df['game_result'])

# Proportion of games won when your team scores more than 102 pts
p = counts/nobs
print("Proportion of games won/played by your team when scoring 102 points or more in the years 2013 to 2015 =", round(p,4))


# Hypothesis Test
test_statistic, p_value = proportions_ztest(counts, nobs, null_hyp_val, prop_var=null_hyp_val)

# Since this is a one-tailed test, we should cut the p-value in half.
p_value = p_value / 2

print("Hypothesis Test for the Population Proportion")
print("Test Statistic =", round(test_statistic,2))
print("P-value =", round(p_value,4))

if p_value < alpha:
    print("Reject the null hypothesis, the proportion of games won by the team when they score more than 102 points is significantly different from 0.90")
else:
    print("Fail to reject the null hypothesis, the proportion of games won by the team when they score more than 102 points is not significantly different from 0.90")

Number of games WON by your team when scoring 102 points or more in the years 2013 to 2015 = 58
Proportion of games won/played by your team when scoring 102 points or more in the years 2013 to 2015 = 0.7945
Hypothesis Test for the Population Proportion
Test Statistic = -3.0
P-value = 0.0013
Reject the null hypothesis, the proportion of games won by the team when they score more than 102 points is significantly different from 0.90


## Step 6: Hypothesis Test for the Difference Between Two Population Means

The management of your team wants to compare the team with the assigned team (the Bulls in 1996-1998). They claim that the skill level of your team in 2013-2015 is the same as the skill level of the Bulls in 1996 to 1998. In other words, the mean relative skill level of your team in 2013 to 2015 is the same as the mean relative skill level of the Bulls in 1996-1998. Test this claim using a 1% level of significance. Assume that the population standard deviation is unknown. Make the following edits to the code block below:

1. Replace ??DATAFRAME_ASSIGNED_TEAM?? with the name of assigned team's dataframe. See Step 1 for the name of assigned team's dataframe.
2. Replace ??DATAFRAME_YOUR_TEAM?? with the name of your team's dataframe. See Step 2 for the name of your team's dataframe.
3. Replace ??RELATIVE_SKILL?? with the name of the variable for relative skill. See the table included in Project Two instructions above to pick the variable name. Enclose this variable in single quotes. For example, if the variable name is var2 then replace ??RELATIVE_SKILL?? with 'var2'.

In [119]:
import scipy.stats as st


# Null Hypothesis: The mean relative skill level of the assigned team is equal to the mean relative skill level of your team. This is a two-tailed test.
alpha = 0.01

mean_elo_n_project_team = assigned_team_df['elo_n'].mean()
print("Mean Relative Skill of the assigned team in the years 1996 to 1998 =", round(mean_elo_n_project_team,2))

mean_elo_n_your_team = your_team_df['elo_n'].mean()
print("Mean Relative Skill of your team in the years 2013 to 2015  =", round(mean_elo_n_your_team,2))


# Hypothesis Test
test_statistic, p_value = st.ttest_ind(assigned_team_df['elo_n'], your_team_df['elo_n'])

print("Hypothesis Test for the Difference Between Two Population Means")
print("Test Statistic =", round(test_statistic,2))
print("P-value =", round(p_value,4))

if round(p_value,4) < alpha:
    print("Reject the null hypothesis, the mean relative skill level of the assigned team is significantly different from the mean relative skill level of your team")
else:
    print("Fail to reject the null hypothesis, the mean relative skill level of the assigned team is not significantly different from the mean relative skill level of your team")

Mean Relative Skill of the assigned team in the years 1996 to 1998 = 1739.8
Mean Relative Skill of your team in the years 2013 to 2015  = 1471.29
Hypothesis Test for the Difference Between Two Population Means
Test Statistic = 34.45
P-value = 0.0
Reject the null hypothesis, the mean relative skill level of the assigned team is significantly different from the mean relative skill level of your team


## Step 7: Summary Report

In [120]:


# To compare the points scored by the assigned team and your team, we will use a two-sample t-test.

# Null Hypothesis: The mean points scored by the assigned team is equal to the mean points scored by your team. This is a two-tailed test.
alpha = 0.01

mean_points_project_team = assigned_team_df['pts'].mean()
print("Mean Points Scored by the assigned team in the years 1996 to 1998 =", round(mean_points_project_team,2))

mean_points_your_team = your_team_df['pts'].mean()
print("Mean Points Scored by your team in the years 2013 to 2015 =", round(mean_points_your_team,2))


# Hypothesis Test
test_statistic, p_value = st.ttest_ind(assigned_team_df['pts'], your_team_df['pts'])

print("Hypothesis Test for the Difference Between Two Population Means")
print("Test Statistic =", round(test_statistic,2))
print("P-value =", round(p_value,4))

if round(p_value,4) < alpha:
    print("Reject the null hypothesis, the mean points scored by the assigned team is significantly different from the mean points scored by your team")
else:
    print("Fail to reject the null hypothesis, the mean points scored by the assigned team is not significantly different from the mean points scored by your team")

Mean Points Scored by the assigned team in the years 1996 to 1998 = 101.68
Mean Points Scored by your team in the years 2013 to 2015 = 96.81
Hypothesis Test for the Difference Between Two Population Means
Test Statistic = 4.79
P-value = 0.0
Reject the null hypothesis, the mean points scored by the assigned team is significantly different from the mean points scored by your team
