<a href="https://colab.research.google.com/github/annaqas/projects_codecademy/blob/main/NBA_training_variables_association.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project, you’ll analyze data from the NBA (National Basketball Association) and explore possible associations.

This data was originally sourced from 538’s Analysis of the Complete History Of The NBA and contains the original, unmodified data from Basketball Reference as well as several additional variables 538 added to perform their own analysis.

You can read more about the data and how it’s being used by 538 here. For this project we’ve limited the data to just 5 teams and 10 columns (plus one constructed column, point_diff, the difference between pts and opp_pts).

You will create several charts and tables in this project, so you’ll need to use plt.clf() between plots in your code so that the plots don’t layer on top of one another.

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns

import codecademylib3
np.set_printoptions(suppress=True, precision = 2)

nba = pd.read_csv('./nba_games.csv')

# Subset Data to 2010 Season, 2014 Season
nba_2010 = nba[nba.year_id == 2010]
nba_2014 = nba[nba.year_id == 2014]

print(nba_2010.head())
print(nba_2014.head())

Suppose you want to compare the knicks to the nets with respect to points earned per game. Using the pts column from the nba_2010 DataFrame, create two series named knicks_pts (fran_id = "Knicks") and nets_pts(fran_id = "Nets") that represent the points each team has scored in their games.

Calculate the difference between the two teams’ average points scored and save the result as diff_means_2010. Based on this value, do you think fran_id and pts are associated? Why or why not?

In [None]:
knicks_pts_2010 = nba_2010.pts[nba_2010.fran_id == 'Knicks']
nets_pts_2010 = nba_2010.pts[nba_2010.fran_id == 'Nets']

knicks_pts_mean_2010 = np.mean(knicks_pts_2010)
nets_pts_mean_2010 = np.mean(nets_pts_2010)
print(knicks_pts_mean_2010)
print(nets_pts_mean_2010)
diff_means_2010 = knicks_pts_mean_2010 - nets_pts_mean_2010
print(diff_means_2010)

Rather than comparing means, it’s useful look at the full distribution of values to understand whether a difference in means is meaningful. Create a set of overlapping histograms that can be used to compare the points scored for the Knicks compared to the Nets. Use the series you created in the previous step (1) and the code below to create the plot. Do the distributions appear to be the same?

In [None]:
plt.hist(knicks_pts_2010, color = 'blue', label = 'Knicks', normed=True, alpha=0.5)
plt.hist(nets_pts_2010, color = 'red', label = 'Nets', normed=True, alpha=0.5)
plt.show()

Now, let’s compare the 2010 games to 2014. Replicate the steps from the previous three exercises using nba_2014. First, calculate the mean difference between the two teams points scored. Save and print the value as diff_means_2014. Did the difference in points get larger or smaller in 2014? Then, plot the overlapping histograms. Does the mean difference you calculated make sense?

In [None]:
knicks_pts_2014 = nba_2014.pts[nba_2014.fran_id == 'Knicks']
nets_pts_2014 = nba_2014.pts[nba_2014.fran_id == 'Nets']

knicks_pts_mean_2014 = np.mean(knicks_pts_2014)
nets_pts_mean_2014 = np.mean(nets_pts_2014)
print(knicks_pts_mean_2014)
print(nets_pts_mean_2014)
diff_means_2014 = knicks_pts_mean_2014 - nets_pts_mean_2014
print(diff_means_2014)

plt.clf()
plt.hist(knicks_pts_2014, color = 'blue', label = 'Knicks', normed=True, alpha=0.5)
plt.hist(nets_pts_2014, color = 'red', label = 'Nets', normed=True, alpha=0.5)
plt.show()

Using nba_2010, generate side-by-side boxplots with points scored (pts) on the y-axis and team (fran_id) on the x-axis. Is there any overlap between the boxes? Does this chart suggest that fran_id and pts are associated? Which pairs of teams, if any, earn different average scores per game?

In [None]:
plt.clf()
sns.boxplot(data = nba_2010, x = 'fran_id', y = 'pts')
plt.show()

The variable game_result indicates whether a team won a particular game ('W' stands for “win” and 'L' stands for “loss”). The variable game_location indicates whether a team was playing at home or away ('H' stands for “home” and 'A' stands for “away”). Do teams tend to win more games at home compared to away?

Data scientists will often calculate a contingency table of frequencies to help them determine if categorical variables are associated. Calculate a table of frequencies that shows the counts of game_result and game_location.

Save your result as location_result_freq and print your result. Based on this table, do you think the variables are associated?

Convert this table of frequencies to a table of proportions and save the result as location_result_proportions. Print your result.

In [None]:
location_result_freq = pd.crosstab(nba_2010.game_location, nba_2010.game_result)
print(location_result_freq)

#output
#game_result      L    W
#game_location          
#A              133   92
#H              105  120

location_result_prop = location_result_freq / len(nba_2010)
print(location_result_prop)

Using the contingency table created in the previous exercise (Ex. 7), calculate the expected contingency table (if there were no association) and the Chi-Square statistic and print your results. Does the actual contingency table look similar to the expected table — or different? Based on this output, do you think there is an association between these variables?

In [None]:
chi2, pval, dof, expected = chi2_contingency(location_result_freq)
print(np.round(expected))
print(chi2)
#output - chi2 = 6.502

For each game, 538 has calculated the probability that each team will win the game. In the data, this is saved as forecast. The point_diff column gives the margin of victory/defeat for each team (positive values mean that the team won; negative values mean that they lost). Did teams with a higher probability of winning (according to 538) also tend to win games by more points?

Using nba_2010, calculate the covariance between forecast (538’s projected win probability) and point_diff (the margin of victory/defeat) in the dataset. Save and print your result. Looking at the matrix, what is the covariance between these two variables?

Calculate the correlation between forecast and point_diff. Save and print your result. Does this value suggest an association between the two variables?

Generate a scatter plot of forecast (on the x-axis) and point_diff (on the y-axis). Does the correlation value make sense?

In [None]:
#Covariance - association between quantitative variables
corr_forecast_point =np.cov(nba_2010.forecast, nba_2010.point_diff)
print(corr_forecast_point)
#covariance = 1.37

#Pearson correlation
pearson_forecast_point, p = pearsonr(nba_2010.forecast, nba_2010.point_diff)
print(pearson_forecast_point)
#pearson = 0.44 - linear association

plt.clf()
plt.scatter(x=nba_2010.forecast, y=nba_2010.point_diff)
plt.xlabel('Forecast')
plt.ylabel('Point difference')
plt.show()