# Some Teams One Dream
Hand checking NBA teams from the 2014-2018 seasons

In [1]:
#acquire libraries
import pandas as pd

#explore libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

from prepare import wrangle_nba

#model libraries
from model import logistic_regression, decision_tree, random_forest, kneighbors, logistic_regression_validate, decision_tree_validate, random_forest_validate, kneighbors_validate


import warnings
warnings.filterwarnings("ignore")

## Acquire

In [None]:
#uploading the nba csv and saving it as a dataframe called nba
nba = pd.read_csv('nba.games.stats.csv')

### Initial analysis of the data

In [None]:
#sneak peak into the data
nba.head()

- going to need to delete unnamed column because it wont be necessary
- won't need game number because it will be in the index

In [None]:
#checking to see how many rows and columns there are
nba.shape

- Four seasons worth of data, might split up into seasons

In [None]:
#checking data types, null values, and column names
nba.info()

- No null values, BIG PLUS

In [None]:
#looking at the summary statistics of all the numeric columns
nba.describe().T

- Every valued is filled
- Some outliers but its just a part of the game. Will keep all the data for the first go around

In [None]:
num_cols = nba.columns[[nba[col].dtype == 'int64' for col in nba.columns]]
for col in num_cols:
    plt.hist(nba[col])
    plt.title(col)
    plt.show()

- Normal distribution for all of the numerical columns

## Prepare

- adding two columns called home_is_west and away_is_west for teams playing in different conferences
- Changed Home, Conference, Opp.Conference and Wins into dummy variables
- dropped dates and columns that deal with point totals to not skew the models to predict who wins
- split into train, validate and test

## Explore

In [None]:
train, validate, test = wrangle_nba()

In [None]:
train.shape, validate.shape, test.shape

In [None]:
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(20,12))
sns.heatmap(train.corr(), cmap='Purples', annot=True, mask=mask)
plt.show()

- fieldgoal%, assist, 3point% and totalrebounds have the highest positive correlation
- oppfieldgoal%, oppassist, opp3point% and opptotalrebounds have lowest negative correlation
- offrebounds, oppoffrebounds, homeiswest and awayiswest seem to have no correlation

### Is there a relationship between wins and home games?

- **$H_0$:** There is no dependence between wins and home games
- **$H_a$:** There is a dependence between wins and home games

In [None]:
alpha = 0.01

In [None]:
observed = pd.crosstab(train.W, train.Home)

In [None]:
chi2, p, degf, expected = stats.chi2_contingency(observed)

In [None]:
if p < alpha:
    print("We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")
p

In [None]:
sns.catplot(x="W", hue="Home", kind="count", data=train)
plt.title('Does being at Home Improve your Win Chance?')
plt.ylabel('# of Games')
plt.xlabel('Loss or Win')
plt.show()

In [None]:
print(train[train.W == 1].Home.value_counts())
print("The ratio of wins to losses at Home is", 1591/(1591+1164))

In [None]:
print(train[train.W == 0].Home.value_counts())
print("The ratio of wins to losses Away is", 1145/(1145+1610))

#### Takeaways:
- The difference between 58% and 42% is pretty significant in regards to win percentage
- The evidence suggest that wins and being at home have some sort of relationship/dependence. 

### Do winning teams have the same free throw percentage as losing teams?

- **$H_0$:** Win or Lose teams shoot the same percentage of free throws
- **$H_a$:** Win or Lose teams do not shoot the same percentage of free throws

In [None]:
win = train[train.W == 1]
lose = train[train.W == 0]

t, p = stats.ttest_ind(win['FreeThrows.'], lose['FreeThrows.'])

In [None]:
if p < alpha:
    print("We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")
p

In [None]:
sns.catplot(x="W", y="FreeThrows.", kind="bar", data=train)
plt.title('Do Winning Teams shoot better free throws than Losing Teams?')
plt.xlabel('Win or Lose')
plt.ylabel('Free Throw %')
plt.show()

In [None]:
print("The average rate of free throw makes for winning teams is", round(win['FreeThrows.'].mean(),3))

In [None]:
print("The average rate of free throw makes for losing teams is", round(lose['FreeThrows.'].mean(),3))

#### Takeaways:
- The difference between 78% and 75% seems almost too small to matter
    - The statistical testing suggest otherwise
- Winning teams do shoot slightly better
    - maybe the few points is the difference to win a game
- The evidence suggest there is a significant difference in free throw percentage between winning and losing teams

### Do winning teams have the same number of offensive rebounds as losing teams?

- **$H_0$:** Win or Lose teams have the same number of offensive rebounds
- **$H_a$:** Win or Lose teams do not have the same number of offensive rebounds.

In [None]:
t, p = stats.ttest_ind(win['OffRebounds'], lose['OffRebounds'])

In [None]:
if p < alpha:
    print("We reject the null hypothesis")
else:
    print("We fail to reject the null hypothesis")
p

In [None]:
sns.catplot(x="W", y="OffRebounds", kind="bar", data=train)
plt.title('Do Winning Teams grab more offensive rebounds than Losing Teams?')
plt.xlabel('Win or Lose')
plt.ylabel('# of Offensive Rebounds')
plt.show()

In [None]:
print("The average rate for offensive rebounds of winning teams is", round(win['OffRebounds'].mean(),3))

In [None]:
print("The average rate for offensive rebounds of winning teams is", round(lose['OffRebounds'].mean(),3))

#### Takeaways:
- The difference between 10.1 and 10.3% seems almost too small to matter
    - The statistical testing suggest otherwise
- Oddly enough, losing teams grab more offensive rebounds than winning teams
    - maybe because they are missing more shots?
- The evidence suggest there is a significant difference of offensive rebounds for winning and losing teams

# Model 
## Train

In [None]:
X_train = train.drop(columns = ['Team', 'Opponent', 'W'])
y_train = train.W

- On the first model we are going to try all the variables and see which have the most influence on the model

### Setting the Baseline

In [None]:
train['baseline'] = train.W.value_counts().index[0]

In [None]:
baseline_accuracy = (train.baseline == train.W).mean()
print(f" The baseline accuracy will be {baseline_accuracy}")

### Model 1

In [None]:
coeff, cm, class_report = logistic_regression(X_train, y_train)

In [None]:
coeff.T

In [None]:
cm

In [None]:
class_report

- Model 1 performed way above the baseline average but it was with all the features
- Let's pull out the features with the most influence i.e. abs() > 1
- `FieldGoals.`, `X3PointShots.`, `FreeThrows.`, `Opp.FieldGoals.`, `Opp.3PointShots.`, `Opp.FreeThrows.`

### Model 2

In [None]:
X_train2 = train[['FieldGoals.', 'X3PointShots.', 'FreeThrows.', 'Opp.FieldGoals.', 'Opp.3PointShots.', 'Opp.FreeThrows.']]
y_train2 = train.W

In [None]:
coeff2, cm2, class_report2 = logistic_regression(X_train2, y_train2)

In [None]:
coeff2

In [None]:
cm2

In [None]:
class_report2

- Model 2's accuracy dropped by more than 7 percentage points. 
- Maybe its because the model only included the offensive stats for the home and away team
- Will try other models then think about mixing in new features

### Model 3

In [None]:
cm3, class_report3 = decision_tree(X_train2, y_train2, 3)

In [None]:
cm3

In [None]:
class_report3

### Model 4

In [None]:
cm4, class_report4 = decision_tree(X_train2, y_train2, 5)

In [None]:
cm4

In [None]:
class_report4

- For Model 3 & 4 we used a decision tree with different max depths
- Neither has performed better than the linear regression
- Will check accuracy further with validate to make sure we did not overfit

### Model 5

In [None]:
cm5, class_report5 = random_forest(X_train2, y_train2, 500, 3)

In [None]:
cm5

In [None]:
class_report5

### Model 6

In [None]:
cm6, class_report6 = random_forest(X_train2, y_train2, 100, 6)

In [None]:
cm6

In [None]:
class_report6

- Model 5 & 6 use random forest with different max depths and minimum sampling
- Model 6 did not do much better and with fear of overfitting, we will only move forward with model 5

In [None]:
cm7, class_report7 = kneighbors(X_train2, y_train2, 3)

In [None]:
cm7

In [None]:
class_report7

- Model 7 has done the best so far with the chosen features
- definitely will keep a close eye on this one. Kneighbors is known to overfit

#### Results
| | Accuracy | 
| --- | --- |
| Model 1: | .9027 |
| Model 2: | .8272 |
| Model 3: | .7902 |
| Model 4: | .8156 |
| Model 5: | .8045 |
| Model 7: | .8673 |

## Validate