# NBA Data Project

*Ayden Rivera*

## Preliminaries

The dataset and its description are both in the `data` folder. For this project you'll need `numpy`, `pandas`, and either `matplotlib` or `seaborn` for visualization. 

In the next cell, make your imports and load the dataset:

In [2]:
# Imports and loading the dataset
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
nba = pd.read_csv('data/nba_data.csv')


## Exploration

Use `.head()` and `.info()` to ensure that the data was loaded correctly and to get a feel for the data types in each column. Use `.describe()` to check whether or not there are any extreme values that don't make sense (*e.g. Can someone play negative minutes or score negative points? Can someone play a million minutes when there are only 48 minutes per game and 82 games in a season?*)

In [3]:
# Exploring the dataset
nba.head()
nba.info()
nba.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 679 entries, 0 to 678
Data columns (total 31 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Rk                 679 non-null    int64  
 1   Player             679 non-null    object 
 2   Pos                679 non-null    object 
 3   Age                679 non-null    int64  
 4   Tm                 679 non-null    object 
 5   G                  679 non-null    int64  
 6   GS                 679 non-null    int64  
 7   MP                 679 non-null    int64  
 8   FG                 679 non-null    int64  
 9   FGA                679 non-null    int64  
 10  FG%                676 non-null    float64
 11  3P                 679 non-null    int64  
 12  3PA                679 non-null    int64  
 13  3P%                655 non-null    float64
 14  2P                 679 non-null    int64  
 15  2PA                679 non-null    int64  
 16  2P%                672 non

Unnamed: 0,Rk,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
count,679.0,679.0,679.0,679.0,679.0,679.0,679.0,676.0,679.0,679.0,...,642.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0,679.0
mean,265.976436,26.025037,43.338733,20.069219,984.421208,169.387334,357.377025,0.464241,50.795287,140.606775,...,0.752586,42.156112,133.879234,176.035346,102.970545,29.698085,18.718704,54.263623,81.194404,463.21944
std,154.956296,4.325709,24.727306,25.766359,800.236331,169.157722,350.737612,0.11279,57.218086,151.702365,...,0.150094,49.18752,130.378234,172.793776,122.358385,27.079014,24.58479,55.433154,64.05651,471.423224
min,1.0,19.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,132.5,23.0,22.0,0.0,266.5,36.0,79.0,0.415,4.5,16.5,...,0.6865,9.0,32.0,45.0,18.0,7.0,4.0,12.0,26.0,95.5
50%,264.0,25.0,45.0,6.0,797.0,108.0,237.0,0.454,30.0,89.0,...,0.769,27.0,94.0,126.0,57.0,22.0,11.0,36.0,68.0,291.0
75%,399.5,29.0,65.5,36.5,1663.5,253.5,529.0,0.506,77.0,218.0,...,0.84475,57.0,196.5,260.5,137.5,45.0,24.5,81.0,123.0,689.0
max,539.0,42.0,83.0,83.0,2963.0,728.0,1559.0,1.0,301.0,731.0,...,1.0,274.0,744.0,973.0,741.0,128.0,193.0,300.0,279.0,2225.0


## Data Cleaning

- The `Rk` and `Player-additional` columns won't be useful to us. Delete them.
- There are several columns with null entries; deal with them appropriately:
    - Is it reasonable for null entries to exist in these columns?
    - Do we need to replace the null values with some other value?

In [4]:
# Cleaning the dataset
del nba['Rk']
del nba['Player-additional']

## Data Augmentation

While the stats included in the dataset are useful for giving us a wide view of a player's contributions throughout the season, basketball fans and analysts have devised more advanced tools to more accurately quantify these contributions. You can look up any of these statistics to see how they're calculated

Add the following statistics as new columns to the dataframe (suggested column name in parentheses):
- Points per shot (PPS)
- Points per possession (PPP)
- True Shooting Percentage (TS%)
- Free Throw Rate (FTR)
- Assist-to-Turnover Ratio (ATO)
- Hollinger Assist Ratio (hAST%)

In [10]:
# Adding Additional Analytics
nba['PPS'] = (2 * nba['2P'] + 3 * nba['3P']) / nba['FGA']
nba['PPP'] = nba['PTS'] / ((nba['FGA'] - nba['ORB']) + nba['TOV'] + (.44 * nba['FTA']))
nba['TS%'] = nba['PTS'] / (2 * (nba['FGA'] + .44 * nba['FTA']))
nba['FTR'] = nba['FTA'] / nba['FGA']
nba['ATO'] = nba['AST'] / nba ['TOV']
nba['hAST%'] = nba['AST'] / (nba['FGA'] +.475 * nba['FTA'] + nba['AST'] + nba['TOV'])
nba

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,BLK,TOV,PF,PTS,PPS,PPP,TS%,FTR,ATO,hAST%
0,Precious Achiuwa,C,23,TOR,55,12,1140,196,404,0.485,...,30,59,102,508,1.042079,1.216592,0.553908,0.306931,0.847458,0.087428
1,Steven Adams,C,29,MEM,42,42,1133,157,263,0.597,...,46,79,98,361,1.193916,1.953886,0.564486,0.490494,1.227848,0.193893
2,Bam Adebayo,C,25,MIA,75,75,2598,602,1114,0.540,...,61,187,208,1529,1.081688,1.181717,0.592232,0.360862,1.283422,0.138572
3,Ochai Agbaji,SG,22,UTA,59,22,1209,165,386,0.427,...,15,41,99,467,1.064767,1.127039,0.560813,0.178756,1.634146,0.127189
4,Santi Aldama,PF,22,MEM,77,20,1682,247,525,0.470,...,48,60,143,696,1.120000,1.235444,0.591475,0.274286,1.616667,0.129264
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
674,Thaddeus Young,PF,34,TOR,54,9,795,108,198,0.545,...,5,42,88,240,1.121212,1.330082,0.572956,0.131313,1.785714,0.229113
675,Trae Young,PG,24,ATL,73,73,2541,597,1390,0.429,...,9,300,104,1914,0.969784,0.999394,0.572656,0.459712,2.470000,0.270979
676,Omer Yurtseven,C,24,MIA,9,0,83,16,27,0.593,...,2,4,16,40,1.296296,1.560062,0.674764,0.222222,0.500000,0.055788
677,Cody Zeller,C,30,MIA,15,2,217,37,59,0.627,...,4,14,33,98,1.254237,1.545741,0.658602,0.593220,0.714286,0.100376


### Team-Contextual Analytics
Understanding a player's value in the context of their *team* is also an important consideration. For example, could a given player be scoring more points simply because they're playing next to a superstar who commands more defensive attention? Is a given center grabbing lots of defensive rebounds because they're skilled, or because their teammates are forcing more bad shots? The following statistics are a bit more difficult to calculate, but may yield better insight about a player's *context* within his team:

- Rebound Rate (TRB%)
- Usage Percentage (USG%)

For these stats you'll need to calculate *team* totals. I recommend creating a pivot table called `team_totals` that aggregates the sum of each column in your original dataset on a per team basis. Then when you need to use a player's team totals, you can look at the appropriate row/column of the `team_totals` dataframe. 

In [6]:
# Team-Contextual Analytics


## Querying the Data

### Simple Lookups
Display the top five players in the league for the following stats: minutes, points, free-throw attempts, 3-pointers made, and assists (For each of these statistics, number 1 should be the *biggest*). 

Do these lists make sense? (If you're not sure, check with a friend who's into basketball, they'll help!)

### More Complex Lookups

- Print out the positions in order of highest average points per player to lowest points per player.
- Repeat the previous question for average blocks per player, per position.
- Determine the league's top scorers in terms of *points per minute* among players who have played at least half their team's total minutes. 

## Visualizing the Data

- Create a heatmap that shows the number of players in each *quintile* of points scored at each position (this should be a 5x5 heatmap)
- Create a scatter plot that shows players' total points on the y axis vs minutes on the x axis. Draw a trendline fit to the data. What does that trendline represent?