
# NBA 3-Point Statistics

## Table of Contents
1. [Introduction](#Introduction)
2. [Data Wrangling](#Data-Wrangling)
3. [New NBA Stats](#New-NBA-Stats)
4. [3Pave Rankings](#3Pave-Rankings)
5. [Conclusion](#Conclusion)


## Introduction

There has never been a single metric to determine the best NBA 3-Point shooters. Most fans know the reality from watching the games. Objectively, most fans consider two numbers, 3-point Percentage, and 3 Pointers Made.

Is there an empirical way to combine 3-point Percentage and 3-pointers Made into one metric? Will the metric verify that Steph Curry is the greatest 3-point shooter of all-time? What about the best 3-point shooting team of all-time?

To answer these questions, I have developed 3-point Average, or 3Pave. This metric computes the number of points the team gains per possession when the player makes a 3-pointer, minus the number of points the team loses when the player misses a 3-pointer. This metric is evaluated based on the expected value of points on an average possession.

This Jupyter Notebook contains Exploratory Data Analysis regarding to the new metric, 3Pave. It evaluates 3-point shooters throughout NBA History using 3Pave, and another new metric, Expected Minutes before a 3, or EM3. Data Wrangling steps are included for those with an interest in learning pandas. Readers uninterested in pandas can skip directly to [New NBA Stats](#New-NBA-Stats).    


#### References

https://www.kaggle.com/drgilermo/nba-players-stats <br>
https://www.basketball-reference.com/

#### Copyright

Corey J Wade<br>
May 31, 2018

This Jupyter Notebook and the statistics within may be redistributed with credit given to the author, Corey J Wade.


## Data Wrangling

The following csv file is taken from https://www.kaggle.com/drgilermo/nba-players-stats. When I downloaded the file, it contained standard statistics through 2017. Dr. Guillermo scraped it from https://www.basketball-reference.com/. 

#### NBA Stats Through 2017

In [558]:
# Import pandas
import pandas as pd

# Open file as dataframe, from Dr. Guillermo via Kaggle
df_2017 = pd.read_csv('Seasons_Stats.csv')

# Display first five rows
df_2017.head()

NameError: name 'df2017' is not defined

Basketball statistics were not widely computed before the modern era, hence the null values. Also, the 3-point shot did not exist before 1979, so we can start there.

In [None]:
# Delete unnecessary column
del df_2017['Unnamed: 0']

# Only select years after 1979
df_2017 = df_2017[df_2017['Year']>=1979]

# Display last five rows
df_2017.tail()

#### 2018 NBA Stats

The 2018 NBA season recently finished. I used the same link, https://www.basketball-reference.com/, to scrape the 2018 statistics.

In [None]:
# Read html file
df_2018, = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2018_totals.html", header=0)

# Convert to csv file
df_2018.to_csv("df_2018.csv", index=False)

# Display first five rows
df_2018.head()

Since there is no column for year, I will one.

In [None]:
# Delete unnecessary column
del df_2018['Rk']

# Add column for year, place at index 0
df_2018.insert(0, 'Year', 2018.0)

# Display last five rows
df_2018.tail()

#### Concatenating Dataframes

Since the dataframes have a different number of columns, I select the relevant columns before concatenating.

In [None]:
# Select relevant columns
tp_2017 = df_2017[['Year', 'Tm', 'Player', 'G','MP', 'PTS', '3P', '3PA', '3P%']]
tp_2018 = df_2018[['Year', 'Tm', 'Player', 'G','MP', 'PTS', '3P', '3PA', '3P%']]

# Concatenate dataframes
tp = pd.concat([tp_2017, tp_2018], ignore_index=True, )

# Show last five rows
tp.tail()

#### Column Consistency

In [None]:
# Display column info
tp.info()

With the exception of 'Year', the data has not been rendered as numbers. They must be converted to floats for mathematical operations.

In [None]:
# Convert numeric columns to decimals
tp.G = pd.to_numeric(tp.G, errors='coerce')
tp.MP = pd.to_numeric(tp.MP, errors='coerce')
tp.PTS = pd.to_numeric(tp.PTS, errors='coerce')
tp['3P'] = pd.to_numeric(tp['3P'], errors='coerce')
tp['3PA'] = pd.to_numeric(tp['3PA'], errors='coerce')
tp['3P%'] = pd.to_numeric(tp['3P%'], errors='coerce')

# Check columns
tp.info()

#### Minimum Requirements

It's not necessary to examine data from all players. If a player was never recorded as taking a 3-pointer, he can be excluded from the dataframe. Furthermore, I am not interested in players that only took a few 3's. The purpose of the minimum requirements is to eliminate non-3-point shooters and very low outliers. My minimum requirements are less stringent than other NBA "qualified" statistics online. See, for instance, https://stats.nba.com/help/statminimums/.

In [None]:
# Choose players with more than 20 3's per season
tp = tp[(tp['3P'] > 20)]

# Choose players with more than 320 mintes played per season
tp = tp[(tp['MP'] > 320)]

# Choose players with at least 41 games per season
tp = tp[(tp['G'] > 41)]

# Display last 5 rows
tp.tail()

In [None]:
tp.info()

Now all columns have the same number of rows.

#### Points Per Possession

The last piece of Data Wrangling is points per possession. It will be used to compute the expected value of points each time a team has the ball. I obtained the team ratings through NBA history at https://www.basketball-reference.com/leagues/NBA_stats.html.

In [None]:
# Read html file
df_teams, = pd.read_html("https://www.basketball-reference.com/leagues/NBA_stats.html", header=0)

# Display first five rows
df_teams.head()

In [None]:
# Drop first row
df_teams.drop(df_teams.index[0], inplace=True)

# Choose relevant columns
df_PPP = df_teams[['Unnamed: 1','Unnamed: 31']]

# Rename columns
df_PPP.columns = ['Year', 'PPP']

# Show first five rows
df_PPP.head()

In [None]:
# Show column info
df_PPP.info()

In [None]:
# Convert year column to year listed before hyphen
df_PPP['Year'] = df_PPP['Year'].str.split('-').str[0]

# Convert columns to numbers
df_PPP['Year'] = pd.to_numeric(df_PPP['Year'], errors='coerce')
df_PPP['PPP'] = pd.to_numeric(df_PPP['PPP'], errors='coerce')

# Drop NaN values
df_PPP = df_PPP.dropna()

# Add 1 to each year, since NBA seasons are maked by the second, not first number
df_PPP['Year'] = df_PPP['Year'] + 1

# Divide by 100 to convert to points per possession
df_PPP['PPP'] = df_PPP['PPP']/100
# offensive rating is given in terms of points per 100 possession

# View dataframe
df_PPP

## New NBA Stats

### Expected Minutes

This first group of statistics computes the number of minutes players are on the court before attemping and making 3's.

#### AM3A : Average Minutes per 3-point Attempt

A player's Average Minutes per 3-Point Attempt is total minutes played divided by total 3-pointers attempted.

In [None]:
# Define new column, AM3A: Average Minutes per 3-point Attempt 
tp['AM3A'] = tp['MP'] / tp['3PA']

# Show last five entrants
tp.tail()

#### EM3BA : Expected Minutes before 3-point Attempt

The expected value of a continuous interval of time is typically at the halfway mark. Will Nick Young (listed above) take a 3 once he checks in, or after 4.27 minutes? His most likely value is halfway between, at 2.135 minutes. This is his expected minutes played before attempting a 3.

In [None]:
# Define new column, EM3A: Expected Minutes before 3-point Attempt
tp['EM3A'] = tp['AM3A'] / 2

# Sort dataframe by new category
tp_EM3A = tp.sort_values('EM3A', ascending=True)

# View players who attempt 3s faster than anyone in NBA history
tp_EM3A.head(20)

Statistical Notes<ul>
    <li> Many players on the list come off the bench. EM3A does not distinguish between starters and reserves.</li>
    <li> Most top performers are from the last few years, due to the meteoric rise of NBA 3-pointers. </li>
     <li> Joe Hassett from 1982 is a shocker! </li>
    <li> EM3A is a valuable 3-point statistic. It's informative to know that Eric Gordon will likely take a 3 within 2 minutes of checking in.
    </ul>

#### AM3 : Average Minutes per 3-Pointer

AM3 is like AM3A except it computes the average minutes played per each 3-pointer made.

In [None]:
# Define new column, AM3P: Average Minutes per 3-pointer made
tp['AM3'] = tp['MP']/tp['3P']

# Show last five rows
tp.tail()

#### EM3 : Expected Minutes Before a 3

This is my favorite statistic of the group. It's how long a player is expected to be on the court before making a 3. As before, EM3 is Average Minutes per 3-pointer is divided by two.

In [None]:
# Define new column, EM3P: Expected Minutes before 3-pointer
tp['EM3'] = tp['AM3'] / 2

# Dort dataframe by new category
tp_EM3 = tp.sort_values('EM3', ascending=True)

# Display top twenty seasons of all-time
tp_EM3.head(20)

EMB3 Statistical Notes:<ul>
    <li> EMB3 measures how quickly shooters make 3-pointers upon taking the court.</li> 
    <li> More restrictive minimum requirements could eliminate reserves. I personally prefer leaving them in. </li>
     <li> Steph Curry's legendary 2016 MVP season is a clear # 1. </li>
    </ul>

EM3A and EM3 are more intriguing, and telling, than AM3A and AM3. The latter can be eliminated since they are just doubles of the former.

In [None]:
# Delete extraneous columns
del tp['AM3A'] 
del tp['AM3']

### 3Pave

The 3-point statistics above are compelling, but they do not a provide a single metric to rank all 3-point shooters. This is where 3Pave, or 3-point average comes in. 3Pave adds what the team gains beyond the expected value, and subtracts what the team loses beyond the expected value. 

#### Points Per Possession

3Pave depends on the expected value. Should the expected value be points per possession? Or points per field goal attempt? There is no definitive answer. I have chosen points per possession since each time a team has the ball, this is what they are expected to earn. I have computed the mean points per possession throughout NBA history. This statistic was first computed in 1974.

In [None]:
# Compute ev, expected value in points per possession
ev = df_PPP['PPP'].mean()

# Display ev
print('Avg. Points Per Possession:', ev)

This is very close to what current teams average at 1.08

#### 3Pave Formula

When a player makes a 3-pointer, the team gains an extra 3 points minus the expected value. When a player misses a 3-pointer, the team loses the expected value.

In [None]:
# Formula for 3Pave, 3-point Average

# Compute 3PG, 3-pointers per Game
tp['3PG']=tp['3P']/tp['G']

# Compute 3PAG, 3-point Attempts per Game
tp['3PAG']= tp['3PA']/tp['G']

# Compute 3PMi, 3-point Misses per Game
tp_misses =tp['3PAG']-tp['3PG']

# Declare expected value
ev = df_PPP['PPP'].mean()
                          
# Compute 3PAd, 3-point Advantage
tp['3Pave']=tp['3PG'] * (3 - ev) - tp_misses * ev

# (3 - ev) is what the team gains per 3-pointer made
# -ev is what the team loses per 3-pointer missed

## 3Pave Rankings

#### The Top 25

In [None]:
# Sort dataframe by 3Pave
tp=tp.sort_values('3Pave', ascending=False)

# Reset index
tp = tp.reset_index(drop=True)

# Start index at 1 instead of 0
tp.index = tp.index + 1

# Display top 25 3-point shooting seasons of all-time
tp.head(25)

3Pave Statistical Notes:<ul>
    <li> Steph Curry's legendary MVP season is heads and shoulders above the rest, and he dominates the list as a player.</li> 
    <li> 3Pave does a great job of comparing 3-point shooters over the years. </li>
     <li> Different expected values will produce different results. </li>
    <li> 3Pave has real meaning. It conveys the actual points a team gains by the player shooting 3-pointers. </li>
    </ul>

#### Weighted

It's telling to use the same measure, mean points per possession, across all years. But is it justifiable? Teams score more points per possession these days, so it could be argued that 3-pointers were more valuable in years past. The expected value can be weighted, by taking the mean points per possession for each given year. 

In [None]:
# Merge df_PPP, dataframe with 'Year' and 'PPP', with tp, the current dataframe
tp = tp.merge(df_PPP)

# Declare weighed expected value
evw = tp['PPP']

# Compute 3PMi, 3-point Misses per Game
tp_misses =tp['3PAG']-tp['3PG']
                          
# Compute 3Pave using weighted expected value
tp['3Pave/w']=tp['3PG'] * (3 - evw) - tp_misses * evw

#### The Top 25, Weighted

In [None]:
# Keep dataframe tight by eliminating unnecessary columns
tp.drop(['MP', 'PPP'], axis=1, inplace=True)

# Sort dataframe by 3Pave/w
tp=tp.sort_values('3Pave/w', ascending=False)

# Reset index
tp = tp.reset_index(drop=True)

# Start index at 1 instead of 0
tp.index = tp.index + 1

# Display top 25 3-point shooting weighted seasons
tp.head(25)

The values are very close. Some players from earlier eras, like Ray Allen, move up the list, but others, like Glen Rice, actually move down. It depends on how many points per possession the league averaged that year. Players from higher scoring eras, like 2018, drop down. Consider Klay Thompson's 2018 drop from 9 to 17. Was his 3-point season not as valuable because the whole league was better at shooting 3s? 

In [None]:
# Return to index with 3Pave, unweighted, as default order

# Sort dataframe by 3Pave
tp=tp.sort_values('3Pave', ascending=False)

# Reset index
tp = tp.reset_index(drop=True)

# Start index at 1 instead of 0
tp.index = tp.index + 1

#### 2018 League Leaders

We can check the league leaders for any given year. Note that for a particular year, weighted and unweighted will provide the same order.

In [None]:
# Create 2018 dataframe
tp_2018 = tp[tp['Year']==2018.0]

# Reset index
tp_2018 = tp_2018.reset_index(drop=True)

# Start index at 1 instead of 0
tp_2018.index = tp_2018.index + 1

# Show top 10 3Pave
tp_2018.head(10)

The Golden State Warriors dominate the list. What about the Houston Rockets? They have the reputation of being a great 3-point shooting team.

#### 2018 Warriors v Rockets

In [None]:
# Create 2018 dataframe for GSW and HOU
tp_2018_GSW_HOU = tp_2018[(tp_2018['Tm']=='GSW') | (tp_2018['Tm']=='HOU')]

# Display dataframe
tp_2018_GSW_HOU

Golden State is at the top and bottom, while Houston dominates the middle. It's interesting to note that Eric Gordon is a + or - depending on whether the column is weighted. Summing 3Pave will give us the winner.

In [None]:
tp_2018_GSW_HOU.groupby('Tm')['3Pave'].sum()

Golden State is the clear winner, almost doubling Houston in points gained by shooting 3's. How does the team rank historically?

#### Best 3-Point Shooting Teams of All-Time

In [None]:
# Eliminate TOT, Total for traded players, from list of NBA Teams
tp_teams = tp[tp['Tm'] != 'TOT']

# Group teams and year by 3Pave, sort in order
tp_teams = tp_teams.groupby(['Tm','Year'])['3Pave'].sum().sort_values(ascending=False)

# Convert top 25 to DataFrame
pd.DataFrame(tp_teams.head(25))

It's no surprise that the Warriors take the top 3 spots with the 73-win way above the rest. The 7-seconds-or-less Suns are close behind. The Charlotte Hornets from '97 are a surprise #7 until one recalls that they had Dell Curry and Glen Rice. Four of the last five NBA champions made the top 10. The legendary '96 Bulls are 16th.

#### Best 3-Point Shooting Teams of All-Time, Weighted

In [None]:
# Eliminate TOT, Total for traded players, from list of NBA Teams
tp_teams = tp[tp['Tm'] != 'TOT']

# Group teams and year by weighted 3Pave, and sort in order
tp_teams = tp_teams.groupby(['Tm','Year'])['3Pave/w'].sum().sort_values(ascending=False)

# Convert top 25 to DataFrame
pd.DataFrame(tp_teams.head(25))

Weighted includes more Spurs teams, the Reggie Miller Pacer team that made the finals, and a Ray Allen Bucks team. The Warriors comfortably hold the first two spots.

#### Best of the 90s

In [None]:
# Create 90s dataframe
tp_90s = tp[(tp['Year']<2000) & (tp['Year']>1989)]

# Display top 25 3-point shooting seasons of all-time
tp_90s.head(25)

The big 90s shooters. Glen Rice, Reggie Miller, Dell Curry, Dennis Scott, Mitch Ritchmond, Dale Ellis.

#### Best of the 80s

In [None]:
# Create 80s dataframe
tp_80s = tp[tp['Year']<1990]
# The first 3-point shot was recorded in 1979, officially the 1980 season.

# Display top 25
tp_80s.head(25)

Craig Hodges. Mark Price. Larry Bird. More Dale Ellis. The first 3-point shot was made in 1979, so we can't go back much further.

#### Career Totals

How about the most points gained by shooting 3's over their entire career?

In [None]:
# Create 3Pave/c using same formula as 3Pave, but use totals instead of per game
tp['3Pave/c'] = tp['3P'] * (3 - ev) - (tp['3PA'] - tp['3P']) * ev

# Group by player, and sum over their career
tp_player = tp.groupby('Player')['3Pave/c'].sum()

# Order from the top
tp_player = tp_player.sort_values(ascending=False)

# Convert to DataFrame for nicer viewing
tp_player = pd.DataFrame(tp_player)

# Display top 25
tp_player.head(25)

Kyle Korver beats out Ray Allen and Reggie Miller. (Stats are for the regular season only.) The only players on this list who are not retired, or at the end of their careers are Steph Curry and Klay Thompson. It's amazing to think that Steph Curry is already number two. Let's compare this to traditional 3 pointers made.

In [None]:
# Convert to DataFrame: Group by player, sum over 3 pointers, sort values, display top 25
pd.DataFrame(tp.groupby('Player')['3P'].sum().sort_values(ascending=False).head(25))

Most fans would agree that Kobe Bryant, Lebron James and Nick Van Exel are not as good at 3-pointers as Glen Rice, Del Curry and Steve Kerr. Finally, we have a statistic to prove it.

#### Career Averages

In [None]:
# Count number of seasons
tp['seasons'] = tp.groupby('Player')['3Pave'].transform('count')

# Require at least 4 seasons
tp_seasons = tp[tp['seasons']>=4]

# Compute average, divide sum of player's 3Pave by the number of seasons
tp_av = tp_seasons.groupby('Player')['3Pave'].sum()/tp_seasons.groupby('Player')['3Pave'].count()

# Sort and display top 25 as DataFrame
pd.DataFrame(tp_av.sort_values(ascending=False).head(25))

Steph Curry doubles everyone on the list except for teammate Klay Thompson, and Kyle Korver. I think we can safely answer the questions posed at the beginning of this notebook. The Warriors are the best 3-point shooting team of all-time because they have the two best 3-point shooters of all-time. The statistics verify what every fan knows to be true: Steph Curry is the greatest of all-time.

## Conclusion

Three new NBA statistics have been presented. EM3A, Expected Minutes before a 3-point Attempt could be of value to coaches preparing for opponents and working with their own players. EM3, Expected Minutes before a 3, is a fun statistic that could be used for similar reasons. 3Pave, 3-point Average, is a powerful statistic that provides a single number to rank 3-point shooters across all seasons.

3Pave rewards players for making 3-point shots, and penalizes them for missing. Players that make a lot of 3s, but shoot a low percentage are exposed as making slight contributions to their teams. Players who shoot a high percentage need to make a high volume to be competitive. 3Pave rankings are statisically verifiable while simultaneously communicating valuable information.

3Pave reveals the gain in points beyond the league average that a player adds to his team by shooting 3's. It can be weighted, summed, or displayed as per game averages. It can be used as a barometer to determine whether a player should be encouraged or discouraged from shooting 3's. Any positive score is a plus for the team, while negative scores are a detriment.

3Pave stands up well to general expectations. With the analysis above, one can argue that Steph Curry, Klay Thompson, and Kyle Korver are the best 3-point shooters of all-time, with Ray Allen close behind. The Golden State Warriors are verifiably the greatest 3-point shooting team of all time, followed by the 7-seconds-or-less Phoenix Suns.

3Pave can be used to analyze playoff statistics and clutch 3-point shooters. It can be used during basketball seasons past and future to analyze the success of 3-point shooters. It can be used for any league, WNBA, college, high school, etc., provided that an appropiate expected value, like points per possession, is utilized.