# NBA BasketballReference Draft Scraper + Analysis & Visualizations

## Project Overview:

### Objective:

I wish to collect the career stats of top picks from the past 20 NBA drafts. I will compare numerical stats among players in drafts using graphs and visualization techniques and will also explore interesting relationships among the scraped data with different visualization techniques.

### Tasks:

1. From www.basketball-reference.com/draft, scrape the rookie stats of the top 10 overall picks of last 20 NBA drafts (2001 - 2020)
2. Scrape data into a single Pandas MultiIndexDataFrame with outer index of draft year and inner index of pick number
3. Run DataFrame manipulations and various visualization techniques to gain insight into questions regarding drafting teams, picks, and player stats.

*** In this analysis, "Drafting teams" includes teams that trade for a pick. Drafting teams are teams that a rookie plays for for the entire season. Therefore, teams that trade or receive picks midway through the season will not be counted in this analysis.

## Imports

In [None]:
import numpy as np
from scipy import stats

import pandas as pd

import requests
from bs4 import BeautifulSoup

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

%matplotlib inline

## Scraper Code and Processing

In [None]:
# Initialize dictionary and lists to be used in scraper

player_data = {'team_id':[],'pos':[],'player':[],'mp_per_g':[],'efg_pct':[],
               'pts_per_g':[],'trb_per_g':[],'ast_per_g':[],'ws':[],'bpm':[]}   # Columns for DataFrame

index_array = [[],[]]  # Used for MultiIndex

positions = ["Point Guard","Shooting Guard","Small Forward","Power Forward","Center"] # Used to scrape Positions

In [None]:
def get_doc(url):   # return BeautifulSoup parsed document
    response = requests.get(url)
    response.close()
    soup = BeautifulSoup(response.text,'html.parser')
    return soup


def scrape_draft(doc,year): # Scrape players' data for single draft
    tr_tags = doc.findAll('tr')
    for pick in range(2,12,1):
        get_advanced_stats(tr_tags, pick)  # Scrape certain stats from main draft page
        player_url = "https://www.basketball-reference.com" + tr_tags[pick].find('td',{'data-stat':'player'}).a['href']
        get_player_data(get_doc(player_url))  # Scrape stats from individual player pages
        index_array[0].append(year)
        index_array[1].append(pick - 1)
        
        
def get_advanced_stats(tags, pick):  # Scrape player's career advanced stats from the draft page
    for key in ['ws','bpm','player']:
        field = tags[pick].find('td',{'data-stat': key})
        if field is not None:
            player_data[key].append(field.text)
        else:
            player_data[key].append(np.NaN)

    
def get_player_data(doc):  # Scraper player's rookie season data off of individual page
    rookie_stats = doc.findAll('tr',{'class':'full_table'})[0]
    for key in player_data:
        if key not in ['pos','ws','bpm','player']:
            field = rookie_stats.find('td',{'data-stat': key})
            if field is not None:
                player_data[key].append(field.text) # Append values to dictionary
            else:
                player_data[key].append(np.NaN)   # Appends missing values dictionary if no data to scrape
    get_player_pos(doc)
    
        
def get_player_pos(doc):  # Scrape player positions
    test_sections = [doc.findAll('p')[i].text.strip() for i in range(1,5,1)]  # Possible lines with position titles
    test_string = ""
    for section in test_sections:
        test_string += section  
    pos = [pos for pos in positions if pos in test_string]  
    player_data['pos'].append(pos) # Appends list of positions on website to dictionary
    
    
def draft_scraper(start_year,end_year):   # Entire web scraper function
    for year in range(start_year, end_year + 1, 1):  # Iterates through all draft years
        draft_url = 'https://www.basketball-reference.com/draft/NBA_' + str(year) + '.html'
        scrape_draft(get_doc(draft_url),year)
 
    # Form MultiIndex
    index_tuples = list(zip(*index_array))   
    index = pd.MultiIndex.from_tuples(index_tuples, names=["year", "pick"])  # Create a Pandas MultiIndex

    return pd.DataFrame(player_data,index=index)  # Returns DataFrame version of player_data dictionary

### Scrape Data into DataFrame

In [None]:
draft_data = draft_scraper(2001,2020)

In [None]:
draft_data # Check the DataFrame

### Data Cleaning / Processing

In [None]:
#Convert 'numeric' objects into numeric types

player_data = {'team_id':[],'pos':[],'player':[],'mp_per_g':[],'efg_pct':[],
               'pts_per_g':[],'trb_per_g':[],'ast_per_g':[],'ws':[],'bpm':[]}  #For use in following processing

draft_data.replace('',np.NaN,inplace=True)
for key in player_data:
    if key not in ['team_id','pos','player','g']:
        draft_data[key] = draft_data[key].astype(float)

In [None]:
#New Positions: guard, guard/forward, forward, forward/center, center

def parse_pos(row):  #Function to parse scraped positions into 1 position
    if len(row) == 1:
        if row[0] == 'Point Guard' or row[0] == 'Shooting Guard':
            return 'Guard'
        elif row[0] == 'Small Forward' or row[0] == 'Power Forward':
            return 'Forward'
        elif row[0] == 'Center':
            return 'Center'
    elif row == ['Power Forward','Center']:
        return 'Forward/Center'
    elif row == ['Small Forward','Power Forward']:
        return 'Forward'
    elif row == ['Point Guard','Shooting Guard']:
        return 'Guard'
    elif row == ['Shooting Guard','Small Forward']:
        return 'Guard/Forward'
    elif len(row) == 3 and row[0] == 'Point Guard':
        return 'Guard'
    else:
        return 'Forward'

draft_data['pos'] = draft_data['pos'].apply(lambda x: parse_pos(x))   #"Resets" row to single position instead of list of positions

In [None]:
# Replace TOT with Multiple (For players with multiple teams their rookie season)

draft_data.replace('TOT','Multiple',inplace=True)

In [None]:
draft_data

## Analysis Questions (For Selected 2001 - 2020 NBA Drafts):

The select questions are addressed below in sequential fashion.

### Analysis 1: Which teams had the most top-10 picks play for them their entire rookie season? Top-5 picks?

In [None]:
most_top10_picks = draft_data.team_id.value_counts().sort_values(ascending=False).head(6)

most_top10_picks

In [None]:
reset = draft_data.reset_index()
most_top5_picks = reset[reset.pick < 6].team_id.value_counts().sort_values(ascending=False).head(6)

most_top5_picks

### Analysis 2: Of the top 5 teams with the most top-10 picks play for them their rookie season, which typically chose the rookies with the most career win shares?

In [None]:
top_pick_teams = most_top10_picks.index.tolist()
top_team_ws = draft_data.groupby('team_id').median().ws[top_pick_teams].sort_values(ascending=True)

top_team_ws

Let's visualize this list of teams with their win share numbers in a barplot using Seaborn.

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x=top_team_ws.index,y=top_team_ws.values,palette='Blues')

plt.title('Median Career Win Shares of Players Drafted by Teams with Most Top-10 Picks',fontdict={'fontsize':15})
plt.ylabel('ws')

plt.tight_layout()

plt.savefig('barplot.png')

As we can see from this plot, we see that the Golden State Warriors and the Chicago Bulls typically chose the rookies with the most career win shares in the past 20 years. This could partly be attributed to the teams' player developmental and training staff, and/or their position in the drafts, being able to choose a clear star talent. However, these numbers may not always reflect well on these teams as certain players could have possibly gotten traded earlier on in their careers.

### Analysis 3: Have guards had increasing 3PT% in years closer to 2020 than 2001 due to the transition to a distance shooting era?

In [None]:
guards = draft_data[draft_data.pos == 'Guard'].reset_index()
guards

plot = sns.jointplot(x='year',y='efg_pct',data=guards,kind='scatter',color='red')
plot.fig.suptitle('Rookie EFG% for Guards drafted from 2001 - 2020 NBA Drafts', x=0.5, y=1, fontsize=15)
plt.tight_layout()


plt.savefig('jointplot.png')

This scatterplot shows no clear correlation at all between the draft year and the guards' rookie 3-point percentage. One would think that there would be a positive correlation between these two variables since that in recent years, the league has transitioned to an era dominated by the 3-point shot. Therefore, it follows that incoming point guards may have focused more on their outside jumper and thus had an overall better 3-point percentage than rookie guards several years before. 

However, as displayed here, this is not the case. Possibly, a more clear positive correlation could be between year and the number of attempts per game. This would better echo the transition to a 3-point era, as guards would be taking more shots, but not necessarily shooting a better percentage.

### Analysis 4: Can we visualize clear groupings by position based on players' points, assists, and rebounding averages?

In [None]:
# Let us regroup the positions into 3 categories: Guards, Wings, Big-Men 
# First, we need to see the best way to group these categories:

draft_data['pos'].value_counts()

In [None]:
"""
Let's map:
    Guard --> Guards
    Guard/Forward , Forward --> Wings
    Forward/Center , Center --> Big-Men
"""

temp = draft_data

temp['new_pos'] = temp.pos.map({'Guard':'Guard','Guard/Forward':'Wing','Forward':'Wing',
                                'Forward/Center':'Big Man','Center':'Big Man'})

#Graph a 3D plot
import plotly.express as pex

plot = pex.scatter_3d(data_frame = temp,x='pts_per_g',y='ast_per_g',z='trb_per_g',color='new_pos')
plot.show()

To view this interactive plot, please download the raw plot file [here](https://github.com/asattiraju13/NBA-Draft-Scraper-Visualization/blob/main/plots/3dscatter.html) as an HTML file, which you can select to display in your browser.

This interactive plot illustrates some of the differences between the the three main basketball position categories - guards, wings, and big-men - as it is it relatively easy to see three different groupings. The grouping of Big-Men has is located higher up on the trb_per_g axis than the others because they logically average more rebounds per game than other positions due to their height. The grouping of Guards is located farther out on the ast_per_g axis than the others because they are more responsible for distributing the ball to others, generating more assists than other positions. Notice that there is no clear grouping along the pts_per_g axis, as all three positions can score the ball depending on which position a certain team and offense emphasizes.

In [None]:
plot.write_html("3dscatter.html")

##### Let's explore this further ... using Linear Discriminant Analysis with all numeric data as our input and  new_pos as our class label.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda

X = temp.reset_index().drop(['year','pick','team_id','pos','player','new_pos',
                             'efg_pct','mp_per_g','bpm','ws'],axis=1)   #LDA with PPG, RPG, APG as only input factors
y = temp.new_pos

lda_model = lda(n_components=2)
newX = lda_model.fit_transform(X.values,y)

In [None]:
lda_model.explained_variance_ratio_

From the explained variance ratio, the X axis of the below plot explains 0.96% of the variance in the data; we could expect that the RPG and APG features are included in this axis as these features relatively clearly separated the position groupings on the interactive plot above.

In [None]:
plt.title('LDA graph with Position Groupings',fontdict={'fontsize':15})
sns.scatterplot(x=newX[:,0],y=newX[:,1],hue=y)
plt.savefig('ldascatter.png')

This plot of Linear Discriminant Analysis with respect to the three position groupings listed on the graph attempt to plot the data points - which incorporated the features of PPG, RPG, and APG - along 2 axes to maximize separation between the position labels. As we can see, there are clear groupings of the Guard and Big Men categories and a less clear grouping of the Wing category, as it overlaps both of the others.

### Analysis 5: Is there a clear relationship between career box score plus minus or win shares with pick number?

In [None]:
by_pick = draft_data.reset_index().groupby('pick').mean().reset_index()

fig, axs = plt.subplots(1, 2, figsize=(15, 5))

ws_stats = stats.pearsonr(by_pick.pick,by_pick.ws)
sns.regplot(ax = axs[0], x='pick',y='ws',data=by_pick,color='red')
axs[0].set_title('Average Career Win Shares vs. Pick # (2001 - 2020 drafts)',fontsize=15)
axs[0].text(7,50,'R^2: ' + str((ws_stats[0]**2).round(2)) + ', p-val: ' + str(ws_stats[1].round(3)), fontsize=12)

bpm_stats = stats.pearsonr(by_pick.pick,by_pick.bpm)
sns.regplot(ax = axs[1], x='pick',y='bpm',data=by_pick, color = 'blue')
axs[1].set_title('Average Career Box Plus-Minus vs. Pick # (2001 - 2020 drafts)',fontsize=15)
axs[1].text(7,1.4,'R^2: ' + str((bpm_stats[0]**2).round(2)) + ', p-val: ' + str(bpm_stats[1].round(3)), fontsize=12)

plt.tight_layout()

plt.savefig('scatterplots.png')

These two graphs above both show that there is a negative correlation both for average career win shares vs pick number and average career box plus minus vs pick number. The correlation coefficient for both graphs is roughly $\sqrt{0.55}$ = 0.74, which illustrates a moderate correlation as supported by the graphs. One would expect these relationships, as lower pick numbers are typically the better players and will have more success during their entire career. Let's look more closely at the ranges of win shares and Box Plus-Minus vs pick number below using boxplots in Seaborn.

In [None]:
fig, axs = plt.subplots(2, 1, figsize=(20, 10))

sns.boxplot(ax = axs[0], x='pick',y='ws',data=draft_data.reset_index(),palette='Reds')
axs[0].set_ylim(-5,150)
axs[0].set_title('Career Win Shares vs. Pick # (2001 - 2020 drafts)',fontsize=15)

sns.boxplot(ax = axs[1], x='pick',y='bpm',data=draft_data.reset_index(),palette='Blues')
axs[1].set_ylim(-8,10)
axs[1].set_title('Career Box Plus-Minus vs. Pick # (2001 - 2020 drafts)',fontsize=15)

plt.savefig('boxplots.png')

These boxplot graphs corroborate some support for the conclusion established from the scatterplot graphs, as the median win shares and box-plus minus of the top picks (smaller number picks) are on mostly higher than the median win shares and box-plus minus of the lower picks (higher number picks). However, these graphs best illustrate moderate correlation, since there are some aberrations in the trend. This is illustrated by median win shares and box-plus minus of the number two pick in particular, which are much lower than expected. Such an aberration with a top pick could possibly indicate a greater likelihood of number 2 picks not living up to star potential, becoming "busts" in their career.

### Analysis 6: Are there differences in the rookie median minutes per game between positions among the top 6 drafting teams?

In [None]:
most_picks_teams = np.array(most_top10_picks.index)

fig, axs = plt.subplots(3, 2, figsize=(15, 10))
fig.suptitle('Median Rookie MPG Distribution among Positions Drafted by Top Drafting Teams', x=0.5, y=1, fontsize=15)

ax_list = [axs[0,0],axs[0,1],axs[1,0],axs[1,1],axs[2,0],axs[2,1]]

for i in range(6):
    
    data = draft_data[draft_data.team_id == most_picks_teams[i]].groupby('pos').median().mp_per_g
    sns.barplot(ax = ax_list[i], x = data.index, y = data.values, palette='RdBu_r')
    ax_list[i].set_title(most_picks_teams[i],fontsize=15)
   
plt.tight_layout()
plt.setp(axs[:,0],ylabel = 'MPG')

plt.savefig('barplot_grid.png')

This visualization is quite interesting, as it offers some insight into how different top-drafting teams value the players they drafted differently based on position. This could be due to a number of factors, such as the growth of another "veteran" player on the team, thus limiting a rookie's minutes, or the need to fill a void left by a star player's departure from the team, thus increasing a rookie's minutes. There are several reasons, each unique to the drafting team's circumstances. Of course, this could also depend upon the rookie's success throughout the season, which translates to the number of minutes played. 

From this barplot grid, we can see that Chicago, Cleveland, and Golden State have favored rookie guards and forwards in terms of minutes in the past 20 draft years. This makes sense, as Chicago drafted former MVP Derrick Rose, Cleveland drafted star Kyrie Irving, and Golden State drafted former MVP Stephen Curry. We also see that the minute distribution per position for Phoenix and Sacramento are fairly uniform, with rookie guards / forwards and centers having the most minutes in both cities. Minnesota seems to have favored centers more, drafting big-men such as star Karl Anthony-Towns. We could also propose explanations for the lack of minutes of some positions, making this visualization extremely versatile.