# What's this about?
For the next few blog posts I'll be dealing with NBA draft data. If you pay attention to the NBA at all, then you'll be aware of what the draft is and it's presumed importance. Basketball analysts such as Zach Lowe and Bill Simmons have  argued that the best chance a bad team has at being a good team is to have a high draft pick, while many empirical studies show the value of draft position: [here][3], [here] [2], and [here] [1] and general managers have staked their careers on the premise that the way to a great team was through drafting a star player. Given the importance of the draft then, I decided to take a look at the link between draft position and player quality (pretty well established correlation) as well as looking at whether certain GM's or coaches consistently beat the odds in the draft. 
Since teams have equivalent information on the players they are going to draft, there should be a clear link between the quality of the player and the player's draft position. GM's and coaches that consistently get good players from lower draft positions then either are evaulating the available information differently or better at developing players. Teasing these two apart may not be possible, but given that Sam Hinkie based his career on having high draft picks, I think it's worth looking at whether there are GM's or coaches that are able to grab players that exceed the average player's performance given the same draft position. This analysis has actually been done [before][4] (and [again][5]) so my analysis will be extending this to more years as well as looking to pull apart the influence of the coach as compared to the GM in drafting performance. Another advantage of this project, despite being done before, is that I have something to compare my results too (granted I'll need to find access to these projects) and to look to for ideas on how to actually analyze the data.
# What does this project teach me?
To complete this project, I'll need to scrape data from a number of websites, convert them into Pandas DataFrames and merge them. Most of this project will be gathering and cleaning data, but eventually I will be graphing player performance based against draft position and analyzing which GM's and coaches did better in the draft.

[1]: http://nyloncalculus.com/2016/03/07/evaluating-draft-prospects-a-first-pass/
[2]: http://www.82games.com/barzilai1.htm
[3]: https://statsbylopez.com/2016/06/22/the-making-and-comparison-of-draft-curves/Teams
[4]: http://basketball.realgm.com/wiretap/232776/study-finds-isiah-thomas-as-best-drafting-gm-since-1989
[5]: http://www.nbaminer.com/draft-success-of-nba-teams/

# And it begins...
This project will be carried out in parts and I will be trying to share how I did things as well as the difficulties I encountered. To begin, I am trying to grab draft data by following this [blog post][1], but as you'll see, I quickly deviated because I ran into many errors and tried whatever I could to get around these issues. Part 1 of this project will be focused on getting the draft data from a single season, 1976 (the NBA merger season) with Part 2 focused on using the template here to create a loop to grab all draft data from 1976 to 2014 and combine them into a single large DataFrame.
[1]: http://savvastjortjoglou.com/nba-draft-part01-scraping.html

In [1]:
import pandas as pd
from pandas import Series,DataFrame,read_html
import numpy as np

from bs4 import BeautifulSoup
import html5lib

Above is some standard imports. I tried to follow the blog, but despite have installed requests, kept getting an error that the module was not available, so I switched tactics after finding pandas 'read_html' command. 
Below I will go through scraping data from a single season and then progressing to build a loop to grab all the seasons that I want before finally merging all the DataFrames together.

In [2]:
url = 'http://www.basketball-reference.com/draft/NBA_1976.html'

In [3]:
dframe_list = pd.io.html.read_html(url)

In [4]:
type(dframe_list)

list

You can convert the list to a Pandas DataFrame by simply assigning the first element in the list. I'm not quite sure how Pandas knows that I want to create a DataFrame, but it takes the first element in my list (which is a list of column headers) and makes them the column headers in a DataFrame.

In [5]:
dframe = dframe_list[0]
dframe.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Round 1,Unnamed: 4,Totals,Shooting,Per Game,Advanced,Rk,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST.1,WS,WS/48,BPM,VORP
0,1,1,HOU,John Lucas,University of Maryland,14,928,25556,9951,2151,...,10.3,,,,,,,,,
1,2,2,CHI,Scott May,Indiana University,7,355,8029,3690,1450,...,2.0,,,,,,,,,
2,3,3,KCK,Richard Washington,"University of California, Los Angeles",6,351,7874,3456,2204,...,-0.6,,,,,,,,,
3,4,4,DET,Leon Douglas,University of Alabama,7,456,10111,3587,2954,...,1.1,,,,,,,,,
4,5,5,POR,Wally Walker,University of Virginia,8,565,10168,3968,1759,...,-0.8,,,,,,,,,


In [6]:
type(dframe)

pandas.core.frame.DataFrame

In [7]:
dframe.columns.values

array(['Unnamed: 0', 'Unnamed: 1', 'Unnamed: 2', u'Round 1', 'Unnamed: 4',
       u'Totals', u'Shooting', u'Per Game', u'Advanced', u'Rk', u'Pk',
       u'Tm', u'Player', u'College', u'Yrs', u'G', u'MP', u'PTS', u'TRB',
       u'AST', u'FG%', u'3P%', u'FT%', u'MP.1', u'PTS.1', u'TRB.1',
       u'AST.1', u'WS', u'WS/48', u'BPM', u'VORP'], dtype=object)

Above I've shown that dframe is a DataFrame and printed out my column headers. As you can see, these column headers are clearly wrong. The blog I was following deals with this issue by taking a copy of the website and making it a beautiful soup object, then searching through that object for the column headers and then assigning those as the column headers. As I'll briefly discuss below, for some reason that did not work for me, instead, I continually got an empty list when I tried to grab the column headers and took an alternative route that may not be ideal for a data set with a vast number of column headers (so sooner or later I'll need to learn the first method). 

First, in looking at the dataset, I noticed that the dataframe had been read in to have the first and second rows from the webpage to be the column headers (look at the webpage to see what I mean). This resulted in misalignment of the values and a bunch of columns at the end filled with NaN values. My next steps then are to drop those NaN filled columns and to create a list of accurate column headers to replace with the misaligned column headers. Finally, because the webpage has rows that provide breaks for different rounds of the draft, I need to drop these useless rows. 

In [8]:
# dropping the NaN filled columns
# dframe.dropna(axis=1,how='all',inplace=True)

I tried the above code first to drop those NaN filled columns, which worked, but also dropped a bunch of rows of players that never played in the NBA and thus all of their column entries were NaN at the beginning. Since I know there are 22 actual columns, I will drop all columns after the 22nd.

In [9]:
dframe.drop(dframe.columns[[22,23,24,25,26,27,28,29,30]],inplace=True,axis=1)
# For some reason, trying to enter a range of number e.g. 22:30 for the dropped columns returned a syntax error

In [10]:
# the list of accurate column headers, this is pretty inefficient to just create the list
column_names = ['Rk','Pk','Team','Player','College','Yrs','Games','Minutes Played','PTS','TRB','AST','FG_Percentage','TP_Percentage','FT_Percentage','Minutes per Game','Points per Game','TRB per game','Assits per Game','Win Share','WS_per_game','BPM','VORP']

Now I can replace the column headers with the list of column headers I created here. As you can tell, I'm having to do some of these things by plugging in values and such which is not ideal, so I still have a ways to go in learning more the syntax of Python and pandas.

In [11]:
# replacing the column headers with the properly named list created above
dframe.columns=column_names

In [12]:
dframe.head()

Unnamed: 0,Rk,Pk,Team,Player,College,Yrs,Games,Minutes Played,PTS,TRB,...,TP_Percentage,FT_Percentage,Minutes per Game,Points per Game,TRB per game,Assits per Game,Win Share,WS_per_game,BPM,VORP
0,1,1,HOU,John Lucas,University of Maryland,14,928,25556,9951,2151,...,0.303,0.776,27.5,10.7,2.3,7.0,53.7,0.101,-0.4,10.3
1,2,2,CHI,Scott May,Indiana University,7,355,8029,3690,1450,...,0.0,0.811,22.6,10.4,4.1,1.7,17.4,0.104,-1.0,2.0
2,3,3,KCK,Richard Washington,"University of California, Los Angeles",6,351,7874,3456,2204,...,0.25,0.711,22.4,9.8,6.3,1.2,10.8,0.066,-2.3,-0.6
3,4,4,DET,Leon Douglas,University of Alabama,7,456,10111,3587,2954,...,0.0,0.601,22.2,7.9,6.5,1.1,15.2,0.072,-1.6,1.1
4,5,5,POR,Wally Walker,University of Virginia,8,565,10168,3968,1759,...,0.2,0.643,18.0,7.0,3.1,1.5,12.9,0.061,-2.3,-0.8


## An Aside
In trying to follow the same steps as the blog to replace column headers, I kept running into an error every time I tried to replace the column headers with those that I found using 'soup.finall' that I did not have a [1] list item. I ended up assigning the command to an object 'rows' and in looking at 'rows', as shown below, found it to be an empty list. I poked around quite a bit online and there are lots of people that seem to run into this issue, but for a wide variety of reasons. At this point I'm not sure if it's something to do with the versions of python and beautifulsoup that I'm running or something else. For the sake of simplicity I decided to simply use the method demonstrated above and created a list of column_names.

In [49]:
rows = soup.findAll('tr', limit=2)
rows

[]

There are a couple of things I need to do now. 
1. I need to drop the rows that served as breaks for different rounds in the original dataset
2. I'll need to re-align the index of the rows to be correct
3. I need to fill the NaN values in as zeroes
4. I need to fix the object type for each column
Since having the correct object type for each column can speed up the the other three things I need to do, I'll do that first.

In [13]:
dframe.dtypes

Rk                  object
Pk                  object
Team                object
Player              object
College             object
Yrs                 object
Games               object
Minutes Played      object
PTS                 object
TRB                 object
AST                 object
FG_Percentage       object
TP_Percentage       object
FT_Percentage       object
Minutes per Game    object
Points per Game     object
TRB per game        object
Assits per Game     object
Win Share           object
WS_per_game         object
BPM                 object
VORP                object
dtype: object

In [14]:
# Since the Team, Player, College columns do not need to be converted I will leave these out
numeric_columns = column_names
del numeric_columns[2:5]

for column in numeric_columns:
    dframe[column] = pd.to_numeric(dframe[column], errors='coerce')
dframe.dtypes

Rk                  float64
Pk                  float64
Team                 object
Player               object
College              object
Yrs                 float64
Games               float64
Minutes Played      float64
PTS                 float64
TRB                 float64
AST                 float64
FG_Percentage       float64
TP_Percentage       float64
FT_Percentage       float64
Minutes per Game    float64
Points per Game     float64
TRB per game        float64
Assits per Game     float64
Win Share           float64
WS_per_game         float64
BPM                 float64
VORP                float64
dtype: object

Now that I have these in numeric form, I should be able to drop the excess rows and then convert the remaining NaN's to zeroes


In [15]:
# dropping the rows that served as breaks for different rounds of the draft
dframe = dframe[dframe.Rk.notnull()]
dframe.head()

Unnamed: 0,Rk,Pk,Team,Player,College,Yrs,Games,Minutes Played,PTS,TRB,...,TP_Percentage,FT_Percentage,Minutes per Game,Points per Game,TRB per game,Assits per Game,Win Share,WS_per_game,BPM,VORP
0,1,1,HOU,John Lucas,University of Maryland,14,928,25556,9951,2151,...,0.303,0.776,27.5,10.7,2.3,7.0,53.7,0.101,-0.4,10.3
1,2,2,CHI,Scott May,Indiana University,7,355,8029,3690,1450,...,0.0,0.811,22.6,10.4,4.1,1.7,17.4,0.104,-1.0,2.0
2,3,3,KCK,Richard Washington,"University of California, Los Angeles",6,351,7874,3456,2204,...,0.25,0.711,22.4,9.8,6.3,1.2,10.8,0.066,-2.3,-0.6
3,4,4,DET,Leon Douglas,University of Alabama,7,456,10111,3587,2954,...,0.0,0.601,22.2,7.9,6.5,1.1,15.2,0.072,-1.6,1.1
4,5,5,POR,Wally Walker,University of Virginia,8,565,10168,3968,1759,...,0.2,0.643,18.0,7.0,3.1,1.5,12.9,0.061,-2.3,-0.8


In [16]:
# changing the remaining NaN's to zeroes
dframe = dframe.fillna(0)
# reindexing to align the row index correctly
dframe.index = range(173)
dframe.head()

Unnamed: 0,Rk,Pk,Team,Player,College,Yrs,Games,Minutes Played,PTS,TRB,...,TP_Percentage,FT_Percentage,Minutes per Game,Points per Game,TRB per game,Assits per Game,Win Share,WS_per_game,BPM,VORP
0,1,1,HOU,John Lucas,University of Maryland,14,928,25556,9951,2151,...,0.303,0.776,27.5,10.7,2.3,7.0,53.7,0.101,-0.4,10.3
1,2,2,CHI,Scott May,Indiana University,7,355,8029,3690,1450,...,0.0,0.811,22.6,10.4,4.1,1.7,17.4,0.104,-1.0,2.0
2,3,3,KCK,Richard Washington,"University of California, Los Angeles",6,351,7874,3456,2204,...,0.25,0.711,22.4,9.8,6.3,1.2,10.8,0.066,-2.3,-0.6
3,4,4,DET,Leon Douglas,University of Alabama,7,456,10111,3587,2954,...,0.0,0.601,22.2,7.9,6.5,1.1,15.2,0.072,-1.6,1.1
4,5,5,POR,Wally Walker,University of Virginia,8,565,10168,3968,1759,...,0.2,0.643,18.0,7.0,3.1,1.5,12.9,0.061,-2.3,-0.8


In [17]:
# checking whether there are still any missing values
dframe.isnull().sum().sum()

0

## Moving forward
Now that I have the dataframe with proper column headers and all rows are properly filled in, I need to add a column for the draft year and can drop the 'Rk' column.

In [18]:
# Inserting a column for draft year 
dframe.insert(0, 'Draft_Yr', 1976)

In [19]:
# Deleting the 'Rk' column
dframe.drop('Rk', axis=1, inplace=True)
dframe.head()

Unnamed: 0,Draft_Yr,Pk,Team,Player,College,Yrs,Games,Minutes Played,PTS,TRB,...,TP_Percentage,FT_Percentage,Minutes per Game,Points per Game,TRB per game,Assits per Game,Win Share,WS_per_game,BPM,VORP
0,1976,1,HOU,John Lucas,University of Maryland,14,928,25556,9951,2151,...,0.303,0.776,27.5,10.7,2.3,7.0,53.7,0.101,-0.4,10.3
1,1976,2,CHI,Scott May,Indiana University,7,355,8029,3690,1450,...,0.0,0.811,22.6,10.4,4.1,1.7,17.4,0.104,-1.0,2.0
2,1976,3,KCK,Richard Washington,"University of California, Los Angeles",6,351,7874,3456,2204,...,0.25,0.711,22.4,9.8,6.3,1.2,10.8,0.066,-2.3,-0.6
3,1976,4,DET,Leon Douglas,University of Alabama,7,456,10111,3587,2954,...,0.0,0.601,22.2,7.9,6.5,1.1,15.2,0.072,-1.6,1.1
4,1976,5,POR,Wally Walker,University of Virginia,8,565,10168,3968,1759,...,0.2,0.643,18.0,7.0,3.1,1.5,12.9,0.061,-2.3,-0.8


In [26]:
cd NBA_Data

/Users/rorypulvino/Dropbox (Personal)/Python/blog/content/NBA_Data


In [27]:
dframe.to_csv('1976_Draft.csv')

# And I finally have it...
Creating this first DataFrame for one year of the draft took quite a bit more time than I would have thought (about a week working for a few hours in the evening when there was time.) Even with a blog post to follow, I ran into quite a few errors that required searching around the internet (mainly stackexchange) to find solutions. Python is trickier in its syntax than what I'm used to with Stata and whereas in Stata I would have solved some of these problems by building loops, Python has ready made commands to solve the problems while it's loops are still a little more difficult for me to implement. Now on to the second part of the blog post, creating a loop to grab all this draft data and putting it into a single DataFrame.