# Part 2
Having scraped a DataFrame for a single draft year, I'm now looking to do the same for all draft years from 1976 through 2015. To do this I'll still be following this [blog's][1] basic outline, though I expect to have to make significant deviations. The blog also provides a lot more explanation of what exactly Python is doing. After that, I'll merge the DataFrames and move on to scraping data of the players' career accomplishments (all-star appearances, all-NBA teams, etc.) and GM's and coaches of the teams to add to the DataFrame. Finally, I'll actually be able to analyze how draft position is related to player quality and look at which GM's and coaches beat expectations.
[1]: http://savvastjortjoglou.com/nba-draft-part01-scraping.html

In [1]:
# Standard imports 
import pandas as pd
from pandas import Series,DataFrame,read_html
import numpy as np

from bs4 import BeautifulSoup
import html5lib

## Building the loop
The next part of this project requires a hopefully simple loop to go through the NBA drafts, apply the steps from Part 1, and combine the completed DataFrames into a single large DataFrame.

In [2]:
# start by creating a url template to use in the loop
url_template = "http://www.basketball-reference.com/draft/NBA_{year}.html"

In [3]:
# create an empty DataFrame to append each draft year DataFrame to
draft_dframe = DataFrame()

In [4]:
for year in range(1976, 2016): # Will build the DataFrame for each year of interest
    url = url_template.format(year = year) # Grabbing the correct url
    
    dframe_list_year = pd.io.html.read_html(url)
    dframe_year = dframe_list_year[0]
    
    # Dropping the NaN filled columns
    dframe_year.drop(dframe_year.columns[[0,22,23,24,25,26,27,28,29,30]],inplace=True,axis=1)
    
    # Renaming the columns
    column_names = ['Pk','Team','Player','College','Yrs','Games','Minutes Played','PTS','TRB','AST','FG_Percentage','TP_Percentage','FT_Percentage','Minutes per Game','Points per Game','TRB per game','Assits per Game','Win Share','WS_per_game','BPM','VORP']
    dframe_year.columns = column_names
    
    # Add in a column for the draft year
    dframe_year.insert(0, 'Draft_Yr', year)
    
    # Append to the big DataFrame
    draft_dframe = draft_dframe.append(dframe_year, ignore_index=True)


In [6]:
draft_dframe.head()

Unnamed: 0,Draft_Yr,Pk,Team,Player,College,Yrs,Games,Minutes Played,PTS,TRB,...,TP_Percentage,FT_Percentage,Minutes per Game,Points per Game,TRB per game,Assits per Game,Win Share,WS_per_game,BPM,VORP
0,1976,1,HOU,John Lucas,University of Maryland,14,928,25556,9951,2151,...,0.303,0.776,27.5,10.7,2.3,7.0,53.7,0.101,-0.4,10.3
1,1976,2,CHI,Scott May,Indiana University,7,355,8029,3690,1450,...,0.0,0.811,22.6,10.4,4.1,1.7,17.4,0.104,-1.0,2.0
2,1976,3,KCK,Richard Washington,"University of California, Los Angeles",6,351,7874,3456,2204,...,0.25,0.711,22.4,9.8,6.3,1.2,10.8,0.066,-2.3,-0.6
3,1976,4,DET,Leon Douglas,University of Alabama,7,456,10111,3587,2954,...,0.0,0.601,22.2,7.9,6.5,1.1,15.2,0.072,-1.6,1.1
4,1976,5,POR,Wally Walker,University of Virginia,8,565,10168,3968,1759,...,0.2,0.643,18.0,7.0,3.1,1.5,12.9,0.061,-2.3,-0.8


In the blog post I am following, the author pulls the data and appends it all together before cleaning it. I'm not quite sure why they do it this way, though I can imagine it's more efficient as it would save on having to perform those operations on each dataset.

In [7]:
# Converting the data types to the proper type
numeric_columns = column_names
del numeric_columns[1:4] # Dropping the string columns 'Player' 'Team' 'College' from this list

# Converting data types using the numeric_columns list
for column in numeric_columns:
    draft_dframe[column] = pd.to_numeric(draft_dframe[column], errors='coerce')
draft_dframe.dtypes

Draft_Yr              int64
Pk                  float64
Team                 object
Player               object
College              object
Yrs                 float64
Games               float64
Minutes Played      float64
PTS                 float64
TRB                 float64
AST                 float64
FG_Percentage       float64
TP_Percentage       float64
FT_Percentage       float64
Minutes per Game    float64
Points per Game     float64
TRB per game        float64
Assits per Game     float64
Win Share           float64
WS_per_game         float64
BPM                 float64
VORP                float64
dtype: object

## Stupid hiccup in converting data types
I changed tactics to drop the 'Rk' column when I was building out the DataFrame which changed the list and the loop for converting the data types. I kept making the same mistake as well in creating the list and deleting out the wrong columns to make my list for numeric columns because I forgot that draft year was not in the column names list and thus is indexed differently. 

I used the .notnull() command (method?) in the previous post to remove the unnecessary rows, but didn't explain what was going on or why. By converting certain columns to numeric values, those unnecessary rows made up of strings like 'Rk' in the 'Rk' column were converted to NaN values. Then by applying .notnull() to the 'Rk' column, it goes through giving a True where there is a number and a False where there is a NaN. The way the code is written it says to look at the 'Rk' column using .notnull() and only take those values where .notnull() is True and make that the new DataFrame.  

In [8]:
# Dropping the rows that served as breaks for different rounds of the draft
draft_dframe = draft_dframe[draft_dframe.Pk.notnull()]

In [10]:
draft_dframe.tail()

Unnamed: 0,Draft_Yr,Pk,Team,Player,College,Yrs,Games,Minutes Played,PTS,TRB,...,TP_Percentage,FT_Percentage,Minutes per Game,Points per Game,TRB per game,Assits per Game,Win Share,WS_per_game,BPM,VORP
4232,2015,56.0,NOP,Branden Dawson,Michigan State University,1.0,6.0,29.0,5.0,4.0,...,,1.0,4.8,0.8,0.7,0.0,0.0,0.069,-6.6,0.0
4233,2015,57.0,DEN,Nikola Radicevic,,,,,,,...,,,,,,,,,,
4234,2015,58.0,PHI,J.P. Tokoto,University of North Carolina,,,,,,...,,,,,,,,,,
4235,2015,59.0,ATL,Dimitrios Agravanis,,,,,,,...,,,,,,,,,,
4236,2015,60.0,PHI,Luka Mitrovic,,,,,,,...,,,,,,,,,,


In [11]:
# changing the remaining NaN's to zeroes
draft_dframe = draft_dframe.fillna(0)
# reindexing to align the row index correctly
draft_dframe.index = range(3985)

In [12]:
# checking whether there are still any missing values
draft_dframe.isnull().sum().sum()

0

In [13]:
cd NBA_Data

/Users/rorypulvino/Dropbox (Personal)/Python/blog/content/NBA_Data


In [14]:
draft_dframe.to_csv('1976_to_2015_Draft.csv')

# Got the first DataFrame, now need to fill out the rest
This DataFrame contains all the draft picks from 1976 to 2015 from basketball-reference.com. The next couple of steps now are:
1. I will need to build DataFrames of NBA coaches and GMs. I've been looking for websites with this information, and luckily basketball-reference has this information. The data is not as conveniently organized as the draft data, but can be organized.
2. I would like to also add in data on the players' personal accomplishments such as making the NBA All-Star Game or being named to the All-NBA team. This seems relevant because it is very rare for a team to win a championship without such players.
3. I need to 'fix' the DataFrame so that teams such as the OKC Thunder appear as the OKC Thunder throughout the DataFrame, rather than as the Seattle Super-Sonics. This will simply make analysis easier.