# Filling out the data
For this third post, I'll be information on player performance that I want to be able to fully examine the correlation between draft position and player quality. In this post I have one objective:
Get more player quality data from basketball-reference. Specifically, I'm going to get data on all-star appearances, all-nba selections, etc. This is mainly as more practice with scraping, as well as working with merging data sets.

In [1]:
# Standard imports 
import pandas as pd
from pandas import Series,DataFrame,read_html
import numpy as np

from bs4 import BeautifulSoup
import html5lib

# Player accomplishment data

In [2]:
# Dealing with All-NBA team selections
url = "http://www.basketball-reference.com/awards/all_league.html"
dframe_list = pd.io.html.read_html(url)
df = dframe_list[0]
df.head()

Unnamed: 0,Season,Lg,Tm,Unnamed: 4,.1,.2,.3,.4
0,2015-16,NBA,1st,DeAndre Jordan C,LeBron James F,Kawhi Leonard F,Stephen Curry G,Russell Westbrook G
1,2015-16,NBA,2nd,DeMarcus Cousins C,Draymond Green F,Kevin Durant F,Chris Paul G,Damian Lillard G
2,2015-16,NBA,3rd,Andre Drummond C,LaMarcus Aldridge F,Paul George F,Kyle Lowry G,Klay Thompson G
3,,,,,,,,
4,2014-15,NBA,1st,Marc Gasol C,LeBron James F,Anthony Davis F,James Harden G,Stephen Curry G


This code chunk seems pretty great for scraping a website and returning a DataFrame for me to then manipulate. Now I need to properly label the columns, drop the useless rows, relabel some rows and pivot the table. I am not going to keep whether a player was selected to the first, second, or third All-NBA team because the distinction I don't think is very relevant for this analysis. Why is it not relevant? Well, each team is restricted to the five starting positions, thus though Russell Westbrook may be a more valuable player than say Marc Gasol, he may not be as valuable as Curry or Harden, so he gets bumped to the second team despite being more valuable or 'better' than one of the first team players. What matters for this analysis is not how good a player is with respect to those playing his same position, rather what matters is how good he is with respect to the league as a whole. As such, being considered in the top ten or fifteen players league wide I think is a more relevant statistic than necessarily his rank at his position. 

In [5]:
# Dropping columns
df.drop(df.columns[[1,2]], inplace=True, axis=1)


In [6]:
# Inserting proper column names
column_names = ['year', 'name1', 'name2', 'name3', 'name4', 'name5']
df.columns = column_names

# Dropping NaN filled rows
df = df[df.name1.notnull()]
df.head()

Unnamed: 0,year,name1,name2,name3,name4,name5
0,2015-16,DeAndre Jordan C,LeBron James F,Kawhi Leonard F,Stephen Curry G,Russell Westbrook G
1,2015-16,DeMarcus Cousins C,Draymond Green F,Kevin Durant F,Chris Paul G,Damian Lillard G
2,2015-16,Andre Drummond C,LaMarcus Aldridge F,Paul George F,Kyle Lowry G,Klay Thompson G
4,2014-15,Marc Gasol C,LeBron James F,Anthony Davis F,James Harden G,Stephen Curry G
5,2014-15,Pau Gasol C,DeMarcus Cousins C,LaMarcus Aldridge F,Chris Paul G,Russell Westbrook G


## A challenge arises
I now want to essentially pivot this data frame such that the first column is the name of every person ever selected to an all-NBA team and the second column is the number of times they have been selected. The other part of this challenge is that the columns here do not simply have a player's name, but also includes their position. I think this means I will likely need to use a regular expression to only grab the player's name so that I can eventually merge this DataFrame with others that just have a player's name. 

I played around for about 10 minutes with the regular expression idea before deciding that it was unlikely to be the most efficient way. I instead googled for a way to split apart pieces of a column and found [this][1]. Using code from this I can now break up the elements of a cell in the DataFrame.
[1]: http://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows

In [7]:
# First, I'm going to deal with the years column so that I can
# drop all observations from before 1976.
# This makes a separate DataFrame with the years split apart
year = pd.DataFrame(df.year.str.split('-').tolist())
year.head()

Unnamed: 0,0,1
0,2015,16
1,2015,16
2,2015,16
3,2014,15
4,2014,15


In [8]:
# Now I'm simply dropping the useless second column and renaming
year.drop(year.columns[[1]], inplace=True, axis=1)
column_name = ['year']
year.columns = column_name
year.head()

Unnamed: 0,year
0,2015
1,2015
2,2015
3,2014
4,2014


In [9]:
# Now I'm going to drop the column of years from the original 
# DataFrame
df.drop(df.columns[[0]], inplace=True, axis=1)

# And then create a new DataFrame that is just my original 
# DataFrame, but now with just names and the new DataFrame of just
# the years
df_allNBA = [year, df]
df_allNBA = pd.concat(df_allNBA, axis=1)
df_allNBA.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,year,name1,name2,name3,name4,name5
0,2015,DeAndre Jordan C,LeBron James F,Kawhi Leonard F,Stephen Curry G,Russell Westbrook G
1,2015,DeMarcus Cousins C,Draymond Green F,Kevin Durant F,Chris Paul G,Damian Lillard G
2,2015,Andre Drummond C,LaMarcus Aldridge F,Paul George F,Kyle Lowry G,Klay Thompson G
3,2014,,,,,
4,2014,Marc Gasol C,LeBron James F,Anthony Davis F,James Harden G,Stephen Curry G


In [10]:
# Dropping players from before the 1976 season
df_allNBA = df_allNBA[df_allNBA.year >= 1976]

In [11]:
# Now that I've dropped players from before 1976, I can drop the 
# years column. This will also help me later when I stack the 
# DataFrame in order to break the player's name from his position
df_allNBA.drop(df_allNBA.columns[[0]], inplace=True, axis=1)

# This will reshape the dataframe into a single column
stacked = pd.DataFrame(df_allNBA.stack())
stacked.columns = [['Player']]
stacked.head()

Unnamed: 0,Unnamed: 1,Player
0,name1,DeAndre Jordan C
0,name2,LeBron James F
0,name3,Kawhi Leonard F
0,name4,Stephen Curry G
0,name5,Russell Westbrook G


In [12]:
# Back to using that piece of code to break apart elements of a 
# cell. This creates a dataframe now with two columns that I will
# need to join back together
stacked = pd.DataFrame(stacked.Player.str.split().tolist()).ix[:,0:1]
stacked.head()

Unnamed: 0,0,1
0,DeAndre,Jordan
1,LeBron,James
2,Kawhi,Leonard
3,Stephen,Curry
4,Russell,Westbrook


In [13]:
# Joining those columns back together
stacked[0] = stacked[0] + " " + stacked[1]
stacked.drop(stacked.columns[[1]], inplace=True, axis=1)
stacked.columns = ['Player']
stacked.head()

Unnamed: 0,Player
0,DeAndre Jordan
1,LeBron James
2,Kawhi Leonard
3,Stephen Curry
4,Russell Westbrook


In [14]:
# Now I can run through just a single column to count up how many
# times a player's name appears
df_allNBA = pd.DataFrame(stacked['Player'].value_counts())
df_allNBA.head()

Unnamed: 0,Player
Tim Duncan,15
Kobe Bryant,15
Shaquille O'Neal,14
Kareem Abdul-Jabbar,14
Karl Malone,14


In [15]:
# As you can see, from stacking the player's name actually 
# appears as the index. So this next piece of code resets the index
# and bumps the player's name index out as a column
df_allNBA.reset_index(level=0, inplace=True)

# Renaming the columns
column_names = ['Player', 'All_NBA']
df_allNBA.columns = column_names
df_allNBA.head()

Unnamed: 0,Player,All_NBA
0,Tim Duncan,15
1,Kobe Bryant,15
2,Shaquille O'Neal,14
3,Kareem Abdul-Jabbar,14
4,Karl Malone,14


## Why so much work and a loop that didn't work
The code above probably seems like a lot, and I feel pretty confident that a more experienced Python user could have broken up those columns and counted players names in a much more efficient manner. But for me, this is actually a shortened version. Right where I noted that a 'challenge' arose, my first instinct was to take the code to split up elements of a cell using a loop (shown below). My thought was to create a DataFrame for essentially each column of names, break them apart to get rid of the position, put them back together and then merge all of them plus the year's DataFrame. From there I would then drop observations after 1976, then drop the years, stack the columns, count each observation, etc. I ran into problems because the loop didn't work. If I ran the code on each column, it works, but in trying to put it in a loop there doesn't appear to be a way to call the column as a variable. This is because the loop is just calling the string, for example 'name1', and not calling the actual column. So when the loop hits where it needs the actual variable column like at `df.column.str` it can't run cause there's nothing to run on. There might be a way to actually get a loop to call the variable properly, but I tried a lot of different things and nothing work. 
This issue also led to my first posting on [StackOverflow][1]. Unfortunately, I (1) started my question off wrong as I offered to little information, but (2) no one has yet offered a more elegant solution that works than what I have above.
[1]: http://stackoverflow.com/questions/38666111/why-does-a-column-from-pandas-dataframe-not-work-in-this-loop/38666726?noredirect=1#comment64715921_38666726

In [21]:
column_names = [name1, name2, name3, name4, name5]
for column in column_names:
    column = pd.DataFrame(df.column.str.split().tolist()).ix[:,0:1]
    column[0] = column[0] + " " + column[1]
    column.drop(column.columns[[1]], inplace=True, axis=1)
    column.columns = column

ValueError: Length mismatch: Expected axis has 1 elements, new values have 177 elements

## All-Star Game selections nows
Now I can turn to creating a dataframe of all star game appearances. This was definitely easier.

In [16]:
# Dealing with All-Star Game selections
url = "http://www.basketball-reference.com/awards/all_star_by_player.html"
dframe_star = pd.io.html.read_html(url)
df_allStar = dframe_star[0]
df_allStar.head()

Unnamed: 0,0,1,2,3,4
0,Rk,Player,Tot,NBA,ABA


This was a strange error and my immediate instinct is that the pandas operation cannot read the table for some reason. Googling around for this error suggests this is the [case][1], though the solution is a bit more complicated.
[1]: https://github.com/pydata/pandas/issues/6393

In [17]:
import requests

This is a pretty standard import when scraping websites, or at least I see it quite a bit when I've looked at others' code. I haven't been using it because the way that I have previously been grabbing data was working and didn't require working as much with BeautifulSoup. My understanding is that BeautifulSoup makes a webpage very searchable as it takes the HTML and recognizes the many tags in the file. This allows it to identify various pieces such as tables. 

In [18]:
url = "http://www.basketball-reference.com/awards/all_star_by_player.html"
soup = BeautifulSoup(requests.get(url).text, 'lxml')

In [19]:
soup.findAll("table")[0].findAll("tr")[1]

<tr>\n<td class="right ranker" csk="1">1</td>\n<td csk="Abdul-Jabbar,Kareem"><a href="/players/a/abdulka01.html">Kareem Abdul-Jabbar</a></td>\n<td class="center">19</td>\n<td class="center">19</td>\n<td class="center">0</td>\n</tr>

I think this just shows that findAll did find a table in the HTML. Luckily, I think labeling a table as a table in HTML is standard practice thus it's easy to find them.

In [20]:
df_star = pd.read_html(str(soup.find_all('table')[0]))
df_star

[       0                    1   2   3  4
 0      1  Kareem Abdul-Jabbar  19  19  0
 1      2          Kobe Bryant  18  18  0
 2      3        Julius Erving  16  11  5
 3      4           Tim Duncan  15  15  0
 4      5        Kevin Garnett  15  15  0
 5      6     Shaquille O'Neal  15  15  0
 6      7       Michael Jordan  14  14  0
 7      8          Karl Malone  14  14  0
 8      9           Jerry West  14  14  0
 9     10     Wilt Chamberlain  13  13  0
 10    11            Bob Cousy  13  13  0
 11    12        John Havlicek  13  13  0
 12    13         Moses Malone  13  12  1
 13    14        Dirk Nowitzki  13  13  0
 14    15           Rick Barry  12   8  4
 15    16           Larry Bird  12  12  0
 16    17        George Gervin  12   9  3
 17    18          Elvin Hayes  12  12  0
 18    19         LeBron James  12  12  0
 19    20        Magic Johnson  12  12  0
 20    21      Hakeem Olajuwon  12  12  0
 21    22      Oscar Robertson  12  12  0
 22    23         Bill Russell  12

Back to using pandas 'read_html', but now in a directed sense with the find_all having grabbed the right part of the HTML and turned it into a string, pandas has an easy time turning that string into a list.

In [21]:
df_star = df_star[0]

And now this strange code above that somehow converts the list into a DataFrame.

In [22]:
# Renaming the columns
column_names = ['PK', 'Player', 'Tot', 'NBA', 'ABA']
df_star.columns = column_names
df_star.head()

Unnamed: 0,PK,Player,Tot,NBA,ABA
0,1,Kareem Abdul-Jabbar,19,19,0
1,2,Kobe Bryant,18,18,0
2,3,Julius Erving,16,11,5
3,4,Tim Duncan,15,15,0
4,5,Kevin Garnett,15,15,0


In [23]:
df_star['All-Star'] = df_star['Tot']
df_star.drop(df_star.columns[[0,2,3,4]], inplace=True, axis=1)
df_star.head()
# This DataFrame was wonderfully easy to compile!

Unnamed: 0,Player,All-Star
0,Kareem Abdul-Jabbar,19
1,Kobe Bryant,18
2,Julius Erving,16
3,Tim Duncan,15
4,Kevin Garnett,15


In [24]:
# Need to now merge these player accomplishment dataframes into 
# one dataframe by player name and accomplishment and then merge
# that with my earlier data frame of draft history

draft_dframe = pd.read_csv('NBA_Data/1976_to_2015_Draft.csv')
draft_dframe.head()

Unnamed: 0.1,Unnamed: 0,Draft_Yr,Pk,Team,Player,College,Yrs,Games,Minutes Played,PTS,...,TP_Percentage,FT_Percentage,Minutes per Game,Points per Game,TRB per game,Assits per Game,Win Share,WS_per_game,BPM,VORP
0,0,1976,1.0,HOU,John Lucas,University of Maryland,14.0,928.0,25556.0,9951.0,...,0.303,0.776,27.5,10.7,2.3,7.0,53.7,0.101,-0.4,10.3
1,1,1976,2.0,CHI,Scott May,Indiana University,7.0,355.0,8029.0,3690.0,...,0.0,0.811,22.6,10.4,4.1,1.7,17.4,0.104,-1.0,2.0
2,2,1976,3.0,KCK,Richard Washington,"University of California, Los Angeles",6.0,351.0,7874.0,3456.0,...,0.25,0.711,22.4,9.8,6.3,1.2,10.8,0.066,-2.3,-0.6
3,3,1976,4.0,DET,Leon Douglas,University of Alabama,7.0,456.0,10111.0,3587.0,...,0.0,0.601,22.2,7.9,6.5,1.1,15.2,0.072,-1.6,1.1
4,4,1976,5.0,POR,Wally Walker,University of Virginia,8.0,565.0,10168.0,3968.0,...,0.2,0.643,18.0,7.0,3.1,1.5,12.9,0.061,-2.3,-0.8


In [25]:
df_accomplishment = pd.merge(df_allNBA, df_star, on = ['Player',])
df_accomplishment.head()

Unnamed: 0,Player,All_NBA,All-Star
0,Tim Duncan,15,15
1,Kobe Bryant,15,18
2,Shaquille O'Neal,14,15
3,Kareem Abdul-Jabbar,14,19
4,Karl Malone,14,14


In [26]:
df_draft = pd.merge(df_accomplishment, draft_dframe, on = 'Player',
                   how = 'outer')
df_draft.head()

Unnamed: 0.1,Player,All_NBA,All-Star,Unnamed: 0,Draft_Yr,Pk,Team,College,Yrs,Games,...,TP_Percentage,FT_Percentage,Minutes per Game,Points per Game,TRB per game,Assits per Game,Win Share,WS_per_game,BPM,VORP
0,Tim Duncan,15.0,15.0,2863.0,1997.0,1.0,SAS,Wake Forest University,19.0,1392.0,...,0.179,0.696,34.0,19.0,10.8,3.0,206.4,0.209,5.5,89.3
1,Kobe Bryant,15.0,18.0,2817.0,1996.0,13.0,CHH,0,20.0,1346.0,...,0.329,0.837,36.1,25.0,5.2,4.7,172.7,0.17,3.9,72.1
2,Shaquille O'Neal,14.0,15.0,2585.0,1992.0,1.0,ORL,Louisiana State University,19.0,1207.0,...,0.045,0.527,34.7,23.7,10.9,2.5,181.7,0.208,5.0,74.0
3,Kareem Abdul-Jabbar,14.0,19.0,,,,,,,,...,,,,,,,,,,
4,Karl Malone,14.0,14.0,1875.0,1985.0,13.0,UTA,Louisiana Tech University,19.0,1476.0,...,0.274,0.742,37.2,25.0,10.1,3.6,234.6,0.205,5.4,102.5


Merging data in pandas is pretty easy, easier than it seems in Stata. Doing a few quality checks, such as making sure that all players transferred over and that all players undrafted or drafted before 1976 came out okay. I feel like when I do this in Stata, over half the time something has gone wrong because the thing you want to merge on, such as subject id, has some nuance that screws things up. Luckily, that's not the case here and everything is in order.

In [27]:
# Dropping undrafted players and those drafted before 1976
df_draft = df_draft[df_draft.Pk.notnull()]

In [28]:
# Filling the NaN cells for drafted players with zeroes
df_draft = df_draft.fillna(0)
# Dropping the ABA all-star column
df_draft.drop(df_draft.columns[[3]], inplace=True, axis=1)
df_draft.head()

Unnamed: 0,Player,All_NBA,All-Star,Draft_Yr,Pk,Team,College,Yrs,Games,Minutes Played,...,TP_Percentage,FT_Percentage,Minutes per Game,Points per Game,TRB per game,Assits per Game,Win Share,WS_per_game,BPM,VORP
0,Tim Duncan,15.0,15.0,1997.0,1.0,SAS,Wake Forest University,19.0,1392.0,47368.0,...,0.179,0.696,34.0,19.0,10.8,3.0,206.4,0.209,5.5,89.3
1,Kobe Bryant,15.0,18.0,1996.0,13.0,CHH,0,20.0,1346.0,48637.0,...,0.329,0.837,36.1,25.0,5.2,4.7,172.7,0.17,3.9,72.1
2,Shaquille O'Neal,14.0,15.0,1992.0,1.0,ORL,Louisiana State University,19.0,1207.0,41918.0,...,0.045,0.527,34.7,23.7,10.9,2.5,181.7,0.208,5.0,74.0
4,Karl Malone,14.0,14.0,1985.0,13.0,UTA,Louisiana Tech University,19.0,1476.0,54852.0,...,0.274,0.742,37.2,25.0,10.1,3.6,234.6,0.205,5.4,102.5
6,Hakeem Olajuwon,12.0,12.0,1984.0,1.0,HOU,University of Houston,18.0,1238.0,44222.0,...,0.202,0.712,35.7,21.8,11.1,2.5,162.8,0.177,4.9,77.1


In [29]:
cd NBA_Data

/Users/rorypulvino/Dropbox (Personal)/Python/blog/content/NBA_Data


In [30]:
df_draft.to_csv('1976_to_2015_Draftees.csv')

## Player data loaded
All the data I wanted for player stats for draftees from 1976 to 2015 is now in one DataFrame which is pretty awesome. I could just run analysis on this now looking at the correlation between player quality and draft pick, but would like to merge the coaching and GM data first. In my next post, I'll be dealing with the GM and coaching data issues and merging it into this DataFrame.

Overall, completing this post took longer than expected. This is mostly because I had originally intended to work on this and the GM and coaching data all in a single post, but the GM data took much longer than expected as did the all-NBA selections. Oh and I moved to Nairobi during this time. I'd estimate this was about 10  hours of work, with 8 of those hours coming in trying to work through the issues in the all-NBA dataframe.