# Filling out the data
For this third post, I'll be information on player performance that I want to be able to fully examine the correlation between draft position and player quality. In this post I have one objective:
Get more player quality data from basketball-reference. Specifically, I'm going to get data on all-star appearances, all-nba selections, etc. This is mainly as more practice with scraping, as well as working with merging data sets.

In [None]:
# Standard imports 
import pandas as pd
from pandas import Series,DataFrame,read_html
import numpy as np

from bs4 import BeautifulSoup
import html5lib

# Player accomplishment data

In [None]:
# Dealing with All-NBA team selections
url = "http://www.basketball-reference.com/awards/all_league.html"
dframe_list = pd.io.html.read_html(url)
df = dframe_list[0]
df.head()

This code chunk seems pretty great for scraping a website and returning a DataFrame for me to then manipulate. Now I need to properly label the columns, drop the useless rows, relabel some rows and pivot the table. I am not going to keep whether a player was selected to the first, second, or third All-NBA team because the distinction I don't think is very relevant for this analysis. Why is it not relevant? Well, each team is restricted to the five starting positions, thus though Russell Westbrook may be a more valuable player than say Marc Gasol, he may not be as valuable as Curry or Harden, so he gets bumped to the second team despite being more valuable or 'better' than one of the first team players. What matters for this analysis is not how good a player is with respect to those playing his same position, rather what matters is how good he is with respect to the league as a whole. As such, being considered in the top ten or fifteen players league wide I think is a more relevant statistic than necessarily his rank at his position. 

In [None]:
# Dropping columns
df.drop(df.columns[[1,2]], inplace=True, axis=1)


In [None]:
# Inserting proper column names
column_names = ['year', 'name1', 'name2', 'name3', 'name4', 'name5']
df.columns = column_names

# Dropping NaN filled rows
df = df[df.name1.notnull()]
df.head()

## A challenge arises
I now want to essentially pivot this data frame such that the first column is the name of every person ever selected to an all-NBA team and the second column is the number of times they have been selected. The other part of this challenge is that the columns here do not simply have a player's name, but also includes their position. I think this means I will likely need to use a regular expression to only grab the player's name so that I can eventually merge this DataFrame with others that just have a player's name. 

I played around for about 10 minutes with the regular expression idea before deciding that it was unlikely to be the most efficient way. I instead googled for a way to split apart pieces of a column and found [this][1]. Using code from this I can now break up the elements of a cell in the DataFrame.
[1]: http://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows

In [None]:
# First, I'm going to deal with the years column so that I can
# drop all observations from before 1976.
# This makes a separate DataFrame with the years split apart
year = pd.DataFrame(df.year.str.split('-').tolist())
year.head()

In [None]:
# Now I'm simply dropping the useless second column and renaming
year.drop(year.columns[[1]], inplace=True, axis=1)
column_name = ['year']
year.columns = column_name
year.head()

In [None]:
# Now I'm going to drop the column of years from the original 
# DataFrame
df.drop(df.columns[[0]], inplace=True, axis=1)

# And then create a new DataFrame that is just my original 
# DataFrame, but now with just names and the new DataFrame of just
# the years
df_allNBA = [year, df]
df_allNBA = pd.concat(df_allNBA, axis=1)
df_allNBA.head()

In [None]:
# Dropping players from before the 1976 season
df_allNBA = df_allNBA[df_allNBA.year >= 1976]

In [None]:
# Now that I've dropped players from before 1976, I can drop the 
# years column. This will also help me later when I stack the 
# DataFrame in order to break the player's name from his position
df_allNBA.drop(df_allNBA.columns[[0]], inplace=True, axis=1)

# This will reshape the dataframe into a single column
stacked = pd.DataFrame(df_allNBA.stack())
stacked.columns = [['Player']]
stacked.head()

In [None]:
# Back to using that piece of code to break apart elements of a 
# cell. This creates a dataframe now with two columns that I will
# need to join back together
stacked = pd.DataFrame(stacked.Player.str.split().tolist()).ix[:,0:1]
stacked.head()

In [None]:
# Joining those columns back together
stacked[0] = stacked[0] + " " + stacked[1]
stacked.drop(stacked.columns[[1]], inplace=True, axis=1)
stacked.columns = ['Player']
stacked.head()

In [None]:
# Now I can run through just a single column to count up how many
# times a player's name appears
df_allNBA = pd.DataFrame(stacked['Player'].value_counts())
df_allNBA.head()

In [None]:
# As you can see, from stacking the player's name actually 
# appears as the index. So this next piece of code resets the index
# and bumps the player's name index out as a column
df_allNBA.reset_index(level=0, inplace=True)

# Renaming the columns
column_names = ['Player', 'All_NBA']
df_allNBA.columns = column_names
df_allNBA.head()

## Why so much work and a loop that didn't work
The code above probably seems like a lot, and I feel pretty confident that a more experienced Python user could have broken up those columns and counted players names in a much more efficient manner. But for me, this is actually a shortened version. Right where I noted that a 'challenge' arose, my first instinct was to take the code to split up elements of a cell using a loop (shown below). My thought was to create a DataFrame for essentially each column of names, break them apart to get rid of the position, put them back together and then merge all of them plus the year's DataFrame. From there I would then drop observations after 1976, then drop the years, stack the columns, count each observation, etc. I ran into problems because the loop didn't work. If I ran the code on each column, it works, but in trying to put it in a loop there doesn't appear to be a way to call the column as a variable. This is because the loop is just calling the string, for example 'name1', and not calling the actual column. So when the loop hits where it needs the actual variable column like at `df.column.str` it can't run cause there's nothing to run on. There might be a way to actually get a loop to call the variable properly, but I tried a lot of different things and nothing work. 
This issue also led to my first posting on [StackOverflow][1]. Unfortunately, I (1) started my question off wrong as I offered to little information, but (2) no one has yet offered a more elegant solution that works than what I have above.
[1]: http://stackoverflow.com/questions/38666111/why-does-a-column-from-pandas-dataframe-not-work-in-this-loop/38666726?noredirect=1#comment64715921_38666726

In [None]:
column_names = [name1, name2, name3, name4, name5]
for column in column_names:
    column = pd.DataFrame(df.column.str.split().tolist()).ix[:,0:1]
    column[0] = column[0] + " " + column[1]
    column.drop(column.columns[[1]], inplace=True, axis=1)
    column.columns = column

## All-Star Game selections nows
Now I can turn to creating a dataframe of all star game appearances. This was definitely easier.

In [None]:
# Dealing with All-Star Game selections
url = "http://www.basketball-reference.com/awards/all_star_by_player.html"
dframe_star = pd.io.html.read_html(url)
df_allStar = dframe_star[0]
df_allStar.head()

This was a strange error and my immediate instinct is that the pandas operation cannot read the table for some reason. Googling around for this error suggests this is the [case][1], though the solution is a bit more complicated.
[1]: https://github.com/pydata/pandas/issues/6393

In [None]:
import requests

This is a pretty standard import when scraping websites, or at least I see it quite a bit when I've looked at others' code. I haven't been using it because the way that I have previously been grabbing data was working and didn't require working as much with BeautifulSoup. My understanding is that BeautifulSoup makes a webpage very searchable as it takes the HTML and recognizes the many tags in the file. This allows it to identify various pieces such as tables. 

In [None]:
url = "http://www.basketball-reference.com/awards/all_star_by_player.html"
soup = BeautifulSoup(requests.get(url).text, 'lxml')

In [None]:
soup.findAll("table")[0].findAll("tr")[1]

I think this just shows that findAll did find a table in the HTML. Luckily, I think labeling a table as a table in HTML is standard practice thus it's easy to find them.

In [None]:
df_star = pd.read_html(str(soup.find_all('table')[0]))
df_star

Back to using pandas 'read_html', but now in a directed sense with the find_all having grabbed the right part of the HTML and turned it into a string, pandas has an easy time turning that string into a list.

In [None]:
df_star = df_star[0]

And now this strange code above that somehow converts the list into a DataFrame.

In [None]:
# Renaming the columns
column_names = ['PK', 'Player', 'Tot', 'NBA', 'ABA']
df_star.columns = column_names
df_star.head()

In [None]:
df_star['All-Star'] = df_star['Tot']
df_star.drop(df_star.columns[[0,2,3,4]], inplace=True, axis=1)
df_star.head()
# This DataFrame was wonderfully easy to compile!

In [None]:
# Need to now merge these player accomplishment dataframes into 
# one dataframe by player name and accomplishment and then merge
# that with my earlier data frame of draft history

draft_dframe = pd.read_csv('NBA_Data/1976_to_2015_Draft.csv')
draft_dframe.head()

In [None]:
df_accomplishment = pd.merge(df_allNBA, df_star, on = ['Player',])
df_accomplishment.head()

In [None]:
df_draft = pd.merge(df_accomplishment, draft_dframe, on = 'Player',
                   how = 'outer')
df_draft.head()

Merging data in pandas is pretty easy, easier than it seems in Stata. Doing a few quality checks, such as making sure that all players transferred over and that all players undrafted or drafted before 1976 came out okay. I feel like when I do this in Stata, over half the time something has gone wrong because the thing you want to merge on, such as subject id, has some nuance that screws things up. Luckily, that's not the case here and everything is in order.