# Web Scraping of www.basketball-reference.com

During this portion of the project, I was focused on collecting the data needed for my project. I needed to scrape data from www.basketball-reference.com which I was able to do using BeautifulSoup, pandas, and urllib. 

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import urllib.request

The first question I had to answer before starting any of this is what players should I consider in my analysis? I decided to use Hall of Fame Probability as a preliminary deciding factor. Created by the people at Basketball Reference, Hall of Fame Probability measures exactly what the name not-so-subtly suggests: a player's probability of making it into the Basketball Hall of Fame. 

Given the nature of my problem (predicting a player's performance dropoff before it happens), it seemed most logical to look at players who performed at a high level. 

I scraped the all-time leaders in this metric as well as the active leaders. There was some overlap in the players, so I ended up with 310 unique players.

In [2]:
# Access Basketball Reference to get league leaders in win shares for each season
#specify the url
site = "https://www.basketball-reference.com/leaders/hof_prob.html"
#Query the website and return the html to the variable 'page'
page = urllib.request.urlopen(site)
#Parse the html in the 'page' variable and store it in Beautiful Soup format
soup = BeautifulSoup(page, 'lxml')

In [3]:
# Save two Hall of Fame Probability leader tables (all time and active)
alltime_table, active_table = soup.find_all('table')

In [4]:
#create list tuples of players and href 
alltime_list = []
active_list = []
for line in alltime_table('td'):
    try:
        alltime_list.append((str(line.a.string), str(line.a.get('href'))))
    except:
        pass
    
for line in active_table('td'):
    try:
        active_list.append((str(line.a.string), str(line.a.get('href'))))
    except:
        pass

`alltime_list` and `active_list` are lists of tuples containing the player's name and the path to access their season statistics. I have printed the length and top 5 entries to both.

In [5]:
len(alltime_list), alltime_list[:5]

(250,
 [('Kareem Abdul-Jabbar', '/players/a/abdulka01.html'),
  ('Michael Jordan', '/players/j/jordami01.html'),
  ('Bill Russell', '/players/r/russebi01.html'),
  ('Kobe Bryant', '/players/b/bryanko01.html'),
  ('Wilt Chamberlain', '/players/c/chambwi01.html')])

In [6]:
len(active_list), active_list[:5]

(100,
 [('LeBron James', '/players/j/jamesle01.html'),
  ('Dwyane Wade', '/players/w/wadedw01.html'),
  ('Dirk Nowitzki', '/players/n/nowitdi01.html'),
  ('Kevin Durant', '/players/d/duranke01.html'),
  ('Chris Paul', '/players/p/paulch01.html')])

Next I needed to combine the two lists and remove players on both lists. I did this using set functionality.

In [7]:
# Create list of players on active_list but not alltime_list
# and add to alltime_list
active_list = list(set(active_list).difference(set(alltime_list)))
alltime_list += active_list
# Print new list length and last 10 players of list
print(len(alltime_list))
print(alltime_list[-10:])

311
[('Jodie Meeks', '/players/m/meeksjo01.html'), ('Mike Conley', '/players/c/conlemi01.html'), ('Brook Lopez', '/players/l/lopezbr01.html'), ('Rudy Gobert', '/players/g/goberru01.html'), ('Ricky Rubio', '/players/r/rubiori01.html'), ('Jae Crowder', '/players/c/crowdja01.html'), ('Harrison Barnes', '/players/b/barneha02.html'), ('Tony Allen', '/players/a/allento01.html'), ('Jamal Crawford', '/players/c/crawfja01.html'), ('Shaun Livingston', '/players/l/livinsh01.html')]


# Scraping Season Statistics
This for loop gets the season by season stats for each player in the alltime_list. First it turns the player page html into a Beautiful Soup object to make the stats table. I also pull the player's height from the top of the page, and convert it from a string 'x-xx' to inches.

In [8]:
all_seasons = [] 
for player in alltime_list[:10]:
    reference_site = 'https://www.basketball-reference.com' + player[1]
    page = urllib.request.urlopen(reference_site)
    soup = BeautifulSoup(page, 'lxml')
    per_game_table = soup.table
    
    height = str(soup.find_all('div', {'id':'info'})[0].find_all('span', {'itemprop':'height'})[0].string)
    height = height.split('-')
    height = int(height[0])*12 + int(height[1])
    
    # This section gets stats values for a single season and 
    # makes a single list
    values = []
    length = 0
    for row in (per_game_table('tr')):
        for num, column in enumerate(row):
            # Players from different decades have different
            # available stats, but PTS is always the last column
            if column.string == 'PTS':
                length = int((num+1) / 2)
            if column != '\n':
                values.append(str(column.string))
    categories = values[:length]
    # Ignores category names
    values = values[length:]

    # Create list of individual season stats lists
    player_career = [['Player', 'href', 'Height'] + categories]
    season = [player[0], player[1], height]
    for value in values:
        if value.startswith('Did'):
            # Find players who took time off mid-career
            print(player[0], value)
        season.append(str(value))
        #added and (season[3] != None)
        if (len(season) == length+3) and (season[4] != 'None'):
            player_career.append(season)
            season = [player[0], player[1], height]
        # Player must have played at least 12 seasons
    #if (len(player_career) > 12):# and (len(player_career[0]) == 33):
    all_seasons.append(player_career)
    print(str((len(player_career)-1)), ' seasons of ', player[0], 'added')


20  seasons of  Kareem Abdul-Jabbar added
Michael Jordan Did Not Play (retired)
Michael Jordan Did Not Play (retired)
Michael Jordan Did Not Play (retired)
17  seasons of  Michael Jordan added
13  seasons of  Bill Russell added
20  seasons of  Kobe Bryant added
16  seasons of  Wilt Chamberlain added
16  seasons of  LeBron James added
19  seasons of  Tim Duncan added
21  seasons of  Shaquille O'Neal added
16  seasons of  John Havlicek added
14  seasons of  Oscar Robertson added


# Converting to DataFrame and Exporting to .csv
The scraped data was scraped into a single list which is not useful to me, so I need to convert it to a pandas DataFrame before I exported it to a csv file. 

In [9]:
# Create pandas DataFrame of all data
category_list= ['Player', 'href', 'Height', 'Season', 'Age', 'Tm', 'Lg', 
                'Pos', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', 
                '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 
                'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'
                ]

In [10]:
# Merge DataFrames and export to csv
df_list = []
season_df_list = []
master_df = pd.DataFrame(columns=category_list)
for career in all_seasons:
    labels=career[0]
    for season in career[1:]:
        season_df = pd.DataFrame(data=season, index=labels)
        season_df_list.append(season_df.transpose())
    print(career[1][0] + ' added')
career_df = pd.concat(season_df_list, ignore_index=True, sort=True)
    

career_df.to_csv('all-stats-sample.csv', columns = category_list)

Kareem Abdul-Jabbar added
Michael Jordan added
Bill Russell added
Kobe Bryant added
Wilt Chamberlain added
LeBron James added
Tim Duncan added
Shaquille O'Neal added
John Havlicek added
Oscar Robertson added
