# Scraping all seasons of all teams

In the notebook titled "Scraping a season- May24," we assembled the team stats of every Warriors game of the 2017-2018 regular season.

In this notebook, we will seek to assemble the team stats of every NBA game since the 2002-2003 regular season. 

In [28]:
#necessary libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import numpy as np

## Links to team schedules

To every game is attached a Game ID. There isn't a clear pattern to how these Game ID's are determined (e.g. date or  teams involved), so we will use the same strategy used in the notebook "Scraping a season- May24". We will first find the links to the schedules of games for each season and then use requests and BeautifulSoup to grab Game ID's. 

By inspection, the link to a team's schedule has the form "http://www.espn.com/nba/team/schedule/_/name/{0}/year/{1}/seasontype/{2}."  Here, {0} represents the abbreviation for a team (e.g. "bos" for the Boston Celtics and "gs" for the Golden State Warriors), {1} represents the final year of the season (e.g., "2018" for the 2017-2018 season), and {2} represents which part of the season the schedule is for ("1" for preseason, "2" for regular season, and "3" for postseason).

For now, we will be most interested in the regular season, so {2} will given by 2. ESPN.com includes stats since the 2002-2003 season, so we will be interested in letting {1} vary from 2003 to 2018. The most difficult part will be finding the teams' abbreviations. We will do this by exploiting the fact that the "Teams" page from the NBA section of ESPN.com includes links that include the teams' abbreviations.

In [2]:
url = "http://www.espn.com/nba/teams"

html = requests.get(url).content

soup = BeautifulSoup(html, 'lxml') #process webpage using BeautifulSoup

In [3]:
headings = soup.find_all('h5') #one heading for each team

abbrevs = []

for heading in headings:
    team_url = heading.contents[0].get('href') #looks like "http://www.espn.com/nba/team/_/name/bos/boston-celtics"
    abbrev = team_url.split('/')[-2] #abbreviation will is team before last '/'
    abbrevs.append(abbrev)

We check that we got the desired abbreviations, stored as a list in the variable "abbrevs".

In [4]:
print(abbrevs)

['bos', 'bkn', 'ny', 'phi', 'tor', 'gs', 'lac', 'lal', 'phx', 'sac', 'chi', 'cle', 'det', 'ind', 'mil', 'dal', 'hou', 'mem', 'no', 'sa', 'atl', 'cha', 'mia', 'orl', 'wsh', 'den', 'min', 'okc', 'por', 'utah']


## Game ID's for 2017-18 season

Now that we have the team abbreviations, we can find all of the Game ID's for all of the regular season games during the 2017-18 season. We start by bringing back a function written in the notebook "Scraping a season- May24" that extracts the dates and game ID's from the webpage of a team's schedule. 

In [5]:
def schedule_to_date_gameids(url,year):
    '''
    We will extract the dates and game ID's from a schedule of games.
    
    Input:
    url (schedule of games played with scores)
    
    Output:
    List (pairs of dates and game ID's)
    '''
    
    page = requests.get(url)
    html = page.content
    soup = BeautifulSoup(html, 'lxml')
    
    tb = soup.find_all('table')[0]
    
    date_gameid = []
    
    for row in tb.find_all('tr', {'class':['oddrow', 'evenrow']}):
        
        date = row.find_all('td')[0].contents[0].strip() #date in format DAY, MONTH DATE
        date = date.split(',')[1][1:] #store MONTH DATE
        
        #convert into datetime.date object (wrong year)
        try:
            date_object = datetime.strptime(date, '%b %d').date()
            
            #correct year of game
            if date_object.month <= 7:
                game_year = year
            else:
                game_year = year - 1
                
            date_object = date_object.replace(year=game_year)

        except ValueError as error: #February 29 error
            date_object = datetime.strptime('Feb 29 ' + str(game_year), '%b %d %Y').date()
        '''
        #correct year of game
        if date_object.month >= 1:
            game_year = year
        else:
            game_year = year - 1
        
        date_object = date_object.replace(year=game_year)
        '''
        
        game_link = row.find_all('li', {'class':'score'})
        
        try:
            #game ID at end of link to recap of game
            game_id = game_link[0].contents[0].get('href').split('/')[-1]
            date_gameid.append([date_object, game_id])
            
        except IndexError as error:
            pass
    
    return date_gameid

In [6]:
#base url for team schedule
# {0}: abbreviation, {1}: end year of season, {2}: season type (1 for pre, 2 for regular, 3 for post)
base_schedule_url = "http://www.espn.com/nba/team/schedule/_/name/{0}/year/{1}/seasontype/{2}"

year = 2018 #final year of season
season_type = 2 #regular season games

We will store the game ID's in a dictionary with keys being team abbreviations. The values will be dictionaries, with keys being the final year of the season and values being a list of Game ID's. We can see that our program works pretty quickly, with an average time of about 35 seconds/season.

There was an error with accessing the schedule for the 2003-2004 season. Rather than rectifying this issue now, we will simply only consider the seasons after this one.

In [7]:
game_ids_dict = {}


start_time = time.time()

for yr in range(2005, 2019):
    game_ids_dict[yr] = {}
    for abbrev in abbrevs:
        schedule_url = base_schedule_url.format(abbrev, str(yr), str(season_type))
        game_ids_dict[yr][abbrev] = schedule_to_date_gameids(schedule_url, yr)
    print('The {0}-{1} season took {2} seconds'.format(str(yr-1), str(yr), time.time()-start_time))
    start_time = time.time()

The 2004-2005 season took 34.892163038253784 seconds
The 2005-2006 season took 35.17851114273071 seconds
The 2006-2007 season took 36.24991226196289 seconds
The 2007-2008 season took 34.481783866882324 seconds
The 2008-2009 season took 34.22162103652954 seconds
The 2009-2010 season took 42.67656421661377 seconds
The 2010-2011 season took 33.26324915885925 seconds
The 2011-2012 season took 29.46262288093567 seconds
The 2012-2013 season took 36.67962431907654 seconds
The 2013-2014 season took 34.758257150650024 seconds
The 2014-2015 season took 33.82126808166504 seconds
The 2015-2016 season took 34.857800245285034 seconds
The 2016-2017 season took 36.437708139419556 seconds
The 2017-2018 season took 33.0653440952301 seconds


In [8]:
print(game_ids_dict)

{2005: {'bos': [[datetime.date(2004, 11, 3), '241103002'], [datetime.date(2004, 11, 5), '241105002'], [datetime.date(2004, 11, 6), '241106018'], [datetime.date(2004, 11, 10), '241110002'], [datetime.date(2004, 11, 12), '241112002'], [datetime.date(2004, 11, 17), '241117027'], [datetime.date(2004, 11, 19), '241119002'], [datetime.date(2004, 11, 21), '241121002'], [datetime.date(2004, 11, 23), '241123011'], [datetime.date(2004, 11, 24), '241124020'], [datetime.date(2004, 11, 26), '241126002'], [datetime.date(2004, 11, 28), '241128014'], [datetime.date(2004, 11, 29), '241129019'], [datetime.date(2004, 12, 1), '241201002'], [datetime.date(2004, 12, 3), '241203002'], [datetime.date(2004, 12, 5), '241205023'], [datetime.date(2004, 12, 6), '241206009'], [datetime.date(2004, 12, 9), '241209022'], [datetime.date(2004, 12, 11), '241211025'], [datetime.date(2004, 12, 13), '241213012'], [datetime.date(2004, 12, 15), '241215002'], [datetime.date(2004, 12, 17), '241217002'], [datetime.date(2004, 12,

## Saving dates/game ID's in DataFrames

We will now save these date/game ID pairs in DataFrames. For every season, we will have a DataFrame of 30 columns (one for each team) and 82 records (one for each game) with one exception- the 2011-2012 season. This season was shortened to 66 games due to a players' strike.

In [35]:
base_dataframe = pd.DataFrame(columns=abbrevs)

regular_season_dfs = {yr:base_dataframe.copy() for yr in range(2005,2019)}


#df_2018 = pd.concat([pd.DataFrame([game_ids_dict[2018][abbrev][0] for abbrev in abbrevs], columns=abbrevs) for idx in range(5)], \
                    #ignore_index=True)

#print(df_2018)

#print([game_ids_dict[2018][abbrev][0] for abbrev in abbrevs])

#df_2018 = base_dataframe.copy()

#for idx in range(82):
 #   df_2018.loc[idx,:] = [game_ids_dict[2018][abbrev][idx] for abbrev in abbrevs]

In [36]:
for yr in range(2005, 2019):
    if yr == 2012: #strike shortened regular season to 66 games
        for idx in range(66):
            regular_season_dfs[yr].loc[idx,:] = [game_ids_dict[yr][abbrev][idx] for abbrev in abbrevs]
    else: #all other seasons have 82 games
        for idx in range(82):
            try:
                regular_season_dfs[yr].loc[idx,:] = [game_ids_dict[yr][abbrev][idx] for abbrev in abbrevs]
            except IndexError as error:
                print(yr, idx)

2013 81


We see that there is an issue with the 2012-2013 season. Specifically, some teams played less than 82 regular season games. With a simple for loop, we find that the Boston Celtics and Indiana Pacers only 81 games during that season. With further research, we find that the April 16th game between them was canceled due to the bombing during the Boston Marathon. It was decided at the end of the regular season to not reschedule the game on account of it not affecting the playoffs seeding.

In [37]:
for abbrev in abbrevs:
    print(abbrev, len(game_ids_dict[2013][abbrev]))

bos 81
bkn 82
ny 82
phi 82
tor 82
gs 82
lac 82
lal 82
phx 82
sac 82
chi 82
cle 82
det 82
ind 81
mil 82
dal 82
hou 82
mem 82
no 82
sa 82
atl 82
cha 82
mia 82
orl 82
wsh 82
den 82
min 82
okc 82
por 82
utah 82


In [38]:
#append final row to 2012-13 season
last_game_2013 = []

for abbrev in abbrevs:
    if abbrev=='bos' or abbrev=='ind': #deal with 81 games as NaN for last game of Boston and Ind
        last_game_2013.append(np.nan)
    else:
        last_game_2013.append(game_ids_dict[2013][abbrev][81])
        
regular_season_dfs[2013].loc[81,:] = last_game_2013

We finally save the DataFrames of date/Game ID's to csv files.

In [39]:
base_csv_name = '{0}-{1}-date-gameIDs'

for yr in range(2005, 2019):
    csv_name = base_csv_name.format(str(yr),'regular')
    regular_season_dfs[yr].to_csv(csv_name, sep=',')