# Accumulating all games since 2004-2005 season

In the past notebooks "Scraping all regular seasons- June3" and "Saving Game and Team Data in CSV's-June10", we gathered the data from all of the regular season NBA games played since the 2004-2005 season and stored them in a DataFrame. 

We will build upon this in this notebook by doing the same for postseason games. We will then combine all of the regular and postseason games in a single DataFrame and save it to a CSV.

In [11]:
#necessary libraries

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time

## Regular season games

We start by recalling the csv file we created in the previous notebook. We hope to add postseason games to this file.

In [2]:
all_regular_season = pd.read_csv('all_regular_season_games_since_ohfour.csv')

#the first column is the game number of the season - 1 (so 1 would mean that is team's second game of season)
#change name 'Unnamed: 0' to 'game_number'

all_regular_season = all_regular_season.rename(index=int, columns={'Unnamed: 0': 'game_number'})
all_regular_season.loc[:,'game_number'] = all_regular_season.loc[:,'game_number'].apply(lambda x: x + 1)

#see first few rows 
all_regular_season.head(10)

Unnamed: 0,game_number,team,season_start_year,season_end_year,season_type,game_month,game_day,game_year,game_date,matchup_id
0,1,bos,2004,2005,regular,11,3,2004,11/3/2004,241103002
1,2,bos,2004,2005,regular,11,5,2004,11/5/2004,241105002
2,3,bos,2004,2005,regular,11,6,2004,11/6/2004,241106018
3,4,bos,2004,2005,regular,11,10,2004,11/10/2004,241110002
4,5,bos,2004,2005,regular,11,12,2004,11/12/2004,241112002
5,6,bos,2004,2005,regular,11,17,2004,11/17/2004,241117027
6,7,bos,2004,2005,regular,11,19,2004,11/19/2004,241119002
7,8,bos,2004,2005,regular,11,21,2004,11/21/2004,241121002
8,9,bos,2004,2005,regular,11,23,2004,11/23/2004,241123011
9,10,bos,2004,2005,regular,11,24,2004,11/24/2004,241124020


In [3]:
#store all team abbreviations in list, alphabetically sorted

#just get abbreviations from 2017-2018 season
season_eighteen = all_regular_season[all_regular_season['season_end_year'] == 2018] 
abbrevs = list(set(season_eighteen.loc[:,'team']))
abbrevs.sort()

print(abbrevs)

['atl', 'bkn', 'bos', 'cha', 'chi', 'cle', 'dal', 'den', 'det', 'gs', 'hou', 'ind', 'lac', 'lal', 'mem', 'mia', 'mil', 'min', 'no', 'ny', 'okc', 'orl', 'phi', 'phx', 'por', 'sa', 'sac', 'tor', 'utah', 'wsh']


In [4]:
url = 'http://www.espn.com/nba/team/schedule/_/name/gs/year/2018/seasontype/3'

html = requests.get(url).content

soup = BeautifulSoup(html,'lxml')

tb = soup.find_all('table')[0]

print(tb.find('tr', {'class':'stathead'}))
    #print(row)
    #print(row.contents[0].contents[0].split())

<tr class="stathead"><td colspan="9">2018 Postseason Schedule</td></tr>


We now wish to scrape all of the playoff games played since the 2004-2005 season. The base url for a team's playoff schedule is  http://www.espn.com/nba/team/schedule/_/name/{abbrev}/year/{year}/seasontype/3, where {abbrev} is the team's abbreviation and {year} is the year that the playoffs are taking place. For example, http://www.espn.com/nba/team/schedule/_/name/gs/year/2018/seasontype/3 leads to the Warriors playoff schedule for this past year. 

One would then suspect that one could iterate over all of the team abbreviations and years. However, one runs into a problem when using this template for a team that does __NOT__ make the playoffs. For example, the Brooklyn Nets didn't make the playoffs this past year. If we try http://www.espn.com/nba/team/schedule/_/name/bkn/year/2018/seasontype/3, the page exists, but redundantly leads to the regular season schedule. Thus, we need to determine which teams make the playoffs in each season.

We will do this by noticing that at the top of each schedule, it says `{year} Regular Season Schedule` or `{year} Postseason Schedule`, where `{year}` denotes the end year of the season. We will use the requests and BeautifulSoup libraries to find which teams make the playoffs each season.

In [5]:
def regular_or_postseason(year, abbrev):
    '''
    Determines if a team made the playoffs during a year by analyzing a page at a url.
    
    Input:
    year: int that is final year of season (between 2005-2018)
    abbrev: string that is abbreviation for team
    
    Output:
    string ('Regular' or 'Postseason')
    
    '''
    
    #url either leads to playoff schedule (if team didn't make playoffs) or regular season schedule
    base_url = 'http://www.espn.com/nba/team/schedule/_/name/{0}/year/{1}/seasontype/3'
    
    html = requests.get(base_url.format(abbrev, str(year))).content
    soup = BeautifulSoup(html, 'lxml')
    
    tb = soup.find_all('table')[0]
    
    #schedule will have header "{year} Regular Season Schedule" or "{year} Postseason Schedule"
    #extract "Regular" or "Postseason" to determine if team made playoffs
    
    season_type_html = tb.find('tr',{'class':'stathead'})
    return season_type_html.contents[0].contents[0].split()[1]

In [6]:

start_time = time.time()

#keys: years
#values: lists of teams that made playoffs that year
made_playoffs = dict(zip(range(2005,2019), [[] for year in range(2005,2019)]))

for year in range(2005,2019):
    start_time = time.time()
    for abbrev in abbrevs:
        if regular_or_postseason(year,abbrev) == 'Postseason':
            made_playoffs[year].append(abbrev)
    print(str(year) + ' took ' + str(time.time()-start_time) + ' seconds.')
            

2005 took 24.0380380153656 seconds.
2006 took 25.403692960739136 seconds.
2007 took 23.47454309463501 seconds.
2008 took 24.653419017791748 seconds.
2009 took 23.37662100791931 seconds.
2010 took 24.133108854293823 seconds.
2011 took 23.285101175308228 seconds.
2012 took 24.276541233062744 seconds.
2013 took 24.003634929656982 seconds.
2014 took 27.07154893875122 seconds.
2015 took 25.207557201385498 seconds.
2016 took 26.439001083374023 seconds.
2017 took 25.57918691635132 seconds.
2018 took 24.282296895980835 seconds.


In [7]:
for year in made_playoffs:
    print(str(year) + ' playoff teams: ' \
          + str(len(made_playoffs[year])) + ' teams')
    for team in made_playoffs[year]:
        print(team, end = ' ')
    print('\n')

2005 playoff teams: 16 teams
bkn bos chi dal den det hou ind mem mia okc phi phx sa sac wsh 

2006 playoff teams: 16 teams
bkn chi cle dal den det ind lac lal mem mia mil phx sa sac wsh 

2007 playoff teams: 16 teams
bkn chi cle dal den det gs hou lal mia orl phx sa tor utah wsh 

2008 playoff teams: 16 teams
atl bos cle dal den det hou lal no orl phi phx sa tor utah wsh 

2009 playoff teams: 16 teams
atl bos chi cle dal den det hou lal mia no orl phi por sa utah 

2010 playoff teams: 16 teams
atl bos cha chi cle dal den lal mia mil okc orl phx por sa utah 

2011 playoff teams: 16 teams
atl bos chi dal den ind lal mem mia no ny okc orl phi por sa 

2012 playoff teams: 16 teams
atl bos chi dal den ind lac lal mem mia ny okc orl phi sa utah 

2013 playoff teams: 16 teams
atl bkn bos chi den gs hou ind lac lal mem mia mil ny okc sa 

2014 playoff teams: 16 teams
atl bkn cha chi dal gs hou ind lac mem mia okc por sa tor wsh 

2015 playoff teams: 16 teams
atl bkn bos chi cle dal gs hou lac 

It appears that our code works. It should be noted that the Seattle Sonics became the Oklahoma City Thunder just before the 2008-2009 season and the New Jersey Nets became the Brooklyn Nets before the 2012-2013 season. The Seattle Sonics and Brooklyn Nets are denoted by their successors' abbreviations.

## Gathering postseason data

Now that we know which teams made the playoffs each year since the 2004-2005 season, we can gather info on all of the playoff games played since that season.

We bring back a function written in the notebook "Scraping all regular seasons- June3".

In [12]:
def schedule_to_date_gameids(url,year):
    '''
    We will extract the dates and game ID's from a schedule of games.
    
    Input:
    url (schedule of games played with scores)
    
    Output:
    List (pairs of dates and game ID's)
    '''
    
    page = requests.get(url)
    html = page.content
    soup = BeautifulSoup(html, 'lxml')
    
    tb = soup.find_all('table')[0]
    
    date_gameid = []
    
    for row in tb.find_all('tr', {'class':['oddrow', 'evenrow']}):
        
        date = row.find_all('td')[0].contents[0].strip() #date in format DAY, MONTH DATE
        date = date.split(',')[1][1:] #store MONTH DATE
        
        #convert into datetime.date object (wrong year)
        try:
            date_object = datetime.strptime(date, '%b %d').date()
            
            #correct year of game
            if date_object.month <= 7:
                game_year = year
            else:
                game_year = year - 1
                
            date_object = date_object.replace(year=game_year)

        except ValueError as error: #February 29 error
            date_object = datetime.strptime('Feb 29 ' + str(game_year), '%b %d %Y').date()
        '''
        #correct year of game
        if date_object.month >= 1:
            game_year = year
        else:
            game_year = year - 1
        
        date_object = date_object.replace(year=game_year)
        '''
        
        game_link = row.find_all('li', {'class':'score'})
        
        try:
            #game ID at end of link to recap of game
            game_id = game_link[0].contents[0].get('href').split('/')[-1]
            date_gameid.append([date_object, game_id])
            
        except IndexError as error:
            pass
    
    return date_gameid

In [14]:
#base url for a post season schedule
base_url = 'http://www.espn.com/nba/team/schedule/_/name/{0}/year/{1}/seasontype/3'

#check that function and base url works on 2017-2018 Warriors postseason
schedule_to_date_gameids(base_url.format('gs',2018),2018)

[[datetime.date(2018, 4, 14), '401029441'],
 [datetime.date(2018, 4, 16), '401029446'],
 [datetime.date(2018, 4, 19), '401029453'],
 [datetime.date(2018, 4, 22), '401029455'],
 [datetime.date(2018, 4, 24), '401029456'],
 [datetime.date(2018, 4, 28), '401031412'],
 [datetime.date(2018, 5, 1), '401031645'],
 [datetime.date(2018, 5, 4), '401031646'],
 [datetime.date(2018, 5, 6), '401031647'],
 [datetime.date(2018, 5, 8), '401031648'],
 [datetime.date(2018, 5, 14), '401032761'],
 [datetime.date(2018, 5, 16), '401032762'],
 [datetime.date(2018, 5, 20), '401032763'],
 [datetime.date(2018, 5, 22), '401032764'],
 [datetime.date(2018, 5, 24), '401032765'],
 [datetime.date(2018, 5, 26), '401032766'],
 [datetime.date(2018, 5, 28), '401032767'],
 [datetime.date(2018, 5, 31), '401034613'],
 [datetime.date(2018, 6, 3), '401034614'],
 [datetime.date(2018, 6, 6), '401034615'],
 [datetime.date(2018, 6, 8), '401034616']]

In the last notebook, we wrote a function that transforms a team's schedule into a DataFrame containing information on the team's schedule for the season. However, the situation was a little different because the base information was a string representation of a list. Hence, we will rewrite the function.

Once again, these DataFrames will have the following columns:
- `team`, the team's abbreviation.
- `season_start_year`, first year of season (each season spans over two calendar year).
- `season_end_year`, second year of season.
- `season_type`, type of game played (either 'preseason', 'regular', or 'postseason').
- `game_month`, month that the game occurs.
- `game_day`, day that the game occurs (integer between 0 and 31).
- `game_year`, year that the game occurs.
- `game_date`, full date of game in form MM-DD-YYYY.
- `matchup_id`, Matchup ID of the game.

In [27]:
def matchups_to_dataframe(matchups_list, abbrev, yr, season_type):
    
    #column names of DataFrame as keys
    schedule_info = {}
    
    schedule_info['team'] = [abbrev] * len(matchups_list)
    schedule_info['season_start_year'] = [yr - 1] * len(matchups_list)
    schedule_info['season_end_year'] = [yr] * len(matchups_list)
    schedule_info['season_type'] = [season_type] * len(matchups_list)
    schedule_info['game_month'] = [pair[0].month for pair in matchups_list]
    schedule_info['game_day'] = [pair[0].day for pair in matchups_list]
    schedule_info['game_year'] = [pair[0].year for pair in matchups_list]
    schedule_info['game_date'] = [pair[0].strftime('%m/%d/%Y') for pair in matchups_list]
    schedule_info['matchup_id'] = [int(pair[1]) for pair in matchups_list]
    
    #turn dictionary into DataFrame
    team_season_df = pd.DataFrame.from_dict(schedule_info)
    
    #rearrange columns of DataFrame
    columns = ['team', 'season_start_year', 'season_end_year', 'season_type',\
               'game_month', 'game_day', 'game_year', 'game_date', 'matchup_id']
    
    return team_season_df[columns]    

We first test the function on the 2017-2018 Warriors postseason.

In [29]:
matchups_to_dataframe(schedule_to_date_gameids(base_url.format('gs',2018),2018), 'gs', 2018, 'postseason')

Unnamed: 0,team,season_start_year,season_end_year,season_type,game_month,game_day,game_year,game_date,matchup_id
0,gs,2017,2018,postseason,4,14,2018,04/14/2018,401029441
1,gs,2017,2018,postseason,4,16,2018,04/16/2018,401029446
2,gs,2017,2018,postseason,4,19,2018,04/19/2018,401029453
3,gs,2017,2018,postseason,4,22,2018,04/22/2018,401029455
4,gs,2017,2018,postseason,4,24,2018,04/24/2018,401029456
5,gs,2017,2018,postseason,4,28,2018,04/28/2018,401031412
6,gs,2017,2018,postseason,5,1,2018,05/01/2018,401031645
7,gs,2017,2018,postseason,5,4,2018,05/04/2018,401031646
8,gs,2017,2018,postseason,5,6,2018,05/06/2018,401031647
9,gs,2017,2018,postseason,5,8,2018,05/08/2018,401031648


We now gather all of the postseason played by all teams since the 2004-2005 season. We find that there is an error with the 2017-2018 postseason, so we save that for later.

In [62]:
playoff_games_by_year = {}

for year in range(2005, 2019):
    try:
        playoff_games_by_year[year] = pd.concat(matchups_to_dataframe(schedule_to_date_gameids(base_url.format(abbrev,year),year),abbrev, year, 'postseason') for abbrev in made_playoffs[year])
    except ValueError as error:
        print(year)

2018


In [68]:
#deal with 2018 season separately
year = 2018
last_postseason = pd.concat(matchups_to_dataframe(schedule_to_date_gameids(base_url.format(abbrev,year),year),abbrev, year, 'postseason') for abbrev in made_playoffs[year])

In [71]:
all_playoff_games_but_one_season = pd.concat(playoff_games_by_year[year] for year in range(2005,2018))

#DataFrame of all playoff games since the 2004-2005 season
all_playoff_games = pd.concat((all_playoff_games_but_one_season,last_postseason))

In [75]:
print(all_playoff_games.shape)

all_playoff_games.head(20)

(2344, 9)


Unnamed: 0,team,season_start_year,season_end_year,season_type,game_month,game_day,game_year,game_date,matchup_id
0,bkn,2004,2005,postseason,4,24,2005,04/24/2005,250424014
1,bkn,2004,2005,postseason,4,26,2005,04/26/2005,250426014
2,bkn,2004,2005,postseason,4,28,2005,04/28/2005,250428017
3,bkn,2004,2005,postseason,5,1,2005,05/01/2005,250501017
0,bos,2004,2005,postseason,4,23,2005,04/23/2005,250423002
1,bos,2004,2005,postseason,4,25,2005,04/25/2005,250425002
2,bos,2004,2005,postseason,4,28,2005,04/28/2005,250428011
3,bos,2004,2005,postseason,4,30,2005,04/30/2005,250430011
4,bos,2004,2005,postseason,5,3,2005,05/03/2005,250503002
5,bos,2004,2005,postseason,5,5,2005,05/05/2005,250505011


In [82]:
playoff_columns = all_playoff_games.columns

all_games_info = pd.concat((all_regular_season[playoff_columns], all_playoff_games))

print(all_games_info.shape)

all_games_info.head(20)

(36140, 9)


Unnamed: 0,team,season_start_year,season_end_year,season_type,game_month,game_day,game_year,game_date,matchup_id
0,bos,2004,2005,regular,11,3,2004,11/3/2004,241103002
1,bos,2004,2005,regular,11,5,2004,11/5/2004,241105002
2,bos,2004,2005,regular,11,6,2004,11/6/2004,241106018
3,bos,2004,2005,regular,11,10,2004,11/10/2004,241110002
4,bos,2004,2005,regular,11,12,2004,11/12/2004,241112002
5,bos,2004,2005,regular,11,17,2004,11/17/2004,241117027
6,bos,2004,2005,regular,11,19,2004,11/19/2004,241119002
7,bos,2004,2005,regular,11,21,2004,11/21/2004,241121002
8,bos,2004,2005,regular,11,23,2004,11/23/2004,241123011
9,bos,2004,2005,regular,11,24,2004,11/24/2004,241124020


We conclude this notebook by writing this file to a CSV file.

In [79]:
all_games_info.to_csv('all_games_04_on.csv')