# Scraping a season

In the previous notebook, we wrote a function that takes as input a url with a team stats table and outputs a DataFrame containing the team stats.

In this notebook, we will build on the previous notebook to grab the team stats tables from all of the playoff games of the 2018 Warriors (up to Game 4 of the 2018 WCF). 

The main challenge will be gathering the Game ID's for each of the playoff games. There doesn't appear to be a pattern in how the Game ID's are generated. We will get around this by using the BeautifulSoup library to find the Game ID's from links on the team's schedule page.

At the end of the notebook, we will create two DataFrames- one for the Warriors and one for the team's opponents. Each DataFrame will have 82 records (one for each game of the regular season) and 33 records (for various statistics). I then write these DataFrames to csv files.

## Necessary libraries

We begin by importing all of the libraries we will need for this notebook.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime #find date of games
import time #keep track of lengths of program run time

## Extracting dates and game ID's

We will begin with a webpage that has a schedule of games already played by the Golden State Warriors in the 2018 playoffs. We then go game-by-game, extracting dates and game ID's and compiling them in a list.

In [2]:
#link to 2018 Warriors playoffs schedule
url = 'http://www.espn.com/nba/team/schedule/_/name/gs'

page = requests.get(url)
html = page.content
soup = BeautifulSoup(html, 'lxml')


tb = soup.find_all('table')[0] #there is just one table

#final year of season (season spans over 2 years)
year = 2018

#to store dates and game ID's
date_gameid = []


for row in tb.find_all('tr', {'class':['oddrow', 'evenrow']}):
    
    tdx = row.find_all('td')
    
    date = tdx[0].contents[0].strip() #currently in format DAY, MONTH DATE
    date = date.split(',')[1][1:] #store MONTH DATE
    
    #convert into datetime.date object (wrong year)
    date_object = datetime.strptime(date, '%b %d').date() 
    
    if date_object.month >= 1:
        game_year = year
    else:
        game_year = year - 1
        
    date_object = date_object.replace(year=game_year)
    
    game_link = row.find_all('li', {'class':'score'})
    
    
    try:
        #game ID at end of link to recap of game
        game_id = game_link[0].contents[0].get('href').split('/')[-1]
        date_gameid.append([date_object, game_id])
        
    #game hasn't occurred yet, so no recap available    
    except IndexError as error:
        pass
    

print(date_gameid)

[[datetime.date(2018, 4, 14), '401029441'], [datetime.date(2018, 4, 16), '401029446'], [datetime.date(2018, 4, 19), '401029453'], [datetime.date(2018, 4, 22), '401029455'], [datetime.date(2018, 4, 24), '401029456'], [datetime.date(2018, 4, 28), '401031412'], [datetime.date(2018, 5, 1), '401031645'], [datetime.date(2018, 5, 4), '401031646'], [datetime.date(2018, 5, 6), '401031647'], [datetime.date(2018, 5, 8), '401031648'], [datetime.date(2018, 5, 14), '401032761'], [datetime.date(2018, 5, 16), '401032762'], [datetime.date(2018, 5, 20), '401032763'], [datetime.date(2018, 5, 22), '401032764']]


## Team Stats DataFrames for 2018 Playoffs

For each pair of a date and a game ID, we will create a DataFrame for the team stats of the game. We will compile these DataFrames in a list. We will need to bring back the Team Stats function written in the previous notebook.

In [3]:
#Base URL for Team Stats table
teamstat_base_url = 'http://www.espn.com/nba/matchup?gameId='

def teamstats(url):
    '''
    Extract a table of team stats from a webpage at a url and store the stats in a DataFrame.
    
    Input:
    url to page with team stats table
    
    Output:
    DataFrame (2 records x 29 columns)
        One record each for visistor team and home team
    '''
    
    page = requests.get(url)
    html = page.content
    
    soup = BeautifulSoup(html, 'lxml')
    tables = soup.find_all('table')
    
    tb0 = tables[0].tbody #table with team names, points scored each quarter, total points
    tb1 = tables[1].tbody #table with traditional team stats (assists, total rebounds, etc.)
    
    [visitor_team_row, home_team_row] = [row for row in tb0.find_all('tr')]
    #lists of team name, points scored each quarter, total points    
    visitor_team_name_points = [val.contents[0].strip() for val in visitor_team_row.find_all('td')]
    home_team_name_points = [val.contents[0].strip() for val in home_team_row.find_all('td')]
    
    #Handling the case that the game went to overtime
    if len(visitor_team_name_points) > 6: #it was an overtime game
        num_overtimes = len(visitor_team_name_points) - 6 #number of overtime periods
        #List of int's for visitor and home teams of points scored per overtime period
        visitor_overtime_points = list(map(int,visitor_team_name_points[5:5+num_overtimes]))
        home_overtime_points = list(map(int,home_team_name_points[5:5+num_overtimes]))
        #then remove these items from lists of team name and point totals
        for period in range(num_overtimes):
            visitor_team_name_points.pop(5 + period)
            home_team_name_points.pop(5 + period)
    else: #no overtime
        num_overtimes = 0
        visitor_overtime_points = []
        home_overtime_points = []

    #create 3 lists for Team Stats table
    #List 1: names of stats in Team Stats table
    #List 2: corresponding visitor team's stat
    #List 3: corresponding home team's stat
    tb1_stat_names, tbl_visitor_stats, tb1_home_stats = [], [], []
    
    #cycle over different stats
    for row in tb1.find_all('tr'):
        
        tdx = [val for val in row.find_all('td')]
        
        tb1_stat_names += tdx[0].contents[0].strip().split('-')
        tbl_visitor_stats += tdx[1].contents[0].strip().split('-')
        tb1_home_stats += tdx[2].contents[0].strip().split('-')
    
    
    #precede each stat 'Attempted' with type of shot attempted
    tb1_stat_names[1] = 'FG Attempted'
    tb1_stat_names[4] = '3PT Attempted'
    tb1_stat_names[7] = 'FT Attempted'
    
    tb0_stat_names = ['Team name', '1st Qtr Points', '2nd Qtr Points', \
                  '3rd Qtr Points', '4th Qtr Points', 'Total Points']
    
    #names of all stats, including team name, rebounds, etc.
    stat_names = tb0_stat_names + tb1_stat_names + ['Number of OT Periods', 'OT Points']
    #corresponding stats for teams
    visitor_stats = visitor_team_name_points + tbl_visitor_stats + [num_overtimes, visitor_overtime_points]
    home_stats = home_team_name_points + tb1_home_stats + [num_overtimes, home_overtime_points]
    
    #create DataFrame of all stats (all entries will be type string (why??))
    stats_df = pd.DataFrame(columns=stat_names)
    stats_df.loc[0] = visitor_stats
    stats_df.loc[1] = home_stats
    
    #append column of which team won (1 if won, 0 if lost)
    if int(stats_df.loc[0,'Total Points']) > int(stats_df.loc[1,'Total Points']):
        stats_df.loc[:,'Won?'] = pd.Series([1,0])
    else:
        stats_df.loc[:,'Won?'] = pd.Series([0,1])
        
    stats_df['Away or home?'] = pd.Series(['Away', 'Home'])
        
    
    #convert all entries from string type to (int or float) type (except for Team Name)
    #column_names = list(stats_df.columns)
    
    for stat in stat_names:
        if (stat == 'Team name') or (stat == 'OT Points'):
            pass
        elif '%' in stat: #convert percentage stats to float type
            stats_df[stat] = stats_df[stat].apply(lambda num: float(num))
        else: #convert other stats to int type
            stats_df[stat] = stats_df[stat].apply(lambda num: int(num))
            
    return stats_df    

We first check that we can obtain a Team Stats table from the base URL + game ID format. We test on Game 1 of the First Round of 2018 Playoffs between the San Antonio Spurs and the Warriors.

In [4]:
print(teamstats(teamstat_base_url + date_gameid[0][1]))

  Team name  1st Qtr Points  2nd Qtr Points  3rd Qtr Points  4th Qtr Points  \
0        SA              17              24              22              29   
1        GS              28              29              29              27   

   Total Points  FG Made  FG Attempted  Field Goal %  3PT Made      ...        \
0            92       32            80          40.0         9      ...         
1           113       44            81          54.3        10      ...         

   Points Off Turnovers  Fast Break Points  Points in Paint  Personal Fouls  \
0                    11                  5               22              20   
1                    19                  6               34              18   

   Technical Fouls  Flagrant Fouls  Number of OT Periods  OT Points  Won?  \
0                0               0                     0         []     0   
1                1               0                     0         []     1   

   Away or home?  
0           Away  
1         

We then compile a list of DataFrames from the list of dates with game ID's. We include the game date as the first column of each DataFrame.

In [5]:
start_time = time.time()

stats_df_list = []

for game in date_gameid:
    
    stats_df = teamstats(teamstat_base_url + game[1])
    
    #add column for game date
    stats_df['Date'] = pd.Series([game[0],game[0]])
    
    #switch date to first column
    cols = list(stats_df.columns)
    cols = [cols[-1]] + cols[:-1]
    stats_df = stats_df[cols]
    
    stats_df_list.append(stats_df)
    
print(stats_df_list[0])

print('Program took ', time.time()-start_time, 'seconds to run.')

         Date Team name  1st Qtr Points  2nd Qtr Points  3rd Qtr Points  \
0  2018-04-14        SA              17              24              22   
1  2018-04-14        GS              28              29              29   

   4th Qtr Points  Total Points  FG Made  FG Attempted  Field Goal %  \
0              29            92       32            80          40.0   
1              27           113       44            81          54.3   

       ...        Points Off Turnovers  Fast Break Points  Points in Paint  \
0      ...                          11                  5               22   
1      ...                          19                  6               34   

   Personal Fouls  Technical Fouls  Flagrant Fouls  Number of OT Periods  \
0              20                0               0                     0   
1              18                1               0                     0   

   OT Points  Won?  Away or home?  
0         []     0           Away  
1         []     1   

## Encapsulating storing team stats from season

Previously, we turned a url with a playoffs schedule into a list of dates and game ID's. We will now encapsulate this process into a function.

In [6]:
year = 2018

def schedule_to_date_gameids(url,year):
    '''
    We will extract the dates and game ID's from a schedule of games.
    
    Input:
    url (schedule of games played with scores)
    
    Output:
    List (pairs of dates and game ID's)
    '''
    
    page = requests.get(url)
    html = page.content
    soup = BeautifulSoup(html, 'lxml')
    
    tb = soup.find_all('table')[0]
    
    date_gameid = []
    
    for row in tb.find_all('tr', {'class':['oddrow', 'evenrow']}):
        
        date = row.find_all('td')[0].contents[0].strip() #date in format DAY, MONTH DATE
        date = date.split(',')[1][1:] #store MONTH DATE
        
        #convert into datetime.date object (wrong year)
        date_object = datetime.strptime(date, '%b %d').date()
        
        #correct year of game
        if date_object.month >= 1:
            game_year = year
        else:
            game_year = year - 1
        
        date_object = date_object.replace(year=game_year)
        
        game_link = row.find_all('li', {'class':'score'})
        
        try:
            #game ID at end of link to recap of game
            game_id = game_link[0].contents[0].get('href').split('/')[-1]
            date_gameid.append([date_object, game_id])
            
        except IndexError as error:
            pass
    
    return date_gameid

We test this function on the 2018 playoffs schedule of the Warriors and find that it works.

In [7]:
start_time = time.time()

url = 'http://www.espn.com/nba/team/schedule/_/name/gs'

date_gameid = schedule_to_date_gameids(url,year)
print(date_gameid)

print('\n' + '-'*50 + '\n')

print('Program takes ', time.time() - start_time, 'seconds.')

[[datetime.date(2018, 4, 14), '401029441'], [datetime.date(2018, 4, 16), '401029446'], [datetime.date(2018, 4, 19), '401029453'], [datetime.date(2018, 4, 22), '401029455'], [datetime.date(2018, 4, 24), '401029456'], [datetime.date(2018, 4, 28), '401031412'], [datetime.date(2018, 5, 1), '401031645'], [datetime.date(2018, 5, 4), '401031646'], [datetime.date(2018, 5, 6), '401031647'], [datetime.date(2018, 5, 8), '401031648'], [datetime.date(2018, 5, 14), '401032761'], [datetime.date(2018, 5, 16), '401032762'], [datetime.date(2018, 5, 20), '401032763'], [datetime.date(2018, 5, 22), '401032764']]

--------------------------------------------------

Program takes  0.20714592933654785 seconds.


The program works! Also it takes a negligible amount of time relative to compiling the list of Team Stats DataFrames. 

We now test this function on a regular season schedule.

In [8]:
#base url for regular season schedules
regular_base_url = 'http://www.espn.com/nba/team/schedule/_/name/gs/year/{}/seasontype/2'

schedule_2018_url = regular_base_url.format(str(year))

date_gameid_2018_regular = schedule_to_date_gameids(schedule_2018_url,year)

print(date_gameid_2018_regular[:10])

[[datetime.date(2018, 10, 17), '400974438'], [datetime.date(2018, 10, 20), '400974444'], [datetime.date(2018, 10, 21), '400974784'], [datetime.date(2018, 10, 23), '400974796'], [datetime.date(2018, 10, 25), '400974814'], [datetime.date(2018, 10, 27), '400974826'], [datetime.date(2018, 10, 29), '400974842'], [datetime.date(2018, 10, 30), '400974851'], [datetime.date(2018, 11, 2), '400974868'], [datetime.date(2018, 11, 4), '400974886']]


We see that it works again! We will write a function that takes in a url and spits out a list of DataFrames. This will use both the "schedule_to_date_gameids" function and the "team stats" function.

We first write a function that takes in a game pair (a list of a date and game ID) and returns a DataFrame with the date and team stats.

In [9]:
def gamepair_to_dataframe(game):
    '''
    Turns a pair of a data and game ID to a DataFrame.
    
    Input:
    list of a datetime.date object and a string (date of game and game ID)
    
    Output:
    DataFrame with 2 rows and 33 columns
    '''
    
    #base URL to team stats table
    teamstat_base_url = 'http://www.espn.com/nba/matchup?gameId={}'
    
    stats_df = teamstats(teamstat_base_url.format(game[1])) #attach game ID to base url
    
    #add column for game date
    stats_df['Date'] = pd.Series([game[0], game[0]])
    
    #switch date to first column
    cols = stats_df.columns.tolist()
    cols = [cols[-1]] + cols[:-1]
    stats_df = stats_df[cols]
    
    return stats_df    

In [10]:
def schedule_to_teamstats_dataframes(url, year):
    '''
    This turns a url containing a schedule of games and returns a list of team stats tables for the games.
    
    Input:
    url (to schedule of games)
    
    Output:
    list of DataFrames (of team stats tables)
    '''
    
    #list of dates and game ID's
    date_gameid = schedule_to_date_gameids(url, year)
    
    stats_df_list = [gamepair_to_dataframe(game) for game in date_gameid]
    
    return stats_df_list

start_time = time.time()

#store all DataFrames of team stats
all_game_dataframes = schedule_to_teamstats_dataframes(schedule_2018_url, year)

#check that list has the right number of games and that the first DataFrame looks as expected
print(len(all_game_dataframes))
print(all_game_dataframes[0])


print('Program took ', time.time() - start_time, 'seconds.')
    


82
         Date Team name  1st Qtr Points  2nd Qtr Points  3rd Qtr Points  \
0  2018-10-17       HOU              34              28              26   
1  2018-10-17        GS              35              36              30   

   4th Qtr Points  Total Points  FG Made  FG Attempted  Field Goal %  \
0              34           122       47            97          48.5   
1              20           121       43            80          53.8   

       ...        Points Off Turnovers  Fast Break Points  Points in Paint  \
0      ...                          11                 15               54   
1      ...                          21                 36               32   

   Personal Fouls  Technical Fouls  Flagrant Fouls  Number of OT Periods  \
0              16                1               1                     0   
1              25                0               0                     0   

   OT Points  Won?  Away or home?  
0         []     1           Away  
1         []     0

We will store all of DataFrames of the Warriors in a DataFrame and all of the DataFrames of their opponents in a separate DataFrame. For some reason, all of the entries have been changed back to type string, so we will need to convert most of the other columns, to either type float or int.

In [11]:
#retrieve columns of each DataFrame
cols = all_game_dataframes[0].columns.tolist()

warriors_team_stats = pd.DataFrame(columns=cols)
opponents_team_stats = pd.DataFrame(columns=cols)

for idx in range(82):
        
    if all_game_dataframes[idx].loc[0,'Team name'] == 'GS': #warriors are home team
        warriors_team_stats.loc[idx+1] = all_game_dataframes[idx].loc[0,:]
        opponents_team_stats.loc[idx+1] = all_game_dataframes[idx].loc[1,:]
    
    elif all_game_dataframes[idx].loc[1,'Team name'] == 'GS': #warriors are away team
        warriors_team_stats.loc[idx+1] = all_game_dataframes[idx].loc[1,:]
        opponents_team_stats.loc[idx+1] = all_game_dataframes[idx].loc[0,:]



In [12]:
#all entries have been changed back to type string (why?)
#change back to type (int or float)

for col in cols:
    if col in ['Date', 'Team name', 'OT Points', 'Away or home?']:
        pass
    #convert all percentage stats into type float
    elif '%' in col:
        warriors_team_stats[col] = warriors_team_stats[col].apply(lambda num: float(num))
        opponents_team_stats[col] = opponents_team_stats[col].apply(lambda num: float(num))
    #convert all other stats to type int
    else:
        warriors_team_stats[col] = warriors_team_stats[col].apply(lambda num: int(num))
        opponents_team_stats[col] = opponents_team_stats[col].apply(lambda num: int(num))

We finally print out our final DataFrames for Warriors and opponents stats.

In [13]:
print(warriors_team_stats)

          Date Team name  1st Qtr Points  2nd Qtr Points  3rd Qtr Points  \
1   2018-10-17        GS              35              36              30   
2   2018-10-20        GS              26              35              37   
3   2018-10-21        GS              26              25              20   
4   2018-10-23        GS              40              25              34   
5   2018-10-25        GS              29              32              30   
6   2018-10-27        GS              27              26              34   
7   2018-10-29        GS              35              22              24   
8   2018-10-30        GS              34              40              33   
9   2018-11-02        GS              24              26              34   
10  2018-11-04        GS              36              24              43   
11  2018-11-06        GS              22              28              25   
12  2018-11-08        GS              22              29              44   
13  2018-11-

In [15]:
print(opponents_team_stats)

          Date Team name  1st Qtr Points  2nd Qtr Points  3rd Qtr Points  \
1   2018-10-17       HOU              34              28              26   
2   2018-10-20        NO              39              25              26   
3   2018-10-21       MEM              31              25              32   
4   2018-10-23       DAL              24              38              22   
5   2018-10-25       TOR              26              27              33   
6   2018-10-27       WSH              34              33              30   
7   2018-10-29       DET              27              25              30   
8   2018-10-30       LAC              28              29              29   
9   2018-11-02        SA              33              22              23   
10  2018-11-04       DEN              23              32              21   
11  2018-11-06       MIA              21              16              20   
12  2018-11-08       MIN              22              28              26   
13  2018-11-

We conclude by saving these two DataFrames to csv files.

In [17]:
warriors_team_stats.to_csv('2018-regular-warriors', sep=',')
opponents_team_stats.to_csv('2018-regular-warriors-opponents', sep=',')