## Grabbing all player stats

In the last notebook "Scraping_all_team_stats- June11", we grabbed the team stats tables for every game since the 2008-2009 season and stored them in a single DataFrame/CSV. In the notebook before that "Accumulating all games since 2004-2005 season-June11", we stored information on every game since the 2004-2005 season, including Matchup ID's.

We will build upon this by scraping all of the player stats table since 2004-2005.

In [2]:
#necessary libraries

import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
from nose.tools import assert_equal

## Scraping one player stats table

We will begin by scraping the player stats table for one game. For this, we choose Game 3 of the 2018 NBA Finals, where the Golden State Warriors beat the Cleveland Cavaliers 110-102 to win the 2017-2018 NBA championship. This game will be most well-known for Stephen Curry only 1-10 from 3PointLand and Kevin Durant hitting a massively important three-pointer with less than a minute left to then push the score to 106-100. The box score to be analyzed can be found at http://www.espn.com/nba/boxscore?gameId=401034615.

We decided to choose analyzing this game over analyzing Game 4 because Game 3 includes DNP players (Did Not Play). We will need to decide how to handle these players.

In [2]:
#matchup ID for Warriors-Cavs game
matchup_id = 401034615

#url to game's boxscore
boxscore_url = 'http://www.espn.com/nba/boxscore?gameId={0}'.format(matchup_id)

print(boxscore_url)

http://www.espn.com/nba/boxscore?gameId=401034615


In [3]:
html = requests.get(boxscore_url).content

soup = BeautifulSoup(html, 'lxml')

tables = soup.find_all('tbody')

print(len(tables))

7


We print the beginning of each table to understand their contents.

In [4]:
for idx in range(len(tables)):
    print('Table {0}: '.format(idx))
    print(tables[idx].prettify()[:500])
    
    print('-'*50)

Table 0: 
<tbody>
 <tr>
  <td class="team-name">
   GS
  </td>
  <td>
   28
  </td>
  <td>
   24
  </td>
  <td>
   31
  </td>
  <td>
   27
  </td>
  <td class="final-score">
   110
  </td>
 </tr>
 <tr>
  <td class="team-name">
   CLE
  </td>
  <td>
   29
  </td>
  <td>
   29
  </td>
  <td>
   23
  </td>
  <td>
   21
  </td>
  <td class="final-score">
   102
  </td>
 </tr>
</tbody>
--------------------------------------------------
Table 1: 
<tbody>
 <tr>
  <td class="name">
   <a href="http://www.espn.com/nba/player/_/id/6589/draymond-green" name="&amp;lpos=nba:game:boxscore:playercard">
    <span>
     D. Green
    </span>
    <span class="abbr">
     D. Green
    </span>
   </a>
   <span class="position">
    PF
   </span>
  </td>
  <td class="min">
   40
  </td>
  <td class="fg">
   4-8
  </td>
  <td class="3pt">
   0-2
  </td>
  <td class="ft">
   2-2
  </td>
  <td class="oreb">
   0
  </td>
  <td class="dreb">
   2
  </td>
  <
--------------------------------------------------
Tab

We find:
    
- `Table 0`: contains team names, points scored per quarter, and points scored in the game.
- `Table 1`: contains visiting team's starting players stats
- `Table 2`: contains visiting team's bench players stats
- `Table 3`: contains home team's starting players stats
- `Table 4`: contains home team's bench player stats
- `Table 5`: contains visiting team's cumulative stats
- `Table 6`: contains home team's cumulative stats

One curious issue we have to figure out how to handle is the case of players who Do Not Play. In the box score, they have a description `DNP-` followed by an explanation of why they didn't play. We will see how that looks in the html file. More specifically, we will see how Zaza Pachulia's DNP looks like in table 2.

In [5]:
#find which row Zaza Pachulia is in
divisions = tables[2].find_all('td',{'class':'name'})
for idx, division in enumerate(divisions):
    print(idx, ': ', division.contents[0])

0 :  <a href="http://www.espn.com/nba/player/_/id/2177/david-west" name="&amp;lpos=nba:game:boxscore:playercard"><span>D. West</span><span class="abbr">D. West</span></a>
1 :  <a href="http://www.espn.com/nba/player/_/id/3155535/kevon-looney" name="&amp;lpos=nba:game:boxscore:playercard"><span>K. Looney</span><span class="abbr">K. Looney</span></a>
2 :  <a href="http://www.espn.com/nba/player/_/id/2386/andre-iguodala" name="&amp;lpos=nba:game:boxscore:playercard"><span>A. Iguodala</span><span class="abbr">A. Iguodala</span></a>
3 :  <a href="http://www.espn.com/nba/player/_/id/3064427/jordan-bell" name="&amp;lpos=nba:game:boxscore:playercard"><span>J. Bell</span><span class="abbr">J. Bell</span></a>
4 :  <a href="http://www.espn.com/nba/player/_/id/2393/shaun-livingston" name="&amp;lpos=nba:game:boxscore:playercard"><span>S. Livingston</span><span class="abbr">S. Livingston</span></a>
5 :  <a href="http://www.espn.com/nba/player/_/id/3243/nick-young" name="&amp;lpos=nba:game:boxscore:p

We see that Zaza occurs at the index 7. We use that to print Zaza's entry in table 2.

In [6]:
zaza_row = tables[2].find_all('tr')[7]
print(zaza_row.prettify())

<tr>
 <td class="name">
  <a href="http://www.espn.com/nba/player/_/id/2016/zaza-pachulia" name="&amp;lpos=nba:game:boxscore:playercard">
   <span>
    Z. Pachulia
   </span>
   <span class="abbr">
    Z. Pachulia
   </span>
  </a>
  <span class="position">
   C
  </span>
 </td>
 <td class="dnp" colspan="14">
  DNP-COACH'S DECISION
 </td>
</tr>



We see that Zaza's row includes his abbreviation, position, and a division of class `"dnp"`. We can compare this with a standard row of a player who played. Here, we take the first bench player, David West.

In [7]:
west_row = tables[2].find_all('tr')[0]
print(west_row.prettify())

<tr>
 <td class="name">
  <a href="http://www.espn.com/nba/player/_/id/2177/david-west" name="&amp;lpos=nba:game:boxscore:playercard">
   <span>
    D. West
   </span>
   <span class="abbr">
    D. West
   </span>
  </a>
  <span class="position">
   PF
  </span>
 </td>
 <td class="min">
  5
 </td>
 <td class="fg">
  0-2
 </td>
 <td class="3pt">
  0-0
 </td>
 <td class="ft">
  0-0
 </td>
 <td class="oreb">
  1
 </td>
 <td class="dreb">
  1
 </td>
 <td class="reb">
  2
 </td>
 <td class="ast">
  0
 </td>
 <td class="stl">
  0
 </td>
 <td class="blk">
  0
 </td>
 <td class="to">
  1
 </td>
 <td class="pf">
  0
 </td>
 <td class="plusminus">
  -3
 </td>
 <td class="pts">
  0
 </td>
</tr>



We would like to store all of these players stats in a DataFrame. Each record in this DataFrame should represent a player's performance in a game. 

The way we will handle `DNP`'s is having a column that states whether a player played or not. For each of the stats of a `DNP` player, we will store 0. We will just need to remember to take into account later of whether the player played or not.

Our DataFrame will have the following 21 columns (with all stats representing performance in game): 

1. `team_name`: name of player's team.
2. `player_name`: player norm in form "stephen-curry", this will be taken from the link of the player (which looks like http://www.espn.com/nba/player/_/id/3975/stephen-curry).
3. `player_id`: each player is assigned a unique player ID (e.g., 3975 for Stephen Curry). This ID will taken by the player link.
4. `position`: position of the player (e.g., 'P.G.' or 'C'.)
5. `started`: 'yes' if starting player, 'no' if bench or DNP.
6. `played`: string that is 'yes' if player played, 'no' if player didn't play.
7. `min`: minutes played by player.
8. `fg`: field goals made and attempted (e.g., string '3-5' if player made 3 out of 5 FG attempts).
9. `3pt`: 3-point fields made and attempted (e.g., string '4-6' if player made 4 of out 6 3PT FG attempts).
10. `ft`: free throws made and attempted (e.g., string '7-10' if player made 7 out of 10 FT attempts.)
11. `oreb`: offensive rebounds.
12. `dreb`: defensive rebounds.
13. `reb`: total rebounds (oreb + dreb).
14. `ast`: assists.
15. `stl`: steals.
16. `blk`: blocks.
17. `to`: turnovers.
18. `pf`: personal fouls.
19. `plusminus`: plus/minus (points scored by player's team - points scored by opponent's team while player was playing).
20. `pts`: points scored by player.
21. `matchup_id`: matchup ID of game.

In [8]:
# gather team names
[visiting_team, home_team] = [team.contents[0] for team in tables[0].find_all('td',{'class':'team-name'})]
print(visiting_team)
print(home_team)

GS
CLE


In [9]:
'''
get all column names
'''

#start with one player, like Draymond Green, and gather names from class names
divisions = tables[1].find_all('tr')[0].find_all('td')

column_names = [division.get('class')[0] for division in divisions]
print(column_names)

['name', 'min', 'fg', '3pt', 'ft', 'oreb', 'dreb', 'reb', 'ast', 'stl', 'blk', 'to', 'pf', 'plusminus', 'pts']


This has almost all of the column names. We will manually add the rest in and remove 'name'.

In [10]:
column_names = ['team_name', 'player_name', 'player_id', 'position', 'started', 'played'] + column_names[1:] + ['matchup_id']

print(str(len(column_names)) + ' columns.')
print(column_names)

21 columns.
['team_name', 'player_name', 'player_id', 'position', 'started', 'played', 'min', 'fg', '3pt', 'ft', 'oreb', 'dreb', 'reb', 'ast', 'stl', 'blk', 'to', 'pf', 'plusminus', 'pts', 'matchup_id']


We will now gather the stats for all of the home team's starting players. We will start by just pulling the stats for the first player, Draymond Green.

In [11]:
home_starting = tables[1].find_all('tr')

print(home_starting[0].prettify())

<tr>
 <td class="name">
  <a href="http://www.espn.com/nba/player/_/id/6589/draymond-green" name="&amp;lpos=nba:game:boxscore:playercard">
   <span>
    D. Green
   </span>
   <span class="abbr">
    D. Green
   </span>
  </a>
  <span class="position">
   PF
  </span>
 </td>
 <td class="min">
  40
 </td>
 <td class="fg">
  4-8
 </td>
 <td class="3pt">
  0-2
 </td>
 <td class="ft">
  2-2
 </td>
 <td class="oreb">
  0
 </td>
 <td class="dreb">
  2
 </td>
 <td class="reb">
  2
 </td>
 <td class="ast">
  9
 </td>
 <td class="stl">
  2
 </td>
 <td class="blk">
  0
 </td>
 <td class="to">
  2
 </td>
 <td class="pf">
  4
 </td>
 <td class="plusminus">
  -2
 </td>
 <td class="pts">
  10
 </td>
</tr>



In [12]:
home_starting = tables[1].find_all('tr') #list all rows (same as players)

green_stats = home_starting[0].find_all('td') #stats for Draymond Green
print(green_stats)

[<td class="name"><a href="http://www.espn.com/nba/player/_/id/6589/draymond-green" name="&amp;lpos=nba:game:boxscore:playercard"><span>D. Green</span><span class="abbr">D. Green</span></a><span class="position">PF</span></td>, <td class="min">40</td>, <td class="fg">4-8</td>, <td class="3pt">0-2</td>, <td class="ft">2-2</td>, <td class="oreb">0</td>, <td class="dreb">2</td>, <td class="reb">2</td>, <td class="ast">9</td>, <td class="stl">2</td>, <td class="blk">0</td>, <td class="to">2</td>, <td class="pf">4</td>, <td class="plusminus">-2</td>, <td class="pts">10</td>]


In [13]:
name_stat = green_stats[0]
player_url = name_stat.contents[0].get('href') #link to player's personal ESPN page
print('Player url: ' + player_url)

#extract player ID, name, and position
player_id = player_url.split('/')[-2]
player_name = player_url.split('/')[-1]
player_position = name_stat.contents[1].contents[0]
print('Player ID: ' + player_id)
print('Player name: ' + player_name)
print('Player position: ' + player_position)

Player url: http://www.espn.com/nba/player/_/id/6589/draymond-green
Player ID: 6589
Player name: draymond-green
Player position: PF


We will now retreive each of the other stats. Luckily, the class name for each stat (besides name) is the column name for the relevant stat in the DataFrame. This coincidence will help us easily retrieve all of the stats. We will store Green's stats in a dictionary to make it easier to convert into a DataFrame later.

In [14]:
green_dict = {column:0 for column in column_names}

green_dict['player_name'] = player_name
green_dict['player_id'] = player_id
green_dict['played'] = 'yes'
green_dict['position'] = player_position
green_dict['matchup_id'] = matchup_id
green_dict['started'] = 'yes'

for idx in range(1,len(green_stats)):
    green_dict[green_stats[idx].get('class')[0]] = green_stats[idx].contents[0]
    
print(green_dict)

{'team_name': 0, 'player_name': 'draymond-green', 'player_id': '6589', 'position': 'PF', 'started': 'yes', 'played': 'yes', 'min': '40', 'fg': '4-8', '3pt': '0-2', 'ft': '2-2', 'oreb': '0', 'dreb': '2', 'reb': '2', 'ast': '9', 'stl': '2', 'blk': '0', 'to': '2', 'pf': '4', 'plusminus': '-2', 'pts': '10', 'matchup_id': 401034615}


We could write retreiving all of the stats as functions. But first we will handle the case of `DNP` players. These players will be treated slightly different since their player records for the game don't contain certain stats, like assists and rebounds. Thus, we will first write a function that determines if a player played.

Recall that Zaza Pachulia didn't play in Game 3 of the 2018 Finals. Looking at his html file above, we see that his row has a division of class `DNP` while players who play do not have this division. This motivates the writing of the following function.

In [15]:
def did_play(row):
    '''
    Checks if player (represented by row) played in game.
    
    Input:
    row: row in html file
    
    Output:
    'yes' if player played in game
    'no' if player didn't play in game
    '''
    
    #player didn't play if division of class 'dnp' exists
    dnp_exists = len(row.find_all('td',{'class':'dnp'}))
    if dnp_exists == 0:
        return 'yes'
    elif dnp_exists == 1:
        return 'no'
    else:
        return 'dnp error'

We check if this code worked on Zaza Pachulia (who didn't play) and David West (who did play).

In [16]:
print('Zaza Pachulia: ' + did_play(zaza_row))

print('David West: ' + did_play(west_row))

Zaza Pachulia: no
David West: yes


We now write functions to extract each of the players' personal features and stats.

In [17]:
def get_player_feature(row, feature):
    '''
    Extracts a player's name, ID, or position.
    
    Input:
    row: html representing player
    feature: string ('player_name', 'player_id', or 'position')
    
    Output:
    string: desired feature of player
    '''
    
    if feature == 'position':
        return row.find_all('td')[0].contents[1].contents[0]        
    
    else:
        #print(row.prettify())
        player_url = row.find_all('td')[0].contents[0].get('href') #link to player's personal ESPN page
        if feature == 'player_name':
            return player_url.split('/')[-1]
        elif feature == 'player_id':
            return player_url.split('/')[-2]


def get_player_stat(row, idx, played):
    '''
    Extract relevant stat based on index if player played. Return 0 otherwise.
    
    Input:
    row: html representing player
    idx: int at least 1 (index for stat based on column names)
    played: 'yes' (played) or 'no' (didn't play)
    
    Output:
    string: value for stat
    '''
    
    #player played, so return relevant stat
    if played == 'yes':
        stats_list = row.find_all('td')
        return stats_list[idx].contents[0]        
    
    #player did not play
    #all stats should be 0 or '0-0'
    elif played == 'no':
        if idx in (2,3,4):
            return '0-0'
        else:
            return '0'

We check these stats against stats shown earlier for Green in the game and on Zaza.

In [18]:
assert_equal(get_player_feature(zaza_row, 'player_name'), 'zaza-pachulia')
assert_equal(get_player_feature(zaza_row, 'position'), 'C')
assert_equal(get_player_feature(zaza_row, 'player_id'), '2016')
assert_equal(get_player_stat(zaza_row,1,'no'),'0') #stat for minutes played
assert_equal(get_player_stat(zaza_row,2,'no'),'0-0') #stat for 3 pointers 

green_row = home_starting = tables[1].find_all('tr')[0] #row for Draymond Green
assert_equal(get_player_feature(green_row, 'player_name'), 'draymond-green')
assert_equal(get_player_stat(green_row, 1,'yes'), '40') #stat for minutes played

Our eventual goal will be to loop over boxscores by looping over matchup ID's and using the base url for NBA box scores. In each iteration, we will loop over the four tables of player stats- two tables of player stats for each team, one for starters and one for bench players. One weird case that could happen is if one team never plays any bench players over the course of a game, but let's hope for now that this case doesn't arise to be dealt with.

To achieve our goal, we will eventually write a function that takes a matchup ID as input and spits out a table of all of the players stats as output. Within this, we will need to employ another function that accepts each of the four box score's tables as input and spits out a DataFrame describing that table. We will do this now.

In [19]:
def table_specific(table_index, starting_or_team, visiting_team='', home_team=''):
    '''
    Returns if players in table started or which team players in table played on.
    
    Input:
    table_index: int (between 1 and 4)
    starting_or_team: string for column interested in ('started' or 'team_name')
    visiting_team: name of visiting team (only necessary if interested in 'team_name)
    home_team: name of home team (only necessary if interested in 'team_name)
    '''
    
    #interested in whether players started in table
    if starting_or_team == 'started':
        
        if table_index in (1,3):
            return 'yes'
        elif table_index in (2,4):
            return 'no'
    
    #interested in team player played on
    elif starting_or_team == 'team_name':
        
        if table_index in (1,2):
            return visiting_team
        elif table_index in (3,4):
            return home_team

In [20]:
#check that function works on GS-CLE game

assert_equal(table_specific(1,'started'), 'yes')
assert_equal(table_specific(4,'started'), 'no')

visiting_team = 'GS'
home_team = 'CLE'

assert_equal(table_specific(2,'team_name', visiting_team, home_team), visiting_team)
assert_equal(table_specific(3,'team_name', visiting_team, home_team), home_team)

In [58]:
def table_to_dataframe(table, table_index, match_id, visiting_team, home_team, stat_names, column_names):
    '''
    Stores all of the player stats in a table to a DataFrame.
    
    Input:
    table: html tbody of player stats
    table_index: int (between 1 and 4)
    matchup_id: Matchup ID for game
    visiting_team: string (abbreviation)
    home_team: string (abbreviation)
    stat_names: list of names of 15 standard stats ('name', 'min', ... 'pts')
    column_names: all column names of DataFrame
    
    Output:
    DataFrame containing all players' information and performance for game 
    (21 columns, number of rows equals number of players in table)
    '''
    
    #only finds player rows and not team stats rows
    player_rows = table.find_all('tr',{'class':None})
    number_of_players = len(player_rows)
    
    #keys: column names
    #values: lists of stats of all players in table
    player_stats = {}
    
    player_stats['team_name'] = [table_specific(table_index,'team_name', visiting_team, home_team)] * number_of_players
    
    #extract player name stats
    for player_feature in ('player_name', 'player_id', 'position'):
        player_stats[player_feature] = [get_player_feature(row,player_feature)\
                                        for row in player_rows]
        
    player_stats['started'] = [table_specific(table_index, 'started')] * number_of_players
    
    player_stats['played'] = [did_play(row) for row in player_rows]
    
    for idx in range(1,15):
        player_stats[stat_names[idx]] = [get_player_stat(row,idx,did_play(row)) for row in player_rows]
        
    player_stats['matchup_id'] = [match_id] * number_of_players
    
    #DataFrame from player_stats dictionary
    player_stats_df = pd.DataFrame.from_dict(player_stats)
    
    
    
    return player_stats_df[column_names]


    

In [22]:
stat_names = [division.get('class')[0] for division in divisions]

warriors_starting_players_stats = table_to_dataframe(tables[1], 1, matchup_id, visiting_team, home_team, stat_names, column_names)
warriors_starting_players_stats

Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg,3pt,ft,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,GS,draymond-green,6589,PF,yes,yes,40,4-8,0-2,2-2,0,2,2,9,2,0,2,4,-2,10,401034615
1,GS,kevin-durant,3202,SF,yes,yes,43,15-23,6-9,7-7,1,12,13,7,1,1,3,3,15,43,401034615
2,GS,javale-mcgee,3452,C,yes,yes,14,5-7,0-0,0-0,2,1,3,0,0,2,1,1,3,10,401034615
3,GS,stephen-curry,3975,PG,yes,yes,39,3-16,1-10,4-4,0,5,5,6,1,0,2,3,0,11,401034615
4,GS,klay-thompson,6475,SG,yes,yes,41,4-11,2-5,0-0,0,4,4,2,1,1,0,3,14,10,401034615


In [23]:
warriors_bench_players_stats = table_to_dataframe(tables[2], 2, matchup_id, visiting_team, home_team, stat_names, column_names)
warriors_bench_players_stats


Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg,3pt,ft,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,GS,david-west,2177,PF,no,yes,5,0-2,0-0,0-0,1,1,2,0,0,0,1,0,-3,0,401034615
1,GS,kevon-looney,3155535,SF,no,yes,0,0-0,0-0,0-0,0,0,0,0,0,0,0,0,0,0,401034615
2,GS,andre-iguodala,2386,SF,no,yes,22,3-4,0-0,2-2,0,2,2,1,1,0,1,3,14,8,401034615
3,GS,jordan-bell,3064427,C,no,yes,12,4-5,0-0,2-4,2,4,6,0,0,1,0,0,0,10,401034615
4,GS,shaun-livingston,2393,PG,no,yes,17,4-5,0-0,0-0,0,0,0,2,0,0,0,3,3,8,401034615
5,GS,nick-young,3243,SG,no,yes,5,0-0,0-0,0-0,0,0,0,0,0,0,0,0,-5,0,401034615
6,GS,patrick-mccaw,3137730,SG,no,yes,1,0-0,0-0,0-0,0,0,0,0,0,0,0,0,1,0,401034615
7,GS,zaza-pachulia,2016,C,no,no,0,0-0,0-0,0-0,0,0,0,0,0,0,0,0,0,0,401034615


In [24]:
cavs_starting_players_stats = table_to_dataframe(tables[3], 3, matchup_id, visiting_team, home_team, stat_names, column_names)
cavs_starting_players_stats

Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg,3pt,ft,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,CLE,lebron-james,1966,SF,yes,yes,47,13-28,1-6,6-7,3,7,10,11,2,2,4,2,-8,33,401034615
1,CLE,kevin-love,3449,C,yes,yes,31,6-13,3-7,5-5,5,8,13,3,1,0,2,2,-1,20,401034615
2,CLE,tristan-thompson,6474,C,yes,yes,34,4-8,0-0,0-1,1,6,7,0,0,1,0,2,-2,8,401034615
3,CLE,george-hill,3438,PG,yes,yes,27,2-6,1-2,0-0,1,1,2,4,0,0,4,3,-2,5,401034615
4,CLE,jr-smith,2444,SG,yes,yes,33,5-14,3-10,0-0,1,3,4,0,3,0,0,4,-4,13,401034615


In [25]:
cavs_bench_players_stats = table_to_dataframe(tables[4], 4, matchup_id, visiting_team, home_team, stat_names, column_names)
cavs_bench_players_stats

Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg,3pt,ft,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,CLE,larry-nance-jr,2580365,PF,no,yes,13,2-4,0-0,1-2,2,1,3,1,0,0,1,1,-4,5,401034615
1,CLE,jeff-green,3209,SF,no,yes,18,1-4,1-3,0-0,0,0,0,1,0,0,1,1,-7,3,401034615
2,CLE,kyle-korver,2011,SG,no,yes,11,0-4,0-2,0-0,0,2,2,0,0,0,1,1,0,0,401034615
3,CLE,rodney-hood,2581177,SG,no,yes,26,7-11,0-1,1-2,2,4,6,0,0,1,0,2,-12,15,401034615
4,CLE,ante-zizic,4017838,PF,no,no,0,0-0,0-0,0-0,0,0,0,0,0,0,0,0,0,0,401034615
5,CLE,cedi-osman,3893016,SF,no,no,0,0-0,0-0,0-0,0,0,0,0,0,0,0,0,0,0,401034615
6,CLE,jose-calderon,2806,PG,no,no,0,0-0,0-0,0-0,0,0,0,0,0,0,0,0,0,0,401034615
7,CLE,jordan-clarkson,2528426,PG,no,no,0,0-0,0-0,0-0,0,0,0,0,0,0,0,0,0,0,401034615


Comparing with the boxscore at http://www.espn.com/nba/boxscore?gameId=401034615, these DataFrames are correct. However, all of the columns (except for matchup_id) are of type `string`. We would all of the stat columns to be of type `int`.

However, the worse issue is that the shot columns (`fg`, `3pt`, and `ft`) can't be converted into type `int` by simply casting into `int`. We will need to split each of these columns into two columns (one for made shots and one for attempted shots), taking the spot of the original column. We write a function that does this.

Before, we find the indices at which 'fg', '3pt', and 'ft' occur.

In [26]:
old_column_names = warriors_starting_players_stats.columns.tolist()

print('fg: index ' + str(column_names.index('fg')))
print('3pt: index ' + str(column_names.index('3pt')))
print('ft: index ' + str(column_names.index('ft')))

fg: index 7
3pt: index 8
ft: index 9


In [49]:
def split_shot_columns(stats_df):
    '''
    Replaces each shot column with two columns of made and attempted shots.
    
    Input:
    stats_df: DataFrame to be split
    
    Output:
    new DataFrame with three shot columns split
    '''
    
    stats_df_copy = stats_df.copy() #make deep copy
    
    columns_to_replace = ['fg', '3pt', 'ft']
    
    for column_name in columns_to_replace:
        column = stats_df_copy.loc[:,column_name]
        stats_df_copy.loc[:,column_name + '_made'] = column.apply(lambda x: x.split('-')[0])
        stats_df_copy.loc[:,column_name + '_attempted'] = column.apply(lambda x: x.split('-')[1])
        
    new_columns = stats_df_copy.columns.tolist()[:7] \
                    + [shot + made_or_att for shot in columns_to_replace for made_or_att in ['_made', '_attempted']] \
                    + stats_df_copy.columns.tolist()[10:]
            
    return stats_df_copy[new_columns].loc[:,:'matchup_id']
    
    

In [28]:
split_warriors_starting_players_stats = split_shot_columns(warriors_starting_players_stats)

print(split_warriors_starting_players_stats.shape)

split_warriors_starting_players_stats

(5, 24)


Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg_made,fg_attempted,3pt_made,3pt_attempted,ft_made,ft_attempted,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,GS,draymond-green,6589,PF,yes,yes,40,4,8,0,2,2,2,0,2,2,9,2,0,2,4,-2,10,401034615
1,GS,kevin-durant,3202,SF,yes,yes,43,15,23,6,9,7,7,1,12,13,7,1,1,3,3,15,43,401034615
2,GS,javale-mcgee,3452,C,yes,yes,14,5,7,0,0,0,0,2,1,3,0,0,2,1,1,3,10,401034615
3,GS,stephen-curry,3975,PG,yes,yes,39,3,16,1,10,4,4,0,5,5,6,1,0,2,3,0,11,401034615
4,GS,klay-thompson,6475,SG,yes,yes,41,4,11,2,5,0,0,0,4,4,2,1,1,0,3,14,10,401034615


Before combining all 4 tables, we take note all columns to convert to type `int`, which consists of player ID to all stats past `played`.

In [29]:
split_column_names = split_warriors_starting_players_stats.columns.tolist()
minutes_index = split_column_names.index('min')
columns_to_int = ['player_id'] + [split_column_names[idx] for idx in range(minutes_index, len(split_column_names))]
print(columns_to_int)

['player_id', 'min', 'fg_made', 'fg_attempted', '3pt_made', '3pt_attempted', 'ft_made', 'ft_attempted', 'oreb', 'dreb', 'reb', 'ast', 'stl', 'blk', 'to', 'pf', 'plusminus', 'pts', 'matchup_id']


In [54]:
def get_game_player_stats(match_id, columns_to_int):
    '''
    Get DataFrame of all player stats from game.
    
    Input:
    match_id: Matchup ID of game.
    columns_to_int: list of columns to convert to type int.
    
    Output:
    DataFrame with 27 columns and number of rows equal to number of players in game.
    '''
    
    base_url = 'http://www.espn.com/nba/boxscore?gameId={0}'
    
    game_html = requests.get(base_url.format(match_id)).content
    
    game_soup = BeautifulSoup(game_html,'lxml')
    
    game_tables = game_soup.find_all('tbody')
    
    #find visiting and home team names
    names = game_tables[0].find_all('td',{'class':'team-name'})
    visiting_team = names[0].contents[0]
    home_team = names[1].contents[0]
    
    #create DataFrame of all players' stats with shot stats split
    try:
        stats_df = pd.concat([split_shot_columns(table_to_dataframe(game_tables[table_idx], table_idx, match_id, visiting_team, home_team, stat_names, column_names))
                for table_idx in range(1,5)])
    except AttributeError as error:
        print('AttributeError with match ' + str(match_id))
    
    #convert certain columns to type int
    for column_name in columns_to_int:
        try:
            stats_df.loc[:,column_name] = stats_df.loc[:,column_name].apply(lambda x: int(x))
        except ValueError as error:
            print(match_id, column_name)
        
    return stats_df   

In [47]:
#check function on Warriors-Cavs game
get_game_player_stats(matchup_id, columns_to_int)

Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg_made,fg_attempted,3pt_made,3pt_attempted,ft_made,ft_attempted,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,GS,draymond-green,6589,PF,yes,yes,40,4,8,0,2,2,2,0,2,2,9,2,0,2,4,-2,10,401034615
1,GS,kevin-durant,3202,SF,yes,yes,43,15,23,6,9,7,7,1,12,13,7,1,1,3,3,15,43,401034615
2,GS,javale-mcgee,3452,C,yes,yes,14,5,7,0,0,0,0,2,1,3,0,0,2,1,1,3,10,401034615
3,GS,stephen-curry,3975,PG,yes,yes,39,3,16,1,10,4,4,0,5,5,6,1,0,2,3,0,11,401034615
4,GS,klay-thompson,6475,SG,yes,yes,41,4,11,2,5,0,0,0,4,4,2,1,1,0,3,14,10,401034615
0,GS,david-west,2177,PF,no,yes,5,0,2,0,0,0,0,1,1,2,0,0,0,1,0,-3,0,401034615
1,GS,kevon-looney,3155535,SF,no,yes,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,401034615
2,GS,andre-iguodala,2386,SF,no,yes,22,3,4,0,0,2,2,0,2,2,1,1,0,1,3,14,8,401034615
3,GS,jordan-bell,3064427,C,no,yes,12,4,5,0,0,2,4,2,4,6,0,0,1,0,0,0,10,401034615
4,GS,shaun-livingston,2393,PG,no,yes,17,4,5,0,0,0,0,0,0,0,2,0,0,0,3,3,8,401034615


It works! 

We then check our function on the game between the Warriors and the New Orleans Pelicans on October 20, 2018, just to make sure it generalizes. The Warriors won this game 128-120 and the box score is available at http://www.espn.com/nba/boxscore?gameId=400974444.

In [48]:
gs_pel_id = 400974444

get_game_player_stats(gs_pel_id, columns_to_int)

Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg_made,fg_attempted,3pt_made,3pt_attempted,ft_made,ft_attempted,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,GS,draymond-green,6589,PF,yes,yes,35,3,11,2,7,0,0,0,5,5,9,0,0,5,4,-1,8,400974444
1,GS,kevin-durant,3202,SF,yes,yes,38,9,18,3,6,1,1,0,8,8,2,1,7,5,3,-1,22,400974444
2,GS,zaza-pachulia,2016,C,yes,yes,21,3,6,0,0,2,2,3,6,9,0,2,0,2,5,-1,8,400974444
3,GS,stephen-curry,3975,PG,yes,yes,35,7,16,4,11,10,10,0,3,3,8,0,0,1,1,-1,28,400974444
4,GS,klay-thompson,6475,SG,yes,yes,37,13,20,7,13,0,0,2,2,4,2,0,1,1,2,-1,33,400974444
0,GS,david-west,2177,PF,no,yes,13,4,5,1,1,2,4,1,0,1,2,0,0,0,4,-1,11,400974444
1,GS,andre-iguodala,2386,SF,no,yes,23,3,5,0,1,1,1,1,6,7,3,0,0,1,0,-1,7,400974444
2,GS,javale-mcgee,3452,C,no,yes,9,1,2,0,0,0,0,0,4,4,1,0,0,1,2,-1,2,400974444
3,GS,jordan-bell,3064427,C,no,yes,5,2,2,0,0,0,0,1,1,2,0,1,0,0,0,-1,4,400974444
4,GS,shaun-livingston,2393,PG,no,yes,14,1,4,0,0,0,2,2,3,5,1,1,0,2,1,-1,2,400974444


## Scraping many games at once

We will now test our function on multiple games at once to get an idea of how long it takes to run.

We will import our DataFrame written in the notebook "Accumulating all games since 2004-2005 season-June11" that includes a list of all Matchup ID's since the 2004-2005 season. We then check how long it takes for our code to run on 30 games.

In [33]:
all_games_played = pd.read_csv('all_games_04_on.csv', index_col='Unnamed: 0')

print(all_games_played.shape)

all_games_played.head(20)

(36140, 9)


Unnamed: 0,team,season_start_year,season_end_year,season_type,game_month,game_day,game_year,game_date,matchup_id
0,bos,2004,2005,regular,11,3,2004,11/3/2004,241103002
1,bos,2004,2005,regular,11,5,2004,11/5/2004,241105002
2,bos,2004,2005,regular,11,6,2004,11/6/2004,241106018
3,bos,2004,2005,regular,11,10,2004,11/10/2004,241110002
4,bos,2004,2005,regular,11,12,2004,11/12/2004,241112002
5,bos,2004,2005,regular,11,17,2004,11/17/2004,241117027
6,bos,2004,2005,regular,11,19,2004,11/19/2004,241119002
7,bos,2004,2005,regular,11,21,2004,11/21/2004,241121002
8,bos,2004,2005,regular,11,23,2004,11/23/2004,241123011
9,bos,2004,2005,regular,11,24,2004,11/24/2004,241124020


We test our function on 30 games to see how quick it works. We decide to check our function on the first 30 games of the Warriors 2017-2018 regular season just because the Warriors are awesome.

In [34]:
#select first 30 of GS 2017-2018 season
all_games_played_2018 = all_games_played[all_games_played['season_end_year']==2018]
print('Total games in 2017-2018 season (with multiplicity 2): ' +  str(all_games_played_2018.shape[0]))
all_games_played_gs_2018 = all_games_played_2018[all_games_played_2018['team']=='gs']
#filter to Warriors game during 2017-2018 season
print(all_games_played_gs_2018.shape)
all_games_played_gs_2018.head()

Total games in 2017-2018 season (with multiplicity 2): 2624
(103, 9)


Unnamed: 0,team,season_start_year,season_end_year,season_type,game_month,game_day,game_year,game_date,matchup_id
31746,gs,2017,2018,regular,10,17,2017,10/17/2017,400974438
31747,gs,2017,2018,regular,10,20,2017,10/20/2017,400974444
31748,gs,2017,2018,regular,10,21,2017,10/21/2017,400974784
31749,gs,2017,2018,regular,10,23,2017,10/23/2017,400974796
31750,gs,2017,2018,regular,10,25,2017,10/25/2017,400974814


In [35]:
#list of first 30 matchup ID's
matchup_id_gs_30 = [list(all_games_played_gs_2018.loc[:,'matchup_id'])[idx] for idx in range(30)]
#check that list looks correct
print(matchup_id_gs_30)

[400974438, 400974444, 400974784, 400974796, 400974814, 400974826, 400974842, 400974851, 400974868, 400974886, 400974899, 400974914, 400974935, 400974950, 400974966, 400974982, 400974987, 400975014, 400975027, 400975035, 400975047, 400975063, 400975069, 400975086, 400975097, 400975109, 400975119, 400975146, 400975168, 400975201]


In [36]:
start_time = time.time()

players_stats_gs_30 = pd.concat([get_game_player_stats(matchup_id_gs_30[idx], columns_to_int) for idx in range(len(matchup_id_gs_30))])

print(players_stats_gs_30.shape)

players_stats_gs_30.head(50)

print('30 games took ' + str(time.time()-start_time) \
      + ' seconds for an average of ' \
      + str((time.time()-start_time)/30) \
      + ' seconds per game.')

(768, 24)
30 games took 39.775038957595825 seconds for an average of 1.3258348941802978 seconds per game.


Not too slow! With an average of around 1300 games per season, this means it should take about 30 minutes per season. This is very similar to how long it took to scrape the team stats data. This is a bit surprising since each player stats table has much more information (information on each player instead of just each team overall) than each team stats table. Perhaps it's because this code is written better than the previous code.

In [37]:
#check all games represented
assert_equal(set(players_stats_gs_30.loc[:,'matchup_id']),set(matchup_id_gs_30))

In [60]:
players_stats_gs_30.head(100)

Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg_made,fg_attempted,3pt_made,3pt_attempted,ft_made,ft_attempted,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,HOU,ryan-anderson,3412,PF,yes,yes,33,5,12,3,8,0,0,3,5,8,1,1,1,0,3,8,13,400974438
1,HOU,trevor-ariza,2426,SF,yes,yes,38,3,9,2,5,0,0,1,5,6,5,2,0,1,2,-8,8,400974438
2,HOU,clint-capela,3102529,C,yes,yes,18,6,10,0,0,0,1,1,3,4,0,1,1,2,0,-23,12,400974438
3,HOU,chris-paul,2779,PG,yes,yes,33,2,9,0,4,0,0,1,7,8,10,2,1,1,4,-13,4,400974438
4,HOU,james-harden,3992,PG,yes,yes,36,10,23,4,9,3,4,1,5,6,11,1,0,3,2,1,27,400974438
0,HOU,luc-mbah-a-moute,3451,PF,no,yes,23,6,9,2,3,0,0,2,2,4,0,0,0,2,1,4,14,400974438
1,HOU,pj-tucker,3033,SF,no,yes,29,6,9,4,6,4,6,1,5,6,0,1,0,1,3,20,20,400974438
2,HOU,eric-gordon,3431,SG,no,yes,29,9,16,0,6,6,8,0,1,1,1,1,2,2,1,16,24,400974438
3,HOU,zhou-qi,3892894,PF,no,no,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,400974438
4,HOU,tarik-black,2528393,PF,no,no,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,400974438


## Scraping all stats since the 2004-2005 season

Since we have Matchup ID's going back to the 2004-2005 season, we will just try to run the code on all of the Matchup ID's since then and see what happens. We organize all of the Matchup ID's as in the previous notebook "Scraping_all_team_stats- June11".

In [39]:
#divide large DataFrame into smaller DataFrames by year
all_games_played_by_year = {}
for year in range(2005, 2019):
    all_games_played_by_year[year] = all_games_played[all_games_played['season_end_year']==year]

In [40]:
#keys: year
#values: all matchup ID's of games from that year
all_matchup_id = {}

for year in range(2018,2004,-1):
    all_matchup_id[year] = list(set(all_games_played_by_year[year].loc[:,'matchup_id']))


To start things off, we will just run our code on the 2017-2018 season. We will see if our code works for such a large batch of games (over 1300 games!) and how long it takes to run.

In [51]:
#keys: year
#values: DataFrame of all players stats of games in that year
all_players_stats_by_year = {}

#first test on 2018 season
last_year = 2018

start_time = time.time()
print('Start of ' + str(last_year) + ' season.')
all_players_stats_by_year[last_year] = pd.concat(get_game_player_stats(all_matchup_id[last_year][idx],columns_to_int) for idx in range(len(all_matchup_id[last_year])))
all_players_stats_by_year[last_year].to_csv('all_players_stats_' + str(last_year) + '.csv')
print(str(last_year) + ' took ' + str(time.time()-start_time) + ' seconds.')

Start of 2018 season.
400975115 min
400975115 fg_made
400975115 fg_attempted
400975115 3pt_made
400975115 3pt_attempted
400975115 ft_made
400975115 ft_attempted
400975115 oreb
400975115 dreb
400975115 reb
400975115 ast
400975115 stl
400975115 blk
400975115 to
400975115 pf
400975115 plusminus
400975115 pts
400975606 min
400975606 fg_made
400975606 fg_attempted
400975606 3pt_made
400975606 3pt_attempted
400975606 ft_made
400975606 ft_attempted
400975606 oreb
400975606 dreb
400975606 reb
400975606 ast
400975606 stl
400975606 blk
400975606 to
400975606 pf
400975606 plusminus
400975606 pts
400975750 min
400975750 fg_made
400975750 fg_attempted
400975750 3pt_made
400975750 3pt_attempted
400975750 ft_made
400975750 ft_attempted
400975750 oreb
400975750 dreb
400975750 reb
400975750 ast
400975750 stl
400975750 blk
400975750 to
400975750 pf
400975750 plusminus
400975750 pts
2005 took 1516.9311182498932 seconds.


We find that the 2018 season took 1516 seconds, or about 25 minutes to scrape. There were also issues with 3 games with matchup ID's `400975115`, `400975606`, `400975750`. By going to the box scores of these games, we see the issue. In these games, there were 4 players for which each stat is listed as '--' or '-----'. The function was unable to convert these dashes into `int`'s. It's mostly likely that these records were mistakenly written in the box score and the player just didn't play. We will need to take note of this when we analyze this DataFrame with SQL. Otherwise, the program worked swimmingly! 

We will view the beginning of the table and view its size before scraping all seasons.


In [52]:
print(all_players_stats_by_year[2018].shape)

all_players_stats_by_year[2018].head(100)

(33100, 24)


Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg_made,fg_attempted,3pt_made,3pt_attempted,ft_made,ft_attempted,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,OKC,carmelo-anthony,1975,PF,yes,yes,32,3,6,3,3,0,0,1,2,3,2,2,1,0,2,12,9,400975872
1,OKC,corey-brewer,3191,SF,yes,yes,29,5,8,1,4,3,4,1,1,2,3,2,0,1,4,6,14,400975872
2,OKC,paul-george,4251,SF,yes,yes,39,9,20,3,6,5,5,0,7,7,6,4,0,3,3,5,26,400975872
3,OKC,steven-adams,2991235,C,yes,yes,37,5,11,0,0,0,2,5,8,13,1,0,0,2,3,10,10,400975872
4,OKC,russell-westbrook,3468,PG,yes,yes,38,7,19,0,4,5,8,1,10,11,5,3,1,7,5,7,19,400975872
0,OKC,patrick-patterson,4264,PF,no,yes,12,3,3,1,1,0,0,1,0,1,0,0,0,0,0,-17,7,400975872
1,OKC,jerami-grant,2991070,SF,no,yes,14,0,3,0,1,0,0,1,2,3,0,0,1,1,2,-13,0,400975872
2,OKC,raymond-felton,2753,PG,no,yes,16,4,8,2,4,0,0,0,3,3,1,0,0,1,2,-16,10,400975872
3,OKC,alex-abrines,2995702,SG,no,yes,9,0,3,0,3,1,2,0,0,0,0,0,0,0,0,-9,1,400975872
4,OKC,terrance-ferguson,4230546,SG,no,yes,14,1,4,1,4,0,0,0,0,0,1,0,0,0,2,-5,3,400975872


In [53]:
#keys: year
#values: DataFrame of all players stats of games in that year

for year in range(2017,2004,-1):
    print('Start of ' + str(year) + ' season.')
    start_time = time.time()
    all_players_stats_by_year[year] = pd.concat(get_game_player_stats(all_matchup_id[year][idx],columns_to_int) for idx in range(len(all_matchup_id[year])))
    all_players_stats_by_year[year].to_csv('all_players_stats_' + str(year) + '.csv')
    print(str(year) + ' took ' + str(time.time()-start_time) + ' seconds.')

Start of 2017 season.
400900168 min
400900168 fg_made
400900168 fg_attempted
400900168 3pt_made
400900168 3pt_attempted
400900168 ft_made
400900168 ft_attempted
400900168 oreb
400900168 dreb
400900168 reb
400900168 ast
400900168 stl
400900168 blk
400900168 to
400900168 pf
400900168 plusminus
400900168 pts
400900276 min
400900276 fg_made
400900276 fg_attempted
400900276 3pt_made
400900276 3pt_attempted
400900276 ft_made
400900276 ft_attempted
400900276 oreb
400900276 dreb
400900276 reb
400900276 ast
400900276 stl
400900276 blk
400900276 to
400900276 pf
400900276 plusminus
400900276 pts
400900304 min
400900304 fg_made
400900304 fg_attempted
400900304 3pt_made
400900304 3pt_attempted
400900304 ft_made
400900304 ft_attempted
400900304 oreb
400900304 dreb
400900304 reb
400900304 ast
400900304 stl
400900304 blk
400900304 to
400900304 pf
400900304 plusminus
400900304 pts
400900318 min
400900318 fg_made
400900318 fg_attempted
400900318 3pt_made
400900318 3pt_attempted
400900318 ft_made
4009003

AttributeError: 'NoneType' object has no attribute 'split'

We see that the code successfully scraped the player stats from the 2016-2017 season to the 2013-2014 season, with a few errors with some games that were likely the result of the same '--' error. We will store this output into the text file 'player_stats_errors' for future analysis. We then hit a critical error involving some tables during the 2012-2013 season.

For now, we will try to rectify the issue with the tables. After trying to run the code on the 2012-2013 season, we see there is an issue with a game in November 2012 between the Raptors and 76ers. Looking at the box score at http://www.espn.com/nba/boxscore?gameId=400277874, we see that the error is most likely due to the fact that a player's name is listed as 'null'.  

These errors are getting a bit annoying to deal with. Thus, we will just keep track of all of the games (by Matchup ID) that lead to problems and return empty DataFrames for these games.

In [61]:
def get_game_player_stats_or_error(match_id, columns_to_convert, split_columns):
    '''
    Either returns expected player stats DataFrame or there is an error/return empty DataFrame.
    
    Inputs:
    match_id: Matchup ID of game.
    columns_to_convert: columns to convert to type int
    split_columns: column names of DataFrame.
    
    Outputs:
    DataFrame (empty if there is an error)
    
    '''
    
    
    try: 
        return get_game_player_stats(match_id, columns_to_convert)
    except:
        print('Problem with ' + str(match_id))
        return pd.DataFrame(columns=split_columns)

In [62]:
year_2013 = 2013

print('Start of ' + str(year_2013) + ' season.')
start_time = time.time()
all_players_stats_by_year[year_2013] = pd.concat(get_game_player_stats_or_error(all_matchup_id[year_2013][idx],\
                                                                                columns_to_int,\
                                                                                split_column_names)\
                                                 for idx in range(len(all_matchup_id[year_2013])))
all_players_stats_by_year[year_2013].to_csv('all_players_stats_' + str(year_2013) + '.csv')
print(str(year_2013) + ' took ' + str(time.time()-start_time) + ' seconds.')

Start of 2013 season.
400277756 min
400277756 fg_made
400277756 fg_attempted
400277756 3pt_made
400277756 3pt_attempted
400277756 ft_made
400277756 ft_attempted
400277756 oreb
400277756 dreb
400277756 reb
400277756 ast
400277756 stl
400277756 blk
400277756 to
400277756 pf
400277756 plusminus
400277756 pts
AttributeError with match 400277874
Problem with 400277874
400278328 min
400278328 fg_made
400278328 fg_attempted
400278328 3pt_made
400278328 3pt_attempted
400278328 ft_made
400278328 ft_attempted
400278328 oreb
400278328 dreb
400278328 reb
400278328 ast
400278328 stl
400278328 blk
400278328 to
400278328 pf
400278328 plusminus
400278328 pts
400278344 min
400278344 fg_made
400278344 fg_attempted
400278344 3pt_made
400278344 3pt_attempted
400278344 ft_made
400278344 ft_attempted
400278344 oreb
400278344 dreb
400278344 reb
400278344 ast
400278344 stl
400278344 blk
400278344 to
400278344 pf
400278344 plusminus
400278344 pts
2013 took 1780.0473568439484 seconds.


We will now run the code from the 2011-2012 season to the 2004-2005 season, keeping track of games that lead to errors.

In [63]:
#keys: year
#values: DataFrame of all players stats of games in that year

for year in range(2012,2004,-1):
    print('Start of ' + str(year) + ' season.')
    start_time = time.time()
    all_players_stats_by_year[year] = pd.concat(get_game_player_stats_or_error(all_matchup_id[year][idx],\
                                                                               columns_to_int,\
                                                                               split_column_names) \
                                                for idx in range(len(all_matchup_id[year])))
    all_players_stats_by_year[year].to_csv('all_players_stats_' + str(year) + '.csv')
    print(str(year) + ' took ' + str(time.time()-start_time) + ' seconds.')

Start of 2012 season.
320211030 min
320211030 fg_made
320211030 fg_attempted
320211030 3pt_made
320211030 3pt_attempted
320211030 ft_made
320211030 ft_attempted
320211030 oreb
320211030 dreb
320211030 reb
320211030 ast
320211030 stl
320211030 blk
320211030 to
320211030 pf
320211030 plusminus
320211030 pts
320422024 min
320422024 fg_made
320422024 fg_attempted
320422024 3pt_made
320422024 3pt_attempted
320422024 ft_made
320422024 ft_attempted
320422024 oreb
320422024 dreb
320422024 reb
320422024 ast
320422024 stl
320422024 blk
320422024 to
320422024 pf
320422024 plusminus
320422024 pts
320420005 min
320420005 fg_made
320420005 fg_attempted
320420005 3pt_made
320420005 3pt_attempted
320420005 ft_made
320420005 ft_attempted
320420005 oreb
320420005 dreb
320420005 reb
320420005 ast
320420005 stl
320420005 blk
320420005 to
320420005 pf
320420005 plusminus
320420005 pts
320416028 min
320416028 fg_made
320416028 fg_attempted
320416028 3pt_made
320416028 3pt_attempted
320416028 ft_made
3204160

We combine these DataFrames into a single DataFrame and save it. We then take a glance at hard-earned DataFrame.

In [65]:
all_players_stats = pd.concat(all_players_stats_by_year[year] for year in range(2005,2019))

all_players_stats.to_csv('all_players_stats_05_to_18.csv')

In [66]:
print(all_players_stats.shape)

all_players_stats.head(20)

(479654, 24)


Unnamed: 0,team_name,player_name,player_id,position,started,played,min,fg_made,fg_attempted,3pt_made,3pt_attempted,ft_made,ft_attempted,oreb,dreb,reb,ast,stl,blk,to,pf,plusminus,pts,matchup_id
0,PHI,marc-jackson,377.0,PF,yes,yes,31,8,17,0,0,1,2,1,4,5,0,1,1,1,2,--,17,250112004.0
1,PHI,kenny-thomas,849.0,PF,yes,yes,34,2,7,0,0,0,0,1,5,6,4,2,1,2,5,--,4,250112004.0
2,PHI,john-salmons,1726.0,SF,yes,yes,24,1,7,1,4,0,0,1,2,3,1,0,0,3,1,--,3,250112004.0
3,PHI,andre-iguodala,2386.0,SF,yes,yes,27,2,3,0,0,2,2,1,5,6,7,0,1,2,0,--,6,250112004.0
4,PHI,allen-iverson,366.0,SG,yes,yes,39,8,21,0,3,5,6,0,2,2,8,2,0,4,1,--,21,250112004.0
0,PHI,corliss-williamson,936.0,PF,no,yes,29,5,9,0,0,4,5,1,1,2,0,1,0,1,4,--,14,250112004.0
1,PHI,brian-skinner,779.0,PF,no,yes,2,0,0,0,0,0,0,1,0,1,0,0,0,0,0,--,0,250112004.0
2,PHI,kevin-ollie,620.0,PG,no,yes,7,1,3,0,0,0,0,0,1,1,1,0,0,0,0,--,2,250112004.0
3,PHI,willie-green,2004.0,SG,no,yes,7,0,3,0,0,0,0,0,1,1,2,1,0,0,2,--,0,250112004.0
4,PHI,kyle-korver,2011.0,SG,no,yes,29,3,9,2,7,1,1,0,2,2,1,0,0,1,1,--,9,250112004.0
