# Scrape play-by-play data from ESPN

The code is a bit messy, but the idea is pretty simple. Profootballreference.com's play-by-play tables are one of the best resources out there, but they don't say which team has the ball. That's easy to figure out with context, but not for an algorithm that doesn't know what players are on what team. ESPN's webpages for individual game play-by-play makes it easier to tell which team has the ball for a given drive because they have little team logos next to each drive saying which team has the ball. So I wrote a scraper to get that information from ESPN.

The scraper works like this:
* Use the requests library to get an html version of the webpage.
* Make a soup object from the source html.
* Use a customized function to grab a particular table from the soup object.
* In that table, pull out individual drives, which each load an image. The name of the image tells us which team has the ball.
* Within each drives, grab individual plays. I make a dictionary for each column that I then put together to make a single pandas DataFrame.

The code written here first checks a different page to compile a list of urls for individual games. I then use the scraper to loop through webpages for individual games.

# Process the raw data

Making sense of the raw data is crucial. I do a few things to make the data useful.
* Parse down, distance, and field position.
* Parse quarter and time remaining.
* Make columns for score difference and total score.
* Make columns for whether home team has possession and wins.
* Parse play detail to determine whether an individual play is a run, pass, scramble, punt, field goal, whether a pass is complete, and how many yards were gained on the play.
* Determine whether an offensive play is "successful".

In [3]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [2]:
def make_soup(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'lxml')
    return soup

In [3]:
class pbp_drive():
    
    def parse_drive(self,drive):
        
        # Try to read drive header
        header = drive.find("div",{"class":"accordion-header"})
        self.is_half = False
        if header == None: # Then we've got the end of a half or something
            self.is_half = True
#            print("found end of half, etc.")
            text = drive.find("span",{"class":"post-play"}).contents
            df = pd.DataFrame([[[],text]], columns=['downdist','detail'])
            return df
        
        # Grab information from the drive header
        possessor_logo = drive.find("span",{"class":"home-logo"}).contents[0]
        s = "nfl/500/"
        e = ".png"
        # Cut off pieces of url before and after home team
        self.offense = (str(possessor_logo).split(s))[1].split(e)[0].upper()
        # Get result of the drive
        self.result = header.find("span",{"class":"headline"}).contents
        # Get info about home/away score
        home_info = header.find("span",{"class":"home"}).contents
        self.home_team = home_info[0].contents[0]
        self.home_score_after = home_info[1].contents[0]
        away_info = header.find("span",{"class":"away"}).contents
        self.away_team = away_info[0].contents[0]
        self.away_score_after = away_info[1].contents[0]
        # Get drive summary
        self.drive_detail = header.find("span",{"class":"drive-details"}).contents
#        print(self.drive_detail)
        self.num_plays = self.drive_detail[0].split()[0]
        self.num_yards = self.drive_detail[0].split()[2]
        self.time_of_poss = self.drive_detail[0].split()[4]
#        print(self.result)
#        print([self.home_team,self.home_score_after,self.away_team,self.away_score_after])
    
        # Make a dataframe for the drive
        # Grab info about individual plays from this drive
        playlist = []
        plays = drive.find_all("li")
        for p in plays:
            try:
                downdist = p.h3.contents
                detail = p.span.contents[0].replace("\n","").replace("\t","")
                playlist.append([downdist,detail])
#                print([downdist,detail])
            except:
                pass
            
        # Return a dataFrame of plays for this drive
        df = pd.DataFrame(playlist, columns=['downdist','detail'])
        df['play_num'] = df.index + 1
        return df


In [4]:
# Putting all of the pieces together
def get_game_df(url):
    
    # Make a soup object with html
    soup = make_soup(url)
    
    # Find article with play-by-play table
    article = soup.find("article", {"class":"sub-module play-by-play"})
    # Article is constructed like accordion, with items corresponding
    # to individual drives
    accordion = article.find("ul", {"class":"css-accordion"})
    drives = accordion.find_all("li", {"class":"accordion-item"})
    
    # Now parse each of the drives into a dataFrame
    drivelist = []
    for i, drive in enumerate(drives):
        # Initialize drive object, then parse 
        d = pbp_drive()
        d.df = d.parse_drive(drive)
        d.drive_num = i
    
        if i == 0:
            d.home_score_before = 0
            d.away_score_before = 0
        else:
            d.home_score_before = drivelist[-1].home_score_after
            d.away_score_before = drivelist[-1].away_score_after
    
        # If the drive isn't a special section marking the end of half/game
        # Then add drive's dataFrame to the drive list
        if not d.is_half:
            d.df['home'] = d.home_team
            d.df['away'] = d.away_team
            d.df['possession'] = d.offense
            d.df['home_score_before'] = d.home_score_before
            d.df['away_score_before'] = d.away_score_before
            d.df['home_score_after'] = d.home_score_after
            d.df['away_score_after'] = d.away_score_after
            d.df['drive_num'] = d.drive_num

            #print(d.df)
            drivelist.append(d)

    # Make a dataFrame for individual drives
    drive_dicts = [{'drive':dd.drive_num,
                    'offense':dd.offense,
                    'plays':dd.num_plays,
                    'yds_gained':dd.num_yards,
                    'time':dd.time_of_poss,
                    'result':dd.result[0],
                    'home':dd.home_team,
                    'away':dd.away_team,
                    'home_score_before':dd.home_score_before,
                    'away_score_before':dd.away_score_before,
                    'home_score_after':dd.home_score_after,
                    'away_score_after':dd.away_score_after }
                   for dd in drivelist ]
    drives_df = pd.DataFrame(drive_dicts)
            
    pbp_df = pd.concat([d.df for d in drivelist])
    return pbp_df, drives_df

In [5]:
game, drives = get_game_df("http://www.espn.com/nfl/playbyplay?gameId=400951568")
print(game.head(10))
drives.head(10)

                 downdist                                             detail  \
0                      []  (15:00 - 1st) G.Zuerlein kicks 65 yards from L...   
1   [1st and 10 at SF 25]  (14:52 - 1st)  B.Hoyer pass short right intend...   
0  [1st and Goal at SF 3]  (14:48 - 1st) Todd Gurley 3 Yard Rush G.Zuerle...   
0                      []  (14:43 - 1st) G.Zuerlein kicks 64 yards from L...   
1   [1st and 10 at SF 19]  (14:31 - 1st)  C.Hyde up the middle to SF 21 f...   
2    [2nd and 8 at SF 21]  (14:01 - 1st)  C.Hyde left tackle to SF 41 for...   
3   [1st and 10 at SF 41]  (13:29 - 1st)  C.Hyde right guard to SF 44 for...   
4    [2nd and 7 at SF 44]  (13:14 - 1st)  M.Breida left tackle to SF 41 f...   
5   [3rd and 10 at SF 41]  (12:33 - 1st)  (Shotgun) B.Hoyer pass short ri...   
6    [4th and 2 at SF 49]  (11:57 - 1st)  B.Pinion punts 36 yards to LA 1...   

   play_num home away possession home_score_before away_score_before  \
0         1  LAR   SF         SF               

Unnamed: 0,away,away_score_after,away_score_before,drive,home,home_score_after,home_score_before,offense,plays,result,time,yds_gained
0,SF,0,0,0,LAR,0,0,SF,1,Interception,0:08,0
1,SF,0,0,1,LAR,7,0,LAR,1,Touchdown,0:04,3
2,SF,7,0,2,LAR,7,7,SF,14,Touchdown,6:31,81
3,SF,7,7,3,LAR,14,7,LAR,8,Touchdown,3:45,75
4,SF,7,7,4,LAR,14,14,SF,5,Fumble,1:27,42
5,SF,7,7,5,LAR,17,14,LAR,7,Field Goal,3:10,38
6,SF,7,7,6,LAR,17,17,SF,6,Punt,3:20,32
7,SF,10,7,7,LAR,17,17,SF,4,Field Goal,1:58,-6
8,SF,10,10,8,LAR,17,17,LAR,3,Punt,1:07,-2
9,SF,13,10,9,LAR,17,17,SF,8,Field Goal,4:44,47


In [6]:
# Function to get gameIds for a particular year/week
results = {}
def get_gameId(year,week):
    # Make a soup object for the appropriate page
    url = "http://www.espn.com/nfl/schedule/_/week/{0}/year/{1}/seasontype/2".format(week,year)
    soup = make_soup(url)
    sched_page = soup.find("section",{"id":"main-container"})
    
    # Make a list for gameIds
    gameids = []
    for link in sched_page.find_all('a'):
        if "gameId" in link.get('href'):
            # Extract last bit of url listed
            s = "gameId="
            this_game = link.get('href').split(s)[1]
            gameids.append(this_game)
            # And add text displayed to a dictionary
            results[this_game] = link.contents[0]
    
    return gameids, results

In [7]:
gameids, results = get_gameId(2017,1)
print(gameids)
print(results)

['400951566', '400951567', '400951570', '400951572', '400951574', '400951576', '400951584', '400951592', '400951580', '400951597', '400951601', '400951605', '400951608', '400951581', '400951612', '400951615']
{'400951566': 'KC 42, NE 27', '400951567': 'BUF 21, NYJ 12', '400951570': 'ATL 23, CHI 17', '400951572': 'BAL 20, CIN 0', '400951574': 'PIT 21, CLE 18', '400951576': 'DET 35, ARI 23', '400951584': 'OAK 26, TEN 16', '400951592': 'PHI 30, WSH 17', '400951580': 'JAX 29, HOU 7', '400951597': 'LAR 46, IND 9', '400951601': 'GB 17, SEA 9', '400951605': 'CAR 23, SF 3', '400951608': 'DAL 19, NYG 3', '400951581': 'Postponed', '400951612': 'MIN 29, NO 19', '400951615': 'DEN 24, LAC 21'}


In [8]:
# Loop over desired weeks/years to get gameIds that can be used to look up play-by-play for each of the games
import time
gameids = []
gameresults = {}
gameyear = {}
gameweek = {}
for year in range(2009,2018):
    for week in range(1,18):
        print("Looking up gameIds for {0} week {1}".format(year,week))
        ids, results = get_gameId(year,week)
        gameids.append(ids)
        
        # Add entry in dictionaries for each gameId
        for i in ids:
            gameyear[i] = year
            gameweek[i] = week
            gameresults[i] = results[i]
        
        # Sleep so we don't get blocked
        time.sleep(0.5)
        
# Now flatten gameids, which is a list of weekly lists
ids = [i for sublist in gameids for i in sublist]
print(gameresults)

Looking up gameIds for 2009 week 1
Looking up gameIds for 2009 week 2
Looking up gameIds for 2009 week 3
Looking up gameIds for 2009 week 4
Looking up gameIds for 2009 week 5
Looking up gameIds for 2009 week 6
Looking up gameIds for 2009 week 7
Looking up gameIds for 2009 week 8
Looking up gameIds for 2009 week 9
Looking up gameIds for 2009 week 10
Looking up gameIds for 2009 week 11
Looking up gameIds for 2009 week 12
Looking up gameIds for 2009 week 13
Looking up gameIds for 2009 week 14
Looking up gameIds for 2009 week 15
Looking up gameIds for 2009 week 16
Looking up gameIds for 2009 week 17
Looking up gameIds for 2010 week 1
Looking up gameIds for 2010 week 2
Looking up gameIds for 2010 week 3
Looking up gameIds for 2010 week 4
Looking up gameIds for 2010 week 5
Looking up gameIds for 2010 week 6
Looking up gameIds for 2010 week 7
Looking up gameIds for 2010 week 8
Looking up gameIds for 2010 week 9
Looking up gameIds for 2010 week 10
Looking up gameIds for 2010 week 11
Looking up

In [9]:
# Make dataframe for game-specific information using our dictionaries for year and week
data = [ {'gameId':i,
          'season':gameyear[i],
          'week':gameweek[i],
          'result':gameresults[i]}
        for i in ids]
gamedata_df = pd.DataFrame(data)
# Set the gameId as the unique identifier for each row
gamedata_df.set_index('gameId', inplace=True)
gamedata_df.sample(10)

Unnamed: 0_level_0,result,season,week
gameId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
321230010,"TEN 38, JAX 20",2012,17
311211029,"ATL 31, CAR 23",2011,14
400554230,"HOU 30, TEN 16",2014,8
321111016,"MIN 34, DET 24",2012,10
400791530,"CLE 24, SF 10",2015,14
400791626,"SEA 26, CHI 0",2015,3
321104013,"TB 42, OAK 32",2012,9
331117019,"NYG 27, GB 13",2013,11
400791574,"ARI 26, BAL 18",2015,7
311217027,"DAL 31, TB 15",2011,15


In [10]:
# Create list of individual game dataFrames
pbp_list = []
drivelevel_list = []
game_home = {}
game_away = {}

# Now loop over gameIds to scrape individual game play-by-play
for i in ids:
    
    # Check year of game. Only search for pbp of games from 2004 or later.
    if gameyear[i] >= 2004:
        
        try:
    
            print(i)
            # Make whole url for play-by-play
            url = "http://www.espn.com/nfl/playbyplay?gameId="+i
    
            # Call function to scrape and parse info into a dataFrame
            pbp_df, drives_df = get_game_df(url)
    
            # Add column to dataframe for gameId
            pbp_df['gameId'] = i
            drives_df['gameId'] = i
    
            # Extract home/away from game df
            game_home[i] = pbp_df['home'].values[0]
            game_away[i] = pbp_df['away'].values[0]

            pbp_list.append(pbp_df)
            drivelevel_list.append(drives_df)
            
        except:
            print("Failed to scrape gameId "+i)
            pass
        
        time.sleep(0.5)

    
# Put individual game dataFrames together into one big dataFrame
allplays_df = pd.concat(pbp_list)
alldrives_df = pd.concat(drivelevel_list)

290910023
290913001
290913004
290913005
290913011
290913018
290913027
290913029
290913033
290913034
290913019
290913022
290913026
290913009
290914017
290914013
290920001
290920008
290920009
290920010
290920012
290920020
290920021
290920028
290920030
290920002
290920025
290920003
290920007
290920024
290920006
290921015
290927008
290927014
290927016
290927017
290927020
290927021
290927027
290927033
290927034
290927002
290927026
290927004
290927013
290927024
290927022
290928006
291004003
291004005
291004011
291004012
291004017
291004028
291004030
291004034
291004015
291004018
291004007
291004025
291004023
291005016
291011002
291011008
291011012
291011014
291011019
291011021
291011029
291011033
291011025
291011007
291011022
291011026
291011010
291012015
291018004
291018009
291018016
291018018
291018023
291018027
291018028
291018030
291018013
291018026
291018017
291018020
291018001
291019024
291025005
291025012
291025014
291025023
291025027
291025034
291025013
291025029
291025004
291025006


320930012
320930014
320930020
320930034
320930007
320930022
320930030
320930009
320930027
320930021
321001006
321004014
321007004
321007011
321007012
321007019
321007023
321007028
321007029
321007030
321007016
321007017
321007025
321007018
321008020
321011010
321014001
321014005
321014015
Failed to scrape gameId 321014015
321014020
321014021
321014027
321014033
321014022
321014026
321014025
321014028
321014034
321015024
321018025
321021002
321021011
321021014
321021016
321021019
321021027
321021029
321021034
321021013
321021017
321021004
321022003
321025016
321028003
321028005
321028008
321028009
321028010
321028014
321028020
321028021
321028023
321028012
321028006
321028007
321029022
321101024
321104004
321104005
321104009
321104010
321104011
321104028
321104030
Failed to scrape gameId 321104030
321104034
321104013
321104026
321104019
321104001
321105018
321108030
321111004
321111015
321111016
321111017
321111018
321111027
321111029
321111033
321111026
321111021
321111025
321111003
32

400791625
400791598
400791621
400791637
400791663
400791667
400791670
400791674
400791676
400791503
400761515
400791523
400791517
400791520
400791527
400791529
400791533
400791536
400791511
400791540
400791564
400791567
400791574
400791601
400761516
400791605
400791642
400791609
400791613
400791618
400791649
400791646
400791653
400791657
400791681
400791684
400791690
400791698
400791725
400791727
400791721
400791722
400791728
400791730
400791700
400791732
400791731
400791733
400791734
400791735
400791486
400791490
400791499
400791543
400791552
400791556
400791494
400791549
400791581
400791586
400791595
400791591
400791599
400791622
400791632
400791640
400791673
400791636
400791703
400791668
400791707
400791679
400791665
400791715
400791677
400791713
400791710
400791717
400791507
400791509
400791510
400791514
400791516
400791502
400791505
400791504
400791522
400791506
400791519
400791512
400791526
400791532
400791537
400791561
400791568
400791611
400791647
400791619
400791643
400791576


In [11]:
# Add columns to gamedata_df for home and away team dictionaries
gamedata_df['home'] = pd.Series(game_home)
gamedata_df['away'] = pd.Series(game_away)

print(gamedata_df[gamedata_df['season']>=2004].head(5))
allplays_df.sample(10)

                        result  season  week home away
gameId                                                
290910023  PIT 13, TEN 10 (OT)    2009     1  TEN  PIT
290913001        ATL 19, MIA 7    2009     1  MIA  ATL
290913004        DEN 12, CIN 7    2009     1  DEN  CIN
290913005       MIN 34, CLE 20    2009     1  MIN  CLE
290913011       IND 14, JAX 12    2009     1  JAX  IND


Unnamed: 0,downdist,detail,play_num,home,away,possession,home_score_before,away_score_before,home_score_after,away_score_after,drive_num,gameId
9,[4th and Goal at CHI 20],(4:47 - 4th) M.Crosby 38 yard field goal is B...,10,GB,CHI,GB,38,17,38,17,14,400554207
0,[1st and 10 at CIN 45],(5:45 - 2nd) M.Stafford pass incomplete deep m...,1,DET,CIN,DET,7,14,7,14,9,291206004
6,[1st and 10 at DEN 36],(0:30 - 2nd) (Shotgun) B.Osweiler pass incomp...,7,NYG,DEN,DEN,17,3,17,3,11,400951782
10,[2nd and 3 at CHI 11],(11:06 - 4th) A.Luck pass short left to D.Aver...,11,IND,CHI,IND,14,34,21,34,22,320909003
2,[3rd and 9 at NO 38],(12:44 - 3rd) (Shotgun) M.Stafford pass incom...,3,DET,NO,DET,10,31,10,31,15,400951704
5,[4th and 8 at DET 32],(0:50 - 1st) (Punt formation) S.Martin punts ...,6,DET,CHI,DET,0,7,0,7,5,400874668
1,[1st and 10 at DET 14],(8:25 - 2nd) T.Riddick right guard to DET 18 ...,2,DET,CHI,DET,7,17,14,17,6,400951701
2,[1st and 10 at ATL 39],(2:33 - 2nd) M.Ryan pass incomplete deep left...,3,ATL,KC,ATL,17,17,20,17,6,320909012
3,[1st and 10 at SF 37],(5:47 - 3rd) SF #3-Beathard in at QB. C.Hyde ...,4,SF,ARI,SF,9,9,12,9,14,400951735
1,[2nd and 7 at CIN 17],(14:16 - 1st) C.Benson left guard to CIN 21 fo...,2,MIA,CIN,CIN,0,0,0,7,0,301031004


In [12]:
alldrives_df.sample(10)

Unnamed: 0,away,away_score_after,away_score_before,drive,home,home_score_after,home_score_before,offense,plays,result,time,yds_gained,gameId
20,SD,26,26,21,BAL,31,28,BAL,4,Field Goal,1:34,0,290920024
12,NO,14,14,13,SF,17,10,NO,5,Intercepted Pass,2:53,26,331117018
19,NO,38,38,20,NE,17,17,NE,3,Punt,1:21,2,291130018
4,STL,10,3,4,DEN,0,0,STL,1,Touchdown,0:08,63,400554383
15,SD,38,31,16,GB,45,45,SD,3,Touchdown,1:07,55,311106024
28,CHI,20,20,29,SEA,23,23,SEA,4,End of Game,1:54,-3,301017003
11,IND,10,10,12,ARI,3,3,ARI,3,Punt,1:32,8,400951632
21,BAL,7,7,22,MIN,12,6,MIN,5,Touchdown,1:40,20,331208033
18,SEA,39,32,19,IND,18,18,SEA,9,Touchdown,4:46,74,400951747
13,WSH,10,10,14,DAL,7,0,DAL,6,Touchdown,3:26,34,300912028


## Begin processing the dataFrames

In [13]:
# Get information about winning team and final scores
winner = {}
home_score = {}
away_score = {}
ot = {}
for i in list(gamedata_df.index.values):
    try:
        final = gameresults[i]
    
        # Winner should be first team listed
        winner[i] = final.split()[0]
    
        if "(OT)" in final:
            ot[i] = 1
        else:
            ot[i] = 0
        
        if game_home[i] == winner[i]:
            # Home team wins, their score is listed first
            home_score[i] = final.split()[1].rstrip(",")
            away_score[i] = final.split()[3]
        else:
            # Away team wins, their score is listed first
            home_score[i] = final.split()[3]
            away_score[i] = final.split()[1].rstrip(",")
        
        # Check for a tie
        if home_score[i] == away_score[i]:
            winner[i] = "TIE"
            
    except:
        winner[i] = "unknown"
        ot[i] = "unknown"
        home_score[i] = "unknown"
        away_score[i] = "unknown"
    
gamedata_df['winner'] = pd.Series(winner)
gamedata_df['home_score'] = pd.Series(home_score)
gamedata_df['away_score'] = pd.Series(away_score)
gamedata_df['OT'] = pd.Series(ot)

In [14]:
# Games where grabbing pbp failed have some unknown values
gamedata_df[ gamedata_df['winner'] == "unknown" ]

Unnamed: 0_level_0,result,season,week,home,away,winner,home_score,away_score,OT
gameId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
321014015,"MIA 17, STL 14",2012,6,,,unknown,unknown,unknown,unknown
321104030,"DET 31, JAX 14",2012,9,,,unknown,unknown,unknown,unknown
331201020,"MIA 23, NYJ 3",2013,13,,,unknown,unknown,unknown,unknown
400554331,Postponed,2014,12,,,unknown,unknown,unknown,unknown
400554366,"WSH 27, PHI 24",2014,16,,,unknown,unknown,unknown,unknown
400554443,"NO 23, TB 20",2014,17,,,unknown,unknown,unknown,unknown
400874508,"SEA 6, ARI 6 (OT)",2016,7,,,unknown,unknown,unknown,unknown
400951581,Postponed,2017,1,,,unknown,unknown,unknown,unknown
400951748,"BUF 22, MIA 16",2017,17,,,unknown,unknown,unknown,unknown


In [15]:
# Double check a game that couldn't get home/away
url = "http://www.espn.com/nfl/playbyplay?gameId="+"400554366"
g_df,d_df = get_game_df(url)

AttributeError: 'NoneType' object has no attribute 'find'

In [16]:
gamedata_df.sample(15)

Unnamed: 0_level_0,result,season,week,home,away,winner,home_score,away_score,OT
gameId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
291011014,"MIN 38, STL 10",2009,5,MIN,STL,MIN,38,10,0
291018016,"MIN 33, BAL 31",2009,6,BAL,MIN,MIN,31,33,0
320913009,"GB 23, CHI 10",2012,2,CHI,GB,GB,10,23,0
291206030,"JAX 23, HOU 18",2009,13,HOU,JAX,JAX,18,23,0
321104011,"IND 23, MIA 20",2012,9,MIA,IND,IND,20,23,0
400791741,"SEA 13, DET 10",2015,4,DET,SEA,SEA,10,13,0
400874508,"SEA 6, ARI 6 (OT)",2016,7,,,unknown,unknown,unknown,unknown
400874699,"SEA 25, SF 23",2016,17,SEA,SF,SEA,25,23,0
311204015,"MIA 34, OAK 14",2011,13,OAK,MIA,MIA,14,34,0
311208023,"PIT 14, CLE 3",2011,14,CLE,PIT,PIT,3,14,0


## And start working with the play-by-play data

In [17]:
# Start by saving all dataFrames to disk in case I mess anything up
gamedata_df.to_csv("../data/espn_gamedata2009-2017.csv")
allplays_df.to_csv("../data/espn_rawplays2009-2017.csv")
alldrives_df.to_csv("../data/espn_drives2009-2017.csv")

In [4]:
# Load dataframes from disk
gamedata_df = pd.read_csv("../data/espn_gamedata2009-2017.csv")
allplays_df = pd.read_csv("../data/espn_rawplays2009-2017.csv")
alldrives_df = pd.read_csv("../data/espn_drives2009-2017.csv")

In [5]:
# Take another look at what we've got so far
gamedata_df.set_index('gameId', inplace=True)
print(allplays_df.info())
allplays_df.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396448 entries, 0 to 396447
Data columns (total 13 columns):
Unnamed: 0           396448 non-null int64
downdist             396448 non-null object
detail               396448 non-null object
play_num             396448 non-null int64
home                 396448 non-null object
away                 396448 non-null object
possession           396448 non-null object
home_score_before    396448 non-null int64
away_score_before    396448 non-null int64
home_score_after     396448 non-null int64
away_score_after     396448 non-null int64
drive_num            396448 non-null int64
gameId               396448 non-null int64
dtypes: int64(8), object(5)
memory usage: 39.3+ MB
None


Unnamed: 0.1,Unnamed: 0,downdist,detail,play_num,home,away,possession,home_score_before,away_score_before,home_score_after,away_score_after,drive_num,gameId
16479,1,['3rd and 20 at CAR 28'],(6:16 - 2nd) J.Delhomme pass short left to S.S...,2,BUF,CAR,CAR,7,2,7,2,14,291025029
295706,1,['1st and 10 at NYJ 20'],(15:00 - 1st) R.Fitzpatrick scrambles right e...,2,NYJ,NYG,NYJ,0,0,0,0,0,400791572
378605,8,['2nd and 5 at MIN 14'],(11:23 - 1st) (No Huddle) J.Goff pass incompl...,9,LAR,MIN,LAR,0,0,7,0,0,400951775
167986,3,['3rd and 4 at PIT 37'],(6:37 - 1st) B.Roethlisberger pass short midd...,4,PIT,DAL,PIT,0,3,0,3,2,321216006
192100,4,['1st and 15 at SF 45'],(9:31 - 2nd) C.Kaepernick pass short left to ...,5,SF,TEN,SF,3,0,10,0,5,331020010


In [6]:
# Start by trying to parse down, distance, and field position

# Make lists to populate with a value for each play
down = []
dist = []
home_fieldpos = []

# Make lists out of home and away teams for comparison with field position
hometeam = allplays_df.home.values
awayteam = allplays_df.away.values

# Function to return fieldposition from the offense's point of view
def get_fieldpos(teamside,ydline,j):
    if teamside == hometeam[j]:
        # Ball is one home team's half. Location should be negative
        return -1*(50-ydline)
    elif teamside == awayteam[j]:
        return 50-ydline
    else:
        return "x"
        

for j, c in enumerate(allplays_df['downdist'].values):
    
    x = c.strip("[]'")
    # Check for an empty list. This probably means an end of quarter/half line
    if (not x) or x == None:
        down.append(0)
        dist.append(0)
        home_fieldpos.append(0)
#        print("Found empty list")
        
    else:
        
        x = [x]
        pieces = x[0].split()
#        print(pieces)
    
        # Get down
        if not pieces[0][0].isalpha():  # check is first character is alphabetic
            # Then first character is numeric. This is the down number.
            down.append(int(pieces[0][0]))
        else:
            down.append(0)
        
        # Get distance
        for i, word in enumerate(pieces):
            if word == "and":
                dist.append(pieces[i+1])  # Keep as string to preserve goal-to-go situations
        
        # Get fieldposition from the home team's perspective
        for i, word in enumerate(pieces):
            if word == "at":
                if pieces[i+1] == '50':
                    home_fieldpos.append(0)
                else:
                    teamside = pieces[i+1]
                    ydline = int(pieces[i+2])
                    
                    fieldpos = get_fieldpos(teamside,ydline,j)
                    
                    if fieldpos == "x":  # Failed to match teamside with home/away teams
                        # Change teamside in a couple cases to account for teams moving
                        # ESPN seems to always use most recent short form name in fieldposition
                        if teamside == "LAR":
                            teamside = "STL"
                        elif teamside == "LAC":
                            teamside = "SD"
                        # Try again with new teamside
                        fieldpos = get_fieldpos(teamside,ydline,j)
                    
                    if fieldpos == "x":
                        home_fieldpos.append(0)
                        print(pieces)
                        print("Failed to find side of field correctly")
                        
                    else:
                        home_fieldpos.append(fieldpos)
                        
                
#    print([down[-1], dist[-1], home_fieldpos[-1]])
allplays_df['down'] = down
allplays_df['dist'] = dist
allplays_df['home_fieldpos'] = home_fieldpos

In [7]:
allplays_df.sample(5)

Unnamed: 0.1,Unnamed: 0,downdist,detail,play_num,home,away,possession,home_score_before,away_score_before,home_score_after,away_score_after,drive_num,gameId,down,dist,home_fieldpos
152962,4,['1st and 10 at TEN 27'],(12:55 - 3rd) J.Locker pass incomplete deep ri...,5,TEN,MIA,TEN,24,3,31,3,15,321111015,1,10,-23
229518,0,['1st and 10 at KC 11'],(2:24 - 2nd) (Shotgun) A.Smith pass short rig...,1,NE,KC,KC,0,14,0,17,8,400554325,1,10,39
285113,2,['2nd and 11 at NO 44'],(10:05 - 4th) (Shotgun) M.Mariota pass short ...,3,TEN,NO,TEN,20,28,28,28,18,400791722,2,11,6
9003,11,['3rd and 1 at BAL 17'],(8:30 - 4th) S.Vollmer reported in as eligible...,12,BAL,NE,NE,21,24,21,27,15,291004017,3,1,-33
225247,3,['2nd and 16 at BAL 35'],(2:39 - 4th) (Shotgun) J.Flacco pass incomple...,4,BAL,CLE,BAL,20,21,20,21,18,400554240,2,16,-15


In [8]:
# Now look to extract the time remaining (in seconds)
detail = allplays_df.detail.values

# Make lists for quarter and time_remaining
qtr = []
time_rem = []

for d in detail[:]:
#    print(d)
#    print(type(d))
    try:
        if (not d) or (d == None):
            # detail is an empty list
            qtr.append(0)
            time_rem.append("0:00")
        
        else:
            pieces = d.split()
#            print(pieces)
            if pieces[0][0] == "E":
                # Found End of Quarter/Overtime line
                qtr.append(0)
                time_rem.append("0:00")

            elif pieces[0][0] == "(":
                # Found beginning of standard "(1:23 - 4th)" template
                qtr.append(pieces[2][0])
                time_rem.append(pieces[0].lstrip("("))
            
            else:
                # Not sure what this is, so just be safe and go to 0:00 rem in 4th
                print(d)
                qtr.append(0)
                time_rem.append("0:00")
            
    except:
        print("Default parse failed for:")
        print(d)
        qtr.append(0)
        time_rem.append("0:00")
            
#    print(qtr[-1])
#    print(time_rem[-1])
    
allplays_df['qtr'] = qtr
allplays_df['time_rem'] = time_rem

 R. Tannehill sacked at MIA 1 for -8 yards (C. Avril).


In [9]:
allplays_df[['downdist','detail','down','dist','qtr','time_rem']].sample(25)

Unnamed: 0,downdist,detail,down,dist,qtr,time_rem
55811,['2nd and 8 at LAC 45'],(8:17 - 4th) J.Campbell pass short left to L.M...,2,8,4,8:17
56483,['3rd and 1 at GB 27'],(2:18 - 3rd) L.Polite up the middle to GB 27 f...,3,1,3,2:18
314090,['3rd and 28 at BUF 13'],(13:49 - 4th) (Shotgun) T.Taylor pass short l...,3,28,4,13:49
64991,['1st and 10 at OAK 44'],(15:00 - 4th) M.Cassel pass incomplete short ...,1,10,4,15:00
43962,['1st and 10 at DET 15'],(11:14 - 3rd) Shaun.Hill pass short right to J...,1,10,3,11:14
285799,['3rd and 8 at ATL 26'],(3:25 - 2nd) (Shotgun) B.Gabbert pass incompl...,3,8,2,3:25
381200,['1st and 10 at KC 24'],(13:46 - 2nd) L.McCoy up the middle to KC 23 ...,1,10,2,13:46
64833,['4th and 2 at SEA 2'],(0:02 - 3rd) L.Tynes 20 yard field goal is GOO...,4,2,3,0:02
348444,['1st and 10 at IND 22'],(10:48 - 3rd) DeAndre Washington 22 Yard Rush ...,1,10,3,10:48
7906,['2nd and 5 at ARI 45'],(0:22 - 3rd) K.Warner pass incomplete short le...,2,5,3,0:22


In [10]:
# Make a column for secconds remaining in the game
qtr = allplays_df.qtr.values
time_rem = allplays_df.time_rem.values

secs_rem = []

for i, tr in enumerate(time_rem):
    if qtr[i] in ["1","2","3","4"]:
        q = int(qtr[i])
    elif qtr[i] == "O":
        q = 4
    else:
        q = 0
    mins = int(tr.split(":")[0])
    secs = int(tr.split(":")[1])
    secs_rem.append( 900*(4-q) + 60*mins + secs )
    
allplays_df['secs_rem'] = secs_rem

In [11]:
allplays_df[['qtr','time_rem','secs_rem']].sample(10)

Unnamed: 0,qtr,time_rem,secs_rem
394282,2,0:24,1824
26824,4,7:57,477
163533,3,10:43,1543
289002,2,1:30,1890
277313,4,10:52,652
343524,1,3:58,2938
128315,2,0:16,1816
119343,4,12:01,721
36310,2,1:01,1861
343083,3,5:06,1206


In [12]:
allplays_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396448 entries, 0 to 396447
Data columns (total 19 columns):
Unnamed: 0           396448 non-null int64
downdist             396448 non-null object
detail               396448 non-null object
play_num             396448 non-null int64
home                 396448 non-null object
away                 396448 non-null object
possession           396448 non-null object
home_score_before    396448 non-null int64
away_score_before    396448 non-null int64
home_score_after     396448 non-null int64
away_score_after     396448 non-null int64
drive_num            396448 non-null int64
gameId               396448 non-null int64
down                 396448 non-null int64
dist                 396448 non-null object
home_fieldpos        396448 non-null int64
qtr                  396448 non-null object
time_rem             396448 non-null object
secs_rem             396448 non-null int64
dtypes: int64(11), object(8)
memory usage: 57.5+ MB


In [13]:
# Make column for score difference
allplays_df['home_lead'] = allplays_df['home_score_before'] - allplays_df['away_score_before']

# Make column for total score
allplays_df['total_score'] = allplays_df['home_score_before'] + allplays_df['away_score_before']

In [14]:
# Make column with derived metric for adjusted lead
# Make a column for adjusted score
import math
def adjusted_lead(play):
    try:
        return play.home_lead / math.sqrt( 3600-play.secs_rem + 1 )
    except:
        return 0
    
allplays_df['adj_lead'] = allplays_df.apply(
    lambda row: adjusted_lead(row), axis=1 )

In [15]:
# Make a column for whether this play takes place in overtime
allplays_df['OT'] = [1 if " OT" in str(p) else 0 for p in allplays_df.detail.values]

In [16]:
allplays_df[allplays_df.OT == 1].sample(10)

Unnamed: 0.1,Unnamed: 0,downdist,detail,play_num,home,away,possession,home_score_before,away_score_before,home_score_after,...,down,dist,home_fieldpos,qtr,time_rem,secs_rem,home_lead,total_score,adj_lead,OT
12193,8,['1st and 10 at NE 29'],(11:03 - OT) L.Jordan up the middle to NE 27 f...,9,NE,DEN,DEN,17,17,17,...,1,10,-21,O,11:03,663,0,34,0.0,1
11043,2,['3rd and 9 at DAL 17'],(11:17 - OT) T.Romo pass incomplete deep left...,3,DAL,KC,DAL,20,20,20,...,3,9,-33,O,11:17,677,0,40,0.0,1
232800,6,[],(5:18 - OT) Timeout #2 by CAR at 05:18.,7,CAR,CIN,CAR,34,37,37,...,0,0,0,O,5:18,318,-3,71,-0.052358,1
156572,2,['2nd and 10 at HOU 31'],(14:28 - OT) M.Schaub pass short middle to A.F...,3,JAX,HOU,HOU,34,34,34,...,2,10,19,O,14:28,868,0,68,0.0,1
227004,0,[],(15:00 - OT) B.McManus kicks 65 yards from DEN...,1,DEN,SEA,SEA,20,20,20,...,0,0,0,O,15:00,900,0,40,0.0,1
129739,3,['1st and 10 at SEA 48'],(11:53 - OT) L.Stephens-Howling right end to S...,4,SEA,ARI,ARI,20,20,20,...,1,10,-2,O,11:53,713,0,40,0.0,1
277389,15,['4th and 3 at CLE 16'],(4:56 - OT) Brandon McManus 34 Yd Field Goal,16,DEN,CLE,DEN,23,23,26,...,4,3,34,O,4:56,296,0,46,0.0,1
294171,2,['2nd and 10 at NE 20'],(14:57 - OT) (Shotgun) T.Brady sacked at NE 1...,3,NE,DEN,NE,24,24,24,...,2,10,-30,O,14:57,897,0,48,0.0,1
56897,0,['1st and 10 at BAL 20'],(15:00 - OT) J.Flacco pass short middle to R.R...,1,BAL,NE,BAL,20,20,20,...,1,10,-30,O,15:00,900,0,40,0.0,1
116022,3,['3rd and 14 at LAC 20'],(10:48 - OT) P.Rivers pass incomplete short m...,4,DEN,SD,SD,13,13,13,...,3,14,30,O,10:48,648,0,26,0.0,1


In [17]:
# Make column for whether home team has possession
hometeam = allplays_df.home.values
awayteam = allplays_df.away.values
possession = allplays_df.possession.values

home_possession = [
    1 if hometeam[i] == p else 0 if awayteam[i] == p else "X" for i, p in enumerate(possession)
]
allplays_df['home_possession'] = home_possession

In [18]:
# Make a column for whether the hoem team wins
gamedata_df.columns

Index(['result', 'season', 'week', 'home', 'away', 'winner', 'home_score',
       'away_score', 'OT'],
      dtype='object')

In [19]:
hometeam = gamedata_df.home.values
awayteam = gamedata_df.away.values
winner = gamedata_df.winner.values

# Make new column for gamedata
home_wins = [1 if hometeam[i] == w else 0 if awayteam[i] == w else "X" for i, w in enumerate(winner)]
gamedata_df['home_win'] = home_wins

# Try pandas join
#allplays_df = allplays_df.join(gamedata_df['home_win'], on='gameId')

In [20]:
# Must be an easier way to make the home_win column but hopefully this works
joined_df = pd.merge(allplays_df, gamedata_df[['season','week','home_win']], 
                     how='left',
                     left_on='gameId',
                     right_index=True)

In [21]:
joined_df.columns

Index(['Unnamed: 0', 'downdist', 'detail', 'play_num', 'home', 'away',
       'possession', 'home_score_before', 'away_score_before',
       'home_score_after', 'away_score_after', 'drive_num', 'gameId', 'down',
       'dist', 'home_fieldpos', 'qtr', 'time_rem', 'secs_rem', 'home_lead',
       'total_score', 'adj_lead', 'OT', 'home_possession', 'season', 'week',
       'home_win'],
      dtype='object')

In [22]:
final_cols = [
    'downdist', 'detail', 'home', 'away', 'possession',
       'home_score_before', 'away_score_before', 'gameId', 'down', 'dist', 'home_fieldpos',
       'qtr', 'time_rem', 'secs_rem', 'home_lead', 'total_score', 'adj_lead',
       'OT','home_possession','home_win','season','week'
]
print(joined_df[final_cols].describe())
joined_df[final_cols].sample(20)

       home_score_before  away_score_before        gameId           down  \
count      396448.000000      396448.000000  3.964480e+05  396448.000000   
mean           10.384837          11.824252  3.515140e+08       1.789629   
std             9.514909          10.177912  4.557265e+07       1.143247   
min             0.000000           0.000000  2.909100e+08       0.000000   
25%             3.000000           3.000000  3.110090e+08       1.000000   
50%             7.000000          10.000000  3.311100e+08       2.000000   
75%            17.000000          17.000000  4.007917e+08       3.000000   
max            59.000000          62.000000  4.009814e+08       4.000000   

       home_fieldpos       secs_rem      home_lead    total_score  \
count  396448.000000  396448.000000  396448.000000  396448.000000   
mean       -0.259605    1733.033508      -1.439415      22.209089   
std        24.703645    1063.493014      10.958499      16.375533   
min       -57.000000       0.000000    

Unnamed: 0,downdist,detail,home,away,possession,home_score_before,away_score_before,gameId,down,dist,...,time_rem,secs_rem,home_lead,total_score,adj_lead,OT,home_possession,home_win,season,week
355456,['1st and 10 at KC 13'],(12:42 - 3rd) (Shotgun) A.Smith pass short le...,PHI,KC,KC,3,6,400951636,1,10,...,12:42,1662,-3,9,-0.068129,0,0,0,2017,2
224163,['1st and 10 at OAK 24'],(14:53 - 3rd) M.Watson reported in as eligibl...,HOU,OAK,OAK,17,0,400554284,1,10,...,14:53,1793,17,17,0.399806,0,0,1,2014,2
395503,['1st and 10 at JAX 18'],(9:17 - 3rd) D.Henry left end pushed ob at JA...,JAX,TEN,TEN,3,12,400951788,1,10,...,9:17,1457,-9,15,-0.19437,0,0,0,2017,17
155818,['2nd and 6 at KC 42'],(12:41 - 4th) A.Dalton pass incomplete short r...,CIN,KC,CIN,21,6,321118012,2,6,...,12:41,761,15,27,0.28147,0,1,1,2012,11
299256,[],END QUARTER 3,OAK,DEN,OAK,9,12,400791608,0,0,...,0:00,3600,-3,21,-3.0,0,1,1,2015,14
129342,['2nd and 6 at KC 47'],(12:52 - 4th) W.McGahee up the middle to KC 4...,KC,DEN,DEN,7,3,320101007,2,6,...,12:52,772,4,10,0.075204,0,0,1,2011,17
105482,['and -1 at PIT 35'],(15:00 - 3rd) S.Suisham kicks 61 yards from PI...,NE,PIT,NE,10,17,311030023,0,-1,...,15:00,1800,-7,27,-0.164946,0,1,0,2011,8
222343,['2nd and 9 at ATL 2'],(5:29 - 4th) M.Ryan pass short right to L.Toi...,ATL,CIN,ATL,10,24,400554245,2,9,...,5:29,329,-14,34,-0.244749,0,1,0,2014,2
289972,['2nd and 5 at PHI 43'],"(6:33 - 1st) (No Huddle, Shotgun) M.Sanchez p...",TB,PHI,PHI,0,7,400791668,2,5,...,6:33,3093,-7,7,-0.310575,0,0,1,2015,11
50225,['2nd and 6 at HOU 22'],(14:22 - 3rd) A.Foster right guard to HST 25 f...,DAL,HOU,HOU,10,3,300926034,2,6,...,14:22,1762,7,13,0.163233,0,0,1,2010,3


In [23]:
# Objective: Assuming no fumbled snap or pre-snap penalty or other shenanigans,
#  should be able to figure out which plays are "successful" and then look at success rates

# Practical things to deal with:
    # What raw data to handle?
    # Need to be able to take a raw-ish pbp table and extract success rate
    # So make one function that will try and label plays as run/pass, yardage gained, and success
def found_pass(detail):
    d = detail.lower()
    pass_terms = [" pass", " sacked", " scramble",
                  "interception", "intercepted"]
    for term in pass_terms:
        if term in d:
            return True
    return False

def found_scramble(detail):
    d = detail.lower()
    if " scramble" in d:
        return True
    return False
    
def found_run(detail):
    d = detail.lower()
    run_terms = [" run ", " rush", " left tackle ", " up the middle ",
                 " left end ", " right end ", " left guard ", " right guard "]
    if not " scramble" in d:
        for term in run_terms:
            if term in d:
                return True
    return False

        
def found_punt(detail):
    d = detail.lower()
    if " punts " in d:
        return True
    elif " punt return" in d:
        return True
    return False
    
def found_fieldgoal(detail):
    d = detail.lower()
    if " field goal" in d:
        return True
    return False

def yds_run( i, detail ):
    words = detail.lower().split()
    # look for yardage in format "for X yards"
    for j, w in enumerate(words):
        if w == "for" and len(words) > j+2:
            if words[j+2].rstrip(".,") in ("yd","yds","yrd","yrds","yard","yards"):
                return int(words[j+1])
            # or "for no gain"
            elif "no" in words[j+1] and "gain" in words[j+2]:
                return 0
        
        # or "X yard run/rush"
        elif w in ("yd","yds","yrd","yrds","yard","yards") and len(words) >= j+2:
            if words[j+1].rstrip(".,") in ("run","rush"):
                return int(words[j-1])
        
    return "x"
    
def yds_passed( i, detail ):
    words = detail.lower().split()
    # look for yardage in format "for X yards"
    for j, w in enumerate(words):
        if w == "for" and len(words) > j+2:
            if words[j+2].rstrip(".,") in ("yd","yds","yrd","yrds","yard","yards"):
                return int(words[j+1])
            # or "for no gain"
            elif "no" in words[j+1] and "gain" in words[j+2]:
                return 0
            
        # or "X yard pass"
        elif w in ("yd","yds","yrd","yrds","yard","yards") and len(words) >= j+2:
            if words[j+1].rstrip(".,") in ("pass"):
                return int(words[j-1])

    # Or maybe pass went incomplete
    if "incomplete" in detail.lower():
        return 0
    
    # Or maybe pass was intercepted. In this case, just say yds_gained is zero
    elif ("intercepted" in detail.lower()) or ("interception" in detail.lower()):
        return 0
    
    return "x"

    
def parse_details(df):
    print(df.columns)
    details = df.detail.values
    down = df.down.values
    
    # Make a bunch of dictionaries for storiing play-specific data
    # This method assumes that play details are entirely unique.
    # If that assumption fails, would need to work on building lists based on order of "details"
    is_parseable = [False for d in details]
    is_run = [False for d in details]
    is_scramble = [False for d in details]
    is_pass = [False for d in details]
    is_punt = [False for d in details]
    is_fieldgoal = [False for d in details]
    yds_gained = ["x" for d in details]
    runpass_play = [False for d in details]
    
    # Loop through details going through logic tree to find appropriate values
    for i, d in enumerate(details):
        
        # Look exclusively for play details on downs 1-4
        if down[i] in [1,2,3,4]:
            
            # Parse a scramble
            if found_scramble(d):
                is_scramble[i] = True
                yds_gained[i] = yds_run(i,d)
            
            # Try and parse a pass
            if found_run(d):
                is_run[i] = True
                yds_gained[i] = yds_run(i,d)
            
            # Try and parse a run
            if found_pass(d):
                is_pass[i] = True
                yds_gained[i] = yds_passed(i,d)
            
            # Try and parse a punt
            elif found_punt(d):
                is_punt[i] = True
            
            # Try and parse a field goal
            elif found_fieldgoal(d):
                is_fieldgoal[i] = True
                
    for i, yds in enumerate(yds_gained):
        if (is_run[i] or is_pass[i] or is_scramble[i]) and (yds != "x"):
            is_parseable[i] = True
            runpass_play[i] = True
        elif is_punt[i]:
            is_parseable[i] = True
        elif is_fieldgoal[i]:
            is_parseable[i] = True
                
                
    # Now write the columns to the end of the df
    df['is_parseable'] = is_parseable
    df['is_run'] = is_run
    df['is_pass'] = is_pass
    df['is_scramble'] = is_scramble
    df['is_punt'] = is_punt
    df['is_fieldgoal'] = is_fieldgoal
    df['yds_gained'] = yds_gained
    df['runpass'] = runpass_play
    
    return df

In [24]:
import copy
parsed_df = parse_details( copy.deepcopy(joined_df) )
parsed_df.sample(15)

Index(['Unnamed: 0', 'downdist', 'detail', 'play_num', 'home', 'away',
       'possession', 'home_score_before', 'away_score_before',
       'home_score_after', 'away_score_after', 'drive_num', 'gameId', 'down',
       'dist', 'home_fieldpos', 'qtr', 'time_rem', 'secs_rem', 'home_lead',
       'total_score', 'adj_lead', 'OT', 'home_possession', 'season', 'week',
       'home_win'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,downdist,detail,play_num,home,away,possession,home_score_before,away_score_before,home_score_after,...,week,home_win,is_parseable,is_run,is_pass,is_scramble,is_punt,is_fieldgoal,yds_gained,runpass
259507,0,['1st and 10 at CLE 39'],(10:29 - 1st) (Shotgun) J.Manziel pass incomp...,1,CLE,CAR,CLE,0,0,0,...,16,0,True,False,True,False,False,False,0,True
2600,11,['3rd and 3 at OAK 41'],(10:18 - 3rd) P.Rivers pass short middle inte...,12,SD,OAK,SD,10,10,10,...,1,1,True,False,True,False,False,False,7,True
57281,2,['3rd and 5 at ATL 41'],(1:15 - 4th) E.Buckley left end to ATL 37 for ...,3,ATL,PHI,PHI,17,31,17,...,6,0,True,True,False,False,False,False,4,True
306563,3,[],(1:58 - 4th) Two-Minute Warning,4,TEN,IND,TEN,24,30,24,...,17,0,False,False,False,False,False,False,x,False
263659,4,['4th and 3 at GB 46'],(8:04 - 1st) (Punt formation) T.Masthay punts...,5,GB,CHI,GB,0,3,0,...,1,1,True,False,False,False,True,False,x,False
152571,6,['and Goal at JAX 0'],End of Period,7,IND,JAX,IND,3,0,10,...,10,1,False,False,False,False,False,False,x,False
207739,1,['2nd and 11 at WSH 44'],(0:39 - 4th) E.Manning kneels to WAS 45 for -1...,2,NYG,WSH,NYG,24,17,24,...,13,1,False,False,False,False,False,False,x,False
152304,3,['1st and 10 at DAL 44'],(4:45 - 3rd) T.Romo pass short right to J.Witt...,4,DAL,ATL,DAL,6,6,6,...,9,0,True,False,True,False,False,False,2,True
222907,8,['3rd and 17 at NYG 29'],(11:27 - 1st) (Shotgun) Penalty on ARZ-D.Stan...,9,ARI,NYG,ARI,0,0,7,...,2,1,False,False,False,False,False,False,x,False
214469,17,['4th and 7 at WSH 7'],(6:14 - 4th) D.Bailey 25 yard field goal is GO...,18,DAL,WSH,DAL,14,23,17,...,16,1,True,False,False,False,False,True,x,False


In [25]:
# And finally save the dataFrame to a csv
parsed_df.to_csv("espn_parsedplays2009-2017.csv")

## End of finished code