# Day 23

Use NFL play-by-play data from 1999 to create a dataset at the game-team level where I can see the score and the number of different types of scoring plays each team had that made up their final score. Ultimately I want to compare the percentage of score by offense/defense to see how teams perform and if there are any outliers at the weekly, team, or season level.

Today's update:
- Did a sanity check on a number of columns, making sure that extreme values made sense based on description of the play
- Figured out that the play-by-play dataset is missing 59 games across the 1999-2022 seasons. I didn't filter out any games in the API call so I might need to manually check data in the [nflverse repository](https://github.com/nflverse/nflverse-data)

Current challenges:
- See if I can find data on the missing 59 games
- Confirm if data is indeed missing or something is wrong with the data call on my end
- Check score distribution. Confirm zero scores

Solutions:
- Run histograms and utilize frequencies
- Check a few sample cases of rare scoring plays and see if my code places them in the right category
- Update JOIN logic to FULL JOIN

In [65]:
import pandas as pd
import sqlite3
import nfl_data_py as nfl

# Create database connection
conn = sqlite3.connect('../../data/db/database.db')

## Get Score Summary Table

In [66]:
query = """
WITH offense AS (
    SELECT
        game_id,
        season,
        week,
        season_type,
        home_team,
        away_team,
        posteam,
        SUM(touchdown) AS tot_tds,
        SUM(pass_touchdown) AS tot_pass_tds,
        SUM(rush_touchdown) AS tot_rush_tds,
        SUM(return_touchdown) AS tot_ret_tds
    FROM pbp
    WHERE posteam IS NOT NULL 
        AND posteam <> ""
        AND posteam = td_team
        AND (
            pass_touchdown = 1
            OR rush_touchdown = 1
            OR return_touchdown = 1
        )
    GROUP BY game_id, posteam), 
extra_pts AS (
    SELECT
        game_id,
        posteam,
        COUNT(*) AS tot_extra_pts
    FROM pbp
    WHERE extra_point_attempt = 1 AND extra_point_result = 'good'
    GROUP BY game_id, posteam),
field_goals AS (
    SELECT 
        game_id,
        -- Correct for data error in game_id: 2000_11_OAK_DEN
        CASE WHEN game_id = '2000_11_OAK_DEN' 
            AND desc LIKE '%J.Elam%' 
            AND desc LIKE '%field goal%' 
            THEN 'DEN'
        ELSE posteam
        END AS team,
        COUNT(*) AS tot_fgs
    FROM pbp
    WHERE field_goal_attempt = 1 
        AND field_goal_result = 'made'
        AND desc LIKE '%field goal%'
    GROUP BY game_id, team), 
two_pt_convs AS (
    SELECT 
        game_id,
        posteam,
        COUNT(*) AS tot_2pt_conv
    FROM pbp
    WHERE two_point_attempt = 1 AND two_point_conv_result = 'success'
    GROUP BY game_id, posteam),
-- Counts defensive TDs and punt/kickoff return TDs
defense AS (
    SELECT
        game_id,
        td_team AS team,
        COUNT(*) AS tot_def_tds
    FROM pbp
    WHERE touchdown = 1
        AND (
            defteam_score_post <> defteam_score
            OR (defteam_score IS NULL AND defteam_score_post >= 6)
        )
    GROUP BY game_id, td_team),
safeties AS (
    SELECT
        game_id,
        CASE WHEN defteam_score_post <> defteam_score THEN defteam
        ELSE posteam
        END AS team,
        COUNT(*) AS tot_safeties
    FROM pbp
    WHERE safety = 1
    GROUP BY game_id, team),
def_2pt_att AS (
    SELECT
        game_id,
        defteam AS team,
        COUNT(*) AS tot_def_2pt
    FROM pbp
    WHERE desc LIKE '%DEFENSIVE TWO-POINT ATTEMPT%'
        AND defteam_score_post <> defteam_score
    GROUP BY game_id, team),
off_fumb_recovery AS (
    SELECT
        game_id,
        posteam AS team,
        COUNT(*) AS tot_off_fumble_recov_td
    FROM pbp
    WHERE desc LIKE '%fumble%'
        AND posteam_score_post <> posteam_score
        AND touchdown = 1
        AND pass_touchdown = 0
        AND rush_touchdown = 0
        AND return_touchdown = 0
    GROUP BY game_id, team),
joined AS (
    SELECT 
        offense.*,
        CASE WHEN tot_extra_pts IS NULL THEN 0
        ELSE tot_extra_pts
        END AS tot_extra_pts,
        CASE WHEN tot_fgs IS NULL THEN 0
        ELSE tot_fgs
        END AS tot_fgs,
        CASE WHEN tot_2pt_conv IS NULL THEN 0
        ELSE tot_2pt_conv
        END AS tot_2pt_conv,
        CASE WHEN tot_def_tds IS NULL THEN 0
        ELSE tot_def_tds
        END AS tot_def_tds,
        CASE WHEN tot_safeties IS NULL THEN 0
        ELSE tot_safeties
        END AS tot_safeties,
        CASE WHEN tot_def_2pt IS NULL THEN 0
        ELSE tot_def_2pt
        END AS tot_def_2pt,
        CASE WHEN tot_off_fumble_recov_td IS NULL THEN 0
        ELSE tot_off_fumble_recov_td
        END AS tot_off_fumble_recov_td
    FROM offense
    LEFT JOIN extra_pts
        ON extra_pts.game_id = offense.game_id
            AND extra_pts.posteam = offense.posteam
    LEFT JOIN field_goals
        ON field_goals.game_id = offense.game_id
            AND field_goals.team = offense.posteam
    LEFT JOIN two_pt_convs
        ON two_pt_convs.game_id = offense.game_id
            AND two_pt_convs.posteam = offense.posteam
    LEFT JOIN defense
        ON defense.game_id = offense.game_id
            AND defense.team = offense.posteam
    LEFT JOIN safeties
        ON safeties.game_id = offense.game_id
            AND safeties.team = offense.posteam
    LEFT JOIN def_2pt_att
        ON def_2pt_att.game_id = offense.game_id
            AND def_2pt_att.team = offense.posteam
    LEFT JOIN off_fumb_recovery
        ON off_fumb_recovery.game_id = offense.game_id
            AND off_fumb_recovery.team = offense.posteam
)
SELECT *,
    (tot_pass_tds * 6
    + tot_rush_tds * 6
    + tot_ret_tds * 6
    + tot_extra_pts * 1
    + tot_fgs * 3
    + tot_2pt_conv * 2
    + tot_def_tds * 6
    + tot_safeties * 2
    + tot_def_2pt * 2
    + tot_off_fumble_recov_td * 6) AS score,
    -- Use old team abbrev. that matches game_id for teams that moved (pbp data has new names)
    CASE
        WHEN game_id LIKE '%OAK%' AND posteam = 'LV' THEN 'OAK'
        WHEN game_id LIKE '%SD%' AND posteam = 'LAC' THEN 'SD'
        WHEN game_id LIKE '%STL%' AND posteam = 'LA' THEN 'STL'
        ELSE posteam
    END AS team_fixed
FROM joined
"""

df_pbp = pd.read_sql(query, conn)
df_pbp.head(10)

Unnamed: 0,game_id,season,week,season_type,home_team,away_team,posteam,tot_tds,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td,score,team_fixed
0,1999_01_ARI_PHI,1999,1,REG,PHI,ARI,ARI,2.0,1.0,1.0,0.0,1,4,0,0,0,0,0,25.0,ARI
1,1999_01_ARI_PHI,1999,1,REG,PHI,ARI,PHI,3.0,2.0,1.0,0.0,3,1,0,0,0,0,0,24.0,PHI
2,1999_01_BUF_IND,1999,1,REG,IND,BUF,BUF,1.0,1.0,0.0,0.0,0,2,1,0,0,0,0,14.0,BUF
3,1999_01_BUF_IND,1999,1,REG,IND,BUF,IND,3.0,2.0,1.0,0.0,4,1,0,1,0,0,0,31.0,IND
4,1999_01_CAR_NO,1999,1,REG,NO,CAR,CAR,1.0,1.0,0.0,0.0,1,1,0,0,0,0,0,10.0,CAR
5,1999_01_CAR_NO,1999,1,REG,NO,CAR,NO,1.0,1.0,0.0,0.0,1,2,0,1,0,0,0,19.0,NO
6,1999_01_CIN_TEN,1999,1,REG,TEN,CIN,CIN,4.0,2.0,2.0,0.0,1,2,2,0,0,0,0,35.0,CIN
7,1999_01_CIN_TEN,1999,1,REG,TEN,CIN,TEN,4.0,3.0,1.0,0.0,4,2,0,0,1,0,0,36.0,TEN
8,1999_01_DAL_WAS,1999,1,REG,WAS,DAL,DAL,6.0,5.0,1.0,0.0,5,0,0,0,0,0,0,41.0,DAL
9,1999_01_DAL_WAS,1999,1,REG,WAS,DAL,WAS,4.0,2.0,2.0,0.0,3,2,1,0,0,0,0,35.0,WAS


## Data Validation

In [68]:
df_pbp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11509 entries, 0 to 11508
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   game_id                  11509 non-null  object 
 1   season                   11509 non-null  int64  
 2   week                     11509 non-null  int64  
 3   season_type              11509 non-null  object 
 4   home_team                11509 non-null  object 
 5   away_team                11509 non-null  object 
 6   posteam                  11509 non-null  object 
 7   tot_tds                  11509 non-null  float64
 8   tot_pass_tds             11509 non-null  float64
 9   tot_rush_tds             11509 non-null  float64
 10  tot_ret_tds              11509 non-null  float64
 11  tot_extra_pts            11509 non-null  int64  
 12  tot_fgs                  11509 non-null  int64  
 13  tot_2pt_conv             11509 non-null  int64  
 14  tot_def_tds           

In [5]:
df_pbp.describe()

Unnamed: 0,season,week,tot_tds,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td,score
count,11509.0,11509.0,11509.0,11509.0,11509.0,11509.0,11509.0,11509.0,11509.0,11509.0,11509.0,11509.0,11509.0,11509.0
mean,2010.493787,9.428274,2.512295,1.57833,0.907116,0.026849,2.439308,1.549222,0.083152,0.180381,0.031541,0.000956,0.004171,23.499348
std,6.79108,5.285374,1.258945,1.101183,0.929834,0.164842,1.380473,1.174154,0.296167,0.43279,0.177249,0.030902,0.064449,9.399743
min,1999.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
25%,2005.0,5.0,2.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,17.0
50%,2011.0,9.0,2.0,1.0,1.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,23.0
75%,2016.0,14.0,3.0,2.0,1.0,0.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,30.0
max,2022.0,22.0,8.0,7.0,8.0,2.0,8.0,8.0,4.0,4.0,2.0,1.0,1.0,62.0


In [8]:
pd.crosstab(df_pbp['week'], df_pbp['season_type'])

season_type,POST,REG
week,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,693
2,0,691
3,0,681
4,0,648
5,0,627
6,0,613
7,0,600
8,0,609
9,0,615
10,0,632


In [17]:
df_pbp.query("season_type == 'REG' and week == 18 and season != 2021")

Unnamed: 0,game_id,season,week,season_type,home_team,away_team,posteam,tot_tds,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td,score,team_fixed


The 2021 season was the only season with 17 regular-season games over 18 weeks so this checks out.

In [24]:
# Check if total touchdowns matches the total of passing, rushing, and return touchdowns
print(len(df_pbp.query("tot_tds == (tot_pass_tds + tot_rush_tds + tot_ret_tds)")))
print(len(df_pbp))

11509
11509


In [19]:
pd.crosstab(df_pbp['tot_tds'], df_pbp['tot_pass_tds'])

tot_pass_tds,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0
tot_tds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1.0,954,1822,0,0,0,0,0,0
2.0,500,1562,1474,0,0,0,0,0
3.0,184,698,1210,718,0,0,0,0
4.0,50,223,504,543,267,0,0,0
5.0,9,42,104,200,154,54,0,0
6.0,2,10,25,45,46,44,18,0
7.0,1,0,4,8,16,9,2,3
8.0,1,0,0,0,1,0,2,0


In [20]:
pd.crosstab(df_pbp['tot_tds'], df_pbp['tot_rush_tds'])

tot_rush_tds,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0
tot_tds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1.0,1853,923,0,0,0,0,0,0,0
2.0,1517,1545,474,0,0,0,0,0,0
3.0,760,1207,666,177,0,0,0,0,0
4.0,288,556,486,211,46,0,0,0,0
5.0,57,158,200,102,39,7,0,0,0
6.0,20,44,48,45,24,7,2,0,0
7.0,4,2,10,16,7,3,0,1,0
8.0,0,0,2,0,1,0,0,0,1


In [26]:
pd.crosstab(df_pbp['tot_tds'], df_pbp['tot_fgs'])

tot_fgs,0,1,2,3,4,5,6,7,8
tot_tds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1.0,474,811,737,483,202,59,9,1,0
2.0,615,1086,990,578,229,31,4,2,1
3.0,550,969,829,341,103,18,0,0,0
4.0,386,607,394,157,38,5,0,0,0
5.0,160,228,133,37,5,0,0,0,0
6.0,81,64,34,10,0,1,0,0,0
7.0,23,14,6,0,0,0,0,0,0
8.0,2,2,0,0,0,0,0,0,0


### Two Point Conversions

In [27]:
df_pbp.query("tot_2pt_conv >= 4")

Unnamed: 0,game_id,season,week,season_type,home_team,away_team,posteam,tot_tds,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td,score,team_fixed
607,2000_07_ATL_STL,2000,7,REG,LA,ATL,LA,6.0,3.0,2.0,1.0,1,0,4,0,0,0,0,45.0,STL


In [41]:
pd.set_option("display.max_colwidth", None)

df = pd.read_sql("""
SELECT
    posteam,
    --defteam,
    --posteam_score,
    --posteam_score_post,
    --defteam_score,
    --defteam_score_post,
    desc
FROM pbp
WHERE game_id = '2000_07_ATL_STL'
    AND (
        (defteam_score_post - defteam_score) = 2
        OR (posteam_score_post - posteam_score) = 2
    )
""", conn)
df

Unnamed: 0,posteam,desc
0,LA,(Kick formation) TWO-POINT CONVERSION ATTEMPT. K.Lyle pass to L.Fletcher is complete. ATTEMPT SUCCEEDS.
1,LA,TWO-POINT CONVERSION ATTEMPT. M.Faulk rushes right end. ATTEMPT SUCCEEDS.
2,LA,TWO-POINT CONVERSION ATTEMPT. M.Faulk rushes up the middle. ATTEMPT SUCCEEDS. St. Louis #50 R.Tucker injured on the play.
3,ATL,TWO-POINT CONVERSION ATTEMPT. J.Anderson rushes left tackle. ATTEMPT SUCCEEDS.
4,LA,TWO-POINT CONVERSION ATTEMPT. K.Warner pass to R.Williams is complete. ATTEMPT SUCCEEDS.


### Defensive TDs

In [42]:
df_pbp.query("tot_def_tds >= 4")

Unnamed: 0,game_id,season,week,season_type,home_team,away_team,posteam,tot_tds,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td,score,team_fixed
6670,2012_17_JAX_TEN,2012,17,REG,TEN,JAX,TEN,1.0,0.0,1.0,0.0,5,1,0,4,0,0,0,38.0,TEN


In [46]:
pd.set_option("display.max_colwidth", None)

df = pd.read_sql("""
SELECT
    posteam,
    defteam,
    --posteam_score,
    --posteam_score_post,
    --defteam_score,
    --defteam_score_post,
    desc
FROM pbp
WHERE game_id = '2012_17_JAX_TEN'
    AND (
        (defteam_score_post - defteam_score) = 6
        --OR (posteam_score_post - posteam_score) = 6
    )
""", conn)
df

Unnamed: 0,posteam,defteam,desc
0,JAX,TEN,"(2:00) (Shotgun) 7-C.Henne pass short left intended for 89-M.Lewis INTERCEPTED by 55-Z.Brown at TEN 21. 55-Z.Brown for 79 yards, TOUCHDOWN."
1,JAX,TEN,"(:48) (Punt formation) 19-B.Anger punts 47 yards to TEN 31, Center-48-J.Cain. 25-D.Reynaud for 69 yards, TOUCHDOWN."
2,JAX,TEN,"(13:13) (Punt formation) 19-B.Anger punts 54 yards to TEN 19, Center-48-J.Cain. 25-D.Reynaud for 81 yards, TOUCHDOWN."
3,JAX,TEN,"(12:11) 7-C.Henne pass short left intended for 17-T.Clemons INTERCEPTED by 55-Z.Brown (37-T.Campbell) at JAX 30. 55-Z.Brown for 30 yards, TOUCHDOWN."
4,TEN,JAX,"(2:25) 6-B.Kern punt is BLOCKED by 20-M.Harris, Center-48-B.Brinkley, RECOVERED by JAX-20-M.Harris at TEN 19. 20-M.Harris for 19 yards, TOUCHDOWN."


### Field Goals

In [48]:
df_pbp.query("tot_fgs >= 8")

Unnamed: 0,game_id,season,week,season_type,home_team,away_team,posteam,tot_tds,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td,score,team_fixed
3954,2007_07_TEN_HOU,2007,7,REG,HOU,TEN,TEN,2.0,0.0,2.0,0.0,2,8,0,0,0,0,0,38.0,TEN


In [52]:
pd.set_option("display.max_colwidth", None)

df = pd.read_sql("""
SELECT
    posteam,
    defteam,
    --posteam_score,
    --posteam_score_post,
    --defteam_score,
    --defteam_score_post,
    desc
FROM pbp
WHERE game_id = '2007_07_TEN_HOU'
    AND (
        --(defteam_score_post - defteam_score) = 3
        (posteam_score_post - posteam_score) = 3
    )
""", conn)
df

Unnamed: 0,posteam,defteam,desc
0,TEN,HOU,"(12:35) 2-R.Bironas 52 yard field goal is GOOD, Center-58-K.Amato, Holder-15-C.Hentrich."
1,TEN,HOU,"(:55) 2-R.Bironas 25 yard field goal is GOOD, Center-58-K.Amato, Holder-15-C.Hentrich."
2,TEN,HOU,"(11:49) 2-R.Bironas 21 yard field goal is GOOD, Center-58-K.Amato, Holder-15-C.Hentrich."
3,TEN,HOU,"(1:09) 2-R.Bironas 30 yard field goal is GOOD, Center-58-K.Amato, Holder-15-C.Hentrich."
4,TEN,HOU,"(:04) 2-R.Bironas 28 yard field goal is GOOD, Center-58-K.Amato, Holder-15-C.Hentrich."
5,TEN,HOU,"(10:58) 2-R.Bironas 43 yard field goal is GOOD, Center-58-K.Amato, Holder-15-C.Hentrich."
6,TEN,HOU,"(3:49) 2-R.Bironas 29 yard field goal is GOOD, Center-58-K.Amato, Holder-15-C.Hentrich."
7,TEN,HOU,"(:02) 2-R.Bironas 29 yard field goal is GOOD, Center-58-K.Amato, Holder-15-C.Hentrich."


### Extra Points

In [53]:
df_pbp.query("tot_extra_pts >= 8")

Unnamed: 0,game_id,season,week,season_type,home_team,away_team,posteam,tot_tds,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td,score,team_fixed
447,1999_19_MIA_JAX,1999,19,POST,JAX,MIA,JAX,7.0,4.0,3.0,0.0,8,2,0,1,0,0,0,62.0,JAX
2491,2004_07_ATL_KC,2004,7,REG,KC,ATL,KC,8.0,0.0,8.0,0.0,8,0,0,0,0,0,0,56.0,KC
3838,2007_03_DET_PHI,2007,3,REG,PHI,DET,PHI,8.0,4.0,4.0,0.0,8,0,0,0,0,0,0,56.0,PHI
4041,2007_11_NE_BUF,2007,11,REG,BUF,NE,NE,7.0,5.0,2.0,0.0,8,0,0,1,0,0,0,56.0,NE
4901,2009_06_TEN_NE,2009,6,REG,NE,TEN,NE,8.0,6.0,2.0,0.0,8,1,0,0,0,0,0,59.0,NE
5409,2010_07_OAK_DEN,2010,7,REG,DEN,LV,LV,7.0,2.0,5.0,0.0,8,1,0,1,0,0,0,59.0,OAK
5488,2010_10_PHI_WAS,2010,10,REG,WAS,PHI,PHI,7.0,4.0,3.0,0.0,8,1,0,1,0,0,0,59.0,PHI
5901,2011_07_IND_NO,2011,7,REG,NO,IND,NO,7.0,5.0,2.0,0.0,8,2,0,1,0,0,0,62.0,NO
6494,2012_11_IND_NE,2012,11,REG,NE,IND,NE,5.0,3.0,2.0,0.0,8,1,0,3,0,0,0,59.0,NE
7100,2013_15_KC_OAK,2013,15,REG,LV,KC,KC,7.0,5.0,2.0,0.0,8,0,0,1,0,0,0,56.0,KC


In [59]:
pd.set_option("display.max_colwidth", None)

df = pd.read_sql("""
SELECT
    posteam,
    defteam,
    --posteam_score,
    --posteam_score_post,
    --defteam_score,
    --defteam_score_post,
    desc
FROM pbp
WHERE game_id = '1999_19_MIA_JAX'
    AND (
        --(defteam_score_post - defteam_score) = 6
        (posteam_score_post - posteam_score) = 1
    )
""", conn)
df

Unnamed: 0,posteam,defteam,desc
0,JAX,MIA,"M.Hollis extra point is GOOD, Center-Q.Neujahr, Holder-B.Barker."
1,JAX,MIA,"M.Hollis extra point is GOOD, Center-Q.Neujahr, Holder-B.Barker."
2,JAX,MIA,"M.Hollis extra point is GOOD, Center-Q.Neujahr, Holder-B.Barker."
3,JAX,MIA,"M.Hollis extra point is GOOD, Center-Q.Neujahr, Holder-B.Barker."
4,JAX,MIA,"M.Hollis extra point is GOOD, Center-Q.Neujahr, Holder-B.Barker."
5,MIA,JAX,"O.Mare extra point is GOOD, Center-E.Perry, Holder-D.Huard."
6,JAX,MIA,"M.Hollis extra point is GOOD, Center-Q.Neujahr, Holder-B.Barker."
7,JAX,MIA,"M.Hollis extra point is GOOD, Center-Q.Neujahr, Holder-B.Barker."
8,JAX,MIA,"M.Hollis extra point is GOOD, Center-Q.Neujahr, Holder-B.Barker."


### Defensive 2pt

In [60]:
df_pbp.query("tot_def_2pt >= 1")

Unnamed: 0,game_id,season,week,season_type,home_team,away_team,posteam,tot_tds,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td,score,team_fixed
8033,2015_13_CAR_NO,2015,13,REG,NO,CAR,NO,4.0,3.0,1.0,0.0,4,0,1,1,0,1,0,38.0,NO
8227,2016_02_BAL_CLE,2016,2,REG,CLE,BAL,BAL,2.0,2.0,0.0,0.0,2,3,0,0,0,1,0,25.0,BAL
8454,2016_10_DEN_NO,2016,10,REG,NO,DEN,DEN,2.0,2.0,0.0,0.0,2,3,0,0,0,1,0,25.0,DEN
8544,2016_13_KC_ATL,2016,13,REG,ATL,KC,KC,3.0,1.0,2.0,0.0,3,0,0,1,0,1,0,29.0,KC
8559,2016_14_ARI_MIA,2016,14,REG,MIA,ARI,MIA,3.0,3.0,0.0,0.0,3,1,0,0,0,1,0,26.0,MIA
9124,2017_16_JAX_SF,2017,16,REG,SF,JAX,JAX,4.0,2.0,2.0,0.0,2,1,1,0,0,1,0,33.0,JAX
9599,2018_15_NO_CAR,2018,15,REG,CAR,NO,CAR,1.0,1.0,0.0,0.0,1,0,0,0,0,1,0,9.0,CAR
9659,2018_17_LAC_DEN,2018,17,REG,DEN,LAC,LAC,2.0,1.0,1.0,0.0,3,0,0,1,0,1,0,23.0,LAC
9976,2019_11_JAX_IND,2019,11,REG,IND,JAX,IND,4.0,1.0,3.0,0.0,4,1,0,0,0,1,0,33.0,IND
10031,2019_13_OAK_KC,2019,13,REG,KC,LV,KC,4.0,1.0,3.0,0.0,5,1,0,1,0,1,0,40.0,KC


In [61]:
pd.set_option("display.max_colwidth", None)

df = pd.read_sql("""
SELECT
    posteam,
    defteam,
    --posteam_score,
    --posteam_score_post,
    --defteam_score,
    --defteam_score_post,
    desc
FROM pbp
WHERE game_id = '2016_02_BAL_CLE'
    AND (
        (defteam_score_post - defteam_score) = 2
        --(posteam_score_post - posteam_score) = 1
    )
""", conn)
df

Unnamed: 0,posteam,defteam,desc
0,CLE,BAL,"2-P.Murray extra point is Blocked (93-L.Guy), Center-47-C.Hughlett, Holder-4-B.Colquitt. DEFENSIVE TWO-POINT ATTEMPT. 36-T.Young recovered the blocked kick. ATTEMPT SUCCEEDS."


### Score

In [67]:
# There should be some games where a team has scored less than 6 points (ex: only a field goal or no points at all)
df_pbp.query("score < 6")

Unnamed: 0,game_id,season,week,season_type,home_team,away_team,posteam,tot_tds,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td,score,team_fixed


## Check for Missing Data
There is no data missing in the batch of games I have but I kow I am missing some games since there should be games where a team has scored < 6 points. I can find the missing games by joining against the schedules table again, this time performing a LEFT JOIN with schedules as the left dataset instead of play-by-play.

In [70]:
query = """
WITH sched_historical AS (
    WITH home_games AS (
        SELECT
            game_id,
            season,
            week,
            game_type,
            home_team AS team,
            home_score AS score
        FROM schedules
        WHERE season < 2022
    ), away_games AS (
        SELECT
            game_id,
            season,
            week,
            game_type,
            away_team AS team,
            away_score AS score
        FROM schedules
        WHERE season < 2022
    )
    -- Stack the data
    SELECT *
    FROM home_games
    UNION ALL
    SELECT *
    FROM away_games
),
-- SELECT * FROM sched_historical;
sched_2022 AS (
    WITH home_games AS (
        SELECT
            game_id,
            season,
            week,
            game_type,
            home_team AS team,
            home_score AS score
        FROM schedules
        WHERE season = 2022 AND week <= 10
    ), away_games AS (
        SELECT
            game_id,
            season,
            week,
            game_type,
            away_team AS team,
            away_score AS score
        FROM schedules
        WHERE season = 2022 AND week <= 10
    )
    SELECT *
    FROM home_games
    UNION ALL
    SELECT *
    FROM away_games
    
)
SELECT
    game_id,
    season,
    week,
    game_type,
    team,
    score
FROM sched_historical
UNION ALL
SELECT
    game_id,
    season,
    week,
    game_type,
    team,
    score
FROM sched_2022;
"""

df_schedules = pd.read_sql(query, conn)
print(len(df_schedules))
df_schedules.head()

12574


Unnamed: 0,game_id,season,week,game_type,team,score
0,1999_01_MIN_ATL,1999,1,REG,ATL,14.0
1,1999_01_KC_CHI,1999,1,REG,CHI,20.0
2,1999_01_PIT_CLE,1999,1,REG,CLE,0.0
3,1999_01_OAK_GB,1999,1,REG,GB,28.0
4,1999_01_BUF_IND,1999,1,REG,IND,31.0


In [87]:
# Check missing game_id
merged = df_schedules[['game_id', 'team', 'season', 'score']].merge(
    df_pbp[['game_id', 'team_fixed']],
    how='left',
    left_on=['game_id', 'team'],
    right_on=['game_id', 'team_fixed']
    # on=['game_id']
)

print(len(merged))
merged.sort_values('game_id').head()

12574


Unnamed: 0,game_id,team,season,score,team_fixed
6145,1999_01_ARI_PHI,ARI,1999,25.0,ARI
8,1999_01_ARI_PHI,PHI,1999,24.0,PHI
10,1999_01_BAL_STL,STL,1999,27.0,
6147,1999_01_BAL_STL,BAL,1999,10.0,
6141,1999_01_BUF_IND,BUF,1999,14.0,BUF


In [88]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12574 entries, 0 to 12573
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   game_id     12574 non-null  object 
 1   team        12574 non-null  object 
 2   season      12574 non-null  int64  
 3   score       12574 non-null  float64
 4   team_fixed  11509 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 589.4+ KB


In [89]:
# Number of missing game-team records by season, mainly an issue for older data
merged[merged['team_fixed'].isnull()].sort_values('game_id', ascending=False)

Unnamed: 0,game_id,team,season,score,team_fixed
12553,2022_09_IND_NE,IND,2022,3.0,
12538,2022_08_LV_NO,LV,2022,0.0,
12520,2022_07_TB_CAR,TB,2022,3.0,
12374,2022_07_IND_TEN,TEN,2022,19.0,
12522,2022_07_DET_DAL,DET,2022,6.0,
...,...,...,...,...,...
6142,1999_01_SF_JAX,SF,1999,3.0,
2,1999_01_PIT_CLE,CLE,1999,0.0,
6148,1999_01_NYG_TB,NYG,1999,17.0,
6147,1999_01_BAL_STL,BAL,1999,10.0,


In [90]:
merged[merged['team_fixed'].isnull()].sort_values('game_id', ascending=False)['score'].value_counts()

3.0     286
6.0     204
0.0     169
9.0     136
12.0     54
10.0     39
13.0     34
7.0      24
15.0     22
16.0     21
19.0     11
14.0      7
18.0      7
5.0       6
17.0      6
11.0      6
21.0      5
8.0       5
23.0      4
24.0      3
22.0      3
25.0      3
20.0      3
2.0       2
30.0      1
26.0      1
42.0      1
29.0      1
27.0      1
Name: score, dtype: int64

In [91]:
# Number of games with missing pbp data
len(merged[merged['team_fixed'].isnull()].drop_duplicates('game_id'))

1006

In [92]:
# Try using SQL
pd.read_sql("""
SELECT game_id
FROM schedules
WHERE game_id NOT IN (
    SELECT game_id
    FROM pbp
) AND (
    season = 2022 AND week <= 10
    OR season < 2022
)""", conn)

Unnamed: 0,game_id
0,1999_01_BAL_STL
1,2000_03_SD_KC
2,2000_06_BUF_MIA


It seems the pbp data in the database is only missing 3 games while my summary tab;e query is missing more. I'll need to reconcile this tomorrow.