# Day 24

Use NFL play-by-play data from 1999 to create a dataset at the game-team level where I can see the score and the number of different types of scoring plays each team had that made up their final score. Ultimately I want to compare the percentage of score by offense/defense to see how teams perform and if there are any outliers at the weekly, team, or season level.

Today's update:
- Figured out why I was missing games. I was performing a LEFT JOIN where my left dataset were teams that scored at least one passing, rushing, or return touchdown on offense. This was incorrect and left out games where a team scored zero points, just a field goal, or other conditions.
- Figured out that the play-by-play dataset is missing only 3 games across the 1999-2022 seasons. I didn't filter out any games in the API call so I might need to manually check data in the [nflverse repository](https://github.com/nflverse/nflverse-data). Still, if I'm aggregating by season, 3 games isn't a significant issue when it comes to any statistcal analysis I might want to do in the future.

In [1]:
import pandas as pd
import sqlite3
import nfl_data_py as nfl

# Create database connection
conn = sqlite3.connect('../../data/db/database.db')

## Query Data and Validate

In [54]:
query = """
WITH pbp_summary AS (
    WITH all_games AS (
        SELECT DISTINCT game_id, season, week, season_type, posteam AS team
        FROM pbp
        WHERE team IS NOT NULL
            AND team <> ''
        ORDER BY game_id),
    offense AS (
        SELECT
        game_id,
        posteam AS team,
        SUM(pass_touchdown) AS tot_pass_tds,
        SUM(rush_touchdown) AS tot_rush_tds
        FROM pbp
        WHERE team IS NOT NULL and team <> ''
        GROUP BY game_id, team),
    ret_tds AS (
        SELECT 
            game_id, 
            td_team AS team,
            SUM(return_touchdown) AS tot_ret_tds
        FROM pbp
        WHERE return_touchdown = 1 AND td_team = posteam
        GROUP BY game_id, team), 
    extra_pts AS (
        SELECT
            game_id,
            posteam AS team,
            COUNT(*) AS tot_extra_pts
        FROM pbp
        WHERE extra_point_attempt = 1 AND extra_point_result = 'good'
        GROUP BY game_id, posteam),
    field_goals AS (
        SELECT 
            game_id,
            -- Correct for data error in game_id: 2000_11_OAK_DEN
            CASE WHEN game_id = '2000_11_OAK_DEN' 
                AND desc LIKE '%J.Elam%' 
                AND desc LIKE '%field goal%' 
                THEN 'DEN'
            ELSE posteam
            END AS team,
            COUNT(*) AS tot_fgs
        FROM pbp
        WHERE field_goal_attempt = 1 
            AND field_goal_result = 'made'
            AND desc LIKE '%field goal%'
        GROUP BY game_id, team), 
    two_pt_convs AS (
        SELECT 
            game_id,
            posteam AS team,
            COUNT(*) AS tot_2pt_conv
        FROM pbp
        WHERE two_point_attempt = 1 AND two_point_conv_result = 'success'
        GROUP BY game_id, posteam),
    -- Counts defensive TDs and punt/kickoff return TDs
    defense AS (
        SELECT
            game_id,
            td_team AS team,
            COUNT(*) AS tot_def_tds
        FROM pbp
        WHERE touchdown = 1
            AND (
                defteam_score_post <> defteam_score
                OR (defteam_score IS NULL AND defteam_score_post >= 6)
            )
        GROUP BY game_id, td_team),
    safeties AS (
        SELECT
            game_id,
            CASE WHEN defteam_score_post <> defteam_score THEN defteam
            ELSE posteam
            END AS team,
            COUNT(*) AS tot_safeties
        FROM pbp
        WHERE safety = 1
        GROUP BY game_id, team),
    def_2pt_att AS (
        SELECT
            game_id,
            defteam AS team,
            COUNT(*) AS tot_def_2pt
        FROM pbp
        WHERE desc LIKE '%DEFENSIVE TWO-POINT ATTEMPT%'
            AND defteam_score_post <> defteam_score
        GROUP BY game_id, team),
    off_fumb_recovery AS (
        SELECT
            game_id,
            posteam AS team,
            COUNT(*) AS tot_off_fumble_recov_td
        FROM pbp
        WHERE desc LIKE '%fumble%'
            AND posteam_score_post <> posteam_score
            AND touchdown = 1
            AND pass_touchdown = 0
            AND rush_touchdown = 0
            AND return_touchdown = 0
        GROUP BY game_id, team),
    joined AS (
        SELECT 
            all_games.*,
            -- offense.*,
            CASE WHEN tot_pass_tds IS NULL THEN 0
            ELSE tot_pass_tds
            END AS tot_pass_tds,
            CASE WHEN tot_rush_tds IS NULL THEN 0
            ELSE tot_rush_tds
            END AS tot_rush_tds,
            CASE WHEN tot_ret_tds IS NULL THEN 0
            ELSE tot_ret_tds
            END AS tot_ret_tds,
            CASE WHEN tot_extra_pts IS NULL THEN 0
            ELSE tot_extra_pts
            END AS tot_extra_pts,
            CASE WHEN tot_fgs IS NULL THEN 0
            ELSE tot_fgs
            END AS tot_fgs,
            CASE WHEN tot_2pt_conv IS NULL THEN 0
            ELSE tot_2pt_conv
            END AS tot_2pt_conv,
            CASE WHEN tot_def_tds IS NULL THEN 0
            ELSE tot_def_tds
            END AS tot_def_tds,
            CASE WHEN tot_safeties IS NULL THEN 0
            ELSE tot_safeties
            END AS tot_safeties,
            CASE WHEN tot_def_2pt IS NULL THEN 0
            ELSE tot_def_2pt
            END AS tot_def_2pt,
            CASE WHEN tot_off_fumble_recov_td IS NULL THEN 0
            ELSE tot_off_fumble_recov_td
            END AS tot_off_fumble_recov_td
        FROM all_games
        LEFT JOIN offense
            ON offense.game_id = all_games.game_id
                AND offense.team = all_games.team
        LEFT JOIN ret_tds
            ON ret_tds.game_id = all_games.game_id
                AND ret_tds.team = all_games.team  
        LEFT JOIN extra_pts
            ON extra_pts.game_id = all_games.game_id
                AND extra_pts.team = all_games.team
        LEFT JOIN field_goals
            ON field_goals.game_id = all_games.game_id
                AND field_goals.team = all_games.team
        LEFT JOIN two_pt_convs
            ON two_pt_convs.game_id = all_games.game_id
                AND two_pt_convs.team = all_games.team
        LEFT JOIN defense
            ON defense.game_id = all_games.game_id
                AND defense.team = all_games.team
        LEFT JOIN safeties
            ON safeties.game_id = all_games.game_id
                AND safeties.team = all_games.team
        LEFT JOIN def_2pt_att
            ON def_2pt_att.game_id = all_games.game_id
                AND def_2pt_att.team = all_games.team
        LEFT JOIN off_fumb_recovery
            ON off_fumb_recovery.game_id = all_games.game_id
                AND off_fumb_recovery.team = all_games.team
    )
    SELECT *,
        (tot_pass_tds * 6
        + tot_rush_tds * 6
        + tot_ret_tds * 6
        + tot_extra_pts * 1
        + tot_fgs * 3
        + tot_2pt_conv * 2
        + tot_def_tds * 6
        + tot_safeties * 2
        + tot_def_2pt * 2
        + tot_off_fumble_recov_td * 6) AS score,
        -- Use old team abbrev. that matches game_id for teams that moved (pbp data has new names)
        CASE
            WHEN game_id LIKE '%OAK%' AND team = 'LV' THEN 'OAK'
            WHEN game_id LIKE '%SD%' AND team = 'LAC' THEN 'SD'
            WHEN game_id LIKE '%STL%' AND team = 'LA' THEN 'STL'
            ELSE team
        END AS team_fixed
    FROM joined
), sched AS (
    WITH sched_historical AS (
        WITH home_games AS (
            SELECT
                game_id,
                season,
                week,
                game_type,
                home_team AS team,
                home_score AS score
            FROM schedules
            WHERE season < 2022
        ), away_games AS (
            SELECT
                game_id,
                season,
                week,
                game_type,
                away_team AS team,
                away_score AS score
            FROM schedules
            WHERE season < 2022
        )
        -- Stack the data
        SELECT *
        FROM home_games
        UNION ALL
        SELECT *
        FROM away_games
    ), sched_2022 AS (
        WITH home_games AS (
            SELECT
                game_id,
                season,
                week,
                game_type,
                home_team AS team,
                home_score AS score
            FROM schedules
            WHERE season = 2022 AND week <= 10
        ), away_games AS (
            SELECT
                game_id,
                season,
                week,
                game_type,
                away_team AS team,
                away_score AS score
            FROM schedules
            WHERE season = 2022 AND week <= 10
        )
        SELECT *
        FROM home_games
        UNION ALL
        SELECT *
        FROM away_games
    )
    SELECT *
    FROM sched_historical
    UNION ALL
    SELECT *
    FROM sched_2022
)
SELECT
    sched.game_id,
    sched.team,
    sched.score AS sched_score,
    pbp_summary.score AS pbp_score,
    (sched.score - pbp_summary.score) AS score_diff,
    tot_pass_tds,
    tot_rush_tds,
    tot_ret_tds,
    tot_extra_pts,
    tot_fgs,
    tot_2pt_conv,
    tot_def_tds,
    tot_safeties,
    tot_def_2pt,
    tot_off_fumble_recov_td
FROM sched
LEFT JOIN pbp_summary
    ON pbp_summary.game_id = sched.game_id
    AND pbp_summary.team_fixed = sched.team
-- WHERE sched_score <> pbp_score
-- WHERE pbp_score IS NULL
"""

df = pd.read_sql(query, conn)
print(len(df))
df.head()

12574


Unnamed: 0,game_id,team,sched_score,pbp_score,score_diff,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td
0,1999_01_MIN_ATL,ATL,14.0,14.0,0.0,1.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1999_01_KC_CHI,CHI,20.0,20.0,0.0,2.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0
2,1999_01_PIT_CLE,CLE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1999_01_OAK_GB,GB,28.0,28.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1999_01_BUF_IND,IND,31.0,31.0,0.0,2.0,1.0,0.0,4.0,1.0,0.0,1.0,0.0,0.0,0.0


In [58]:
df.query('pbp_score.isnull()')[['game_id', 'team', 'sched_score', 'pbp_score']].sort_values('game_id')

Unnamed: 0,game_id,team,sched_score,pbp_score
10,1999_01_BAL_STL,STL,27.0,
6147,1999_01_BAL_STL,BAL,10.0,
294,2000_03_SD_KC,KC,42.0,
6431,2000_03_SD_KC,SD,10.0,
334,2000_06_BUF_MIA,MIA,22.0,
6471,2000_06_BUF_MIA,BUF,13.0,


Missing 3 games in the 1999 and 2000 seasons.

In [60]:
df.query('sched_score != pbp_score and ~pbp_score.isnull()')

Unnamed: 0,game_id,team,sched_score,pbp_score,score_diff,tot_pass_tds,tot_rush_tds,tot_ret_tds,tot_extra_pts,tot_fgs,tot_2pt_conv,tot_def_tds,tot_safeties,tot_def_2pt,tot_off_fumble_recov_td


All manually calculated scores from the aggregated play-by-play table match the score from the schedules table!