# Steelers Franchise Summary Data Scrape

[pro-football reference](https://www.pro-football-reference.com/) Pro-football reference includes NFL data, dating back to 1967. This data includes player statistics, all-time leaders, draft history, coaches, and much more. Statistics are updated by every week, no later than Tuesday at 6pm. Additional data can be found behind a paid subscription.

*this overview comes from [Ohio State's Sports and Society Initiative](https://sportsandsociety.osu.edu/sports-data-sets)*

In [None]:
# packages
import pandas as pd
import warnings

# scraping
import requests
from bs4 import BeautifulSoup
import re
import lxml # used for parsing html

# bigquery
import os
from dotenv import load_dotenv
from google.cloud import bigquery
from datetime import datetime
from google.cloud import bigquery
import db_dtypes

### Web Scraping

In [2]:
# step 1: define the URL for the Steelers page
url = "https://www.pro-football-reference.com/teams/pit/"

# step 2: get the HTML content with a User-Agent header
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)

# check for successful request
if response.status_code != 200:
    print("Failed to retrieve the page")
    exit()

If no Failure message we are good to continue

In [5]:
warnings.filterwarnings('ignore')

# step 3: parse the HTML with BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# step 4: locate the table by its ID
table = soup.find("table", id="team_index")

# step 5: load the table into a DataFrame
df = pd.read_html(str(table))[0]

# Display the DataFrame
df.head()

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Points,Points,...,Def Rank,Overall Rank,Overall Rank,Overall Rank,Overall Rank,Simple Rating System,Simple Rating System,Simple Rating System,Simple Rating System,Simple Rating System
Unnamed: 0_level_1,Year,Lg,Tm,W,L,T,Div. Finish,Playoffs,PF,PA,...,Yds,T/G,Pts±,Yds±,out of,MoV,SoS,SRS,OSRS,DSRS
0,2024,NFL,Pittsburgh Steelers,6,2,0,1st of 4,,187,119,...,9,2,3,11,32,8.5,-1.4,7.1,1.8,5.3
1,2023,NFL,Pittsburgh Steelers*,10,7,0,3rd of 4,Lost WC,304,324,...,21,3,21,25,32,-1.2,1.9,0.7,-3.0,3.7
2,2022,NFL,Pittsburgh Steelers,9,8,0,3rd of 4,,308,346,...,13,9,24,19,32,-2.2,1.5,-0.8,-3.0,2.3
3,2021,NFL,Pittsburgh Steelers*,9,7,1,2nd of 4,Lost WC,343,398,...,24,13,22,25,32,-3.2,0.8,-2.5,-2.6,0.1
4,2020,NFL,Pittsburgh Steelers*,12,4,0,1st of 4,Lost WC,416,312,...,3,3,7,12,32,6.5,-1.8,4.7,0.3,4.4


Step 6 involves data cleasning, only if needed. Which, we can see from above that it <ins>**is**</ins> needed

In [None]:
# step 6: clean the DataFrame

df2 = df.copy() # make a copy so we arent changing original df. This could cause issues if run this chunk more than once or if we want to revert to original df

# fix the multi-level column flattening the multi-level column index
df2.columns = ['_'.join(col).strip() for col in df2.columns.values]

# rename columns containing 'Unnamed' to add 'Misc_' prefix and clean the rest
df2.columns = [
    'Misc_' + re.sub(r'^Unnamed:.*?_level_0_', '', col) if 'Unnamed' in col else col
    for col in df2.columns
]

# remove special characters
df2.columns = [re.sub(r'±', '_plus_minus', col) for col in df2.columns]
df2.columns = [re.sub(r'/', '_', col) for col in df2.columns]
df2.columns = [re.sub(r'\.', '', col) for col in df2.columns]

# add underscores to column headers instead of spaces
df2.columns = [re.sub(r' ', '_', col) for col in df2.columns]

# filter rows where field "Points PF" is equal to 'Points' or 'PF'. These look to be repeated headers on the website
df2 = df2[~df2["Points_PF"].isin(['Points', 'PF'])]

# convert numeric fields to float
# column names you want to convert to float
cols_to_convert = ['Misc_W', 'Misc_L', 'Misc_T', 'Points_PF', 'Points_PA', 'Points_PD', 'Off_Rank_Pts','Off_Rank_Yds', 'Def_Rank_Pts', 
                   'Def_Rank_Yds', 'Overall_Rank_T_G','Overall_Rank_Pts_plus_minus', 'Overall_Rank_Yds_plus_minus','Overall_Rank_out_of', 
                   'Simple_Rating_System_MoV','Simple_Rating_System_SoS', 'Simple_Rating_System_SRS','Simple_Rating_System_OSRS', 'Simple_Rating_System_DSRS']

# convert the specified columns to float
df2[cols_to_convert] = df2[cols_to_convert].astype(float)

# make sure it looks better
df2.columns
#df2.head()

Index(['Misc_Year', 'Misc_Lg', 'Misc_Tm', 'Misc_W', 'Misc_L', 'Misc_T',
       'Misc_Div_Finish', 'Misc_Playoffs', 'Points_PF', 'Points_PA',
       'Points_PD', 'Misc_Coaches', 'Top_Players_AV', 'Top_Players_Passer',
       'Top_Players_Rusher', 'Top_Players_Receiver', 'Off_Rank_Pts',
       'Off_Rank_Yds', 'Def_Rank_Pts', 'Def_Rank_Yds', 'Overall_Rank_T_G',
       'Overall_Rank_Pts_plus_minus', 'Overall_Rank_Yds_plus_minus',
       'Overall_Rank_out_of', 'Simple_Rating_System_MoV',
       'Simple_Rating_System_SoS', 'Simple_Rating_System_SRS',
       'Simple_Rating_System_OSRS', 'Simple_Rating_System_DSRS'],
      dtype='object')

In [26]:
df2.head()

Unnamed: 0,Misc_Year,Misc_Lg,Misc_Tm,Misc_W,Misc_L,Misc_T,Misc_Div_Finish,Misc_Playoffs,Points_PF,Points_PA,...,Def_Rank_Yds,Overall_Rank_T_G,Overall_Rank_Pts_plus_minus,Overall_Rank_Yds_plus_minus,Overall_Rank_out_of,Simple_Rating_System_MoV,Simple_Rating_System_SoS,Simple_Rating_System_SRS,Simple_Rating_System_OSRS,Simple_Rating_System_DSRS
0,2024,NFL,Pittsburgh Steelers,6.0,2.0,0.0,1st of 4,,187.0,119.0,...,9.0,2.0,3.0,11.0,32.0,8.5,-1.4,7.1,1.8,5.3
1,2023,NFL,Pittsburgh Steelers*,10.0,7.0,0.0,3rd of 4,Lost WC,304.0,324.0,...,21.0,3.0,21.0,25.0,32.0,-1.2,1.9,0.7,-3.0,3.7
2,2022,NFL,Pittsburgh Steelers,9.0,8.0,0.0,3rd of 4,,308.0,346.0,...,13.0,9.0,24.0,19.0,32.0,-2.2,1.5,-0.8,-3.0,2.3
3,2021,NFL,Pittsburgh Steelers*,9.0,7.0,1.0,2nd of 4,Lost WC,343.0,398.0,...,24.0,13.0,22.0,25.0,32.0,-3.2,0.8,-2.5,-2.6,0.1
4,2020,NFL,Pittsburgh Steelers*,12.0,4.0,0.0,1st of 4,Lost WC,416.0,312.0,...,3.0,3.0,7.0,12.0,32.0,6.5,-1.8,4.7,0.3,4.4


In [27]:
df2.dtypes

Misc_Year                       object
Misc_Lg                         object
Misc_Tm                         object
Misc_W                         float64
Misc_L                         float64
Misc_T                         float64
Misc_Div_Finish                 object
Misc_Playoffs                   object
Points_PF                      float64
Points_PA                      float64
Points_PD                      float64
Misc_Coaches                    object
Top_Players_AV                  object
Top_Players_Passer              object
Top_Players_Rusher              object
Top_Players_Receiver            object
Off_Rank_Pts                   float64
Off_Rank_Yds                   float64
Def_Rank_Pts                   float64
Def_Rank_Yds                   float64
Overall_Rank_T_G               float64
Overall_Rank_Pts_plus_minus    float64
Overall_Rank_Yds_plus_minus    float64
Overall_Rank_out_of            float64
Simple_Rating_System_MoV       float64
Simple_Rating_System_SoS 

### Load scraped data as a table to BigQuery

In [29]:
# used for both BQ read/write

# setting environmental variable directly in your code
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'bq-crudek-data.json'

# initialize the BigQuery Client
client = bigquery.Client()

# set table_id to the ID of the table to create
table_id = 'crudek-data.practice_data.steelers_summary'

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("Misc_Year", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Misc_Lg", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Misc_Tm", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Misc_Div_Finish", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Misc_Playoffs", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Misc_Coaches", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Top_Players_AV", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Top_Players_Passer", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Top_Players_Rusher", bigquery.enums.SqlTypeNames.STRING),
        bigquery.SchemaField("Top_Players_Receiver", bigquery.enums.SqlTypeNames.STRING),
    ],
    write_disposition="WRITE_TRUNCATE",
)

# make API request
job = client.load_table_from_dataframe(
    df2, table_id, job_config=job_config
)  
# wait for the job to complete.
job.result()  

LoadJob<project=crudek-data, location=US, id=099031bd-a5b9-4f89-abdc-2ea4018bc50f>

In [30]:
# confirm with shape
table = client.get_table(table_id)
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)

Loaded 90 rows and 29 columns to crudek-data.practice_data.steelers_summary
