### SportsBERT
  
*Gian Favero and Michael Montemurri, Mila, 2024*

This notebook performs the following steps for the 2025 NFL Big Data Bowl competition:
1. Load the BERT model trained on sports text corpa.
2. Fine-tune the model on the BigData25 data formatted as text.
3. Train a shallow regressor to predict the xYardsGained and xWinProbAdded for various plays.

The xPassRush dataframe contains all play + player data from each game. This block of code merges all defensive players for a given play into one row, where players are identified by position and suffixed by the number.

In [71]:
# Load dataframe
import pandas as pd

df = pd.read_csv('../data/processed/df_xPassRush.csv')

# Filter for defensive players only
defensive_players = df[df["club"] != df["possessionTeam"]].copy()

# Add a unique identifier for each position within a play
defensive_players["positionCount"] = (
    defensive_players.groupby(["gameId", "playId", "position"]).cumcount() + 1
)

# Create a unique column suffix for each player using their position
defensive_players["playerSuffix"] = defensive_players["position"] + defensive_players["positionCount"].astype(str)

# Pivot the defensive player hDist and losDist data
pivoted = defensive_players.pivot_table(
    index=["gameId", "playId"],
    columns="playerSuffix",
    values=["xPassRush"],
    aggfunc="first"
)

# Flatten the multi-level columns for easier readability
pivoted.columns = [f"{metric}_{suffix}" for metric, suffix in pivoted.columns]

# Reset index to merge with game context
pivoted.reset_index(inplace=True)

# Extract game context (assuming it doesn't vary within a play)
game_context = defensive_players.drop_duplicates(subset=["gameId", "playId"])

# Merge pivoted data with game context
result_df = pd.merge(game_context, pivoted, on=["gameId", "playId"], how="left")

# compare result to original dataframe. It checks out. Since different plays will have different number of each position we generate cols for all and just use NaN for now
print(result_df.head())
print(result_df.columns)

  df = pd.read_csv('../data/processed/df_xPassRush.csv')


       gameId  playId homeTeamAbbr visitorTeamAbbr  frameId  nflId  \
0  2022090800      56           LA             BUF       76  38577   
1  2022090800      80           LA             BUF       23  38577   
2  2022090800     101           LA             BUF       46  38577   
3  2022090800     122           LA             BUF       78  38577   
4  2022090800     167           LA             BUF       75  38577   

    displayName position club  down  ...  xPassRush_ILB4  xPassRush_OLB1  \
0  Bobby Wagner      ILB   LA     1  ...             NaN             NaN   
1  Bobby Wagner      ILB   LA     2  ...             NaN             NaN   
2  Bobby Wagner      ILB   LA     1  ...             NaN        0.834283   
3  Bobby Wagner      ILB   LA     2  ...             NaN        0.895475   
4  Bobby Wagner      ILB   LA     2  ...             NaN        0.900457   

  xPassRush_OLB2 xPassRush_OLB3 xPassRush_OLB4  xPassRush_S1 xPassRush_S2  \
0            NaN            NaN            Na

Some additional data processing can occur here. Plays where the QB was sacked has no determinable outcome in terms of what the original play call was. As such, this should be excluded from the training data since it could have really been anything, and leaving it empty can allow the model to latch onto this to predict less yards gained.

In [72]:
# Reduce the size of the dataframe by removing unnecessary columns
game_context_columns = [
        "gameId",
        "playId",
        "homeTeamAbbr",
        "visitorTeamAbbr",
        "frameId",
        "nflId",
        "displayName",
        "position",
        "club",
        "down",
        "quarter",
        "yardsToGo",
        "possessionTeam",
        "defensiveTeam",
        "yardlineSide",
        "yardlineNumber",
        "gameClock",
        "preSnapHomeScore",
        "preSnapVisitorScore",
        "preSnapHomeTeamWinProbability",
        "preSnapVisitorTeamWinProbability",
        "event",
    ]

# Offensive formation, receiver alignment, and pre-snap win probabilities related to OC
offense_columns = [
        "offenseFormation",
        "receiverAlignment",
    ]

# Defensive formation, pass coverage, and run concept related to DC
defensive_columns = [
        "pff_manZone",
        "pff_passCoverage",
        "xPassRush"
]
defensive_columns = [col for col in result_df.columns if any(prefix in col for prefix in defensive_columns)]
defensive_columns.remove('xPassRush')

# Play description, pass location, rush location, and PFF run concept related to play call
play_columns = [
        "playDescription",
        "playAction",
        "passLocationType",
        "rushLocationType",
        "pff_runConceptPrimary",
    ]

# Run play mappings (combine common run concepts)
run_concept_mapping = {
    "outside zone": "zone",
    "inside zone": "zone",
    "pull lead": "power",
    "power": "power",
    "man": "power",
    "trap": "power",
    "counter": "misdirection",
    "draw": "misdirection",
    "fb run": "power",
    "trick": "trick",
    "undefined": "undefined",
}

# Yards gained, event, and win probability added related to play outcome
outcome_columns = [
        "yardsGained",
        "homeTeamWinProbabilityAdded",
        "visitorTeamWinProbilityAdded",
    ]

# Combine all columns
useful_columns = game_context_columns + offense_columns + defensive_columns + play_columns + outcome_columns

df_reduced = result_df[useful_columns].copy()

# Remove plays where the QB was sacked
df_reduced = df_reduced[~df_reduced["playDescription"].str.contains("sacked")]

We can define functions to extract the relevant information from each row in the dataframe and form sentence prompts for a BERT model.

In [73]:
def remove_spaces(text):
    # Remove leading and trailing spaces and double spaces
    return " ".join(text.split())

def get_game_context(game_context):
    quarter = game_context["quarter"].values[0] if not pd.isna(game_context["quarter"].values[0]) else "N/A"
    down = game_context["down"].values[0] if not pd.isna(game_context["down"].values[0]) else "N/A"
    yards_to_go = game_context["yardsToGo"].values[0] if not pd.isna(game_context["yardsToGo"].values[0]) else "N/A"
    yardline_number = game_context["yardlineNumber"].values[0] if not pd.isna(game_context["yardlineNumber"].values[0]) else "N/A"
    yardline_side = game_context["yardlineSide"].values[0] if not pd.isna(game_context["yardlineSide"].values[0]) else "N/A"
    game_clock = game_context["gameClock"].values[0] if not pd.isna(game_context["gameClock"].values[0]) else "N/A"
    pre_snap_home_score = game_context["preSnapHomeScore"].values[0] if not pd.isna(game_context["preSnapHomeScore"].values[0]) else "N/A"
    pre_snap_visitor_score = game_context["preSnapVisitorScore"].values[0] if not pd.isna(game_context["preSnapVisitorScore"].values[0]) else "N/A" 

    home_team_abbr = game_context["homeTeamAbbr"].values[0] if not pd.isna(game_context["homeTeamAbbr"].values[0]) else "N/A"
    visitor_team_abbr = game_context["visitorTeamAbbr"].values[0] if not pd.isna(game_context["visitorTeamAbbr"].values[0]) else "N/A"
    possession_team = game_context["possessionTeam"].values[0] if not pd.isna(game_context["possessionTeam"].values[0]) else "N/A"

    pre_snap_home_team_win_probability = game_context["preSnapHomeTeamWinProbability"].values[0] if not pd.isna(game_context["preSnapHomeTeamWinProbability"].values[0]) else "N/A"
    pre_snap_visitor_team_win_probability = game_context["preSnapVisitorTeamWinProbability"].values[0] if not pd.isna(game_context["preSnapVisitorTeamWinProbability"].values[0]) else "N/A"

    ret_str = f"It is {quarter} quarter with {game_clock} left. It is {down} down with {yards_to_go} yards to go. The ball is on the {yardline_side} {yardline_number} yardline. The score is {home_team_abbr} {pre_snap_home_score} - {visitor_team_abbr} {pre_snap_visitor_score}. {possession_team} have the ball. Current win probability for {home_team_abbr} is {pre_snap_home_team_win_probability:.2f} and for {visitor_team_abbr} it is {pre_snap_visitor_team_win_probability:.2f}."

    return ret_str

def get_offensive_context(offense_context):
    offense_formation = offense_context["offenseFormation"].values[0] if not pd.isna(offense_context["offenseFormation"].values[0]) else "N/A"
    receiver_alignment = offense_context["receiverAlignment"].values[0] if not pd.isna(offense_context["receiverAlignment"].values[0]) else "N/A"

    # Remove underscores and lowercase the text
    offense_formation = offense_formation.lower().replace("_", " ")
    receiver_alignment = receiver_alignment.lower().replace("_", " ")

    ret_str = f"The offense is in {offense_formation} formation with the receivers aligned in {receiver_alignment}."

    return ret_str

def get_defensive_context(defensive_context):
    defensive_formation = defensive_context["pff_manZone"].values[0] if not pd.isna(defensive_context["pff_manZone"].values[0]) else "N/A"
    pass_coverage = defensive_context["pff_passCoverage"].values[0] if not pd.isna(defensive_context["pff_passCoverage"].values[0]) else "N/A"

    # Remove underscores and lowercase the text
    defensive_formation = defensive_formation.lower().replace("_", " ")
    pass_coverage = pass_coverage.lower().replace("_", " ")

    ret_str = f"The defense is in {defensive_formation} coverage with {pass_coverage} formation."

    return ret_str

def get_blitzer(defensive_context):
    # check all columns prefixed with xPassRush
    rushers = { 'CB': 0, 'S': 0, 'ILB': 0, 'OLB': 0, 'DE': 0, 'IDL': 0 }
    for col in defensive_context.columns:
        if "xPassRush" in col:
            xPassRush = defensive_context[col].values[0] if not pd.isna(defensive_context[col].values[0]) else 0
            if xPassRush > 0.2:
                # split the column name to get the position
                position = col.split("_")[1]
                position = position[:-1] if position[-1].isdigit() else position
                rushers[position] += 1
    ret_str = f"The defense is rushing {rushers['CB']} cornerbacks, {rushers['S']} safeties, {rushers['ILB']} inside linebackers, {rushers['OLB']} outside linebackers, {rushers['DE']} defensive ends, and {rushers['IDL']} interior defensive linemen."
    return ret_str

def get_play_description(play_context):
    # Replace NaN with a placeholder or handle it explicitly
    playAction = play_context["playAction"].values[0] if not pd.isna(play_context["playAction"].values[0]) else "N/A"
    passLocationType = play_context["passLocationType"].values[0] if not pd.isna(play_context["passLocationType"].values[0]) else "N/A"
    rushLocationType = play_context["rushLocationType"].values[0] if not pd.isna(play_context["rushLocationType"].values[0]) else "N/A"
    pff_runConceptPrimary = play_context["pff_runConceptPrimary"].values[0] if not pd.isna(play_context["pff_runConceptPrimary"].values[0]) else ""
    playDescription = play_context["playDescription"].values[0] if not pd.isna(play_context["playDescription"].values[0]) else ""

    if passLocationType != "N/A":
        passLocationType = passLocationType.lower().replace("_", " ").split(" ")[0] + " box"
        
        # Get the direction and depth from the play description
        passDepth = "short" if "short" in playDescription else "deep" if "deep" in playDescription else ""
        passDirection = "left" if "left" in playDescription else "right" if "right" in playDescription else "middle"

        # In case of a disagreement in passDirection and passLocationType, use the passDirection from the play description
        if passDirection == "middle" and passLocationType != "inside box":
            passLocationType = "inside box"

        if playAction:
            ret_val = f"play action {passDepth} {passLocationType} pass"
        else:
            ret_val = f"{passDepth} {passLocationType} pass"
    else:
        pff_runConceptPrimary = pff_runConceptPrimary.lower().replace("_", " ")
        # Simplify the run concept
        if pff_runConceptPrimary in run_concept_mapping:
            pff_runConceptPrimary = run_concept_mapping[pff_runConceptPrimary]
        rushLocationType = rushLocationType.lower().replace("_", " ")
        ret_val = f"{pff_runConceptPrimary} run"

    ret_val = "The offense play call is a " + ret_val + "."

    return remove_spaces(ret_val)

def get_xWinProb_outcome(outcome_context):
    home_team_abbr = outcome_context["homeTeamAbbr"].values[0] if not pd.isna(outcome_context["homeTeamAbbr"].values[0]) else "N/A"
    visitor_team_abbr = outcome_context["visitorTeamAbbr"].values[0] if not pd.isna(outcome_context["visitorTeamAbbr"].values[0]) else "N/A"
    possession_team = outcome_context["possessionTeam"].values[0] if not pd.isna(outcome_context["possessionTeam"].values[0]) else "N/A"
    yards_gained = outcome_context["yardsGained"].values[0] if not pd.isna(outcome_context["yardsGained"].values[0]) else "N/A"
    home_team_win_probability_added = outcome_context["homeTeamWinProbabilityAdded"].values[0] if not pd.isna(outcome_context["homeTeamWinProbabilityAdded"].values[0]) else "N/A"
    visitor_team_win_probability_added = outcome_context["visitorTeamWinProbilityAdded"].values[0] if not pd.isna(outcome_context["visitorTeamWinProbilityAdded"].values[0]) else "N/A"

    # Find out if the possession team is the home or visitor team
    if possession_team == home_team_abbr:
        win_probability_added = home_team_win_probability_added
    else:
        win_probability_added = visitor_team_win_probability_added

    return win_probability_added

def get_xYardsGained_outcome(outcome_context):
    yards_gained = outcome_context["yardsGained"].values[0] if not pd.isna(outcome_context["yardsGained"].values[0]) else "N/A"

    return yards_gained

In [74]:
# Take sample play
sample_play = df_reduced

# Extract the game context using the game_context_columns
game_context = sample_play[game_context_columns]

# Extract the offensive formation, receiver alignment, and pre-snap win probabilities using the offense_columns
offense_context = sample_play[offense_columns]

# Extract the defensive formation, pass coverage, and pass rush using the defensive_columns
defense_context = sample_play[defensive_columns]

# Extract the play description, play action, pass location, rush location, and run concept using the play_columns
play_context = sample_play[play_columns]

# Extract the yards gained, event, and win probability added using the outcome_columns
outcome_columns += ["homeTeamAbbr", "visitorTeamAbbr", "possessionTeam"]
outcome_context = sample_play[outcome_columns]

# Get the prompts for each of the rows in the dataframe
prompts = []
labels = []
desired_label = "yardsGained"
for i in range(len(sample_play)):
    row = sample_play.iloc[[i]]

    game_context_str = get_game_context(row[game_context_columns])
    offensive_context_str = get_offensive_context(row[offense_columns])
    defensive_context_str = get_defensive_context(row[defensive_columns])
    blitzer_str = get_blitzer(row[defensive_columns])
    simple_play_call = get_play_description(row[play_columns])

    prompt = f"Play {i + 1}: {game_context_str} {offensive_context_str} {defensive_context_str} {blitzer_str} {simple_play_call}"
    prompts.append(prompt)

    win_probability_added = get_xWinProb_outcome(row[outcome_columns])
    yards_gained = get_xYardsGained_outcome(row[outcome_columns])

    if desired_label == "winProbabilityAdded":
        labels.append(round(win_probability_added, 4))
    else:
        labels.append(round(yards_gained, 4))

In [75]:
# Load model directly form Hugging Face model hub
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/SportsBERT")

In [76]:
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split

# Your data
data = {
    "prompt": prompts,
    "labels": labels
}

# Convert labels to float
data["labels"] = [float(label) for label in data["labels"]]

# Create a dataset
dataset = Dataset.from_dict(data)

# Convert dataset to Pandas DataFrame
df = dataset.to_pandas()

# Split into train, validation, and test sets
train_val, test = train_test_split(df, test_size=0.1, random_state=42)
train, val = train_test_split(train_val, test_size=0.1111, random_state=42)  # 10% of total goes to val

# Convert splits back to Dataset format
dataset = DatasetDict({
    "train": Dataset.from_pandas(train),
    "validation": Dataset.from_pandas(val),
    "test": Dataset.from_pandas(test)
})

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["prompt"], padding="max_length", truncation=True)

dataset = dataset.map(tokenize_function, batched=True)

# Remove the "prompt" column from the dataset (not in the map function)
dataset = dataset.remove_columns(["prompt", "__index_level_0__"])
dataset.set_format(type="torch")

print(dataset)

Map: 100%|██████████| 10317/10317 [00:02<00:00, 3580.51 examples/s]
Map: 100%|██████████| 1290/1290 [00:00<00:00, 3910.26 examples/s]
Map: 100%|██████████| 1290/1290 [00:00<00:00, 3868.52 examples/s]

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 10317
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1290
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1290
    })
})





We can train a regressor head on top of BERT/SportsBERT to get an xYards or xWPA prediction.

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification

base_model = AutoModelForSequenceClassification.from_pretrained("microsoft/bert-base-uncased", num_labels=1)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results/bert-base-uncased",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Define Trainer
trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)

# Train the model
trainer.train()

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification

sports_model = AutoModelForSequenceClassification.from_pretrained("microsoft/SportsBERT", num_labels=1)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results/sportsbert",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Define Trainer
trainer = Trainer(
    model=sports_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)

# Train the model
trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/SportsBERT and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




Epoch,Training Loss,Validation Loss
1,No log,59.100998
2,No log,58.250061
3,No log,57.515442
4,No log,58.084133
5,No log,58.022552




TrainOutput(global_step=405, training_loss=74.28382040895062, metrics={'train_runtime': 540.1663, 'train_samples_per_second': 95.498, 'train_steps_per_second': 0.75, 'total_flos': 1.357246192799232e+16, 'train_loss': 74.28382040895062, 'epoch': 5.0})

In [None]:
import torch

# Gather a sample prompt and label
sample_prompt = prompts[0]
sample_label = labels[0]

print(sample_prompt)

# Encode the prompt
encoded_prompt = tokenizer(sample_prompt, padding="max_length", truncation=True, return_tensors="pt")

# Move the encoded prompt to the GPU
encoded_prompt = {key: value.to(trainer.args.device) for key, value in encoded_prompt.items()}

# Make a prediction
base_model.eval()
sports_model.eval()
with torch.no_grad():
    base_logits = base_model(**encoded_prompt).logits
    sports_logits = sports_model(**encoded_prompt).logits

# Convert logits to a prediction
base_prediction = base_logits[0].item()
sports_prediction = sports_logits[0].item()

# Print the prediction and the actual label
print(f"Base Model Prediction: {base_prediction}")
print(f"SportsBERT Prediction: {sports_prediction}")
print(f"Actual: {sample_label}")

Play 1: It is 1 quarter with 15:00 left. It is 1 down with 10 yards to go. The ball is on the BUF 25 yardline. The score is LA 0 - BUF 0. BUF have the ball. Current win probability for LA is 0.41 and for BUF it is 0.59. The offense is in shotgun formation with the receivers aligned in 2x2. The defense is in zone coverage with cover 6-left formation. The defense is rushing 0 cornerbacks, 1 safeties, 0 inside linebackers, 0 outside linebackers, 1 defensive ends, and 3 interior defensive linemen. The offense play call is a short inside box pass.
Prediction: 6.790478706359863
Actual: 6
