# Formula 1 Radio Sentiment Analysis with Bert

In this notebook, two BERT models are implemented:

1. First one is a **trained BERT from sratch**.
2. Secondly, a **pre-trained BERT model** will be downloaded and fine-tuned with my team radios.

After some research, I decided to **not implement the first point**. Here is the explanation:

---

## Sentiment Analysis for F1 Radio Messages: RoBERTa-base with Fine-tuning

### Why RoBERTa-base?

RoBERTa-base is our selected model for sentiment analysis of Formula 1 radio communications for the following reasons:

1. **Superior performance on sentiment classification**: RoBERTa consistently outperforms BERT in emotion detection tasks due to its improved training methodology and longer pretraining.

2. **Strong contextual understanding**: F1 radio messages contain technical jargon, specialized terminology, and driver/team codes that require deep semantic understanding to interpret correctly.

3. **Effective with specialized communications**: F1 radio messages typically feature:
   - Short, direct communications with minimal context
   - Technical terminology and abbreviations (e.g., "box this lap", "DRS enabled")
   - High-pressure communications with non-standard sentence structures
   - Team-specific codes and references

4. **Better handling of domain-specific language**: RoBERTa's robust pretraining makes it more adaptable to specialized domains like motorsport communications.

---

### Why Fine-tuning Instead of Training from Scratch?

Training RoBERTa from scratch is impractical for our project due to:

- **Dataset size limitation**: Our dataset contains only 530 labeled examples, whereas training from scratch would require millions of examples (RoBERTa was pre-trained on ~160GB of text).

- **Computational requirements**: Training RoBERTa-base from scratch would require:
  - 8+ high-end GPUs (NVIDIA V100/A100).
  - 1-2 weeks of continuous training.
  - Approximately 200-400GB of GPU memory.
  - Estimated cost: $5,000-$10,000 on cloud providers.

- **Transfer learning advantage**: Fine-tuning leverages knowledge already encoded in pre-trained weights, requiring only ~2-4 hours on a single GPU to adapt to our F1 domain.

By fine-tuning RoBERTa-base, we can achieve excellent performance while utilizing the model's pre-existing understanding of language structure and sentiment, which is particularly valuable given our specialized domain and limited labeled data.

In [3]:
# Cell 1: Import necessary libraries
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaTokenizer, RobertaForSequenceClassification, AdamW
from transformers import get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report



# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


---

### Loading the RoBERTa model and also the tokenizer.

First, we need to download the RoBERTa model, with its correspondant tokenizer.

In [None]:
# Cell 2: Download the RoBERTa model and tokenizer
model_name = "roberta-base"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3,  # Assuming we have positive, negative, and neutral sentiments
    output_attentions=False,
    output_hidden_states=False,
)



tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model roberta-base loaded successfully


In [None]:
# Move model to the appropriate device (GPU if available)
model.to(device)
print(f"Model {model_name} loaded successfully")

--- 

#### Loading and preparing the data

In [6]:
# Cell 3: Load and prepare the data
# Load the CSV file
df = pd.read_csv('../../outputs/week4/radio_clean/radio_labeled_data.csv')

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())



Dataset shape: (530, 3)

First few rows:
   driver                                      radio_message sentiment
0       1  So don't forget Max, use your head please. Are...   neutral
1       1  Okay Max, we're expecting rain in about 9 or 1...   neutral
2       1  You might find this lap that you meet a little...   neutral
3       1  Just another two or three minutes to get throu...   neutral
4       1   So settle into standard race management now Max.   neutral


In [7]:
# Check sentiment distribution
print("\nSentiment distribution:")
print(df['sentiment'].value_counts())


Sentiment distribution:
sentiment
neutral     379
negative    101
positive     50
Name: count, dtype: int64


--- 
### Creating numeric variables.

I need to change the labelling to numeric labels, basing myself on numeric values. The next correlation will be made:

1. *Positive* will become a 0.
2. *Neutral* will become a 1.
3. *Negative* will become a 2.

In [8]:
# Create numeric labels based on sentiment values
sentiment_mapping = {
    'positive': 0, 
    'neutral': 1, 
    'negative': 2
}

# Apply mapping (adjust this based on your actual sentiment values)
df['label'] = df['sentiment'].map(sentiment_mapping)

# Check if we need to handle any missing mappings
if df['label'].isna().sum() > 0:
    print(f"\nWarning: {df['label'].isna().sum()} rows couldn't be mapped. Unique values in 'sentiment':")
    print(df['sentiment'].unique())


---

### Creating the training, validation and test split.

I´ll follow the same splits techniques during the project, making a training, validation and test dataset.

* *Train* will be 70% of the dataset.
* *Validation*: will be 15% of the dataset.
* *Test*: will be 15% of the dataset.

In [None]:
# Cell 4: Create the train/validation/test split
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    df['radio_message'].values, 
    df['label'].values, 
    test_size=0.3,  # 30% for val+test
    random_state=42, 
    stratify=df['label']
)

val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts,
    temp_labels,
    test_size=0.5,  # 50% of the 30% (so 15% of total)
    random_state=42,
    stratify=temp_labels
)


Training set: 371 samples
Validation set: 79 samples
Test set: 80 samples


In [None]:

print(f"Training set: {len(train_texts)} samples")
print(f"Validation set: {len(val_texts)} samples")
print(f"Test set: {len(test_texts)} samples")