# Data labeling

For using BERT and comparing results of VADER, I´ll label the dataset in this notebook. Moreover, I will eliminate the end-of-race radios for having a more concise model. If the dataset gets very small, I will raise a flag for extracting more radio messages.

Then, I´ll need to rerun the Vader notebook to have the results on our cleaned data.

In [178]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

### First, uploading the csv 

First I need to upload the `radios_raw.csv`.

In [179]:
# Load the radios_raw.csv into a Dataframe. 
df = pd.read_csv("../../outputs/week4/radios_raw.csv")
df = df.reset_index(drop=True)
# Displaying basic information
print(f"Loaded {len(df)} radio messages")
df.head()

Loaded 684 radio messages


Unnamed: 0,driver,filename,file_path,text,duration
0,1,"driver_(1,)_belgium_radio_39.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","So don't forget Max, use your head please. Are...",15.168
1,1,"driver_(1,)_belgium_radio_40.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","Okay Max, we're expecting rain in about 9 or 1...",15.576
2,1,"driver_(1,)_belgium_radio_60.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",FREDER Reggie,5.424
3,1,"driver_(1,)_belgium_radio_62.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",You might find this lap that you meet a little...,5.088
4,1,"driver_(1,)_belgium_radio_63.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",Just another two or three minutes to get throu...,5.712


## Filtering Out Post-Race Radio Messages

I need to eliminate the post-race radio messages that can bias the behaviour of our next models. 

### Identifying post-race messages

For identifying the post-race messages, these are some examples that can help detecting them:

* Congratulatory messages.
* Race result discussions.
* Thank you messages to the team.
* references to "the race" in past tense.
* Cool-down lap instructions.

--- 

### Code to filter messages

I will add a simple code function to try to detect this messages automatically. Then, after this test, I can proceed to delete the radio messages manually.


In [180]:
# First, let's examine some examples that might be post-race messages
# Look for rows containing certain keywords
post_race_keywords = [
    'race is over', 'good job',  'great race', 'next race',
    'cool down', 'cool-down', 'congratulations', 'result', 'finished'
]

# Create a function to check if a message appears to be post-race
def is_post_race(text):
    text_lower = text.lower()
    for keyword in post_race_keywords:
        if keyword in text_lower:
            return True
    return False

# Add a new column indicating if the message appears to be post-race
df['is_post_race'] = df['text'].apply(is_post_race)





In [181]:
# Count potential post-race messages
post_race_count = df['is_post_race'].sum()
print(f"Identified {post_race_count} potential post-race messages out of {len(df)} total")

Identified 37 potential post-race messages out of 684 total


In [182]:
# Preview some of the identified post-race messages
print("\nSample post-race messages:")
for idx, row in df[df['is_post_race'] == True].head(50).iterrows():
    print(f"Driver {row['driver']}: {row['text']}")
    print("-" * 50)




Sample post-race messages:
Driver 1: Yeah, I gave it all. I was a bit unlucky. Still I think some okay points I guess after difficult weekend. Yeah, well done Max. That was a really strong drive today. As GP said, we got had over by the safety car, but your comeback after the stop was very, very strong. So without the safety car, I think it could have been a better race for us today, but good job. Yeah, it was quite good. Quite cool. It's been a hell of a run and yeah, it was always going to come to a close at one point, but brush yourself down and go again next weekend. Yeah, it's okay. They can take one. We'll go again next week.
--------------------------------------------------
Driver 10: Yeah, I had this dust on my eye, like I've been crying the last 20 laps driving one, with one eye. Not the easiest. Not bad for one eye though. Someone will give you some drops. If you see me crying, it's not because I'm emotional for P7. Alright, alright, we won't tell anyone. Good step guys. Go

In [183]:
# Manual verification step (important!)
print("\nNOTE: This automatic detection is just a first pass.")
print("\n Now I need to review manually")

df_not_cleaned = df.to_csv("../../outputs/week4/not_filtered.csv")


NOTE: This automatic detection is just a first pass.

 Now I need to review manually


### Manual Review and Filtering

After manual review, 49 radios are going to be deleted. Therefore, I think it is a good idea to **add more radios, at least 3 races more, to have a good amount of radios**. After that, I will proceed to make again a manual analysis and eliminate the radios.

In [184]:
# # After review, create a filtered dataset without post-race messages
# # Option 1: Use the automated detection (after verifying it's accurate)
# df_filtered = df[df['is_post_race'] == False].copy()


In [185]:
#Manual filtering using message indices


post_race_indices = [2,28, 44, 54, 58, 63, 67, 
                     78,79,83,87,91,96,97,
                     104,114,119,124,133,144,146,
                     159,160,178,182,187,
                     207,211,215,216,240, 245,251,254,
                     259,260,265,267,
                     274,275,277,279,290,303,308,
                     315,326,353,354,360,360,
                     362,370,372,379,390,
                     401,411,416,417,424,431,441,
                     451,462,471,477,481,486,
                     513,531,549,549,553,578,584,
                     592,613,626,641,644,646, 652
                     ]
df_filtered = df[~df.index.isin(post_race_indices)].copy()

print(f"Filtered dataset contains {len(df_filtered)} messages")



Filtered dataset contains 603 messages


### More steps: eliminating columns that are irrelevant for data labeling.

For sentiment analysis, columns like **file_path, duration, filename and post_race** are not relevant for sentiment analysis. Therefore, I will eliminate them from the dataframe and generate a csv with this filtered radios.

Moreover, I will change the name of the column from "text" to "radio_message".

In [186]:
# Now this filtered dataset is ready for sentiment labeling
# I can now eliminate the columns
columns_to_drop = ["is_post_race", "file_path", "duration", "filename"]
df_filtered = df_filtered.drop(columns=columns_to_drop)

df_filtered = df_filtered.rename(columns={'text': 'radio_message'})


df_filtered.to_csv("../../outputs/week4/radios_filtered.csv")

--- 

## Sentiment Labeling

#### Creating the sentiment column (empty)

In [187]:
df_filtered["sentiment"] =""

df_filtered.head()

Unnamed: 0,driver,radio_message,sentiment
0,1,"So don't forget Max, use your head please. Are...",
1,1,"Okay Max, we're expecting rain in about 9 or 1...",
3,1,You might find this lap that you meet a little...,
4,1,Just another two or three minutes to get throu...,
5,1,So settle into standard race management now Max.,


### Labeling the data manually

The next functions helps me to manually label the data, using:
* n as negative.
* p as positive.
* x as neutral.
* s as skip.

In [None]:
# Ejecuta esta celda solo una vez al inicio
if 'current_idx' not in globals():
    current_idx = 0

In [None]:
# First, make a copy of your DataFrame to avoid modifying the original
df_labeling = df_filtered.copy()
df_labeling = df_labeling.reset_index(drop=True)
# Set default value for sentiment
df_labeling['sentiment'] = 'neutral'  # Set neutral as default

# Function to display message and get input
def label_sentiment(row_idx):
    message = df_labeling.loc[row_idx, 'radio_message']
    driver = df_labeling.loc[row_idx, 'driver']  # Assuming you have this column
    
    print(f"Message #{row_idx} from Driver {driver}:")
    print(f"\"{message}\"")
    
    valid_input = False
    while not valid_input:
        label = input("Enter sentiment (p=positive, n=negative, x=neutral, s=skip): ").lower()
        
        if label == 'p':
            df_labeling.loc[row_idx, 'sentiment'] = 'positive'
            valid_input = True
        elif label == 'n':
            df_labeling.loc[row_idx, 'sentiment'] = 'negative'
            valid_input = True
        elif label == 'x':
            df_labeling.loc[row_idx, 'sentiment'] = 'neutral'
            valid_input = True
        elif label == 's':
            print("Skipping to next message...")
            valid_input = True
        else:
            print("Invalid input. Please use p, n, x, or s.")
    
    return df_labeling.loc[row_idx, 'sentiment']

# Example usage - can be run cell by cell
batch_size = 100

# Label messages in batches
for i in range(current_idx, min(current_idx + batch_size, len(df_labeling))):
    label = label_sentiment(i)
    print(f"Labeled as: {label}\n{'-'*50}")

# Save progress
df_labeling.to_csv('../../outputs/week4/labeled_data_progress.csv', index=False)

# Update global counter
current_idx += batch_size
print(f"\nCompletado hasta el mensaje {current_idx}. La próxima ejecución empezará desde aquí.")


Message #0 from Driver 1:
"So don't forget Max, use your head please. Are we both doing it or what? You just follow my instruction. No, I want to know if both cars do it. Max, please follow my instruction and trust it. Thank you."
Labeled as: neutral
--------------------------------------------------
Message #1 from Driver 1:
"Okay Max, we're expecting rain in about 9 or 10 minutes. What are your thoughts? That you can get there or should we box? We'd need to box this lap to cover Leclerc. I can't see the weather, can I? I don't know."
Labeled as: neutral
--------------------------------------------------
Message #2 from Driver 1:
"You might find this lap that you meet a little bit more water."
Labeled as: neutral
--------------------------------------------------
Message #3 from Driver 1:
"Just another two or three minutes to get through this."
Labeled as: neutral
--------------------------------------------------
Message #4 from Driver 1:
"So settle into standard race management now 

In [189]:
import os

# Rename the progress file to final file
os.rename('../../outputs/week4/labeled_data_progress.csv', '../../outputs/week4/radio_labeled_data.csv')
print("File renamed successfully to radio_labeled_data.csv")

# Load the final data to check it
final_df = pd.read_csv('../../outputs/week4/radio_labeled_data.csv')
print(f"Total messages: {len(final_df)}")
print(f"Labeled messages: {final_df['sentiment'].notna().sum()}")
print(f"Sentiment distribution: {final_df['sentiment'].value_counts()}")

File renamed successfully to radio_labeled_data.csv
Total messages: 603
Labeled messages: 603
Sentiment distribution: sentiment
neutral     602
negative      1
Name: count, dtype: int64
