# Data labeling

For using BERT and comparing results of VADER, I´ll label the dataset in this notebook. Moreover, I will eliminate the end-of-race radios for having a more concise model. If the dataset gets very small, I will raise a flag for extracting more radio messages.

Then, I´ll need to rerun the Vader notebook to have the results on our cleaned data.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

### First, uploading the csv 

First I need to upload the `radios_raw.csv`.

In [2]:
# Load the radios_raw.csv into a Dataframe. 
df = pd.read_csv("../../outputs/week4/radios_raw.csv")
df = df.reset_index(drop=True)
# Displaying basic information
print(f"Loaded {len(df)} radio messages")
df.head()

Loaded 684 radio messages


Unnamed: 0,driver,filename,file_path,text,duration
0,1,"driver_(1,)_belgium_radio_39.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","So don't forget Max, use your head please. Are...",15.168
1,1,"driver_(1,)_belgium_radio_40.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","Okay Max, we're expecting rain in about 9 or 1...",15.576
2,1,"driver_(1,)_belgium_radio_60.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",FREDER Reggie,5.424
3,1,"driver_(1,)_belgium_radio_62.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",You might find this lap that you meet a little...,5.088
4,1,"driver_(1,)_belgium_radio_63.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",Just another two or three minutes to get throu...,5.712


## Filtering Out Post-Race Radio Messages

I need to eliminate the post-race radio messages that can bias the behaviour of our next models. 

### Identifying post-race messages

For identifying the post-race messages, these are some examples that can help detecting them:

* Congratulatory messages.
* Race result discussions.
* Thank you messages to the team.
* references to "the race" in past tense.
* Cool-down lap instructions.

--- 

### Code to filter messages

I will add a simple code function to try to detect this messages automatically. Then, after this test, I can proceed to delete the radio messages manually.


In [3]:
# First, let's examine some examples that might be post-race messages
# Look for rows containing certain keywords
post_race_keywords = [
    'race is over', 'good job',  'great race', 'next race',
    'cool down', 'cool-down', 'congratulations', 'result', 'finished'
]

# Create a function to check if a message appears to be post-race
def is_post_race(text):
    text_lower = text.lower()
    for keyword in post_race_keywords:
        if keyword in text_lower:
            return True
    return False

# Add a new column indicating if the message appears to be post-race
df['is_post_race'] = df['text'].apply(is_post_race)





In [4]:
# Count potential post-race messages
post_race_count = df['is_post_race'].sum()
print(f"Identified {post_race_count} potential post-race messages out of {len(df)} total")

Identified 37 potential post-race messages out of 684 total


In [5]:
# Preview some of the identified post-race messages
print("\nSample post-race messages:")
for idx, row in df[df['is_post_race'] == True].head(50).iterrows():
    print(f"Driver {row['driver']}: {row['text']}")
    print("-" * 50)




Sample post-race messages:
Driver 1: Yeah, I gave it all. I was a bit unlucky. Still I think some okay points I guess after difficult weekend. Yeah, well done Max. That was a really strong drive today. As GP said, we got had over by the safety car, but your comeback after the stop was very, very strong. So without the safety car, I think it could have been a better race for us today, but good job. Yeah, it was quite good. Quite cool. It's been a hell of a run and yeah, it was always going to come to a close at one point, but brush yourself down and go again next weekend. Yeah, it's okay. They can take one. We'll go again next week.
--------------------------------------------------
Driver 10: Yeah, I had this dust on my eye, like I've been crying the last 20 laps driving one, with one eye. Not the easiest. Not bad for one eye though. Someone will give you some drops. If you see me crying, it's not because I'm emotional for P7. Alright, alright, we won't tell anyone. Good step guys. Go

In [6]:
# Manual verification step (important!)
print("\nNOTE: This automatic detection is just a first pass.")
print("\n Now I need to review manually")

df_not_cleaned = df.to_csv("../../outputs/week4/not_filtered.csv")


NOTE: This automatic detection is just a first pass.

 Now I need to review manually


### Manual Review and Filtering

After manual review, 49 radios are going to be deleted. Therefore, I think it is a good idea to **add more radios, at least 3 races more, to have a good amount of radios**. After that, I will proceed to make again a manual analysis and eliminate the radios.

### More steps: eliminating columns that are irrelevant for data labeling.

For sentiment analysis, columns like **file_path, duration, filename and post_race** are not relevant for sentiment analysis. Therefore, I will eliminate them from the dataframe and generate a csv with this filtered radios.

Moreover, I will change the name of the column from "text" to "radio_message".

In [7]:
# Now this filtered dataset is ready for sentiment labeling
# I can now eliminate the columns
columns_to_drop = ["is_post_race", "file_path", "duration", "filename"]
df_filtered = df.drop(columns=columns_to_drop)

df_filtered = df_filtered.rename(columns={'text': 'radio_message'})


df_filtered.to_csv("../../outputs/week4/radios_2_columns.csv")

--- 

## Sentiment Labeling

#### Creating the sentiment column (empty)

In [8]:
df_filtered["sentiment"] =""

df_filtered.head()

Unnamed: 0,driver,radio_message,sentiment
0,1,"So don't forget Max, use your head please. Are...",
1,1,"Okay Max, we're expecting rain in about 9 or 1...",
2,1,FREDER Reggie,
3,1,You might find this lap that you meet a little...,
4,1,Just another two or three minutes to get throu...,


### Labeling the data manually


In [9]:
from ipywidgets import widgets
from IPython.display import display, clear_output
import pandas as pd

# Preparar el dataframe
df_labeling = df_filtered.copy()
df_labeling = df_labeling.reset_index(drop=True)

# Crear una copia para el dataset filtrado (sin los mensajes eliminados)
df_filtered_messages = df_labeling.copy()

# Asegurar que existe la columna sentiment
if 'sentiment' not in df_labeling.columns:
    df_labeling['sentiment'] = None

# Contador global para rastrear el mensaje actual
current_idx = 0
# Variable global para contar eliminados
deleted_count = 0

# Crear widgets
message_text = widgets.HTML()
driver_text = widgets.HTML()
index_text = widgets.HTML()

# Botones de sentimiento
positive_button = widgets.Button(description='Positivo', button_style='success')
negative_button = widgets.Button(description='Negativo', button_style='danger')
neutral_button = widgets.Button(description='Neutral', button_style='info')

# Botón de eliminar
delete_button = widgets.Button(description='Eliminar mensaje', button_style='warning')

# Botones de navegación
prev_button = widgets.Button(description='Anterior')
next_button = widgets.Button(description='Siguiente')
save_button = widgets.Button(description='Guardar', button_style='warning')

# Texto de estado
status_text = widgets.HTML()
deleted_text = widgets.HTML(value="Mensajes eliminados: 0")

# Función para actualizar la visualización
def update_display():
    global current_idx
    
    # Obtener mensaje actual
    message = df_labeling.loc[current_idx, 'radio_message']
    driver = df_labeling.loc[current_idx, 'driver']
    
    # Actualizar widgets
    message_text.value = f"<b>Mensaje:</b> \"{message}\""
    driver_text.value = f"<b>Conductor:</b> {driver}"
    index_text.value = f"<b>Mensaje {current_idx+1} de {len(df_labeling)}</b>"
    
    # Comprobar si este mensaje ha sido marcado como eliminado
    is_deleted = current_idx not in df_filtered_messages.index
    
    # Mostrar etiqueta actual si existe
    current_sentiment = df_labeling.loc[current_idx, 'sentiment']
    if is_deleted:
        status_text.value = "<b style='color:red'>ELIMINADO</b>"
    elif pd.notna(current_sentiment):
        status_text.value = f"<b>Etiqueta actual:</b> {current_sentiment}"
    else:
        status_text.value = "<b>Sin etiquetar</b>"
    
    # Actualizar contador de eliminados
    deleted_text.value = f"Mensajes eliminados: {deleted_count}"

# Manejadores de eventos
def on_positive_clicked(b):
    global current_idx
    df_labeling.loc[current_idx, 'sentiment'] = 'positive'
    status_text.value = "<b>Etiquetado como POSITIVO</b>"
    
    # Pasar al siguiente mensaje automáticamente
    if current_idx < len(df_labeling) - 1:
        current_idx += 1
        update_display()

def on_negative_clicked(b):
    global current_idx
    df_labeling.loc[current_idx, 'sentiment'] = 'negative'
    status_text.value = "<b>Etiquetado como NEGATIVO</b>"
    
    # Pasar al siguiente mensaje automáticamente
    if current_idx < len(df_labeling) - 1:
        current_idx += 1
        update_display()

def on_neutral_clicked(b):
    global current_idx
    df_labeling.loc[current_idx, 'sentiment'] = 'neutral'
    status_text.value = "<b>Etiquetado como NEUTRAL</b>"
    
    # Pasar al siguiente mensaje automáticamente
    if current_idx < len(df_labeling) - 1:
        current_idx += 1
        update_display()

def on_delete_clicked(b):
    global current_idx, deleted_count
    
    # Eliminar la fila actual del dataframe filtrado
    if current_idx in df_filtered_messages.index:
        df_filtered_messages.drop(current_idx, inplace=True)
        deleted_count += 1
        status_text.value = "<b style='color:red'>MENSAJE ELIMINADO</b>"
        update_display()

def on_prev_clicked(b):
    global current_idx
    if current_idx > 0:
        current_idx -= 1
        update_display()

def on_next_clicked(b):
    global current_idx
    if current_idx < len(df_labeling) - 1:
        current_idx += 1
        update_display()

def on_save_clicked(b):
    # Guardar ambos archivos
    # 1. Archivo con todos los mensajes etiquetados
    df_labeling.to_csv('../../outputs/week4/radio_labeled_data.csv', index=False)
    
    # 2. Archivo solo con los mensajes no eliminados
    df_filtered_messages.reset_index(drop=True).to_csv('../../outputs/week4/radio_filtered.csv', index=False)
    
    # Contar mensajes etiquetados
    labeled_count = df_labeling['sentiment'].notna().sum()
    total = len(df_labeling)
    percent = (labeled_count / total) * 100
    
    # Mostrar estadísticas
    stats = df_labeling['sentiment'].value_counts()
    pos = stats.get('positive', 0)
    neg = stats.get('negative', 0)
    neu = stats.get('neutral', 0)
    
    # Actualizar estado
    status_text.value = (f"<b>¡Guardado!</b> {labeled_count}/{total} mensajes etiquetados ({percent:.1f}%)<br>"
                         f"Positivos: {pos}, Negativos: {neg}, Neutrales: {neu}<br>"
                         f"<b>Archivos guardados:</b><br>"
                         f"- radio_labeled_data.csv (todos los mensajes)<br>"
                         f"- radio_filtered.csv (sin los {deleted_count} eliminados)")

# Conectar eventos a botones
positive_button.on_click(on_positive_clicked)
negative_button.on_click(on_negative_clicked)
neutral_button.on_click(on_neutral_clicked)
delete_button.on_click(on_delete_clicked)
prev_button.on_click(on_prev_clicked)
next_button.on_click(on_next_clicked)
save_button.on_click(on_save_clicked)

# Mostrar widgets
display(index_text)
display(driver_text)
display(message_text)
display(widgets.HBox([positive_button, negative_button, neutral_button]))
display(delete_button)
display(widgets.HBox([prev_button, next_button, save_button]))
display(status_text)
display(deleted_text)

# Inicializar visualización
update_display()

HTML(value='')

HTML(value='')

HTML(value='')

HBox(children=(Button(button_style='success', description='Positivo', style=ButtonStyle()), Button(button_styl…



HBox(children=(Button(description='Anterior', style=ButtonStyle()), Button(description='Siguiente', style=Butt…

HTML(value='')

HTML(value='Mensajes eliminados: 0')

In [10]:
# Load the labeled dataset
df_radio_labeled = pd.read_csv("../../outputs/week4/radio_clean/radio_labeled_data.csv")

# Check original size
original_length = len(df_radio_labeled)
print(f"Original dataset size: {original_length} rows")

# Count empty values in the sentiment column
null_count = df_radio_labeled['sentiment'].isna().sum()
print(f"Empty values in the 'sentiment' column: {null_count}")

# Remove rows with empty values in sentiment
df_radio_labeled_clean = df_radio_labeled.dropna(subset=['sentiment'])

# Verify how many rows were removed
new_length = len(df_radio_labeled_clean)
removed_count = original_length - new_length
print(f"Rows removed: {removed_count}")
print(f"New dataset size: {new_length} rows")

# Save the cleaned dataset, overwriting the original file
df_radio_labeled_clean.to_csv("../../outputs/week4/radio_clean/radio_labeled_data.csv", index=False)
print("\n✅ File successfully saved at:")
print("../../outputs/week4/radio_clean/radio_labeled_data.csv")


Original dataset size: 684 rows
Empty values in the 'sentiment' column: 154
Rows removed: 154
New dataset size: 530 rows

✅ File successfully saved at:
../../outputs/week4/radio_clean/radio_labeled_data.csv


## Conclussions of this notebook

I created:
- ``radio_labeled_data.csv`` with all the data labeling being made.
-  ``radio_filtered.csv`` without the post-race radios.