# Data labeling

For using BERT and comparing results of VADER, I´ll label the dataset in this notebook. Moreover, I will eliminate the end-of-race radios for having a more concise model. If the dataset gets very small, I will raise a flag for extracting more radio messages.

Then, I´ll need to rerun the Vader notebook to have the results on our cleaned data.

In [46]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

### First, uploading the csv 

First I need to upload the `radios_raw.csv`.

In [47]:
# Load the radios_raw.csv into a Dataframe. 
df = pd.read_csv("../../outputs/week4/radios_raw.csv")
df = df.reset_index(drop=True)
# Displaying basic information
print(f"Loaded {len(df)} radio messages")
df.head()

Loaded 210 radio messages


Unnamed: 0,driver,filename,file_path,text,duration
0,1,"driver_(1,)_belgium_radio_39.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","So don't forget Max, use your head please. Are...",15.168
1,1,"driver_(1,)_belgium_radio_40.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","Okay Max, we're expecting rain in about 9 or 1...",15.576
2,1,"driver_(1,)_belgium_radio_60.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",Semi- This was awesome!,5.424
3,1,"driver_(1,)_belgium_radio_62.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",You might find this lap that you meet a little...,5.088
4,1,"driver_(1,)_belgium_radio_63.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",Just another two or three minutes to get throu...,5.712


## Filtering Out Post-Race Radio Messages

I need to eliminate the post-race radio messages that can bias the behaviour of our next models. 

### Identifying post-race messages

For identifying the post-race messages, these are some examples that can help detecting them:

* Congratulatory messages.
* Race result discussions.
* Thank you messages to the team.
* references to "the race" in past tense.
* Cool-down lap instructions.

--- 

### Code to filter messages

I will add a simple code function to try to detect this messages automatically. Then, after this test, I can proceed to delete the radio messages manually.


In [48]:
# First, let's examine some examples that might be post-race messages
# Look for rows containing certain keywords
post_race_keywords = [
    'race is over', 'good job',  'great race', 'next race',
    'cool down', 'cool-down', 'congratulations', 'result', 'finished'
]

# Create a function to check if a message appears to be post-race
def is_post_race(text):
    text_lower = text.lower()
    for keyword in post_race_keywords:
        if keyword in text_lower:
            return True
    return False

# Add a new column indicating if the message appears to be post-race
df['is_post_race'] = df['text'].apply(is_post_race)





In [49]:
# Count potential post-race messages
post_race_count = df['is_post_race'].sum()
print(f"Identified {post_race_count} potential post-race messages out of {len(df)} total")

Identified 14 potential post-race messages out of 210 total


In [50]:
# Preview some of the identified post-race messages
print("\nSample post-race messages:")
for idx, row in df[df['is_post_race'] == True].head(50).iterrows():
    print(f"Driver {row['driver']}: {row['text']}")
    print("-" * 50)




Sample post-race messages:
Driver 1: Yeah, I gave it all. I was a bit unlucky. Still I think some okay points I guess after difficult weekend. Yeah, well done Max. That was a really strong drive today. As GP said, we got had over by the safety car, but your comeback after the stop was very, very strong. So without the safety car, I think it could have been a better race for us today, but good job. Yeah, it was quite good. Quite cool. It's been a hell of a run and yeah, it was always going to come to a close at one point, but brush yourself down and go again next weekend. Yeah, it's okay. They can take one. We'll go again next week.
--------------------------------------------------
Driver 14: Good job. Thank you guys.
--------------------------------------------------
Driver 16: Good job. Lieutenant, good job. Well managed.
--------------------------------------------------
Driver 16: Good job. You've done a good job. Well managed. And P3, P3, reminder, no in-lap. No in-lap.
---------

In [51]:
# Manual verification step (important!)
print("\nNOTE: This automatic detection is just a first pass.")
print("\n Now I need to review manually")

df_not_cleaned = df.to_csv("../../outputs/week4/not_filtered.csv")


NOTE: This automatic detection is just a first pass.

 Now I need to review manually


### Manual Review and Filtering

After manual review, 49 radios are going to be deleted. Therefore, I think it is a good idea to **add more radios, at least 3 races more, to have a good amount of radios**. After that, I will proceed to make again a manual analysis and eliminate the radios.

In [None]:
# # After review, create a filtered dataset without post-race messages
# # Option 1: Use the automated detection (after verifying it's accurate)
# df_filtered = df[df['is_post_race'] == False].copy()

# Option 2: Manual filtering using message indices
post_race_indices = [list of confirmed post-race message indices]
df_filtered = df[~df.index.isin(post_race_indices)].copy()

print(f"Filtered dataset contains {len(df_filtered)} messages")



# Now this filtered dataset is ready for sentiment labeling

df_filtered.head()



Filtered dataset contains 196 messages


Unnamed: 0,driver,filename,file_path,text,duration,is_post_race
0,1,"driver_(1,)_belgium_radio_39.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","So don't forget Max, use your head please. Are...",15.168,False
1,1,"driver_(1,)_belgium_radio_40.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","Okay Max, we're expecting rain in about 9 or 1...",15.576,False
2,1,"driver_(1,)_belgium_radio_60.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",Semi- This was awesome!,5.424,False
3,1,"driver_(1,)_belgium_radio_62.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",You might find this lap that you meet a little...,5.088,False
4,1,"driver_(1,)_belgium_radio_63.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",Just another two or three minutes to get throu...,5.712,False


## Sentiment Labeling

#### Creating the sentiment column (empty)

In [53]:
df["sentiment"] =""

df.head()

Unnamed: 0,driver,filename,file_path,text,duration,is_post_race,sentiment
0,1,"driver_(1,)_belgium_radio_39.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","So don't forget Max, use your head please. Are...",15.168,False,
1,1,"driver_(1,)_belgium_radio_40.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...","Okay Max, we're expecting rain in about 9 or 1...",15.576,False,
2,1,"driver_(1,)_belgium_radio_60.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",Semi- This was awesome!,5.424,False,
3,1,"driver_(1,)_belgium_radio_62.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",You might find this lap that you meet a little...,5.088,False,
4,1,"driver_(1,)_belgium_radio_63.mp3","..\..\f1-strategy\data\audio\driver_(1,)\drive...",Just another two or three minutes to get throu...,5.712,False,
