# Assignment 1
**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Sexism Detection, Multi-class Classification, RNNs, Transformers, Huggingface



# Contact
For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

- Federico Ruggeri -> federico.ruggeri6@unibo.it
- Eleonora Mancini -> e.mancini@unibo.it

Professor:
- Paolo Torroni -> p.torroni@unibo.it

# Introduction
You are asked to address the [EXIST 2023 Task 1](https://clef2023.clef-initiative.eu/index.php?page=Pages/labs.html#EXIST) on sexism detection.

## Problem Definition
The systems have to decide whether or not a given tweet contains or describes sexist expressions or behaviors (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behavior).

### Examples:

**Text**: *Can’t go a day without women womening*

**Label**: Sexist

**Text**: *''Society's set norms! Happy men's day though!#weareequal''*

**Label**: Not sexist

#[Task 1 - 1.0 points] Corpus

We have preparared a small version of EXIST dataset in our dedicated [Github repository](https://github.com/lt-nlp-lab-unibo/nlp-course-material/tree/main/2024-2025/Assignment%201/data).

Check the `A1/data` folder. It contains 3 `.json` files representing `training`, `validation` and `test` sets.

The three sets are slightly unbalanced, with a bias toward the `Non-sexist` class.



### Dataset Description
- The dataset contains tweets in both English and Spanish.
- There are labels for multiple tasks, but we are focusing on **Task 1**.
- For Task 1, soft labels are assigned by six annotators.
- The labels for Task 1 represent whether the tweet is sexist ("YES") or not ("NO").







### Example


    "203260": {
        "id_EXIST": "203260",
        "lang": "en",
        "tweet": "ik when mandy says “you look like a whore” i look cute as FUCK",
        "number_annotators": 6,
        "annotators": ["Annotator_473", "Annotator_474", "Annotator_475", "Annotator_476", "Annotator_477", "Annotator_27"],
        "gender_annotators": ["F", "F", "M", "M", "M", "F"],
        "age_annotators": ["18-22", "23-45", "18-22", "23-45", "46+", "46+"],
        "labels_task1": ["YES", "YES", "YES", "NO", "YES", "YES"],
        "labels_task2": ["DIRECT", "DIRECT", "REPORTED", "-", "JUDGEMENTAL", "REPORTED"],
        "labels_task3": [
          ["STEREOTYPING-DOMINANCE"],
          ["OBJECTIFICATION"],
          ["SEXUAL-VIOLENCE"],
          ["-"],
          ["STEREOTYPING-DOMINANCE", "OBJECTIFICATION"],
          ["OBJECTIFICATION"]
        ],
        "split": "TRAIN_EN"
      }
    }

In [117]:
#0. IMPORTS
# file management
import sys
import shutil
import urllib
import tarfile
from pathlib import Path

#zip file
import zipfile

# dataframe management
import pandas as pd

# data manipulation
import numpy as np

# for readability
from typing import Iterable

# viz
from tqdm import tqdm

from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 1024,
        'height': 768,
        'scroll': True,
})

{'width': 1024, 'height': 768, 'scroll': True}

### Instructions
1. **Download** the `A1/data` folder.
2. **Load** the three JSON files and encode them as pandas dataframes.
3. **Generate hard labels** for Task 1 using majority voting and store them in a new dataframe column called `hard_label_task1`. Items without a clear majority will be removed from the dataset.
4. **Filter the DataFrame** to keep only rows where the `lang` column is `'en'`.
5. **Remove unwanted columns**: Keep only `id_EXIST`, `lang`, `tweet`, and `hard_label_task1`.
6. **Encode the `hard_label_task1` column**: Use 1 to represent "YES" and 0 to represent "NO".

### 1. Download the folder

In [118]:
#1. Download the A1/data folder
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

def download_url(download_path: Path, url: str):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=download_path, reporthook=t.update_to)

In [119]:
def download_dataset(download_path: Path, url: str):
    print("Downloading dataset...")
    download_url(url=url, download_path=download_path)
    print("Download complete!")


def extract_dataset(download_path: Path, extract_path: Path):
    print("Extracting dataset... (it may take a while...)")
    # Check if the file is a ZIP file
    if zipfile.is_zipfile(download_path):
        with zipfile.ZipFile(download_path, 'r') as zip_ref:
            zip_ref.extractall(extract_path)
        print("Extraction completed!")
    else:
        print("Error: The downloaded file is not a ZIP file.")

In [120]:
# Define paths and URL
url = "https://github.com/nlp-unibo/nlp-course-material/archive/refs/heads/main.zip"
dataset_name = "Exist"
dataset_folder = Path.cwd().joinpath("Datasets")
dataset_folder.mkdir(exist_ok=True)  # Create folder if it doesn't exist
download_path = dataset_folder.joinpath(f"{dataset_name}.zip")
extract_path = dataset_folder

# Download and extract
download_dataset(download_path, url)
extract_dataset(download_path, extract_path)


Downloading dataset...


main.zip: 4.58MB [00:04, 1.08MB/s]

Download complete!
Extracting dataset... (it may take a while...)
Extraction completed!





In [121]:
import os

# Define the path where the ZIP file should have been extracted
# (it should match where `extract_dataset` extracted the files)
extracted_path = dataset_folder  # or dataset_folder / 'nlp-course-material-main' if extracted to subfolder

# Check and print the directory structure
print("Contents of the extracted dataset folder:")
for root, dirs, files in os.walk(extracted_path):
    level = root.replace(str(extracted_path), '').count(os.sep)
    indent = ' ' * 4 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = ' ' * 4 * (level + 1)
    for f in files:
        print(f"{subindent}{f}")


Contents of the extracted dataset folder:
Datasets/
    Exist.zip
    nlp-course-material-main/
        README.md
        .gitignore
        LICENSE
        .DS_Store
        .idea/
            misc.xml
            vcs.xml
            .gitignore
            modules.xml
            nlp-course-material.iml
            inspectionProfiles/
                Project_Default.xml
                profiles_settings.xml
        2023-2024/
            Tutorial 2/
                tutorial2-2324.ipynb
            Tutorial 1/
                tutorial1-2324.ipynb
                images/
                    inputs_outputs.png
                    confusion_matrix.png
                    inputs_outputs_features.png
            Standard Project/
                Standard Project.pdf
            Tutorial 3/
                Tutorial3-2324.ipynb
                images/
                    collator.png
            Assignment 2/
                Assignment2.ipynb
                images/
                    input_

### 2. Load the three JSON files and encode them as pandas dataframes.

In [122]:
#2. Load the three JSON files and encode them as pandas dataframes

import pandas as pd
import json

# Define the path to the dataset folder and select a JSON file (e.g., training.json)
data_folder = dataset_folder.joinpath("nlp-course-material-main", "2024-2025", "Assignment 1", "data")
training_file = data_folder.joinpath("training.json")

# Load the JSON file and inspect its structure
if training_file.is_file():
    with training_file.open(mode='r', encoding='utf-8') as file:
        data = json.load(file)
        print("Loaded data structure:", type(data))  # Check if it's a list or dict
        print("Number of entries:", len(data))       # Check number of entries
        #print("First entry keys:", data[0].keys() if isinstance(data, list) else data.keys())
else:
    print(f"File {training_file} does not exist.")

Loaded data structure: <class 'dict'>
Number of entries: 6920


In [123]:
import pandas as pd
from pathlib import Path

# Define the path to the data folder and load JSON files as DataFrames
data_folder = dataset_folder.joinpath("nlp-course-material-main", "2024-2025", "Assignment 1", "data")

# Load each JSON file as a DataFrame
training_file = data_folder.joinpath("training.json")
test_file = data_folder.joinpath("test.json")
validation_file = data_folder.joinpath("validation.json")

# Load data into DataFrames if files exist
if training_file.is_file():
    training_df = pd.read_json(training_file)
    print("Training DataFrame loaded successfully!")
    print("Training DataFrame shape:", training_df.shape)
else:
    print("Training file not found.")

if test_file.is_file():
    test_df = pd.read_json(test_file)
    print("Test DataFrame loaded successfully!")
    print("Test DataFrame shape:", test_df.shape)
else:
    print("Test file not found.")

if validation_file.is_file():
    validation_df = pd.read_json(validation_file)
    print("Validation DataFrame loaded successfully!")
    print("Validation DataFrame shape:", validation_df.shape)
else:
    print("Validation file not found.")


Training DataFrame loaded successfully!
Training DataFrame shape: (11, 6920)
Test DataFrame loaded successfully!
Test DataFrame shape: (11, 312)
Validation DataFrame loaded successfully!
Validation DataFrame shape: (11, 726)


In [124]:
# Display the first few rows of the training DataFrame
if 'training_df' in locals():
    print("\nFirst few rows of the Training DataFrame:")
    print(training_df.head())

    # Display DataFrame information (columns and data types)
    print("\nTraining DataFrame Info:")
    print(training_df.info())

    # Print a random entry
    random_entry = training_df.sample(1)
    print("\nRandom entry from Training DataFrame:")
    print(random_entry)
else:
    print("Training DataFrame not loaded.")



First few rows of the Training DataFrame:
                                                              100001  \
id_EXIST                                                      100001   
lang                                                              es   
tweet              @TheChiflis Ignora al otro, es un capullo.El p...   
number_annotators                                                  6   
annotators         [Annotator_1, Annotator_2, Annotator_3, Annota...   

                                                              100002  \
id_EXIST                                                      100002   
lang                                                              es   
tweet              @ultimonomada_ Si comicsgate se parece en algo...   
number_annotators                                                  6   
annotators         [Annotator_7, Annotator_8, Annotator_9, Annota...   

                                                              100003  \
id_EXIST           

In [125]:
training_df.head()

Unnamed: 0,100001,100002,100003,100004,100005,100006,100007,100008,100009,100010,...,203251,203252,203253,203254,203255,203256,203257,203258,203259,203260
id_EXIST,100001,100002,100003,100004,100005,100006,100007,100008,100009,100010,...,203251,203252,203253,203254,203255,203256,203257,203258,203259,203260
lang,es,es,es,es,es,es,es,es,es,es,...,en,en,en,en,en,en,en,en,en,en
tweet,"@TheChiflis Ignora al otro, es un capullo.El p...",@ultimonomada_ Si comicsgate se parece en algo...,"@Steven2897 Lee sobre Gamergate, y como eso ha...",@Lunariita7 Un retraso social bastante lamenta...,@novadragon21 @icep4ck @TvDannyZ Entonces como...,@yonkykong Aaah sí. Andrew Dobson. El que se d...,@glutamatom @JoaquinAdolfoC Estaba del lado de...,@BestKabest Esta gringa sigue llorando por el ...,.¿Conoces la #DECORACION #estilo #GAMER para #...,CES 2022 ASUS ROG Rise of Gamers Evento de lan...,...,"""you look like a whore"" I'm literally wearing ...",“You look like a whore” if you think I’m cute ...,Who fucking lied to you? You look like a whore...,@ShefVaidya Ma'am if I say that you look like ...,I forgot I have a m*d that changes the drachen...,idk why y’all bitches think having half your a...,This has been a part of an experiment with @Wo...,"""Take me already"" ""Not yet. You gotta be ready...",@clintneedcoffee why do you look like a whore?...,ik when mandy says “you look like a whore” i l...
number_annotators,6,6,6,6,6,6,6,6,6,6,...,6,6,6,6,6,6,6,6,6,6
annotators,"[Annotator_1, Annotator_2, Annotator_3, Annota...","[Annotator_7, Annotator_8, Annotator_9, Annota...","[Annotator_7, Annotator_8, Annotator_9, Annota...","[Annotator_13, Annotator_14, Annotator_15, Ann...","[Annotator_19, Annotator_20, Annotator_21, Ann...","[Annotator_25, Annotator_26, Annotator_27, Ann...","[Annotator_25, Annotator_26, Annotator_27, Ann...","[Annotator_25, Annotator_26, Annotator_27, Ann...","[Annotator_31, Annotator_32, Annotator_33, Ann...","[Annotator_37, Annotator_38, Annotator_39, Ann...",...,"[Annotator_473, Annotator_474, Annotator_475, ...","[Annotator_617, Annotator_618, Annotator_619, ...","[Annotator_617, Annotator_618, Annotator_619, ...","[Annotator_668, Annotator_669, Annotator_670, ...","[Annotator_674, Annotator_675, Annotator_676, ...","[Annotator_478, Annotator_479, Annotator_480, ...","[Annotator_668, Annotator_669, Annotator_670, ...","[Annotator_467, Annotator_468, Annotator_469, ...","[Annotator_674, Annotator_675, Annotator_676, ...","[Annotator_473, Annotator_474, Annotator_475, ..."


In [126]:
test_df.head()

Unnamed: 0,400178,400179,400180,400181,400182,400183,400184,400185,400186,400187,...,400480,400481,400482,400483,400484,400485,400486,400487,400488,400489
id_EXIST,400178,400179,400180,400181,400182,400183,400184,400185,400186,400187,...,400480,400481,400482,400483,400484,400485,400486,400487,400488,400489
lang,en,en,en,en,en,en,en,en,en,en,...,en,en,en,en,en,en,en,en,en,en
tweet,1st day at the pool on a beautiful Sunday in N...,“I like your outfit too except when i dress up...,"@KNasFanFic 🥺💖 same, though!!! the angst just ...",@themaxburns @GOP Fuck that cunt. Tried to vot...,@ultshunnie u gotta say some shit like “i’ll f...,"@FernsLewds “Mmm, fuck that cunt is hungry for...",Suika gets gangbanged to death https://t.co/vn...,"“I mean I do, but… wouldn’t it be fun to get g...",@Fox_x_Gods Maybe something involving her gett...,@LogicalFallac18 @LBC Sex* as in 'gender' hara...,...,i hate liars😂. specifically those who lie bc t...,"Uh, they're allowed to? calling this out makes...",This genration are no more then selfish cunts....,@JetCelestial @polgara28951124 @CharlotteEmmaU...,@TheRealLyna Bitch your boyfriend Anthony flim...,@YesReallyAngel “Don’t wear a black bra with a...,""" get changed , you look like a prostitute . ""...",made this top and my mom gave me the “you look...,@DawnAnd91320913 I haven't seen anything that ...,@ElDukemane You look like a whore in ur new pi...
number_annotators,6,6,6,6,6,6,6,6,6,6,...,6,6,6,6,6,6,6,6,6,6
annotators,"[Annotator_764, Annotator_765, Annotator_766, ...","[Annotator_805, Annotator_426, Annotator_806, ...","[Annotator_795, Annotator_796, Annotator_797, ...","[Annotator_795, Annotator_796, Annotator_797, ...","[Annotator_770, Annotator_771, Annotator_772, ...","[Annotator_776, Annotator_777, Annotator_195, ...","[Annotator_780, Annotator_781, Annotator_782, ...","[Annotator_785, Annotator_786, Annotator_787, ...","[Annotator_770, Annotator_771, Annotator_772, ...","[Annotator_791, Annotator_122, Annotator_396, ...",...,"[Annotator_776, Annotator_777, Annotator_195, ...","[Annotator_805, Annotator_426, Annotator_806, ...","[Annotator_776, Annotator_777, Annotator_195, ...","[Annotator_801, Annotator_182, Annotator_802, ...","[Annotator_801, Annotator_182, Annotator_802, ...","[Annotator_801, Annotator_182, Annotator_802, ...","[Annotator_801, Annotator_182, Annotator_802, ...","[Annotator_795, Annotator_796, Annotator_797, ...","[Annotator_776, Annotator_777, Annotator_195, ...","[Annotator_776, Annotator_777, Annotator_195, ..."


In [127]:
validation_df.head()

Unnamed: 0,300001,300002,300003,300004,300005,300006,300007,300008,300009,300010,...,400168,400169,400170,400171,400172,400173,400174,400175,400176,400177
id_EXIST,300001,300002,300003,300004,300005,300006,300007,300008,300009,300010,...,400168,400169,400170,400171,400172,400173,400174,400175,400176,400177
lang,es,es,es,es,es,es,es,es,es,es,...,en,en,en,en,en,en,en,en,en,en
tweet,@Fichinescu La comunidad gamer es un antro de ...,@anacaotica88 @MordorLivin No me acuerdo de lo...,@cosmicJunkBot lo digo cada pocos dias y lo re...,Also mientras les decia eso la señalaba y deci...,"And all people killed, attacked, harassed by ...",On this #WorldPressFreedomDay I’m thinking of ...,@DavidGR18 @pppbernat @abc_es @agarzon @IreneM...,@DavidArranzVox @AnabelAlonso_of Uyyy a q huel...,Con 25 Leones🦁 y 500 más en las gradas🗣!!#EkoF...,@kokreto84 @Play87834898 @venusoncrack Me gust...,...,"I'm debating doing a ""feminization"" clip serie...",I'm looking for a girl I spoke to the day befo...,@parker__farquer Three for a girl.You're going...,Foreigner - Waiting for a Girl Like You [Lyric...,@leesu44 @elishabroadway @markbann57 @SeaeyesT...,Amazing that the GOP is trying to take away ou...,It is is impossible for a man to become a woma...,If Gaga decided to sing 18 versions of Free Wo...,This is your reminder that you can be child-fr...,"just completed my last final, i’m officially a..."
number_annotators,6,6,6,6,6,6,6,6,6,6,...,6,6,6,6,6,6,6,6,6,6
annotators,"[Annotator_726, Annotator_727, Annotator_357, ...","[Annotator_731, Annotator_732, Annotator_315, ...","[Annotator_735, Annotator_736, Annotator_345, ...","[Annotator_259, Annotator_739, Annotator_291, ...","[Annotator_731, Annotator_732, Annotator_315, ...","[Annotator_735, Annotator_736, Annotator_345, ...","[Annotator_731, Annotator_732, Annotator_315, ...","[Annotator_742, Annotator_743, Annotator_195, ...","[Annotator_742, Annotator_743, Annotator_195, ...","[Annotator_744, Annotator_745, Annotator_746, ...",...,"[Annotator_805, Annotator_426, Annotator_806, ...","[Annotator_780, Annotator_781, Annotator_782, ...","[Annotator_785, Annotator_786, Annotator_787, ...","[Annotator_764, Annotator_765, Annotator_766, ...","[Annotator_780, Annotator_781, Annotator_782, ...","[Annotator_805, Annotator_426, Annotator_806, ...","[Annotator_770, Annotator_771, Annotator_772, ...","[Annotator_764, Annotator_765, Annotator_766, ...","[Annotator_795, Annotator_796, Annotator_797, ...","[Annotator_770, Annotator_771, Annotator_772, ..."


In [128]:
training_df = training_df.transpose()

In [129]:
test_df = test_df.transpose()

In [130]:
validation_df = validation_df.transpose()

In [131]:
training_df.head()

Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split
100001,100001,es,"@TheChiflis Ignora al otro, es un capullo.El p...",6,"[Annotator_1, Annotator_2, Annotator_3, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[YES, YES, NO, YES, YES, YES]","[REPORTED, JUDGEMENTAL, -, REPORTED, JUDGEMENT...","[[OBJECTIFICATION], [OBJECTIFICATION, SEXUAL-V...",TRAIN_ES
100002,100002,es,@ultimonomada_ Si comicsgate se parece en algo...,6,"[Annotator_7, Annotator_8, Annotator_9, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, NO, NO, YES, NO]","[-, -, -, -, DIRECT, -]","[[-], [-], [-], [-], [OBJECTIFICATION], [-]]",TRAIN_ES
100003,100003,es,"@Steven2897 Lee sobre Gamergate, y como eso ha...",6,"[Annotator_7, Annotator_8, Annotator_9, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",TRAIN_ES
100004,100004,es,@Lunariita7 Un retraso social bastante lamenta...,6,"[Annotator_13, Annotator_14, Annotator_15, Ann...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, YES, NO, YES, YES]","[-, -, DIRECT, -, REPORTED, REPORTED]","[[-], [-], [IDEOLOGICAL-INEQUALITY], [-], [IDE...",TRAIN_ES
100005,100005,es,@novadragon21 @icep4ck @TvDannyZ Entonces como...,6,"[Annotator_19, Annotator_20, Annotator_21, Ann...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[YES, NO, YES, NO, YES, YES]","[REPORTED, -, JUDGEMENTAL, -, JUDGEMENTAL, DIR...","[[STEREOTYPING-DOMINANCE, OBJECTIFICATION], [-...",TRAIN_ES


In [132]:
test_df.head()

Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split
400178,400178,en,1st day at the pool on a beautiful Sunday in N...,6,"[Annotator_764, Annotator_765, Annotator_766, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_EN
400179,400179,en,“I like your outfit too except when i dress up...,6,"[Annotator_805, Annotator_426, Annotator_806, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, YES, YES, YES, NO]","[JUDGEMENTAL, DIRECT, REPORTED, DIRECT, REPORT...","[[OBJECTIFICATION], [OBJECTIFICATION, MISOGYNY...",DEV_EN
400180,400180,en,"@KNasFanFic 🥺💖 same, though!!! the angst just ...",6,"[Annotator_795, Annotator_796, Annotator_797, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_EN
400181,400181,en,@themaxburns @GOP Fuck that cunt. Tried to vot...,6,"[Annotator_795, Annotator_796, Annotator_797, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, YES, YES, YES]","[-, DIRECT, JUDGEMENTAL, DIRECT, DIRECT, DIRECT]","[[-], [IDEOLOGICAL-INEQUALITY, MISOGYNY-NON-SE...",DEV_EN
400182,400182,en,@ultshunnie u gotta say some shit like “i’ll f...,6,"[Annotator_770, Annotator_771, Annotator_772, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, YES, YES, YES, YES]","[DIRECT, REPORTED, DIRECT, DIRECT, JUDGEMENTAL...","[[OBJECTIFICATION, SEXUAL-VIOLENCE], [SEXUAL-V...",DEV_EN


In [133]:
validation_df.head()

Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split
300001,300001,es,@Fichinescu La comunidad gamer es un antro de ...,6,"[Annotator_726, Annotator_727, Annotator_357, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, NO, YES, NO]","[-, JUDGEMENTAL, JUDGEMENTAL, -, REPORTED, -]","[[-], [MISOGYNY-NON-SEXUAL-VIOLENCE], [MISOGYN...",DEV_ES
300002,300002,es,@anacaotica88 @MordorLivin No me acuerdo de lo...,6,"[Annotator_731, Annotator_732, Annotator_315, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, NO, YES, YES, YES]","[JUDGEMENTAL, REPORTED, -, JUDGEMENTAL, JUDGEM...","[[IDEOLOGICAL-INEQUALITY, STEREOTYPING-DOMINAN...",DEV_ES
300003,300003,es,@cosmicJunkBot lo digo cada pocos dias y lo re...,6,"[Annotator_735, Annotator_736, Annotator_345, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_ES
300004,300004,es,Also mientras les decia eso la señalaba y deci...,6,"[Annotator_259, Annotator_739, Annotator_291, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, YES, YES, YES]","[-, REPORTED, REPORTED, REPORTED, JUDGEMENTAL,...","[[-], [SEXUAL-VIOLENCE], [SEXUAL-VIOLENCE], [S...",DEV_ES
300005,300005,es,"And all people killed, attacked, harassed by ...",6,"[Annotator_731, Annotator_732, Annotator_315, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, NO, NO, NO, NO]","[-, DIRECT, -, -, -, -]","[[-], [STEREOTYPING-DOMINANCE], [-], [-], [-],...",DEV_ES


### 3. Generate hard labels for Task 1 using majority voting and store them in a new dataframe column called hard_label_task1. Items without a clear majority will be removed from the dataset.

In [134]:
def majority_vote(labels):
    # Count the occurrences of 'YES' and 'NO'
    yes_count = labels.count('YES')
    no_count = labels.count('NO')

    # Check if there is a clear majority
    if yes_count > no_count:
        return 'YES'
    elif no_count > yes_count:
        return 'NO'
    else:
        return None  # No clear majority, return None to indicate a tie

# Apply the majority_vote function to the 'labels_task1' column
training_df['hard_label_task1'] = training_df['labels_task1'].apply(majority_vote)
test_df['hard_label_task1'] = test_df['labels_task1'].apply(majority_vote)
validation_df['hard_label_task1'] = validation_df['labels_task1'].apply(majority_vote)

# Display the result
training_df.head()
test_df.head()
validation_df.head()


Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split,hard_label_task1
300001,300001,es,@Fichinescu La comunidad gamer es un antro de ...,6,"[Annotator_726, Annotator_727, Annotator_357, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, NO, YES, NO]","[-, JUDGEMENTAL, JUDGEMENTAL, -, REPORTED, -]","[[-], [MISOGYNY-NON-SEXUAL-VIOLENCE], [MISOGYN...",DEV_ES,
300002,300002,es,@anacaotica88 @MordorLivin No me acuerdo de lo...,6,"[Annotator_731, Annotator_732, Annotator_315, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, NO, YES, YES, YES]","[JUDGEMENTAL, REPORTED, -, JUDGEMENTAL, JUDGEM...","[[IDEOLOGICAL-INEQUALITY, STEREOTYPING-DOMINAN...",DEV_ES,YES
300003,300003,es,@cosmicJunkBot lo digo cada pocos dias y lo re...,6,"[Annotator_735, Annotator_736, Annotator_345, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_ES,NO
300004,300004,es,Also mientras les decia eso la señalaba y deci...,6,"[Annotator_259, Annotator_739, Annotator_291, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, YES, YES, YES]","[-, REPORTED, REPORTED, REPORTED, JUDGEMENTAL,...","[[-], [SEXUAL-VIOLENCE], [SEXUAL-VIOLENCE], [S...",DEV_ES,YES
300005,300005,es,"And all people killed, attacked, harassed by ...",6,"[Annotator_731, Annotator_732, Annotator_315, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, NO, NO, NO, NO]","[-, DIRECT, -, -, -, -]","[[-], [STEREOTYPING-DOMINANCE], [-], [-], [-],...",DEV_ES,NO


In [135]:
training_df = training_df.dropna()
training_df.head()

Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split,hard_label_task1
100001,100001,es,"@TheChiflis Ignora al otro, es un capullo.El p...",6,"[Annotator_1, Annotator_2, Annotator_3, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[YES, YES, NO, YES, YES, YES]","[REPORTED, JUDGEMENTAL, -, REPORTED, JUDGEMENT...","[[OBJECTIFICATION], [OBJECTIFICATION, SEXUAL-V...",TRAIN_ES,YES
100002,100002,es,@ultimonomada_ Si comicsgate se parece en algo...,6,"[Annotator_7, Annotator_8, Annotator_9, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, NO, NO, YES, NO]","[-, -, -, -, DIRECT, -]","[[-], [-], [-], [-], [OBJECTIFICATION], [-]]",TRAIN_ES,NO
100003,100003,es,"@Steven2897 Lee sobre Gamergate, y como eso ha...",6,"[Annotator_7, Annotator_8, Annotator_9, Annota...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",TRAIN_ES,NO
100005,100005,es,@novadragon21 @icep4ck @TvDannyZ Entonces como...,6,"[Annotator_19, Annotator_20, Annotator_21, Ann...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[YES, NO, YES, NO, YES, YES]","[REPORTED, -, JUDGEMENTAL, -, JUDGEMENTAL, DIR...","[[STEREOTYPING-DOMINANCE, OBJECTIFICATION], [-...",TRAIN_ES,YES
100006,100006,es,@yonkykong Aaah sí. Andrew Dobson. El que se d...,6,"[Annotator_25, Annotator_26, Annotator_27, Ann...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 46+, 23-45, 18-22]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",TRAIN_ES,NO


In [136]:
test_df = test_df.dropna()
test_df.head()

Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split,hard_label_task1
400178,400178,en,1st day at the pool on a beautiful Sunday in N...,6,"[Annotator_764, Annotator_765, Annotator_766, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_EN,NO
400179,400179,en,“I like your outfit too except when i dress up...,6,"[Annotator_805, Annotator_426, Annotator_806, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, YES, YES, YES, NO]","[JUDGEMENTAL, DIRECT, REPORTED, DIRECT, REPORT...","[[OBJECTIFICATION], [OBJECTIFICATION, MISOGYNY...",DEV_EN,YES
400180,400180,en,"@KNasFanFic 🥺💖 same, though!!! the angst just ...",6,"[Annotator_795, Annotator_796, Annotator_797, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_EN,NO
400181,400181,en,@themaxburns @GOP Fuck that cunt. Tried to vot...,6,"[Annotator_795, Annotator_796, Annotator_797, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, YES, YES, YES]","[-, DIRECT, JUDGEMENTAL, DIRECT, DIRECT, DIRECT]","[[-], [IDEOLOGICAL-INEQUALITY, MISOGYNY-NON-SE...",DEV_EN,YES
400182,400182,en,@ultshunnie u gotta say some shit like “i’ll f...,6,"[Annotator_770, Annotator_771, Annotator_772, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, YES, YES, YES, YES]","[DIRECT, REPORTED, DIRECT, DIRECT, JUDGEMENTAL...","[[OBJECTIFICATION, SEXUAL-VIOLENCE], [SEXUAL-V...",DEV_EN,YES


In [137]:
validation_df = validation_df.dropna()
validation_df.head()

Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split,hard_label_task1
300002,300002,es,@anacaotica88 @MordorLivin No me acuerdo de lo...,6,"[Annotator_731, Annotator_732, Annotator_315, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, NO, YES, YES, YES]","[JUDGEMENTAL, REPORTED, -, JUDGEMENTAL, JUDGEM...","[[IDEOLOGICAL-INEQUALITY, STEREOTYPING-DOMINAN...",DEV_ES,YES
300003,300003,es,@cosmicJunkBot lo digo cada pocos dias y lo re...,6,"[Annotator_735, Annotator_736, Annotator_345, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_ES,NO
300004,300004,es,Also mientras les decia eso la señalaba y deci...,6,"[Annotator_259, Annotator_739, Annotator_291, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, YES, YES, YES]","[-, REPORTED, REPORTED, REPORTED, JUDGEMENTAL,...","[[-], [SEXUAL-VIOLENCE], [SEXUAL-VIOLENCE], [S...",DEV_ES,YES
300005,300005,es,"And all people killed, attacked, harassed by ...",6,"[Annotator_731, Annotator_732, Annotator_315, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, NO, NO, NO, NO]","[-, DIRECT, -, -, -, -]","[[-], [STEREOTYPING-DOMINANCE], [-], [-], [-],...",DEV_ES,NO
300006,300006,es,On this #WorldPressFreedomDay I’m thinking of ...,6,"[Annotator_735, Annotator_736, Annotator_345, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, YES]","[-, -, -, -, -, JUDGEMENTAL]","[[-], [-], [-], [-], [-], [STEREOTYPING-DOMINA...",DEV_ES,NO


### 4. Keep only 'en' language

In [138]:
training_df = training_df[training_df['lang'] == 'en']

training_df.head()

Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split,hard_label_task1
200002,200002,en,Writing a uni essay in my local pub with a cof...,6,"[Annotator_391, Annotator_392, Annotator_393, ...","[F, F, M, M, M, F]","[18-22, 23-45, 18-22, 23-45, 46+, 46+]","[YES, YES, YES, NO, YES, YES]","[REPORTED, DIRECT, REPORTED, -, REPORTED, REPO...","[[STEREOTYPING-DOMINANCE, OBJECTIFICATION, MIS...",TRAIN_EN,YES
200003,200003,en,@UniversalORL it is 2021 not 1921. I dont appr...,6,"[Annotator_397, Annotator_398, Annotator_399, ...","[F, F, M, M, M, F]","[18-22, 23-45, 18-22, 23-45, 46+, 46+]","[YES, YES, NO, YES, NO, YES]","[REPORTED, REPORTED, -, REPORTED, -, JUDGEMENTAL]","[[OBJECTIFICATION, SEXUAL-VIOLENCE], [STEREOTY...",TRAIN_EN,YES
200006,200006,en,According to a customer I have plenty of time ...,6,"[Annotator_409, Annotator_410, Annotator_411, ...","[F, F, M, M, M, F]","[18-22, 23-45, 18-22, 23-45, 46+, 46+]","[YES, YES, YES, YES, YES, YES]","[REPORTED, REPORTED, REPORTED, REPORTED, REPOR...","[[STEREOTYPING-DOMINANCE, OBJECTIFICATION], [S...",TRAIN_EN,YES
200007,200007,en,"So only 'blokes' drink beer? Sorry, but if you...",6,"[Annotator_415, Annotator_416, Annotator_417, ...","[F, F, M, M, M, F]","[18-22, 23-45, 18-22, 23-45, 46+, 46+]","[YES, YES, YES, YES, YES, YES]","[JUDGEMENTAL, REPORTED, REPORTED, DIRECT, DIRE...","[[STEREOTYPING-DOMINANCE], [STEREOTYPING-DOMIN...",TRAIN_EN,YES
200008,200008,en,New to the shelves this week - looking forward...,6,"[Annotator_420, Annotator_296, Annotator_421, ...","[F, F, M, M, M, F]","[18-22, 23-45, 18-22, 23-45, 46+, 46+]","[NO, NO, NO, YES, NO, NO]","[-, -, -, JUDGEMENTAL, -, -]","[[-], [-], [-], [IDEOLOGICAL-INEQUALITY], [-],...",TRAIN_EN,NO


In [139]:
test_df = test_df[test_df['lang'] == 'en']

test_df.head()

Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split,hard_label_task1
400178,400178,en,1st day at the pool on a beautiful Sunday in N...,6,"[Annotator_764, Annotator_765, Annotator_766, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_EN,NO
400179,400179,en,“I like your outfit too except when i dress up...,6,"[Annotator_805, Annotator_426, Annotator_806, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, YES, YES, YES, NO]","[JUDGEMENTAL, DIRECT, REPORTED, DIRECT, REPORT...","[[OBJECTIFICATION], [OBJECTIFICATION, MISOGYNY...",DEV_EN,YES
400180,400180,en,"@KNasFanFic 🥺💖 same, though!!! the angst just ...",6,"[Annotator_795, Annotator_796, Annotator_797, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, NO, NO]","[-, -, -, -, -, -]","[[-], [-], [-], [-], [-], [-]]",DEV_EN,NO
400181,400181,en,@themaxburns @GOP Fuck that cunt. Tried to vot...,6,"[Annotator_795, Annotator_796, Annotator_797, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, YES, YES, YES]","[-, DIRECT, JUDGEMENTAL, DIRECT, DIRECT, DIRECT]","[[-], [IDEOLOGICAL-INEQUALITY, MISOGYNY-NON-SE...",DEV_EN,YES
400182,400182,en,@ultshunnie u gotta say some shit like “i’ll f...,6,"[Annotator_770, Annotator_771, Annotator_772, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, YES, YES, YES, YES]","[DIRECT, REPORTED, DIRECT, DIRECT, JUDGEMENTAL...","[[OBJECTIFICATION, SEXUAL-VIOLENCE], [SEXUAL-V...",DEV_EN,YES


In [140]:
validation_df = validation_df[validation_df['lang'] == 'en']

validation_df.head()

Unnamed: 0,id_EXIST,lang,tweet,number_annotators,annotators,gender_annotators,age_annotators,labels_task1,labels_task2,labels_task3,split,hard_label_task1
400001,400001,en,"@Mike_Fabricant “You should smile more, love. ...",6,"[Annotator_764, Annotator_765, Annotator_766, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, NO, NO, NO, YES, YES]","[-, -, -, -, REPORTED, DIRECT]","[[-], [-], [-], [-], [IDEOLOGICAL-INEQUALITY, ...",DEV_EN,NO
400002,400002,en,@BBCWomansHour @LabWomenDec @EverydaySexism Sh...,6,"[Annotator_770, Annotator_771, Annotator_772, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, YES, NO, YES, YES, NO]","[REPORTED, JUDGEMENTAL, -, REPORTED, REPORTED, -]","[[IDEOLOGICAL-INEQUALITY], [OBJECTIFICATION], ...",DEV_EN,YES
400003,400003,en,#everydaysexism Some man moving my suitcase in...,6,"[Annotator_776, Annotator_777, Annotator_195, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[NO, YES, YES, YES, YES, YES]","[-, REPORTED, REPORTED, REPORTED, REPORTED, JU...","[[-], [STEREOTYPING-DOMINANCE], [OBJECTIFICATI...",DEV_EN,YES
400004,400004,en,@KolHue @OliverJia1014 lol gamergate the go to...,6,"[Annotator_780, Annotator_781, Annotator_782, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, NO, NO, NO, NO, NO]","[DIRECT, -, -, -, -, -]","[[STEREOTYPING-DOMINANCE, OBJECTIFICATION, MIS...",DEV_EN,NO
400005,400005,en,@ShelfStoriesGBL To me this has the same negat...,6,"[Annotator_780, Annotator_781, Annotator_782, ...","[F, F, F, M, M, M]","[18-22, 23-45, 46+, 18-22, 23-45, 46+]","[YES, NO, NO, YES, NO, NO]","[JUDGEMENTAL, -, -, REPORTED, -, -]","[[IDEOLOGICAL-INEQUALITY], [-], [-], [MISOGYNY...",DEV_EN,NO


### 5. Keep only relevant columns

In [141]:
training_df = training_df.drop(columns=['number_annotators',	'annotators',	'gender_annotators',	'age_annotators',	'labels_task1',	'labels_task2',	'labels_task3',	'split'])
training_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
200002,200002,en,Writing a uni essay in my local pub with a cof...,YES
200003,200003,en,@UniversalORL it is 2021 not 1921. I dont appr...,YES
200006,200006,en,According to a customer I have plenty of time ...,YES
200007,200007,en,"So only 'blokes' drink beer? Sorry, but if you...",YES
200008,200008,en,New to the shelves this week - looking forward...,NO


In [142]:
test_df = test_df.drop(columns=['number_annotators',	'annotators',	'gender_annotators',	'age_annotators',	'labels_task1',	'labels_task2',	'labels_task3',	'split'])
test_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
400178,400178,en,1st day at the pool on a beautiful Sunday in N...,NO
400179,400179,en,“I like your outfit too except when i dress up...,YES
400180,400180,en,"@KNasFanFic 🥺💖 same, though!!! the angst just ...",NO
400181,400181,en,@themaxburns @GOP Fuck that cunt. Tried to vot...,YES
400182,400182,en,@ultshunnie u gotta say some shit like “i’ll f...,YES


In [143]:
validation_df = validation_df.drop(columns=['number_annotators',	'annotators',	'gender_annotators',	'age_annotators',	'labels_task1',	'labels_task2',	'labels_task3',	'split'])
validation_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
400001,400001,en,"@Mike_Fabricant “You should smile more, love. ...",NO
400002,400002,en,@BBCWomansHour @LabWomenDec @EverydaySexism Sh...,YES
400003,400003,en,#everydaysexism Some man moving my suitcase in...,YES
400004,400004,en,@KolHue @OliverJia1014 lol gamergate the go to...,NO
400005,400005,en,@ShelfStoriesGBL To me this has the same negat...,NO


### 6. YES = 1, NO = 0

In [144]:
training_df['hard_label_task1'] = training_df['hard_label_task1'].map({'YES': 1, 'NO': 0})
training_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
200002,200002,en,Writing a uni essay in my local pub with a cof...,1
200003,200003,en,@UniversalORL it is 2021 not 1921. I dont appr...,1
200006,200006,en,According to a customer I have plenty of time ...,1
200007,200007,en,"So only 'blokes' drink beer? Sorry, but if you...",1
200008,200008,en,New to the shelves this week - looking forward...,0


In [145]:
test_df['hard_label_task1'] = test_df['hard_label_task1'].map({'YES': 1, 'NO': 0})
test_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
400178,400178,en,1st day at the pool on a beautiful Sunday in N...,0
400179,400179,en,“I like your outfit too except when i dress up...,1
400180,400180,en,"@KNasFanFic 🥺💖 same, though!!! the angst just ...",0
400181,400181,en,@themaxburns @GOP Fuck that cunt. Tried to vot...,1
400182,400182,en,@ultshunnie u gotta say some shit like “i’ll f...,1


In [146]:
validation_df['hard_label_task1'] = validation_df['hard_label_task1'].map({'YES': 1, 'NO': 0})
validation_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
400001,400001,en,"@Mike_Fabricant “You should smile more, love. ...",0
400002,400002,en,@BBCWomansHour @LabWomenDec @EverydaySexism Sh...,1
400003,400003,en,#everydaysexism Some man moving my suitcase in...,1
400004,400004,en,@KolHue @OliverJia1014 lol gamergate the go to...,0
400005,400005,en,@ShelfStoriesGBL To me this has the same negat...,0


# [Task2 - 0.5 points] Data Cleaning
In the context of tweets, we have noisy and informal data that often includes unnecessary elements like emojis, hashtags, mentions, and URLs. These elements may interfere with the text analysis.



### Instructions
- **Remove emojis** from the tweets.
- **Remove hashtags** (e.g., `#example`).
- **Remove mentions** such as `@user`.
- **Remove URLs** from the tweets.
- **Remove special characters and symbols**.
- **Remove specific quote characters** (e.g., curly quotes).
- **Perform lemmatization** to reduce words to their base form.

### 0. Adjust the stop symbols (this was not requested but I added it anyway because it was done in class)

In [147]:
import re
from functools import reduce
import nltk
from nltk.corpus import stopwords

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
GOOD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
try:
    STOPWORDS = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    STOPWORDS = set(stopwords.words('english'))

### 1. Remove emojis

In order to complete this task we need to import the regex. Of course this is taken from the internet because where the fuck was I able to get the fucking regex of the emojis, I don't know shit about this.

In [148]:
def lower(text: str) -> str:
    """
    Transforms given text to lower case.
    """
    return text.lower()

In [149]:
import re

def remove_emojis(text):
    # Regex pattern to match emojis
    emoji_pattern = re.compile(
        "["                             # Begin a character class
        "\U0001F600-\U0001F64F"          # Emoticons
        "\U0001F300-\U0001F5FF"          # Symbols & Pictographs
        "\U0001F680-\U0001F6FF"          # Transport & Map Symbols
        "\U0001F700-\U0001F77F"          # Alchemical Symbols
        "\U0001F780-\U0001F7FF"          # Geometric Shapes Extended
        "\U0001F800-\U0001F8FF"          # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"          # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"          # Chess Symbols
        "\U0001FA70-\U0001FAFF"          # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"          # Dingbats
        "\U000024C2-\U0001F251"          # Enclosed Characters
        "]", re.UNICODE)

    # Substitute emojis with an empty string
    return emoji_pattern.sub(r'', text)

### 2. Remove hashtags

In [150]:
def remove_hashtags(text):
    # Regex pattern to match hashtags
    hashtag_pattern = r'#\w+'  # Matches words starting with # followed by any alphanumeric characters or underscores

    # Substitute hashtags with an empty string
    return re.sub(hashtag_pattern, '', text)

### 3. Remove mentions

In [151]:
def remove_mentions(text):
    # Regex pattern to match Twitter mentions (e.g., @username)
    mention_pattern = r'@\w+'  # Matches @ followed by any alphanumeric characters or underscores

    # Substitute mentions with an empty string
    return re.sub(mention_pattern, '', text)

### 4. Remove urls

In [152]:
def remove_urls(text):
    # Regex pattern to match URLs (http://, https://, ftp://, etc.)
    url_pattern = r'http[s]?://\S+'  # Matches URLs starting with http:// or https:// and followed by non-whitespace characters

    # Substitute URLs with an empty string
    return re.sub(url_pattern, '', text)

### 5. Remove special characters

In [153]:
def replace_special_characters(text: str) -> str:
    """
    Replaces special characters, such as paranthesis, with spacing character
    """
    return REPLACE_BY_SPACE_RE.sub(' ', text)

### 6. Remove quotations characters

In [154]:
def remove_quotations(text):
    # Regex pattern to match single and double quotes
    quote_pattern = r"[\"'‘’“”]"  # Matches either single quotes (') or double quotes (")

    # Substitute quotation marks with an empty string
    return re.sub(quote_pattern, '', text)

Before going ahead with lemmatization I want to perform this to the whole column containing the tweets

In [155]:
# typing
from typing import List, Callable, Dict
from collections import OrderedDict

PREPROCESSING_PIPELINE = [
                          lower,
                          remove_emojis,
                          remove_hashtags,
                          remove_mentions,
                          remove_urls,
                          replace_special_characters,
                          remove_quotations
                          ]

def text_prepare(text: str,
                 filter_methods: List[Callable[[str], str]] = None) -> str:
    """
    Applies a list of pre-processing functions in sequence (reduce).
    Note that the order is important here!
    """
    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE
    return reduce(lambda txt, f: f(txt), filter_methods, text)

In [156]:
print('Pre-processing text...')

print()
print(f'[Debug train] Before:\n{training_df.tweet.values[50]}')
print(f'[Debug test] Before:\n{test_df.tweet.values[50]}')
print(f'[Debug validation] Before:\n{validation_df.tweet.values[50]}')
print()

# Replace each sentence with its pre-processed version
training_df['tweet'] = training_df['tweet'].apply(lambda txt: text_prepare(txt))
test_df['tweet'] = test_df['tweet'].apply(lambda txt: text_prepare(txt))
validation_df['tweet'] = validation_df['tweet'].apply(lambda txt: text_prepare(txt))


print(f'[Debug train] After:\n{training_df.tweet.values[50]}')
print(f'[Debug test] After:\n{test_df.tweet.values[50]}')
print(f'[Debug validation] After:\n{validation_df.tweet.values[50]}')
print()

print("Pre-processing completed!")

Pre-processing text...

[Debug train] Before:
@LibertyAnders I get it.. kind of. 80% of women are going after about 20% of men and ignoring the rest according to statistics of 3 dating apps. so that’s a pain point for some men, but Is hating women really a main component to being an incel? And what’s the difference between MGTOW and Incel?
[Debug test] Before:
@HTFCirno2000 I have an old receipt printer thing but I have no idea how to use it. I would like to fuck with it like you did. https://t.co/UpXFySfBwB
[Debug validation] Before:
@motahedoon Alrubaye bint moawad Good luck 💜💜💜💜🤍🤍#motahedoon_challeng

[Debug train] After:
 i get it.. kind of. 80% of women are going after about 20% of men and ignoring the rest according to statistics of 3 dating apps. so thats a pain point for some men  but is hating women really a main component to being an incel? and whats the difference between mgtow and incel?
[Debug test] After:
 i have an old receipt printer thing but i have no idea how to use 

### 7. Perform lemmatization

In [157]:
#creating lemmatizer object

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize, sent_tokenize, WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()

nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


In [158]:
def get_wordnet_key(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

def lem_text(text: str):
    tokens = tokenizer.tokenize(text)
    tagged = pos_tag(tokens)
    words = [lemmatizer.lemmatize(word, get_wordnet_key(tag)) for word, tag in tagged]
    return " ".join(words)

training_df['tweet'] = [lem_text(text) for text in tqdm(training_df['tweet'], leave=True, position=0)]
test_df['tweet'] = [lem_text(text) for text in tqdm(test_df['tweet'], leave=True, position=0)]
validation_df['tweet'] = [lem_text(text) for text in tqdm(validation_df['tweet'], leave=True, position=0)]


100%|██████████| 2870/2870 [00:06<00:00, 471.17it/s]
100%|██████████| 286/286 [00:00<00:00, 472.08it/s]
100%|██████████| 158/158 [00:00<00:00, 464.30it/s]


In [159]:
training_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
200002,200002,en,write a uni essay in my local pub with a coffe...,1
200003,200003,en,it be 2021 not 1921. i dont appreciate that on...,1
200006,200006,en,accord to a customer i have plenty of time to ...,1
200007,200007,en,so only blokes drink beer? sorry but if you ar...,1
200008,200008,en,new to the shelf this week - look forward to r...,0


In [160]:
test_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
400178,400178,en,1st day at the pool on a beautiful sunday in n...,0
400179,400179,en,i like your outfit too except when i dress up ...,1
400180,400180,en,same though!!! the angst just come and goes. l...,0
400181,400181,en,fuck that cunt. try to vote her out multiple time,1
400182,400182,en,u gotta say some shit like ill fuck that cunt ...,1


In [161]:
validation_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1
400001,400001,en,you should smile more love. just pretend youre...,0
400002,400002,en,she be right but the push be all in the opposi...,1
400003,400003,en,some man move my suitcase in the overhead lugg...,1
400004,400004,en,lol gamergate the go to boogieman maybe if the...,0
400005,400005,en,to me this have the same negativity a gamergat...,0


In [162]:
print(f'[Debug] After:\n{training_df.tweet.values[50]}')
print()

[Debug] After:
i get it.. kind of. 80% of woman be go after about 20% of men and ignore the rest accord to statistic of 3 date apps. so thats a pain point for some men but be hat woman really a main component to be an incel? and whats the difference between mgtow and incel?



# [Task 3 - 0.5 points] Text Encoding
To train a neural sexism classifier, you first need to encode text into numerical format.




### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.





### Note : What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe **must** be added to the vocabulary.
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **special token** (e.g., [UNK]) and a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)



### More about OOV

For a given token:

* **If in train set**: add to vocabulary and assign an embedding (use GloVe if token in GloVe, custom embedding otherwise).
* **If in val/test set**: assign special token if not in vocabulary and assign custom embedding.

Your vocabulary **should**:

* Contain all tokens in train set; or
* Union of tokens in train set and in GloVe $\rightarrow$ we make use of existing knowledge!

In this first section we deal with the glove embedding

In [163]:
import os
import requests
import numpy as np

def download_glove_embeddings(glove_url, download_path, embedding_dim=50):
    """
    Download the GloVe embeddings from a URL if not already downloaded.

    :param glove_url: URL to download the GloVe file
    :param download_path: Local path to store the downloaded file
    :param embedding_dim: Dimensionality of the GloVe embeddings
    """
    # Check if the file already exists
    if not os.path.exists(download_path):
        print(f"Downloading GloVe embeddings from {glove_url}...")
        # Download the file
        response = requests.get(glove_url)

        # Save the file to the specified path
        with open(download_path, 'wb') as f:
            f.write(response.content)
        print(f"Download completed. Saved to {download_path}")
    else:
        print(f"GloVe file already exists at {download_path}")

def load_glove_embeddings(glove_file_path, embedding_dim=50):
    """
    Load GloVe embeddings from a file into a dictionary.

    :param glove_file_path: Path to the GloVe file (e.g., 'glove.6B.50d.txt')
    :param embedding_dim: Dimensionality of the embedding (default is 50)
    :return: A dictionary mapping words to their embedding vectors
    """
    embeddings = {}

    with open(glove_file_path, 'r', encoding='utf-8') as f:
        for line in f:
            # Split each line into word and its embedding
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector

    return embeddings



Now we need to define the functions to build the vocabulary

In [164]:
# Initialize vocabulary with GloVe and custom embeddings
def build_vocabulary(train_tokens, glove_embeddings, embedding_dim=50):
    vocabulary = {}
    for token in set(train_tokens):
        if token in glove_embeddings:
            vocabulary[token] = glove_embeddings[token]
        else:
            # Assign random vector for OOV tokens in the train set
            vocabulary[token] = np.random.uniform(-0.5, 0.5, embedding_dim)
    return vocabulary


Then the embeddings from the text and the OOV case

In [165]:
# Define the [UNK] embedding as the average of all embeddings
def define_unk_embedding(vocabulary):
    return np.mean(list(vocabulary.values()), axis=0)

# Convert tokens in text to embeddings
def text_to_embeddings(text, vocabulary, unk_embedding, embedding_dim=50):
    embeddings = []
    for token in text.split():
        token = token.lower()
        if token in vocabulary:
            embeddings.append(vocabulary[token])
        else:
            embeddings.append(unk_embedding)  # Use [UNK] embedding for unknown tokens
    return np.array(embeddings)

In [166]:
# Main code to prepare embeddings
# Step 1: Download and extract GloVe embeddings
glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
download_path = 'glove.6B.zip'
download_glove_embeddings(glove_url, download_path)

with zipfile.ZipFile(download_path, 'r') as zip_ref:
    zip_ref.extractall('glove_embeddings')

# Load GloVe embeddings (adjust the dimension as needed)
embedding_dim = 50
glove_file_path = 'glove_embeddings/glove.6B.50d.txt'
glove_embeddings = load_glove_embeddings(glove_file_path, embedding_dim)

GloVe file already exists at glove.6B.zip


In [167]:
# Step 2: Tokenize and build vocabulary using only training tokens
train_tokens = [word.lower() for tweet in training_df['tweet'] for word in tweet.split()]
vocabulary = build_vocabulary(train_tokens, glove_embeddings, embedding_dim)

# Step 3: Define [UNK] embedding
unk_embedding = define_unk_embedding(vocabulary)

# Step 4: Embed the 'tweet' column in each dataframe
def embed_tweets(df, vocabulary, unk_embedding, embedding_dim):
    df['tweet_embedding'] = df['tweet'].apply(lambda tweet: text_to_embeddings(tweet, vocabulary, unk_embedding, embedding_dim))
    return df

# Apply to each dataframe
training_df = embed_tweets(training_df, vocabulary, unk_embedding, embedding_dim)
test_df = embed_tweets(test_df, vocabulary, unk_embedding, embedding_dim)
validation_df = embed_tweets(validation_df, vocabulary, unk_embedding, embedding_dim)

In [168]:
training_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1,tweet_embedding
200002,200002,en,write a uni essay in my local pub with a coffe...,1,"[[-0.06561099737882614, 0.18877999484539032, 0..."
200003,200003,en,it be 2021 not 1921. i dont appreciate that on...,1,"[[0.6118299961090088, -0.22071999311447144, -0..."
200006,200006,en,accord to a customer i have plenty of time to ...,1,"[[0.5339900255203247, 0.39184001088142395, -0...."
200007,200007,en,so only blokes drink beer? sorry but if you ar...,1,"[[0.6030799746513367, -0.32023999094963074, 0...."
200008,200008,en,new to the shelf this week - look forward to r...,0,"[[0.19511, 0.50739, 0.0014709, 0.041914, -0.16..."


In [169]:
validation_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1,tweet_embedding
400001,400001,en,you should smile more love. just pretend youre...,0,"[[-0.001091900048777461, 0.33324000239372253, ..."
400002,400002,en,she be right but the push be all in the opposi...,1,"[[0.06038200110197067, 0.37821000814437866, -0..."
400003,400003,en,some man move my suitcase in the overhead lugg...,1,"[[0.9287099838256836, -0.10834000259637833, 0...."
400004,400004,en,lol gamergate the go to boogieman maybe if the...,0,"[[-0.5428900122642517, 0.05374300107359886, -0..."
400005,400005,en,to me this have the same negativity a gamergat...,0,"[[0.6804699897766113, -0.03926299884915352, 0...."


In [170]:
test_df.head()

Unnamed: 0,id_EXIST,lang,tweet,hard_label_task1,tweet_embedding
400178,400178,en,1st day at the pool on a beautiful sunday in n...,0,"[[-0.4002000093460083, 0.188060000538826, 0.19..."
400179,400179,en,i like your outfit too except when i dress up ...,1,"[[0.11890999972820282, 0.15254999697208405, -0..."
400180,400180,en,same though!!! the angst just come and goes. l...,0,"[[0.24132999777793884, 0.3948900103569031, -0...."
400181,400181,en,fuck that cunt. try to vote her out multiple time,1,"[[-0.9193900227546692, -0.8000199794769287, -0..."
400182,400182,en,u gotta say some shit like ill fuck that cunt ...,1,"[[-0.25676, 0.8549, 1.1003, 0.95363, 0.36585, ..."


# [Task 4 - 1.0 points] Model definition

You are now tasked to define your sexism classifier.




### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.

### Token to embedding mapping

You can follow two approaches for encoding tokens in your classifier.

### Work directly with embeddings

- Compute the embedding of each input token
- Feed the mini-batches of shape (batch_size, # tokens, embedding_dim) to your model

### Work with Embedding layer

- Encode input tokens to token ids
- Define a Embedding layer as the first layer of your model
- Compute the embedding matrix of all known tokens (i.e., tokens in your vocabulary)
- Initialize the Embedding layer with the computed embedding matrix
- You are **free** to set the Embedding layer trainable or not

In [171]:
import tensorflow as tf

In [172]:
vocab_size = len(vocabulary)
embedding_dimension = embedding_dim
embedding_matrix = embedding_matrix = np.array(list(vocabulary.values()))
embedding = tf.keras.layers.Embedding(input_dim=vocab_size,
                                      output_dim=embedding_dimension,
                                      weights=[embedding_matrix],
                                      mask_zero=True,                   # automatically masks padding tokens
                                      name='encoder_embedding')

### Padding

Pay attention to padding tokens!

Your model **should not** be penalized on those tokens.

#### How to?

There are two main ways.

However, their implementation depends on the neural library you are using.

- Embedding layer
- Custom loss to compute average cross-entropy on non-padding tokens only

**Note**: This is a **recommendation**, but we **do not penalize** for missing workarounds.

In [173]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Ensure that each tweet embedding is padded to a consistent length
sequence_length = 100  # Define based on your needs or average embedding length in data

# Apply padding directly on the tweet_embedding column
X_train = pad_sequences(training_df['tweet_embedding'].tolist(), maxlen=sequence_length, dtype='float32', padding='post', truncating='post')
X_val = pad_sequences(validation_df['tweet_embedding'].tolist(), maxlen=sequence_length, dtype='float32', padding='post', truncating='post')

# Convert labels to numpy arrays
y_train = training_df['hard_label_task1'].values
y_val = validation_df['hard_label_task1'].values


In [174]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Bidirectional

def create_baseline_model(input_shape, lstm_units=128):
    model = Sequential()
    model.add(Bidirectional(LSTM(units=lstm_units, return_sequences=False), input_shape=input_shape))
    model.add(Dense(1, activation='sigmoid'))  # Sigmoid for binary classification
    return model


In [175]:
def create_model_1(input_shape, lstm_units=128):
    model = Sequential()
    model.add(Bidirectional(LSTM(units=lstm_units, return_sequences=True), input_shape=input_shape))  # First LSTM layer
    model.add(Bidirectional(LSTM(units=lstm_units, return_sequences=False)))  # Second LSTM layer
    model.add(Dense(1, activation='sigmoid'))
    return model


# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline and Model 1.



### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.
* Evaluate your models using macro F1-score.

In [176]:
# input_shape = (sequence_length, X_train.shape[2])  # Shape of each input sample: (sequence_length, embedding_dim)

# # Baseline model
# baseline_model = create_baseline_model(input_shape=input_shape, lstm_units=128)
# baseline_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# # Train the baseline model
# history_baseline = baseline_model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)



In [177]:

# # Model 1
# model_1 = create_model_1(input_shape=input_shape, lstm_units=128)
# model_1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# # Train Model 1
# history_model_1 = model_1.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)

In [178]:
from sklearn.metrics import f1_score
from tensorflow.keras.callbacks import Callback, EarlyStopping
import numpy as np

# Custom Callback to compute F1-score
class F1ScoreCallback(Callback):
    def __init__(self, val_data):
        self.val_data = val_data

    def on_epoch_end(self, epoch, logs=None):
        val_x, val_y = self.val_data
        val_predictions = (self.model.predict(val_x) > 0.5).astype(int)  # Binarize predictions
        f1 = f1_score(val_y, val_predictions, average="macro")
        print(f"Epoch {epoch + 1}: Macro F1-Score = {f1:.4f}")
        logs['val_f1'] = f1  # Add F1 to logs for history tracking

# Early stopping to avoid overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)



In [179]:
input_shape = (sequence_length, X_train.shape[2])  # Shape of each input sample: (sequence_length, embedding_dim)

#### Training the baseline model on the train set

In [180]:
# Create baseline model
baseline_model = create_baseline_model(input_shape=input_shape, lstm_units=128)
baseline_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train baseline model
f1_callback = F1ScoreCallback(val_data=(X_val, y_val))
history_baseline = baseline_model.fit(X_train, y_train,
                                      validation_data=(X_val, y_val),
                                      epochs=10,
                                      batch_size=32,
                                      callbacks=[f1_callback, early_stopping])

# Evaluate baseline model
baseline_predictions = (baseline_model.predict(X_val) > 0.5).astype(int)
baseline_f1 = f1_score(y_val, baseline_predictions, average="macro")
print(f"Baseline Model - Macro F1-Score: {baseline_f1:.4f}")

  super().__init__(**kwargs)


Epoch 1/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 64ms/step
Epoch 1: Macro F1-Score = 0.5797
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 22ms/step - accuracy: 0.6183 - loss: 0.6532 - val_accuracy: 0.6646 - val_loss: 0.6340 - val_f1: 0.5797
Epoch 2/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step 
Epoch 2: Macro F1-Score = 0.7071
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 19ms/step - accuracy: 0.7049 - loss: 0.5775 - val_accuracy: 0.7342 - val_loss: 0.5555 - val_f1: 0.7071
Epoch 3/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step 
Epoch 3: Macro F1-Score = 0.7256
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 20ms/step - accuracy: 0.7325 - loss: 0.5511 - val_accuracy: 0.7468 - val_loss: 0.5506 - val_f1: 0.7256
Epoch 4/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step 
Epoch 4: Macro F1-Score = 0.7295
[1m90/90[0m [32m━━━━━━━━━━━━━━━━

#### Evaluating the model on the validation set

In [181]:
from sklearn.metrics import f1_score, classification_report

# Predict on the validation set
val_predictions = (baseline_model.predict(X_val) > 0.5).astype(int)  # Binarize predictions

# Compute macro F1-score
macro_f1 = f1_score(y_val, val_predictions, average='macro')
print(f"Macro F1-Score on Validation Set: {macro_f1:.4f}")

# Detailed metrics (optional)
print(classification_report(y_val, val_predictions, target_names=['Not Sexist', 'Sexist']))


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
Macro F1-Score on Validation Set: 0.7295
              precision    recall  f1-score   support

  Not Sexist       0.73      0.88      0.80        90
      Sexist       0.78      0.57      0.66        68

    accuracy                           0.75       158
   macro avg       0.76      0.73      0.73       158
weighted avg       0.75      0.75      0.74       158



#### Training with random seed

In [182]:
import tensorflow as tf
import numpy as np
import random

def set_seeds(seed):
    tf.random.set_seed(seed)
    np.random.seed(seed)
    random.seed(seed)


In [183]:
seeds = [42, 123, 789]  # Example seeds
f1_scores = []

for seed in seeds:
    print(f"Training with seed {seed}")
    set_seeds(seed)

    # Create and compile the baseline model
    baseline_model = create_baseline_model(input_shape=input_shape, lstm_units=128)
    baseline_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    # Train the model
    baseline_model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32, verbose=0)

    # Evaluate on validation set
    val_predictions = (baseline_model.predict(X_val) > 0.5).astype(int)
    macro_f1 = f1_score(y_val, val_predictions, average='macro')
    f1_scores.append(macro_f1)
    print(f"Seed {seed} - Macro F1-Score: {macro_f1:.4f}")


Training with seed 42


  super().__init__(**kwargs)


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
Seed 42 - Macro F1-Score: 0.7269
Training with seed 123


  super().__init__(**kwargs)


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 71ms/step
Seed 123 - Macro F1-Score: 0.7353
Training with seed 789


  super().__init__(**kwargs)


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
Seed 789 - Macro F1-Score: 0.7457


#### Pick the best performing model

In [184]:
best_seed = seeds[np.argmax(f1_scores)]
best_f1 = max(f1_scores)
print(f"Best Seed: {best_seed} with Macro F1-Score: {best_f1:.4f}")


Best Seed: 789 with Macro F1-Score: 0.7457


#### Evaluate best performing model

In [185]:
set_seeds(best_seed)
best_model = create_baseline_model(input_shape=input_shape, lstm_units=128)
best_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Retrain the best model
best_model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)

# Final Evaluation
final_val_predictions = (best_model.predict(X_val) > 0.5).astype(int)
final_macro_f1 = f1_score(y_val, final_val_predictions, average='macro')
print(f"Final Model (Seed {best_seed}) - Macro F1-Score: {final_macro_f1:.4f}")


  super().__init__(**kwargs)


Epoch 1/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 24ms/step - accuracy: 0.6287 - loss: 0.6462 - val_accuracy: 0.7089 - val_loss: 0.6135
Epoch 2/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 19ms/step - accuracy: 0.7102 - loss: 0.5815 - val_accuracy: 0.7025 - val_loss: 0.5926
Epoch 3/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 19ms/step - accuracy: 0.7275 - loss: 0.5579 - val_accuracy: 0.7532 - val_loss: 0.5366
Epoch 4/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 15ms/step - accuracy: 0.7349 - loss: 0.5363 - val_accuracy: 0.7342 - val_loss: 0.5340
Epoch 5/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.7547 - loss: 0.5180 - val_accuracy: 0.7405 - val_loss: 0.5226
Epoch 6/10
[1m90/90[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.7649 - loss: 0.4969 - val_accuracy: 0.7278 - val_loss: 0.5347
Epoch 7/10
[1m90/90[0m [32m━━━━

# [Task 6 - 1.0 points] Transformers

In this section, you will use a transformer model specifically trained for hate speech detection, namely [Twitter-roBERTa-base for Hate Speech Detection](https://huggingface.co/cardiffnlp/twitter-roberta-base-hate).




### Relevant Material
- Tutorial 3

### Instructions
1. **Load the Tokenizer and Model**

2. **Preprocess the Dataset**:
   You will need to preprocess your dataset to prepare it for input into the model. Tokenize your text data using the appropriate tokenizer and ensure it is formatted correctly.

   **Note**: You have to use the plain text of the dataset and not the version that you tokenized before, as you need to tokenize the cleaned text obtained after the initial cleaning process.

3. **Train the Model**:
   Use the `Trainer` to train the model on your training data.

4. **Evaluate the Model on the Test Set** using F1-macro.

#### 6.1. Tokenization

In [186]:
from transformers import AutoTokenizer

model_card = 'cardiffnlp/twitter-roberta-base-hate'
tokenizer = AutoTokenizer.from_pretrained(model_card)

#### 6.2. Model definition

In [187]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_card, num_labels=2)

#### 6.3. Preprocess the Dataset

In [188]:
from datasets import Dataset
import torch

# Convert DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(training_df[['tweet', 'hard_label_task1']])
val_dataset = Dataset.from_pandas(validation_df[['tweet', 'hard_label_task1']])
test_dataset = Dataset.from_pandas(test_df[['tweet', 'hard_label_task1']])

# Tokenize datasets
def tokenize_function(examples):
    return tokenizer(examples['tweet'], padding="max_length", truncation=True, max_length=128)

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/2870 [00:00<?, ? examples/s]

Map:   0%|          | 0/158 [00:00<?, ? examples/s]

Map:   0%|          | 0/286 [00:00<?, ? examples/s]

In [189]:
print(tokenized_train)
print(tokenized_test)
print(tokenized_val)

Dataset({
    features: ['tweet', 'hard_label_task1', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 2870
})
Dataset({
    features: ['tweet', 'hard_label_task1', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 286
})
Dataset({
    features: ['tweet', 'hard_label_task1', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 158
})


In [190]:
# Prepare datasets for PyTorch
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "hard_label_task1"])
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "hard_label_task1"])
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask", "hard_label_task1"])

In [191]:
# Separate features and labels
train_features = {
    'input_ids': tokenized_train['input_ids'],
    'attention_mask': tokenized_train['attention_mask']
}
train_labels = tokenized_train['hard_label_task1']

val_features = {
    'input_ids': tokenized_val['input_ids'],
    'attention_mask': tokenized_val['attention_mask']
}
val_labels = tokenized_val['hard_label_task1']

#### 6.4. Train the Model

In [192]:
from sklearn.metrics import f1_score, accuracy_score

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)

    f1 = f1_score(y_pred=predictions, y_true=labels, average='macro')
    acc = accuracy_score(y_pred=predictions, y_true=labels)
    return {'f1': f1, 'acc': acc}

In [193]:
import evaluate

acc_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(output_info):
    predictions, labels = output_info
    predictions = np.argmax(predictions, axis=-1)

    f1 = f1_metric.compute(predictions=predictions, references=labels, average='macro')
    acc = acc_metric.compute(predictions=predictions, references=labels)
    return {**f1, **acc}


In [197]:
tokenized_train = tokenized_train.rename_column("hard_label_task1", "labels")
tokenized_val = tokenized_val.rename_column("hard_label_task1", "labels")
tokenized_test = tokenized_test.rename_column("hard_label_task1", "labels")

In [198]:
from transformers import Trainer, TrainingArguments


training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,           # Keep only the best model
    load_best_model_at_end=True,  # Load the best model at the end
    metric_for_best_model="f1",
    logging_dir="./logs",
    logging_steps=10
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


  trainer = Trainer(


In [199]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.4339,0.339937,0.855552,0.860759
2,0.2173,0.360852,0.874932,0.879747
3,0.1114,0.445718,0.875554,0.879747


TrainOutput(global_step=540, training_loss=0.2728812508009098, metrics={'train_runtime': 250.4608, 'train_samples_per_second': 34.377, 'train_steps_per_second': 2.156, 'total_flos': 566346546662400.0, 'train_loss': 0.2728812508009098, 'epoch': 3.0})

#### 6.5. Evaluate the model

In [200]:
# Evaluate on the test set
test_results = trainer.evaluate(eval_dataset=tokenized_test)

print("Test Results:")
print(test_results)

Test Results:
{'eval_loss': 0.5188692808151245, 'eval_f1': 0.823482940798894, 'eval_accuracy': 0.8251748251748252, 'eval_runtime': 1.7811, 'eval_samples_per_second': 160.578, 'eval_steps_per_second': 10.106, 'epoch': 3.0}


# [Task 7 - 0.5 points] Error Analysis

### Instructions

After evaluating the model, perform a brief error analysis:

 - Review the results and identify common errors.

 - Summarize your findings regarding the errors and their impact on performance (e.g. but not limited to Out-of-Vocabulary (OOV) words, data imbalance, and performance differences between the custom model and the transformer...)
 - Suggest possible solutions to address the identified errors.



# [Task 8 - 0.5 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.


# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Execution Order

You are **free** to address tasks in any order (if multiple orderings are available).

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).
However, you are **free** to play with their hyper-parameters.


### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Robust Evaluation

Each model is trained with at least 3 random seeds.

Task 4 requires you to compute the average performance over the 3 seeds and its corresponding standard deviation.

### Model Selection for Analysis

To carry out the error analysis you are **free** to either

* Pick examples or perform comparisons with an individual seed run model (e.g., Baseline seed 1337)
* Perform ensembling via, for instance, majority voting to obtain a single model.

### Error Analysis

Some topics for discussion include:
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Bonus Points
Bonus points are arbitrarily assigned based on significant contributions such as:
- Outstanding error analysis
- Masterclass code organization
- Suitable extensions
Note that bonus points are only assigned if all task points are attributed (i.e., 6/6).

**Possible Extensions/Explorations for Bonus Points:**
- **Try other preprocessing strategies**: e.g., but not limited to, explore techniques tailored specifically for tweets or  methods that are common in social media text.
- **Experiment with other custom architectures or models from HuggingFace**
- **Explore Spanish tweets**: e.g., but not limited to, leverage multilingual models to process Spanish tweets and assess their performance compared to monolingual models.







# The End