# TO DO

- Weight samples to counteract imbalance
- Error analysis

# Assignment 2

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Human Value Detection, Multi-label classification, Transformers, BERT


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the [Human Value Detection challenge](https://aclanthology.org/2022.acl-long.306/).

## Problem definition

Arguments are paired with their conveyed human values.

Arguments are in the form of **premise** $\rightarrow$ **conclusion**.

### Example:

**Premise**: *``fast food should be banned because it is really bad for your health and is costly''*

**Conclusion**: *``We should ban fast food''*

**Stance**: *in favour of*

<center>
    <img src="images/human_values.png" alt="human values" />
</center>

### 0.1 Imports

By calling `enable_custom_widget_manager()`, we enable the notebook to support custom widgets on Colab. However, to utilize Plotly's FigureWidget, we must downgrade it for compatibility reasons.

In [66]:
import sys

if 'google.colab' in sys.modules:
  from google.colab import output
  output.enable_custom_widget_manager()

  !pip install plotly==5.10



In [None]:
import pandas as pd
import os, random
import urllib.request
from tqdm import tqdm
from IPython.display import display

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import plot
import plotly.offline as pyo

import numpy as np

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

from transformers import AutoTokenizer, AutoModel, logging

In [None]:
def set_reproducibility(seed = 42) -> None:
    random.seed(seed)
    torch.manual_seed(seed)
    np.random.seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_reproducibility()
device = (
    "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using {device} device")

# [Task 1 - 0.5 points] Corpus

Check the official page of the challenge [here](https://touche.webis.de/semeval23/touche23-web/).

The challenge offers several corpora for evaluation and testing.

You are going to work with the standard training, validation, and test splits.

#### Arguments
* arguments-training.tsv
* arguments-validation.tsv
* arguments-test.tsv

#### Human values
* labels-training.tsv
* labels-validation.tsv
* labels-test.tsv

### Example

#### arguments-*.tsv
```

Argument ID    A01005

Conclusion     We should ban fast food

Stance         in favor of

Premise        fast food should be banned because it is really bad for your health and is costly.
```

#### labels-*.tsv

```
Argument ID                A01005

Self-direction: thought    0
Self-direction: action     0
...
Universalism: objectivity: 0
```

### Splits

The standard splits contain

   * **Train**: 5393 arguments
   * **Validation**: 1896 arguments
   * **Test**: 1576 arguments

### Annotations

In this assignment, you are tasked to address a multi-label classification problem.

You are going to consider **level 3** categories:

* Openness to change
* Self-enhancement
* Conversation
* Self-transcendence

**How to do that?**

You have to merge (**logical OR**) annotations of level 2 categories belonging to the same level 3 category.

**Pay attention to shared level 2 categories** (e.g., Hedonism). $\rightarrow$ [see Table 1 in the original paper.](https://aclanthology.org/2022.acl-long.306/)

#### Example

```
Self-direction: thought:    0
Self-direction: action:     1
Stimulation:                0
Hedonism:                   1

Openess to change           1
```

### Instructions

* **Download** the specificed training, validation, and test files.
* **Encode** split files into a pandas.DataFrame object.
* For each split, **merge** the arguments and labels dataframes into a single dataframe.
* **Merge** level 2 annotations to level 3 categories.

### 1.0 Variables and functions

In [69]:
files = [
    "arguments-training.tsv",
    "arguments-validation.tsv",
    "arguments-test.tsv",
    "labels-training.tsv",
    "labels-validation.tsv",
    "labels-test.tsv"
]

def showcase_dict_dataframes(dfs_dict: dict, n: int = 2) -> None:
    """
    Prints information about DataFrames in a dictionary.

    Args:
        dfs_dict (dict): Dictionary containing DataFrames.
        n (int): Numbers of rows to show. Default 2.
    Returns:
        None
    """
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)
    for name, dataframe in dfs_dict.items():
        print(f"DataFrame Name: {name} | Shape: {dataframe.shape}")
        display(dataframe.head(n))
        print()

def configure_plotly_browser_state():
    """
    Configures Plotly to display graphs in Colab.
    """
    import IPython
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))

### 1.1 Download corpus

In [70]:
def download_corpus(files: list, path: str = "corpus") -> None:
    """
    Downloads corpus files from Zenodo.

    Args:
        files (list): List of file names to download.
        path (str, optional): Path to establish where to download the files. Defaults to "corpus".

    Returns:
        None
    """
    print("- Starting download...\n")
    if not os.path.exists(path):
        os.makedirs(path)

    for file_name in files:
        file_path = os.path.join(path, file_name)
        if os.path.exists(file_path):
            print(f"\t@ {file_name} already exists. Skipping download.\n")
        else:
            download_link = f"https://zenodo.org/records/10564870/files/{file_name}?download=1"
            print(f"\t@ Downloading {file_name}...")
            urllib.request.urlretrieve(download_link, file_path)
            print(f"\t@ {file_name} downloaded successfully!\n")

    print("- All files downloaded successfully.")

In [71]:

download_corpus(files)

- Starting download...

	@ arguments-training.tsv already exists. Skipping download.

	@ arguments-validation.tsv already exists. Skipping download.

	@ arguments-test.tsv already exists. Skipping download.

	@ labels-training.tsv already exists. Skipping download.

	@ labels-validation.tsv already exists. Skipping download.

	@ labels-test.tsv already exists. Skipping download.

- All files downloaded successfully.


### 1.2 Encode into a dataframe

In [72]:
def files_to_dataframe(files: list, path: str = "corpus") -> dict:
    """
    Reads multiple files into DataFrames and returns a dictionary.

    Args:
        files (list): List of file names to be read.
        path (str, optional): Path to the directory containing the files. Defaults to "corpus".

    Returns:
        dict: Dictionary containing DataFrames, where keys are file names and values are DataFrames.
    """
    dfs_dict = {}

    for file in files:
        file_path = os.path.join(path, file)
        dfs_dict[file] = pd.read_csv(file_path, sep='\t', header=0)

    return dfs_dict

In [73]:
dfs_dict_1 = files_to_dataframe(files)

In [74]:
showcase_dict_dataframes(dfs_dict_1)

DataFrame Name: arguments-training.tsv | Shape: (5393, 4)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise
0,A01002,We should ban human cloning,in favor of,we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same.
1,A01005,We should ban fast food,in favor of,fast food should be banned because it is really bad for your health and is costly.



DataFrame Name: arguments-validation.tsv | Shape: (1896, 4)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise
0,A01001,Entrapment should be legalized,in favor of,"if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal?"
1,A01012,The use of public defenders should be mandatory,in favor of,the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't



DataFrame Name: arguments-test.tsv | Shape: (1576, 4)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise
0,A26004,We should end affirmative action,against,affirmative action helps with employment equity.
1,A26010,We should end affirmative action,in favor of,affirmative action can be considered discriminatory against poor whites



DataFrame Name: labels-training.tsv | Shape: (5393, 21)


Unnamed: 0,Argument ID,Self-direction: thought,Self-direction: action,Stimulation,Hedonism,Achievement,Power: dominance,Power: resources,Face,Security: personal,Security: societal,Tradition,Conformity: rules,Conformity: interpersonal,Humility,Benevolence: caring,Benevolence: dependability,Universalism: concern,Universalism: nature,Universalism: tolerance,Universalism: objectivity
0,A01002,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,A01005,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0



DataFrame Name: labels-validation.tsv | Shape: (1896, 21)


Unnamed: 0,Argument ID,Self-direction: thought,Self-direction: action,Stimulation,Hedonism,Achievement,Power: dominance,Power: resources,Face,Security: personal,Security: societal,Tradition,Conformity: rules,Conformity: interpersonal,Humility,Benevolence: caring,Benevolence: dependability,Universalism: concern,Universalism: nature,Universalism: tolerance,Universalism: objectivity
0,A01001,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,A01012,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0



DataFrame Name: labels-test.tsv | Shape: (1576, 21)


Unnamed: 0,Argument ID,Self-direction: thought,Self-direction: action,Stimulation,Hedonism,Achievement,Power: dominance,Power: resources,Face,Security: personal,Security: societal,Tradition,Conformity: rules,Conformity: interpersonal,Humility,Benevolence: caring,Benevolence: dependability,Universalism: concern,Universalism: nature,Universalism: tolerance,Universalism: objectivity
0,A26004,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0
1,A26010,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1





### 1.3 Merge arguments and labels

In [75]:
def merge_arguments_labels(dfs_dict: dict) -> dict:
    """
    Merges arguments and labels DataFrames for each split and returns a new dictionary.

    Args:
        dfs_dict (dict): Dictionary containing DataFrames.

    Returns:
        dict: Dictionary containing merged DataFrames, where keys are split names and values are merged DataFrames.
    """
    merged_dfs_dict = {}

    for split in ["training", "validation", "test"]:
        merged_dfs_dict[split] = pd.merge(dfs_dict[f"arguments-{split}.tsv"], dfs_dict[ f"labels-{split}.tsv"], on="Argument ID")

    return merged_dfs_dict

In [76]:
dfs_dict_2 = merge_arguments_labels(dfs_dict_1)

In [77]:
showcase_dict_dataframes(dfs_dict_2)

DataFrame Name: training | Shape: (5393, 24)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Self-direction: thought,Self-direction: action,Stimulation,Hedonism,Achievement,Power: dominance,Power: resources,Face,Security: personal,Security: societal,Tradition,Conformity: rules,Conformity: interpersonal,Humility,Benevolence: caring,Benevolence: dependability,Universalism: concern,Universalism: nature,Universalism: tolerance,Universalism: objectivity
0,A01002,We should ban human cloning,in favor of,we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same.,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,A01005,We should ban fast food,in favor of,fast food should be banned because it is really bad for your health and is costly.,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0



DataFrame Name: validation | Shape: (1896, 24)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Self-direction: thought,Self-direction: action,Stimulation,Hedonism,Achievement,Power: dominance,Power: resources,Face,Security: personal,Security: societal,Tradition,Conformity: rules,Conformity: interpersonal,Humility,Benevolence: caring,Benevolence: dependability,Universalism: concern,Universalism: nature,Universalism: tolerance,Universalism: objectivity
0,A01001,Entrapment should be legalized,in favor of,"if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal?",0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,A01012,The use of public defenders should be mandatory,in favor of,the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0



DataFrame Name: test | Shape: (1576, 24)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Self-direction: thought,Self-direction: action,Stimulation,Hedonism,Achievement,Power: dominance,Power: resources,Face,Security: personal,Security: societal,Tradition,Conformity: rules,Conformity: interpersonal,Humility,Benevolence: caring,Benevolence: dependability,Universalism: concern,Universalism: nature,Universalism: tolerance,Universalism: objectivity
0,A26004,We should end affirmative action,against,affirmative action helps with employment equity.,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0
1,A26010,We should end affirmative action,in favor of,affirmative action can be considered discriminatory against poor whites,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1





### 1.4 Merge level 2 annotations to level 3 categories

In [78]:
def merge_subcategories(dfs_dict: dict, level_2_to_level_3: dict) -> dict:
    """
    Performs the aggregation of level 2 categories into level 3 categories for each DataFrame in the input dictionary.

    Args:
        dfs_dict (dict): Dictionary containing DataFrames.
        level_2_to_level_3 (dict): Dictionary containing the starting and ending column indices for each level 3 category.

    Returns:
        dict: Dictionary containing DataFrames with level 3 categories.
    """
    level3_dfs_dict = {}

    first_column, last_column = list(level_2_to_level_3.values())[0][0], list(level_2_to_level_3.values())[-1][1]

    for name, df in dfs_dict.items():
        new_df = df.copy()
        for level_3_category, (start_column, end_column) in level_2_to_level_3.items():
            new_df[level_3_category] = new_df.loc[:, start_column:end_column].apply(
                lambda row: 1 if row.any() else 0, axis=1
            )

        new_df.drop(df.loc[:, first_column:last_column].columns, axis=1, inplace=True)

        level3_dfs_dict[name] = new_df

    return level3_dfs_dict

# This dictionary contains the non unique mapping of level 2 columns into level 3 macrocategories
level_2_to_level_3 = {
    "Openness to change": ["Self-direction: thought", "Hedonism"],
    "Self-enhancement": ["Hedonism", "Face"],
    "Conservation": ["Face", "Humility"],
    "Self-transcendence": ["Humility", "Universalism: objectivity"]
}

In [79]:
dfs_dict_3 = merge_subcategories(dfs_dict_2, level_2_to_level_3)

In [80]:
showcase_dict_dataframes(dfs_dict_3)

DataFrame Name: training | Shape: (5393, 8)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Openness to change,Self-enhancement,Conservation,Self-transcendence
0,A01002,We should ban human cloning,in favor of,we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same.,0,0,1,0
1,A01005,We should ban fast food,in favor of,fast food should be banned because it is really bad for your health and is costly.,0,0,1,0



DataFrame Name: validation | Shape: (1896, 8)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Openness to change,Self-enhancement,Conservation,Self-transcendence
0,A01001,Entrapment should be legalized,in favor of,"if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal?",0,0,1,0
1,A01012,The use of public defenders should be mandatory,in favor of,the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't,0,0,0,1



DataFrame Name: test | Shape: (1576, 8)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Openness to change,Self-enhancement,Conservation,Self-transcendence
0,A26004,We should end affirmative action,against,affirmative action helps with employment equity.,0,1,1,1
1,A26010,We should end affirmative action,in favor of,affirmative action can be considered discriminatory against poor whites,0,1,0,1





### 1.5 Data analysis

#### 1.5.1 Level three categories

In [81]:
def pie_plot(dfs_dict: dict, keys: list, title: str) -> None:
    """
    Plots the distribution of a set of keys for each split. Used to visualize keys distributions.

    Args:
        dfs_dict (dict): Dictionary containing DataFrames.
        keys (list): List of keys to plot.
        title (str): Title of the plot.

    Returns:
        None
    """
    subplots = make_subplots(rows=1, cols=len(dfs_dict), specs=[[{"type": "pie"}] * len(dfs_dict)],
                        subplot_titles=list(dfs_dict.keys()))
    fig = go.FigureWidget(subplots)

    for i, (name, df) in enumerate(dfs_dict.items()):
        fig.add_trace(go.Pie(labels=keys, values=[df[key].sum() for key in keys],
                              marker=dict(colors=px.colors.qualitative.Set1), hole=.3),
                      row=1, col=i + 1)

    fig.update_layout(title_text=title, legend=dict(traceorder='reversed'))
    fig.show()

def bar_plot(dfs_dict: dict, keys: list, title: str) -> None:
    """
    Plots the distribution of a set of keys for each split. Used to visualize values distributions.

    Args:
        dfs_dict (dict): Dictionary containing DataFrames.
        keys (list): List of keys to plot.
        title (str): Title of the plot.

    Returns:
        None
    """
    subplots = make_subplots(rows=1, cols=len(dfs_dict), subplot_titles=list(dfs_dict.keys()))
    fig = go.FigureWidget(subplots)

    for i, (name, df) in enumerate(dfs_dict.items()):
        values_list = [df[key].value_counts().sort_index() for key in keys]
        x_labels = list(values_list[0].index)

        for j, values in enumerate(values_list):
                fig.add_trace(go.Bar(x=x_labels, y=values.values, name=keys[j],
                                      showlegend=False if keys[j] in [k['name'] for k in fig.data[:]] else True,
                                      marker_color=px.colors.qualitative.Set1[j]),
                               row=1, col=i + 1)

    fig.update_layout(title_text=title, barmode='group')
    fig.show()

def heatmap_plot(dfs_dict: dict, keys: list, group_column: str, title: str, colorscale: str, rows: bool = False) -> None:
    """
    Plots a heatmap showing the occurrence of pairs of values in DataFrames, along with the percentages.

    Args:
        dfs_dict (dict): Dictionary containing DataFrames.
        keys (list): List of keys to plot.
        group_column (str): Column to group by for counting occurrences.
        title (str): Title of the plot.
        colorscale (str): The colorscale for the heatmap.
        rows (bool): If True, arrange subplots in rows; otherwise, arrange in columns. Defaults to False.

    Returns:
        None
    """
    n_rows, n_cols = (len(dfs_dict), 1) if rows else (1, len(dfs_dict))

    subplots = make_subplots(rows=n_rows, cols=n_cols, subplot_titles=list(dfs_dict.keys()))
    fig = go.FigureWidget(subplots)

    for i, (name, df) in enumerate(dfs_dict.items()):
        if group_column:
            df_new = df.groupby(group_column).sum()[keys]
            df_new = df_new.div(df_new.sum(axis=0), axis=1)
            x_values = df_new.columns.tolist()
            y_values = df_new.index.tolist()
        else:
            # Create an empty co-occurrence matrix between the values in keys
            co_occurrence_matrix = df[keys].T.dot(df[keys])
            np.fill_diagonal(co_occurrence_matrix.values, 0)  # Exclude diagonal values
            df_new = co_occurrence_matrix.div(co_occurrence_matrix.sum(axis=0), axis=1)
            x_values = keys
            y_values = keys

        values = [[f'{value:.3f}' for value in row] for row in df_new.values.tolist()]

        fig.add_trace(go.Heatmap(z=df_new.values.tolist(),
                                 x=x_values,
                                 y=y_values,
                                 colorscale=colorscale,
                                 zmin=0,
                                 zmax=1,
                                 showlegend=False,
                                 text=values,
                                 texttemplate="%{text}",
                                 textfont={"size": 12}),
                      row=i // n_cols + 1, col=i % n_cols + 1)

    fig.update_layout(title_text=title)
    fig.show()

In [82]:
level_3_labels = list(dfs_dict_3['training'].columns[4:])

pie_plot(dfs_dict_3, level_3_labels, "Level 3 Categories human values distribution")

In [83]:
bar_plot(dfs_dict_3, level_3_labels, "Level 3 Categories human values distribution")

The dataset appears to be imbalanced, with arguments commonly concerning human values such as `Conversation` and `Self-transcendence`, while the opposite is observed for `Openness to change` and `Self-enanchement`, which are less represented; It turns out that  values such as `Conversation` and `Self-transcendence`, and  `Openness to change` and `Self-enhancement`.

#### 1.5.2 Co-occurrences

In [84]:
heatmap_plot(dfs_dict=dfs_dict_3, keys=level_3_labels, group_column='Stance', title="Stance and human values co-occurrence", colorscale = 'RdBu', rows=True)

The distribution of different `Stance` values is balanced throughout the datasets, meaning that people tend to have different perspectives on arguments. This suggests that considering this field in our models' input may not improve their discrimination capability.

In [85]:
heatmap_plot(dfs_dict=dfs_dict_3, keys=level_3_labels, group_column=None, title="Values co-occurrence", colorscale = 'Aggrnyl',rows=True)

`Self-transcendence` and `Conversation` co-occur often, even if `Hedonism`, their common shared level 2 category, is rarely observed in the datasets. It suggests that there could be a hidden dependence in the samples.

### 1.6 Data pre-processing

In [86]:
def preprocess_dataframes(dfs_dict: dict) -> dict:
    """
    Preprocesses each dataframe in the given dictionary and returns a new dictionary with preprocessed dataframes.

    Args:
        dfs_dict (dict): Dictionary containing DataFrames.

    Returns:
        dict: Dictionary containing preprocessed DataFrames.
    """
    preprocessed_dfs = {}

    for name, df in dfs_dict.items():
        preprocessed_df = df.copy()

        preprocessed_df["Stance"].replace({'in favor of': 1, 'against': 0}, inplace=True)

        preprocessed_dfs[name] = preprocessed_df

    return preprocessed_dfs

In [87]:
dfs_dict_4 = preprocess_dataframes(dfs_dict_3)

In [88]:
showcase_dict_dataframes(dfs_dict_4)

DataFrame Name: training | Shape: (5393, 8)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Openness to change,Self-enhancement,Conservation,Self-transcendence
0,A01002,We should ban human cloning,1,we should ban human cloning as it will only cause huge issues when you have a bunch of the same humans running around all acting the same.,0,0,1,0
1,A01005,We should ban fast food,1,fast food should be banned because it is really bad for your health and is costly.,0,0,1,0



DataFrame Name: validation | Shape: (1896, 8)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Openness to change,Self-enhancement,Conservation,Self-transcendence
0,A01001,Entrapment should be legalized,1,"if entrapment can serve to more easily capture wanted criminals, then why shouldn't it be legal?",0,0,1,0
1,A01012,The use of public defenders should be mandatory,1,the use of public defenders should be mandatory because some people don't have money for a lawyer and this would help those that don't,0,0,0,1



DataFrame Name: test | Shape: (1576, 8)


Unnamed: 0,Argument ID,Conclusion,Stance,Premise,Openness to change,Self-enhancement,Conservation,Self-transcendence
0,A26004,We should end affirmative action,0,affirmative action helps with employment equity.,0,1,1,1
1,A26010,We should end affirmative action,1,affirmative action can be considered discriminatory against poor whites,0,1,0,1





# [Task 2 - 2.0 points] Model definition

You are tasked to define several neural models for multi-label classification.

<center>
    <img src="images/model_schema.png" alt="model_schema" />
</center>

### Instructions

* **Baseline**: implement a random uniform classifier (an individual classifier per category).
* **Baseline**: implement a majority classifier (an individual classifier per category).

<br/>

* **BERT w/ C**: define a BERT-based classifier that receives an argument **conclusion** as input.
* **BERT w/ CP**: add argument **premise** as an additional input.
* **BERT w/ CPS**: add argument premise-to-conclusion **stance** as an additional input.

### Notes

**Do not mix models**. Each model has its own instructions.

You are **free** to select the BERT-based model card from huggingface.

#### Examples

```
bert-base-uncased
prajjwal1/bert-tiny
distilbert-base-uncased
roberta-base
```

### 2.1 Model and data preparation

See [here](https://visualstudiomagazine.com/articles/2020/12/15/pytorch-network.aspx).

We have chosen the `prajjwal1/bert-tiny` model due to its lightweight nature, which makes it suitable for resource-constrained environments; we then employ the `AutoTokenizer` to automatically load the tokenizer associated with the chosen model.

In [89]:
model_name = 'prajjwal1/bert-tiny'
tokenizer = AutoTokenizer.from_pretrained(model_name)

`PyTorch` offers the possibility to create a Custom Dataset for our files, overriding a class and implementing three functions: `__init__`, `__len__`, and `__getitem__`.

In [90]:
class HumanValuesDataset(Dataset):
    def __init__(self, dataframe: pd.DataFrame, tokenizer, text_columns: dict, numerical_columns: list, label_columns: list):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.text_columns = text_columns
        self.numerical_columns = numerical_columns
        self.label_columns = label_columns

    def __len__(self) -> int:
        return len(self.data)

    def __getitem__(self, idx: int) -> dict:
        row = {}
        labels = torch.tensor(self.data.iloc[idx][self.label_columns].values.astype(np.float32))

        # Text
        for col, max_length in self.text_columns.items():
            text = str(self.data.iloc[idx][col])
            encoding = self.tokenizer(text,
                                       truncation=True,
                                       max_length=max_length,
                                       padding='max_length',
                                       return_tensors='pt')
            row[col.lower()] = {
                'input_ids': encoding['input_ids'].squeeze(0),
                'attention_mask': encoding['attention_mask'].squeeze(0),
                'token_type_ids': encoding["token_type_ids"].squeeze(0)
            }

        # Numerical
        for col in self.numerical_columns:
            row[col.lower()] = torch.tensor([self.data[col][idx]])

        row['labels'] = labels
        return row

Finally, we load that datasets using the `DataLoader` class, which can iterate through the dataset as needed. Each iteration returns a batch of features and labels (containing `batch_size=16` features and labels respectively). Because we specified `shuffle=True`, after we iterate over all batches the data is shuffled.

From `HuggingFace` we hacknowledge that the maximum sequence length that our model can handle is `512`.

In [91]:
numerical_columns = ['Stance']

text_columns_names = ['Conclusion', 'Premise']
text_columns = {col:0 for col in text_columns_names}

for col in text_columns.keys():
    idx =  dfs_dict_4['training'][col].str.len().idxmax()
    txt = dfs_dict_4['training'][col][idx]
    text_columns[col] = len(txt)

    if text_columns[col] < 512:
        print(f"Longest '{col}' in training split:\n\t'{txt}'\n\tLength:{text_columns[col]}\n")
    else:
        print(f"Longest '{col}' in training split:\n\t'{txt}'\n\tLength:{text_columns[col]} (Truncated to 512)\n")
        text_columns[col] = 512


Longest 'Conclusion' in training split:
	'The best way to save the world from climate change and protect the environment is to encourage everyone to start with themselves and look at what things they can do to help with this problem'
	Length:190

Longest 'Premise' in training split:
	'According to the United Nations Convention on the rights of people with disabilities, the European Union “shall closely consult with and actively involve persons with disabilities” on political decisions that concern them. Meanwhile the “European Strategy for the Rights of Persons with Disabilities 2021-2030” hardly mentioned people with intellectual and neurological disabilities.  Now the Conference for the Future of Europe wants to be an inclusive citizen consultation but is still not making it a priority to include citizens with Trisomy 21, autism, or members of other neurodivergent communities. Ableism keeps getting perpetuated in the EU and it needs to stop.  We want more representation! We want peop

In [92]:
train_dataset = HumanValuesDataset(dfs_dict_4['training'], tokenizer, text_columns=text_columns,\
                                      numerical_columns=numerical_columns,\
                                      label_columns=level_3_labels)

val_dataset = HumanValuesDataset(dfs_dict_4['validation'], tokenizer, text_columns=text_columns,\
                                        numerical_columns=numerical_columns,\
                                        label_columns=level_3_labels)

test_dataset = HumanValuesDataset(dfs_dict_4['test'], tokenizer, text_columns=text_columns,\
                                        numerical_columns=numerical_columns,\
                                        label_columns=level_3_labels)

In [93]:
def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

batch_size = 32
g = torch.Generator()
g.manual_seed(42)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)

val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)

test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, worker_init_fn=seed_worker, generator=g)

Let us explore a sample to ensure the correctness of our methodology.

In [94]:
x = next(iter(train_loader))

for item, item_value in x.items():
    print(f"{item}:")
    try:
        for el, value in item_value.items():
            print(f"\t{el} [shape: {value.shape}]")
    except:
        print(f"\tshape: {item_value.shape}")
    print()

conclusion:
	input_ids [shape: torch.Size([32, 190])]
	attention_mask [shape: torch.Size([32, 190])]
	token_type_ids [shape: torch.Size([32, 190])]

premise:
	input_ids [shape: torch.Size([32, 512])]
	attention_mask [shape: torch.Size([32, 512])]
	token_type_ids [shape: torch.Size([32, 512])]

stance:
	shape: torch.Size([32, 1])

labels:
	shape: torch.Size([32, 4])



### 2.2 Baselines

We are now going to define the two requested baselines, respectively a `random uniform classifier` and a `majority classifier`.

In [95]:
class RandomUniformClassifier(nn.Module):
    def __init__(self, num_labels):
        """
        Initializes a new instance of the RandomUniformClassifier class.

        Args:
            num_labels (int): The number of labels/classes in the classification task.

        Returns:
            None
        """
        super(RandomUniformClassifier, self).__init__()
        self.num_labels = num_labels

    def forward(self, x):
        """
        Defines the forward pass of the random uniform classifier.

        Args:
            x (torch.Tensor): Input tensor representing the features.

        Returns:
            torch.Tensor: A tensor representing the randomly generated predictions for each sample in the input batch.
        """
        return torch.randint(0, 2, (x['conclusion']['input_ids'].shape[0], self.num_labels)).float()

In [96]:
class MajorityClassifier(nn.Module):
    def __init__(self):
        """
        Initializes a new instance of the MajorityClassifier class.

        Returns:
            None
        """
        super(MajorityClassifier, self).__init__()

    def forward(self, x):
        """
        Defines the forward pass of the majority classifier.

        Args:
            x (torch.Tensor): Input tensor representing the features.

        Returns:
            torch.Tensor: A tensor representing the majority labels repeated for each sample in the batch.
        """
        return self.results.repeat(x['conclusion']['input_ids'].size(0), 1)

    def fit(self, dataset):
        """
        Fits the majority classifier to the given dataset.

        Args:
            dataset (torch.utils.data.Dataset): The dataset containing labeled samples.

        Returns:
            None
        """
        labels = torch.stack([sample['labels'] for sample in dataset])

        self.results = (torch.count_nonzero(labels, dim=0) > len(dataset) / 2).float()

### 2.3 BERT models

We are now going to define the BERT models according to what the task asks.

### Input concatenation

<center>
    <img src="images/input_merging.png" alt="Input merging" />
</center>

In [97]:
logging.set_verbosity_error()

class BERTModule(nn.Module):
    def __init__(self, model_name, num_labels):
        super(BERTModule, self).__init__()
        self.model = AutoModel.from_pretrained(model_name)

    def forward(self, x):
        return self.model(input_ids=x['input_ids'], \
                    attention_mask=x['attention_mask'], \
                    token_type_ids=x['token_type_ids']).pooler_output

class ClassificationHead(nn.Module):
    def __init__(self, input_size, num_labels):
        super(ClassificationHead, self).__init__()
        self.dropout = torch.nn.Dropout(p=0.2)
        self.fc = nn.Linear(input_size, num_labels)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.dropout(x)
        x = self.fc(x)
        x = self.sigmoid(x)
        return x

### BERT w/ C

<center>
    <img src="images/bert_c.png" alt="BERT w/ C" />
</center>

In [98]:
class CClassifier(nn.Module):
    def __init__(self, model_name, num_labels):
        super(CClassifier, self).__init__()
        self.conclusion_module = BERTModule(model_name, num_labels)

        size = self.conclusion_module.model.config.hidden_size
        self.head = ClassificationHead(size, num_labels)

    def forward(self, x):
        h = self.conclusion_module(x['conclusion'])
        y = self.head(h)

        return y

### BERT w/ CP

<center>
    <img src="images/bert_cp.png" alt="BERT w/ CP" />
</center>

In [99]:
class CPClassifier(nn.Module):
    def __init__(self, model_name, num_labels):
        super(CPClassifier, self).__init__()
        self.conclusion_module = BERTModule(model_name, num_labels)
        self.premise_module = BERTModule(model_name, num_labels)

        size = self.premise_module.model.config.hidden_size + \
                self.conclusion_module.model.config.hidden_size
        self.head = ClassificationHead(size, num_labels)

    def forward(self, x):
        h_1 = self.conclusion_module(x['conclusion'])
        h_2 = self.premise_module(x['premise'])
        y = self.head(torch.cat((h_1, h_2), dim=-1))

        return y

### BERT w/ CPS

<center>
    <img src="images/bert_cps.png" alt="BERT w/ CPS" />
</center>

In [100]:
class CPSClassifier(nn.Module):
    def __init__(self, model_name, num_labels):
        super(CPSClassifier, self).__init__()
        self.conclusion_module = BERTModule(model_name, num_labels)
        self.premise_module = BERTModule(model_name, num_labels)

        size = self.premise_module.model.config.hidden_size + \
                self.conclusion_module.model.config.hidden_size + 1
        self.head = ClassificationHead(size, num_labels)

    def forward(self, x):
        h_1 = self.conclusion_module(x['conclusion'])
        h_2 = self.premise_module(x['premise'])
        y = self.head(torch.cat((h_1, h_2, x['stance']), dim=-1))

        return y

### Notes

The **stance** input has to be encoded into a numerical format.

You **should** use the same model instance to encode **premise** and **conclusion** inputs.

# [Task 3 - 0.5 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using per-category binary F1-score.
* Compute the average binary F1-score over all categories (macro F1-score).

### Example

You start with individual predictions ($\rightarrow$ samples).

```
Openess to change:    0 0 1 0 1 1 0 ...
Self-enhancement:     1 0 0 0 1 0 1 ...
Conversation:         0 0 0 1 1 0 1 ...
Self-transcendence:   1 1 0 1 0 1 0 ...
```

You compute per-category binary F1-score.

```
Openess to change F1:    0.35
Self-enhancement F1:     0.55
Conversation F1:         0.80
Self-transcendence F1:   0.21
```

You then average per-category scores.
```
Average F1: ~0.48
```

### 3.1 Metrics

We opted to use `sklearn.metrics.f1_score`.

Its `average` parameter accepts a `binary` value, which computes results for the class specified by `pos_label`; this allows to compute per-category binary F1-score.

The `macro` parameter, instead, calculates metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

In [101]:
from sklearn.metrics import f1_score

Our data analysis revealed a highly imbalanced dataset. While various techniques address this issue, calculating class weights remains a prevalent and effective approach.

By emphasizing the error (loss values) for under-represented classes, we encourage the model to focus on learning these classes effectively. This is achieved by computing class weights based on the **inverse frequency** of each class.

Given that the loss function $\mathcal{L}$ is defined as:

$$ \mathcal{L} = \frac{1}{N} \sum_{n=1}^{N} \sum_{i=1}^{K} w_i t_n^i \log(y_n^i) $$

where:

* $N$ is the total number of samples.
* $K$ is the total number of classes.
* $w_i$ is the weight for class $i$.
* $t_n^i$ is the true label of sample $n$ for class $i$ (either 0 or 1).
* $y_n^i$ is the predicted probability of sample $n$ belonging to class $i$.

Classes with higher weights contribute more significantly to the overall loss. To counteract the model's bias towards frequent classes, we calculate class weights inversely proportional to the class frequencies:

$$ w_i = \frac{N}{K \sum_{n=1}^{N} t_n^i} $$

This formula ensures that classes with fewer samples receive higher weights, balancing the influence of each class on the training process.

In [102]:
def compute_weights(dataset):
    """
    Compute class weights based on the inverse of class frequencies in the dataset.

    Args:
      dataset (Dataset): The dataset containing samples with 'labels' attribute.

    Returns:
      torch.Tensor: A tensor containing the computed class weights.
    """
    n_samples = len(dataset)  # Total number of samples
    n_classes = len(dataset[0]['labels'])  # Total number of classes

    # Count the number of samples for each class
    n_samples_per_class = torch.count_nonzero(dataset[:]['labels'], dim=0)

    # Compute weights
    weights = torch.stack([n_samples / (n_classes * n_samples_j) for n_samples_j in n_samples_per_class])

    return weights

class_weights = compute_weights(train_dataset)
print("Weights:", class_weights)

Weights: tensor([0.6813, 0.5417, 0.3283, 0.3284])


# [Task 4 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate **all** defined models.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Pick **at least** three seeds for robust estimation.
* Compute metrics on the validation set.
* Report **per-category** and **macro** F1-score for comparison.

In [103]:
def move_to_device(data, device: str):
    """
    Move data to the specified device.

    Args:
      data (Union[dict, torch.Tensor, Any]): Input data to be moved to the device.
      device (str): The device to move the data to (e.g., 'cuda' or 'cpu').

    Returns:
      Union[dict, torch.Tensor, Any]: Data moved to the specified device.
    """
    if isinstance(data, dict):
        return {key: move_to_device(value, device) for key, value in data.items()}
    elif isinstance(data, torch.Tensor):
        return data.to(device)
    else:
        return data

def evaluate(model: nn.Module, loader: DataLoader, criterion: nn.Module, device: str, seed: int = 42):
    """
    Evaluate the model on the given loader using the specified criterion.

    Args:
      model (nn.Module): The model to evaluate.
      loader (DataLoader): The data loader for evaluation.
      criterion (nn.Module): The criterion used for evaluation.
      device (str): The device to run the evaluation on (e.g., 'cuda' or 'cpu').
      seed (int): Random seed for reproducibility.

    Returns:
      Tuple[float, float, np.ndarray, List[np.ndarray], List[np.ndarray], List[np.ndarray]]: A tuple containing the loss, macro F1 score, per-category F1 scores, ground truth labels, predicted labels, and prediction scores.
    """
    set_reproducibility(seed)

    model.to(device)
    model.eval()

    criterion.to(device)

    total_loss = 0.0
    with torch.no_grad():
        gts, preds, scores = [], [], []
        for batch in loader:
            batch = move_to_device(batch, device)

            outputs = model(batch)
            loss = criterion(outputs.to(device), batch['labels'])

            total_loss += loss.item()

            gts.extend(batch['labels'].cpu().numpy())
            preds.extend(outputs.cpu().detach().numpy() > 0.5)
            scores.extend(outputs.cpu().detach().numpy())

    loss = total_loss / len(loader)
    f1 = f1_score(gts, preds, average='macro')
    per_category_f1 = f1_score(gts, preds, average=None)

    return loss, f1, per_category_f1, gts, preds, scores

In [104]:
def train(model: nn.Module, train_loader: DataLoader, criterion: nn.Module, optimizer: torch.optim.Optimizer, device: str, epochs: int, seed: int, val_loader = None, verbose: bool = True):
    """
    Train the given model using the specified criterion and optimizer.

    Args:
      model (nn.Module): The model to train.
      train_loader (DataLoader): The data loader for training.
      criterion (nn.Module): The criterion used for training.
      optimizer (torch.optim.Optimizer): The optimizer used for training.
      device (str): The device to run the training on (e.g., 'cuda' or 'cpu').
      epochs (int): The number of epochs for training.
      seed (int): Random seed for reproducibility.
      val_loader (Optional[DataLoader]): The data loader for validation (default: None).
      verbose (bool): Whether to print training progress (default: True).

    Returns:
      Tuple[float, nn.Module]: A tuple containing the best F1 score and the best model.
    """
    set_reproducibility(seed)
    model.to(device)
    criterion.to(device)

    best_f1, best_epoch, best_model = -1, None, None
    train_losses, train_f1_scores = [], []
    val_losses, val_f1_scores = [], []

    save_path = os.path.join('checkpoints', f'{model.__class__.__name__}', str(seed))
    os.makedirs(save_path, exist_ok=True)

    subplots = make_subplots(rows=1, cols=2, subplot_titles=('Loss', 'F1 Score'))
    fig = go.FigureWidget(subplots)
    fig.update_layout(title_text=f'{model.__class__.__name__} - Seed [{seed}]')

    display(fig)

    # Training
    for epoch in range(epochs):
        model.train()
        running_loss = 0.0
        gts, preds = [], []

        tqdm_loader = tqdm(train_loader, desc=f'Epoch {epoch + 1}/{epochs}', leave=False)
        for batch_idx, batch in enumerate(tqdm_loader):
            batch = move_to_device(batch, device)

            # Train step
            optimizer.zero_grad()

            outputs = model(batch)
            loss = criterion(outputs.to(device), batch['labels'])

            loss.backward()
            optimizer.step()

            running_loss += loss.item()

            gts.extend(batch['labels'].cpu().numpy())
            preds.extend(outputs.cpu().detach().numpy() > 0.5)

            tqdm_loader.set_postfix({'loss': running_loss / (batch_idx + 1)})

        # Compute F1 score for training set
        train_loss = running_loss / len(train_loader)
        f1_train = f1_score(gts, preds, average='macro')

        train_losses.append(train_loss)
        train_f1_scores.append(f1_train)

        # Validation
        val_loss, f1_val, _, _, _, _ = evaluate(model, val_loader, criterion, device, seed = seed)

        val_losses.append(val_loss)
        val_f1_scores.append(f1_val)

        # Check if current F1 score is the best seen so far
        if f1_val > best_f1:
            best_f1 = f1_val
            best_epoch = epoch + 1
            torch.save(model.state_dict(), os.path.join(save_path, 'best_model.pth'))

        train_loss_trace = go.Scatter(x=np.arange(len(train_losses)) + 1, y=train_losses, mode='lines+markers', name='Train Loss', line=dict(color='blue'), showlegend=True if epoch==0 else False)
        val_loss_trace = go.Scatter(x=np.arange(len(val_losses)) + 1, y=val_losses, mode='lines+markers', name='Validation Loss', line=dict(color='red'), showlegend=True if epoch==0 else False)
        train_f1_trace = go.Scatter(x=np.arange(len(train_f1_scores)) + 1, y=train_f1_scores, mode='lines+markers', name='Train F1 Score', line=dict(color='green'), showlegend=True if epoch==0 else False)
        val_f1_trace = go.Scatter(x=np.arange(len(val_f1_scores)) + 1, y=val_f1_scores, mode='lines+markers', name='Validation F1 Score', line=dict(color='orange'), showlegend=True if epoch==0 else False)

        fig.add_trace(train_loss_trace, row=1, col=1)
        fig.add_trace(val_loss_trace, row=1, col=1)
        fig.add_trace(train_f1_trace, row=1, col=2)
        fig.add_trace(val_f1_trace, row=1, col=2)

    # Save data
    print(f'Saved model with best F1 score ({best_f1:.3f} at epoch {best_epoch}) - {save_path}\n')

    return best_f1, best_model

def train_models(model_class: nn.Module, seeds: list, model_name: str, level_3_labels: list, train_loader: DataLoader, val_loader: DataLoader, device: str, class_weights: torch.Tensor = class_weights, epochs: int = 15):
    """
    Train multiple models with different random seeds and return the best models.

    Args:
      model_class (Type[nn.Module]): The class of the model to train.
      seeds (List[int]): List of random seeds for reproducibility.
      model_name (str): Name of the model.
      level_3_labels (List[str]): List of class labels.
      train_loader (DataLoader): The data loader for training.
      val_loader (DataLoader): The data loader for validation.
      device (str): The device to run the training on (e.g., 'cuda' or 'cpu').
      class_weights (torch.Tensor): Class weights for training (default: class_weights).
      epochs (int): Number of epochs for training (default: 15).

    Returns:
      Dict[int, Dict[str, Union[float, nn.Module]]]: A dictionary containing the best F1 score and the corresponding best model for each seed.
    """
    best_models = {}
    best_model = None

    for seed in seeds:
        best_models[seed] = {}
        model = model_class(model_name, len(level_3_labels))

        # Freeze layers except for head and dense layers
        for name, param in model.named_parameters():
            if not ("head" in name or "dense" in name):
                param.requires_grad = False

        criterion = nn.BCELoss(weight=class_weights)
        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

        # Train the model
        best_f1, best_model = train(model, train_loader, criterion, optimizer, device, epochs=epochs, seed=seed, val_loader=val_loader)

        best_models[seed] = {'f1': best_f1, 'model': best_model}

    # Calculate and print the average F1 score over all seeds
    average_best_f1 = sum([result['f1'] for result in best_models.values()]) / len([result['f1'] for result in best_models.values()])
    print("Average of best F1 scores:", average_best_f1)

In [105]:
seeds = [7, 42, 99]

In [41]:
train_models(CClassifier, seeds, model_name, level_3_labels, train_loader, val_loader, device)

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

FigureWidget({
    'data': [],
    'layout': {'annotations': [{'font': {'size': 16},
                         …



Saved model with best F1 score (0.731 at epoch 12) - checkpoints/CClassifier/7



FigureWidget({
    'data': [],
    'layout': {'annotations': [{'font': {'size': 16},
                         …



Saved model with best F1 score (0.686 at epoch 12) - checkpoints/CClassifier/42



FigureWidget({
    'data': [],
    'layout': {'annotations': [{'font': {'size': 16},
                         …



Saved model with best F1 score (0.686 at epoch 14) - checkpoints/CClassifier/99

Average of best F1 scores: 0.7010425675262626


In [42]:
train_models(CPClassifier, seeds, model_name, level_3_labels, train_loader, val_loader, device)

FigureWidget({
    'data': [],
    'layout': {'annotations': [{'font': {'size': 16},
                         …



Saved model with best F1 score (0.727 at epoch 8) - checkpoints/CPClassifier/7



FigureWidget({
    'data': [],
    'layout': {'annotations': [{'font': {'size': 16},
                         …



Saved model with best F1 score (0.739 at epoch 12) - checkpoints/CPClassifier/42



FigureWidget({
    'data': [],
    'layout': {'annotations': [{'font': {'size': 16},
                         …



Saved model with best F1 score (0.731 at epoch 2) - checkpoints/CPClassifier/99

Average of best F1 scores: 0.7320822783228161


In [106]:
train_models(CPSClassifier, seeds, model_name, level_3_labels, train_loader, val_loader, device)

FigureWidget({
    'data': [],
    'layout': {'annotations': [{'font': {'size': 16},
                         …



Saved model with best F1 score (0.728 at epoch 11) - checkpoints/CPSClassifier/7



FigureWidget({
    'data': [],
    'layout': {'annotations': [{'font': {'size': 16},
                         …



Saved model with best F1 score (0.725 at epoch 6) - checkpoints/CPSClassifier/42



FigureWidget({
    'data': [],
    'layout': {'annotations': [{'font': {'size': 16},
                         …



Saved model with best F1 score (0.751 at epoch 9) - checkpoints/CPSClassifier/99

Average of best F1 scores: 0.7348913669756808


We will now evaluate over the test set to choose the best model to use for error analysis.

In [44]:
def evaluation_charts(plots: dict, rows: int = 2, cols: int = 2):
    """
    Generate evaluation charts for the given plots.

    Args:
      plots (dict): A dictionary containing plot titles as keys and plot data as values.
                    Plot data should be in the format {'labels': [...], 'data': [...]}.
      rows (int): Number of rows in the subplot grid.
      cols (int): Number of columns in the subplot grid.

    Returns:
      None
    """
    subplots = make_subplots(rows=rows, cols=cols, subplot_titles=[plot['title'] for plot in plots.values()])
    fig = go.FigureWidget(subplots)

    row, col = 1, 1
    for title, plot_data in plots.items():
        colors = ['rgb(228, 26, 28)', 'rgb(55, 126, 184)', 'rgb(77, 175, 74)', 'rgb(152, 78, 163)', 'rgb(255, 127, 0)']

        data = go.Bar(
            x=plot_data['labels'],
            y=plot_data['data'],
            marker=dict(color=colors)
        )

        fig.add_trace(data, row=row, col=col)
        fig.update_yaxes(title_text="F1 Score", row=row, col=col)

        for i, val in enumerate(plot_data['data']):
            fig.add_annotation(
                x=plot_data['labels'][i],
                y=val,
                text=str(round(val, 3)),
                showarrow=False,
                font=dict(color='white', size=11),
                xanchor='center',
                yanchor='middle',
                row=row,
                col=col,
                yshift=-10
            )

        col += 1
        if col > 2:
            col = 1
            row += 1

    fig.update_layout(title="Evaluation Results", showlegend=False, margin=dict(l=60, r=60, t=60, b=40))
    pyo.iplot(fig)

In [45]:
def evaluate_models(model_class, seeds: list, model_name: str, level_3_labels: list, loader: DataLoader, device: str, class_weights=class_weights, rows: int = 2, cols: int = 2):
    """
    Evaluate models with given parameters and visualize evaluation results.

    Args:
      model_class: The class of the model to evaluate.
      seeds (list): List of seed values for reproducibility.
      model_name (str): Name of the model.
      level_3_labels (list): List of level 3 labels.
      loader: Data loader for evaluation.
      device: Device to run the evaluation on (e.g., 'cuda' or 'cpu').
      class_weights: Weights for balancing class distribution.
      rows (int): Number of rows in the evaluation charts subplot.
      cols (int): Number of columns in the evaluation charts subplot.

    Returns:
      dict: Dictionary containing evaluation results including the best model, its F1 score, and average F1 scores.
    """
    plots, results = {}, {}
    best_model = None
    best_f1, avg_loss, avg_f1 = -1, 0, 0
    avg_per_category_f1 = [0] * len(level_3_labels)

    for seed in seeds:
        if model_class == RandomUniformClassifier:
            model = model_class(len(level_3_labels))
        elif model_class == MajorityClassifier:
            model = model_class()
            model.fit(train_dataset)
        else:
            model = model_class(model_name, len(level_3_labels))
            checkpoint_path = os.path.join('checkpoints', f'{model.__class__.__name__}', str(seed))
            model.load_state_dict(torch.load(os.path.join(checkpoint_path, 'best_model.pth'), map_location=device), strict=False)

        criterion = nn.BCELoss(weight=class_weights)

        # Evaluate the model
        _, f1, per_category_f1, _, _, _ = evaluate(model, loader, criterion, device, seed)

        # Update best model if current model has higher average F1 score
        if f1 > best_f1:
            best_model = model
            best_f1 = f1
            best_seed = seed

        avg_f1 += f1
        avg_per_category_f1 = [x + y for x, y in zip(avg_per_category_f1, per_category_f1)]

        plots[seed] = {'title': f"Seed {seed} Evaluation Results", 'labels': level_3_labels + ["Macro"], 'data': list(per_category_f1) + [f1]}

    # Calculate average F1 score, and per-category F1 score
    avg_f1 /= len(seeds)
    avg_per_category_f1 = [x / len(seeds) for x in avg_per_category_f1]

    if rows > 1 or cols > 1:
      plots['Average'] = {'title': "Average Evaluation Results", 'labels': level_3_labels + ["Macro"], 'data': list(avg_per_category_f1) + [avg_f1]}

    evaluation_charts(plots, rows, cols)

    # Prepare results dictionary
    results = {
        'model_class': model_class.__name__,
        'best_model': best_model,
        'best_f1': best_f1,
        'best_seed': best_seed,
        'avg_f1': avg_f1,
        'avg_per_category_f1': avg_per_category_f1
    }

    return results


In [107]:
evaluations_val = []

In [108]:
evaluations_val.append(evaluate_models(CClassifier, seeds, model_name, level_3_labels, val_loader, device))

In [109]:
evaluations_val.append(evaluate_models(CPClassifier, seeds, model_name, level_3_labels, val_loader, device))

In [110]:
evaluations_val.append(evaluate_models(CPSClassifier, seeds, model_name, level_3_labels, val_loader, device))

In [None]:
best_val_model = max(evaluations_val, key=lambda x: x['avg_f1'])

print(f"The best performing model on the validation set is {best_val_model['model_class']}.")
print(f"\tMacro F1 score [{best_val_model['avg_f1']:.3f}]")
print(f"\tBest seed: {best_val_model['best_seed']}.")

The best performing model is the third one, with slightly higher average F1 scores compared to the model that includes `Stance` among its inputs.

The similar performances with the `CP Model` suggests that is could be attributed to chance, confirming our intuition that the `Stance` values, being equally distributed among samples, do not contribute to improving the discriminative performance.

The ability to correctly predict `Openness to Change` highly contributes to the score differences, followed by `Self-Enhancement`, which actually are the these classes are less represented. We will further investigate in the following section.

# [Task 5 - 1.0 points] Error Analysis

You are tasked to discuss your results.

### Instructions

* **Compare** classification performance of BERT-based models with respect to baselines.
* Discuss **difference in prediction** between the best performing BERT-based model and its variants.

### Notes

You can check the [original paper](https://aclanthology.org/2022.acl-long.306/) for suggestions on how to perform comparisons (e.g., plots, tables, etc...).

## 5.1 Best model on test set

First of all, we are going to pick the model performing better on the test set.

In [112]:
evaluations_test = []

In [113]:
evaluations_test.append(evaluate_models(CClassifier, seeds, model_name, level_3_labels, test_loader, device))

In [114]:
evaluations_test.append(evaluate_models(CPClassifier, seeds, model_name, level_3_labels, test_loader, device))

In [115]:
evaluations_test.append(evaluate_models(CPSClassifier, seeds, model_name, level_3_labels, test_loader, device))

In [116]:
best_test_model = max(evaluations_test, key=lambda x: x['avg_f1'])

print(f"The best performing model on the test set is {best_test_model['model_class']}.")
print(f"\tMacro F1 score [{best_test_model['avg_f1']:.3f}]")
print(f"\tBest seed: {best_test_model['best_seed']}.")

The best performing model on the test set is CPSClassifier.
	Macro F1 score [0.698]
	Best seed: 99.


The classifier leveraging on `Conclusion`, `Premise` and `Stance` performs slightly better on the test set as well. This is not surprising, as the two sets come from the same distribution.

## 5.2 Comparison with baselines

Now let's see how the baselines perform. This time, we will evaluate them using only the best seed, and employ `sklearn.metrics.classification_report`, ` sklearn.metrics.precision_recall_curve` and `sklearn.metrics.average_precision_score`.

In [117]:
from sklearn.metrics import classification_report,  precision_recall_curve, average_precision_score

In [118]:
random_uniform_result = evaluate_models(RandomUniformClassifier, [best_test_model['best_seed']], model_name, level_3_labels, test_loader, device, rows = 1, cols = 1)

In [119]:
majority_result = evaluate_models(MajorityClassifier, [best_test_model['best_seed']], model_name, level_3_labels, test_loader, device, rows = 1, cols = 1)

In [120]:
def plot_classification_reports(reports: list):
    """
    Plot multiple classification reports as subplots with heatmaps.

    Args:
      reports (list): A list of dictionaries, where each dictionary contains keys 'model_class' and 'report'.

    Returns:
      None
    """
    subplots = make_subplots(rows=1, cols=len(reports), subplot_titles=[report['model_class'] for report in reports])
    fig = go.FigureWidget(subplots)

    for i, report in enumerate(reports, start=1):
        report_dict = report['report']
        classes = list(report_dict.keys())
        header = list(report_dict[classes[0]].keys())[:-1]

        values = [[round(report_dict[class_][metric], 3) for metric in header] for class_ in classes]

        fig.add_trace(
            go.Heatmap(z=values, x=header, y=classes, colorscale='Magenta', text=values, texttemplate="%{text}", textfont={"size": 12}, hoverinfo='text'),
            row=1, col=i
        )

    fig.update_layout(title='Classification Reports')
    fig.update_traces(showscale=False)

    fig.show()

In [121]:
# Evaluate BERT model
_, _, _, gts_bert, preds_bert, scores_bert = evaluate(best_test_model['best_model'], test_loader, nn.BCELoss(weight=class_weights), device, best_test_model['best_seed'])
bert_report = classification_report(preds_bert, gts_bert, zero_division=0.0,  output_dict=True, target_names=level_3_labels)

# Evaluate Random Uniform Classifier
_, _, _, gts_rand, preds_rand, scores_rand = evaluate(random_uniform_result['best_model'], test_loader, nn.BCELoss(weight=class_weights), device, best_test_model['best_seed'])
random_uniform_report = classification_report(preds_rand, gts_rand, zero_division=0.0, output_dict=True, target_names=level_3_labels)

# Evaluate Majority Classifier
_, _, _, gts_maj, preds_maj, scores_maj = evaluate(majority_result['best_model'], test_loader, nn.BCELoss(weight=class_weights), device, best_test_model['best_seed'])
majority_report = classification_report(preds_maj, gts_maj, zero_division=0.0, output_dict=True, target_names=level_3_labels)

In [122]:
reports_list = [
    {'model_class': f"{best_test_model['model_class']}", 'report': bert_report},
    {'model_class': f"{random_uniform_result['model_class']}", 'report': random_uniform_report},
    {'model_class': f"{majority_result['model_class']}", 'report': majority_report}
]
plot_classification_reports(reports_list)

Here is what we can deduce from this comparison:

- The BERT model generally outperforms both the random classifier and the majority classifier in terms of precision, recall, and F1-score.

- The random classifier's recall is lower for less represented classes. Since recall represents the ability to detect the positive instances of a class, this makes sense because we randomly classify correctly the most represented labels.

- As expected, the majority classifier achieves relatively high recall for the `Conversation` and `Self-transcendence` labels due to always predicting the most represented labels as positive. Its micro results stress the importance of using macro metrics in tasks like this one, otherwise the lack of precision in predicting the less represented classes could remain hidden.

- Macro F1 score is confirmed as the best metric to perform the comparison, as it is insensitive to the imbalance of the classes and treats them all as equal.

## 5.3 Precision-recall curves

Let us now plot the precision-recall curves. The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

In [123]:
def plot_pr_curves(y_true: list, y_scores: list, model_name: str, class_labels:list, rows: int = 2, cols: int = 2):
    """
    Plot Precision-Recall curves for each class using Plotly with subplots.

    Args:
      y_true (list): True binary labels.
      y_scores (list): Target scores, can either be probability estimates of the positive class or confidence values.
      model_name (str): Name of the model for plot title.
      class_labels (list): List of class labels.
      rows (int): Number of rows in the subplots.
      cols (int): Number of columns in the subplots.

    Returns:
      None
    """
    subplots = make_subplots(rows=rows, cols=cols, subplot_titles=class_labels)
    fig = go.FigureWidget(subplots)

    row, col = 1, 1
    for i, label in enumerate(class_labels, start=1):
        y_class_true = [el[i-1] for el in y_true]
        y_class_score = [el[i-1] for el in y_scores]

        precision, recall, _ = precision_recall_curve(y_class_true, y_class_score)
        auc_score = average_precision_score(y_class_true, y_class_score)

        name = f"{label} (AP={auc_score:.2f})"
        fig.add_trace(go.Scatter(x=recall, y=precision, name=name, mode='lines'), row=row, col=col)

        fig.update_xaxes(title_text='Recall')
        fig.update_yaxes(title_text='Precision')

        # fig.update_xaxes(title_text='Recall', row=row, col=col, constrain='domain')
        # fig.update_yaxes(title_text='Precision', row=row, col=col, scaleanchor="x", scaleratio=1)

        col += 1
        if col > 2:
            col = 1
            row += 1

    fig.update_layout(
        title=f'Precision-Recall Curves for {model_name}',
        hovermode='closest'
    )

    # fig.update_layout(
    #     title=f'Precision-Recall Curves for {model_name}',
    #     hovermode='closest',
    #     width = 1300,
    #     height = 1000
    # )

    fig.show()

In [124]:
plot_pr_curves(gts_bert, scores_bert, best_test_model['model_class'], level_3_labels)

Reviewing both precision and recall is valuable when there's an imbalance in observations between classes, such as many instances of class 0 and only a few of class 1, as is the case here. This is because the large number of class 0 examples means less interest in the model's skill at predicting class 0 correctly, which emphasizes high true negatives.

The precision-recall curve plots precision (y-axis) against recall (x-axis) for different thresholds. A no-skill classifier predicts random or constant class, and its baseline varies with the positive-to-negative class ratio, typically 0.5 for balanced datasets.

A model with perfect skill is at (1,1), while a skilful model curves towards (1,1) above the no-skill line. Notice how the bending of the curves for the most represented classes, `Conversation` and `Self-transcendence`, appears to manifest later with respect to the other two classes, `Openness to change` and `Self-enhancement`.

Composite scores like F1 score (harmonic mean of precision and recall) try to summarize this plot, which is why it is used in tasks like this one.


## 5.4 Predictions with different model variants

We are now going to examine some predictions and check how models with different inputs behave.

In [125]:
def predict(sample: dict, model: nn.Module, tokenizer, device: str):
    """
    Predict labels for an input sample.

    Args:
      sample (dict): Input sample containing keys 'conclusion', 'premise', 'stance', and 'labels'.
      model: The trained model for prediction.
      tokenizer: The tokenizer used for tokenization.
      device: Device to run the model on (e.g., 'cuda' or 'cpu').

    Returns:
      tuple: A tuple containing the untokenized conclusion and premise, stance, predicted labels and ground truth labels.
    """
    # Tokenize conclusion and premise
    conclusion = tokenizer.decode(sample['conclusion']['input_ids'][0], skip_special_tokens=True)
    premise = tokenizer.decode(sample['premise']['input_ids'][0], skip_special_tokens=True)
    stance = sample['stance'][0][0]

    # Move data to device
    sample = move_to_device(sample, device)

    # Get model predictions
    with torch.no_grad():
        outputs = model(sample)

    # Decode predicted labels and ground truth labels
    label_names = ['Conversation', 'Openess to change', 'Self-enhancement', 'Self-transcendence']
    preds = [{f"{label_names[i]}":pred > 0.5} for i, pred in enumerate(outputs.cpu().detach().numpy()[0])]
    gts = [{f"{label_names[i]}":gt > 0.5} for i, gt in enumerate(sample['labels'].cpu().numpy()[0])]

    return conclusion, premise, stance, preds, gts


In [126]:
single_test_loader = DataLoader(test_dataset, batch_size=1, shuffle=True, worker_init_fn=seed_worker, generator=g)

n_samples = 4
for idx, sample in enumerate(single_test_loader):
  print(f"- Sample {idx+1}")
  for variant in evaluations_test:
    conclusion, premise, stance, preds, labels = predict(sample, variant['best_model'], tokenizer, device)
    print(f"\tPredictions {variant['model_class']}:\n\t\t{preds}")

  print(f"\tGround truth labels:\n\t\t{labels}")
  print(f"\n\tPremise: {premise}")
  print(f"\tConclusion: {conclusion}")
  print(f"\tStance: {stance}")
  print()

  if idx == n_samples-1:
    break

- Sample 1
	Predictions CClassifier:
		[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
	Predictions CPClassifier:
		[{'Conversation': False}, {'Openess to change': True}, {'Self-enhancement': True}, {'Self-transcendence': True}]
	Predictions CPSClassifier:
		[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
	Ground truth labels:
		[{'Conversation': False}, {'Openess to change': True}, {'Self-enhancement': True}, {'Self-transcendence': True}]

	Premise: guantanamo bay has become a symbol for terrorist organizations that helps to promote the ideology and cause
	Conclusion: we should close guantanamo bay detention camp
	Stance: 1

- Sample 2
	Predictions CClassifier:
		[{'Conversation': False}, {'Openess to change': False}, {'Self-enhancement': True}, {'Self-transcendence': True}]
	Predictions CPClassifier:
		[{'Conversation': False}, {'Openess to change': Fals

We can notice that the `CClassifier` tends to perform generally worse than the other two variants, which instead tend to output the same predictions.

When one of the highly represented classes is `True`, the models hardly fail to predict it correctly; this gives us a sense of the further imbalance occurring between positive and negative samples.

# [Task 6 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Model card

You are **free** to choose the BERT-base model card you like from huggingface.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Model Training

You are **free** to choose training hyper-parameters for BERT-based models (e.g., number of epochs, etc...).

### Neural Libraries

You are **free** to use any library of your choice to address the assignment (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

# The End