# Assignment 2

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: Human Value Detection, Multi-label classification, Transformers, BERT


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the [Human Value Detection challenge](https://aclanthology.org/2022.acl-long.306/).

## Problem definition

Arguments are paired with their conveyed human values.

Arguments are in the form of **premise** $\rightarrow$ **conclusion**.

### Example:

**Premise**: *``fast food should be banned because it is really bad for your health and is costly''*

**Conclusion**: *``We should ban fast food''*

**Stance**: *in favour of*

<center>
    <img src="images/human_values.png" alt="human values" />
</center>

# [Task 1 - 0.5 points] Corpus

Check the official page of the challenge [here](https://touche.webis.de/semeval23/touche23-web/).

The challenge offers several corpora for evaluation and testing.

You are going to work with the standard training, validation, and test splits.

#### Arguments
* arguments-training.tsv
* arguments-validation.tsv
* arguments-test.tsv

#### Human values
* labels-training.tsv
* labels-validation.tsv
* labels-test.tsv

### Example

#### arguments-*.tsv
```

Argument ID    A01005

Conclusion     We should ban fast food

Stance         in favor of

Premise        fast food should be banned because it is really bad for your health and is costly.
```

#### labels-*.tsv

```
Argument ID                A01005

Self-direction: thought    0
Self-direction: action     0
...
Universalism: objectivity: 0
```

### Splits

The standard splits contain

   * **Train**: 5393 arguments
   * **Validation**: 1896 arguments
   * **Test**: 1576 arguments

### Annotations

In this assignment, you are tasked to address a multi-label classification problem.

You are going to consider **level 3** categories:

* Openness to change
* Self-enhancement
* Conversation
* Self-transcendence

**How to do that?**

You have to merge (**logical OR**) annotations of level 2 categories belonging to the same level 3 category.

**Pay attention to shared level 2 categories** (e.g., Hedonism). $\rightarrow$ [see Table 1 in the original paper.](https://aclanthology.org/2022.acl-long.306/)

#### Example

```
Self-direction: thought:    0
Self-direction: action:     1
Stimulation:                0
Hedonism:                   1

Openess to change           1
```

### Instructions

* **Download** the specificed training, validation, and test files.
* **Encode** split files into a pandas.DataFrame object.
* For each split, **merge** the arguments and labels dataframes into a single dataframe.
* **Merge** level 2 annotations to level 3 categories.

# [Task 2 - 2.0 points] Model definition

You are tasked to define several neural models for multi-label classification.

<center>
    <img src="images/model_schema.png" alt="model_schema" />
</center>

### Instructions

* **Baseline**: implement a random uniform classifier (an individual classifier per category).
* **Baseline**: implement a majority classifier (an individual classifier per category).

<br/>

* **BERT w/ C**: define a BERT-based classifier that receives an argument **conclusion** as input.
* **BERT w/ CP**: add argument **premise** as an additional input.
* **BERT w/ CPS**: add argument premise-to-conclusion **stance** as an additional input.

In [1]:
!pip install --upgrade pip

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0


In [2]:
!pip install datasets transformers

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xxhas

In [3]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl.metadata (18 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.10.0->accel

In [4]:
!pip install transformers[torch]

[0m

In [5]:
import pandas as pd
import numpy as np
import torch
import transformers
import torch.nn as nn
from torch.utils.data import DataLoader
import accelerate
from datasets import *
from transformers import AutoTokenizer, BertForSequenceClassification, PreTrainedTokenizerFast, DefaultDataCollator, Trainer, TrainingArguments
from transformers.optimization import AdamW
import random
import matplotlib.pyplot as plt
from transformers.optimization import get_scheduler
import matplotlib.colors as mcolors
from sklearn.metrics import confusion_matrix, classification_report, f1_score,  precision_recall_fscore_support
from sklearn.utils import class_weight
import os
import warnings
warnings.filterwarnings("ignore")
import tqdm.notebook as tq

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')

#HYPERPARAMETERS
model_checkpoint = 'prajjwal1/bert-tiny'
#MAX_LEN = 200
NUM_LABELS = 4
TRAIN_BATCH_SIZE = 64
VALID_BATCH_SIZE = 64
EPOCHS = 10
LEARNING_RATE = 1e-05
DROPOUT = 0.2
WEIGHT_DECAY = 0.01

Using cuda device


In [7]:
def set_reproducibility(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    transformers.set_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

    os.environ['TF_DETERMINISTIC_OPS'] = '1'

In [8]:
# Load the tokenizer
print(f'Loading tokenizer ...')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
assert isinstance(tokenizer, PreTrainedTokenizerFast)
print('Tokenizer loaded.')

Loading tokenizer ...


config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Tokenizer loaded.


In [9]:
#for colab
from google.colab import files
import zipfile


uploaded = files.upload()


zip_filename = next(iter(uploaded))


with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
    zip_ref.extractall('/content/')



Saving files_ass2.zip to files_ass2.zip


In [10]:
#for colab
print(f"Loading data...")
path_labels_data = "./labels_data"
path_arguments_data = "./arguments_data"


def load_file():
    '''Load the file and return the data and labels for training, testing and validation'''

    train_data = pd.read_csv(os.path.join(path_arguments_data, 'arguments-training.tsv'), sep='\t')
    test_data = pd.read_csv(os.path.join(path_arguments_data, 'arguments-test.tsv'), sep='\t')
    valid_data = pd.read_csv(os.path.join(path_arguments_data, 'arguments-validation.tsv'), sep='\t')
    train_labels = pd.read_csv(os.path.join(path_labels_data, 'labels-training.tsv'),sep='\t')
    test_labels = pd.read_csv(os.path.join(path_labels_data, 'labels-test.tsv'),sep='\t')
    valid_labels = pd.read_csv(os.path.join(path_labels_data, 'labels-validation.tsv'),sep='\t')

    return (train_data,train_labels), (test_data,test_labels), (valid_data,valid_labels)



(train_data,train_labels), (test_data,test_labels), (valid_data,valid_labels) = load_file()
print("Data loaded.")

Loading data...
Data loaded.


In [None]:
print(f"Loading data...")
def load_file():
    '''Load the file and return the data and labels for training, testing and validation'''
    train_data = pd.read_csv('arguments-training.tsv', sep='\t')
    test_data = pd.read_csv('arguments-test.tsv', sep='\t')
    valid_data = pd.read_csv('arguments-validation.tsv', sep='\t')
    train_labels = pd.read_csv('labels-training.tsv',sep='\t')
    test_labels = pd.read_csv('labels-test.tsv',sep='\t')
    valid_labels = pd.read_csv('labels-validation.tsv',sep='\t')
    return (train_data,train_labels), (test_data,test_labels), (valid_data,valid_labels)
(train_data,train_labels), (test_data,test_labels), (valid_data,valid_labels) = load_file()
print("Data loaded.")

Loading data...
Data loaded.


In [11]:
# Merge the data and labels
def merge_df(data, labels):
    return pd.merge(data, labels, on='Argument ID')

train_df_merged = merge_df(train_data, train_labels)
val_df_merged = merge_df(valid_data, valid_labels)
test_df_merged = merge_df(test_data, test_labels)

level2_to_level3 = {
    'openness_to_change':["Self-direction: thought", "Self-direction: action","Stimulation", "Hedonism"],
    'self_enhancement':['Hedonism','Achievement', 'Power: dominance', 'Power: resources','Face'],
    'conservation': ['Face', 'Security: personal', 'Security: societal', 'Tradition', 'Conformity: rules', 'Conformity: interpersonal','Humility'],
    'self_transcendence':['Humility', 'Benevolence: caring', 'Benevolence: dependability','Universalism: concern', 'Universalism: nature','Universalism: tolerance', 'Universalism: objectivity']
}

def lv3_labels(df):
    """
    Function to aggregate specified columns by taking the logical OR between their values.

    Parameters:
        df (pandas.DataFrame): The input DataFrame.

    Returns:
        pandas.DataFrame: A new DataFrame with aggregated values.
    """

    # Create a new DataFrame to store aggregated values
    new_df = pd.DataFrame()

    # Iterate over the mapping and compute logical OR for each set of columns
    for new_column, columns_to_aggregate in level2_to_level3.items():
        new_df[new_column] = df[columns_to_aggregate].apply(lambda row: any(row), axis=1).astype(int)  # Convert boolean to int

    # Drop the original columns used for aggregation
    df.drop(columns=[col for cols in level2_to_level3.values() for col in cols], inplace=True)

    # Concatenate new DataFrame with the remaining columns of the original DataFrame
    new_df = pd.concat([df, new_df], axis=1)

    return new_df

df_train = lv3_labels(train_df_merged)
df_val = lv3_labels(val_df_merged)
df_test = lv3_labels(test_df_merged)

In [12]:
def refine_df(df):
    new_df = df.copy()
    new_df["Stance"] = df['Stance'].replace({'in favor of': 1, 'against': 0})
    new_df.drop(df.columns[4:], axis=1, inplace=True)
    new_df['labels'] = df.iloc[:, 4:].apply(lambda x: np.array(list(x)), axis=1)
    return new_df

df_train = refine_df(df_train)
df_test = refine_df(df_test)
df_val = refine_df(df_val)

In [13]:
featurez = ['Conclusion', 'Premise', 'Stance', 'labels']

# Dataframes to Datasets
train_df_to_ds = df_train[featurez]
val_df_to_ds = df_val[featurez]
test_df_to_ds = df_test[featurez]

train_df_to_ds = train_df_to_ds.rename(columns={'Conclusion': 'conclusion', 'Premise': 'premise', 'Stance': 'stance'})
val_df_to_ds = val_df_to_ds.rename(columns={'Conclusion': 'conclusion', 'Premise': 'premise', 'Stance': 'stance'})
test_df_to_ds = test_df_to_ds.rename(columns={'Conclusion': 'conclusion', 'Premise': 'premise', 'Stance': 'stance'})


In [14]:
tclengths = [len(tokenizer(x)["input_ids"]) for x in train_df_to_ds['conclusion']]
tplengths = [len(tokenizer(x)["input_ids"]) for x in train_df_to_ds['premise']]

vclengths = [len(tokenizer(x)["input_ids"]) for x in val_df_to_ds['conclusion']]
vplengths = [len(tokenizer(x)["input_ids"]) for x in val_df_to_ds['premise']]

WC_MAX_LEN = max(tclengths+vclengths)
WCP_MAX_LEN = WC_MAX_LEN + max(tplengths+tplengths) + 1
WCPS_MAX_LEN = WCP_MAX_LEN + 2

print(WC_MAX_LEN)
print(WCP_MAX_LEN)
print(WCPS_MAX_LEN)

38
198
200


In [68]:
#define BertDataset:

class MultilabelDataset(torch.utils.data.Dataset):
    def __init__(self,df:pd.DataFrame, tokenizer:AutoTokenizer, max_len:int):
        self.tokenizer = tokenizer
        self.df = df
        self.conclusion = df.conclusion
        self.premise = df.premise
        self.stance = torch.tensor(df['stance'].values.astype(float), dtype=torch.float)
        self.max_len = max_len
        self.labels = torch.tensor(np.stack(df['labels'].tolist()), dtype=torch.long)

    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):

        conclusion = self.conclusion[index]
        premise = self.premise[index]

        inputs_c = self.tokenizer.encode_plus(
            conclusion,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True,
            truncation=True,
            return_tensors='pt'
        )

        inputs_p = self.tokenizer.encode_plus(
            premise,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True,
            truncation=True,
            return_tensors='pt'
        )

        ids_c = inputs_c['input_ids'].flatten()
        mask_c = inputs_c['attention_mask'].flatten()
        token_type_ids_c = inputs_c['token_type_ids'].flatten()

        ids_p = inputs_p['input_ids'].flatten()
        mask_p = inputs_p['attention_mask'].flatten()
        token_type_ids_p = inputs_p['token_type_ids'].flatten()

        '''
        return {'ids_c' : torch.tensor(ids_c, dtype=torch.long),
                'mask_c' : torch.tensor(mask_c, dtype=torch.long),
                'token_type_ids_c' : torch.tensor(token_type_ids_c, dtype=torch.long),
                'ids_p':torch.tensor(ids_p, dtype=torch.long),
                'mask_p':torch.tensor(mask_p, dtype=torch.long),
                'token_type_ids_p': torch.tensor(token_type_ids_p, dtype=torch.long),
                'stance' : self.stance[index],
                'labels' : self.labels[index]}
        '''
        return {'ids_c' : ids_c,
                'mask_c' : mask_c,
                'token_type_ids_c' : token_type_ids_c,
                'ids_p':ids_p,
                'mask_p':mask_p,
                'token_type_ids_p': token_type_ids_p,
                'stance' : self.stance[index],
                'labels' : self.labels[index]}

wc_datasets = DatasetDict()
wc_datasets['train'] = MultilabelDataset(train_df_to_ds, tokenizer, WC_MAX_LEN)
wc_datasets['validation'] = MultilabelDataset(val_df_to_ds, tokenizer, WC_MAX_LEN)
wc_datasets['test'] = MultilabelDataset(test_df_to_ds, tokenizer, WC_MAX_LEN)


In [88]:
train_loader = DataLoader(wc_datasets['train'], batch_size=TRAIN_BATCH_SIZE, shuffle=True)
val_loader = DataLoader(wc_datasets['validation'], batch_size=VALID_BATCH_SIZE, shuffle=False)


In [89]:
batch = next(iter(train_loader))
print(batch.keys())

dict_keys(['ids_c', 'mask_c', 'token_type_ids_c', 'ids_p', 'mask_p', 'token_type_ids_p', 'stance', 'labels'])


In [17]:
# define f1-score metric
def compute_metrics(eval_predictions):
    predictions, labels = eval_predictions.predictions, eval_predictions.label_ids
    macro_f1_score = compute_f1_macro(labels, predictions)
    return {"macro_f1": macro_f1_score}

def compute_f1_macro(labels, predictions, threshold=0.5, print_per_category=False):
    binary_predictions = (predictions > threshold).astype(int)
    f1_macro = f1_score(y_true=labels, y_pred=binary_predictions, average='macro')
    if print_per_category:
        per_category_scores = precision_recall_fscore_support(labels, binary_predictions, average=None)
        print("Per-category F1 scores:")
        for i, score in enumerate(per_category_scores[2]):
            print(f"Category {i+1}: {score}")
    return f1_macro

In [18]:
# Class Weights
def compute_class_weights_from_df(df):
    # Extracting labels
    labels = df['labels'].tolist()

    # Convert labels to 2D array
    labels_array = np.array([np.array(label) for label in labels])

    # Compute class frequencies
    class_frequencies = np.sum(labels_array, axis=0) / len(labels_array)

    # Compute inverse class frequencies
    inverse_class_frequencies = 1 / class_frequencies

    print("Inverse Class Frequencies:", inverse_class_frequencies)

    return inverse_class_frequencies

In [75]:
import torch
import torch.nn as nn
from transformers import BertForSequenceClassification,BertModel

class MultilabelClassifier(nn.Module):
    def __init__(self, num_labels, drop_out=0.1, model_type="base"):
        super(MultilabelClassifier, self).__init__()
        self.num_labels = num_labels
        self.model_type = model_type

        self.bert_conclusion = BertForSequenceClassification.from_pretrained(model_checkpoint, num_labels=self.num_labels,output_hidden_states=True)

        if model_type != "base":
            self.bert_premise = BertForSequenceClassification.from_pretrained(model_checkpoint, num_labels=self.num_labels,output_hidden_states=True)

        if model_type == "bert_w_cp":
            self.hidden_dim = self.bert_conclusion.config.hidden_size * 2

        elif model_type == "bert_w_cps":
            self.hidden_dim = (self.bert_conclusion.config.hidden_size * 2) + 1

        else:
            self.hidden_dim = self.bert_conclusion.config.hidden_size


        self.dropout = nn.Dropout(drop_out)
        self.dense = nn.Linear(self.hidden_dim, num_labels)
        self.sigmoid = nn.Sigmoid()

    def forward(self, inputs):
        # Extracting data for conclusion
        ids_c = inputs['ids_c'].to(device)
        mask_c = inputs['mask_c'].to(device)
        token_type_ids_c = inputs['token_type_ids_c'].to(device)

        # Extracting data for premise
        ids_p = inputs['ids_p'].to(device)
        mask_p = inputs['mask_p'].to(device)
        token_type_ids_p = inputs['token_type_ids_p'].to(device)

        conclusion_outputs = self.bert_conclusion(input_ids=ids_c, attention_mask=mask_c, token_type_ids=token_type_ids_c)
        pooled_output_c = conclusion_outputs.hidden_states[-1][:, 0, :]


        if self.model_type != "base":
            premise_outputs = self.bert_premise(input_ids=ids_p, attention_mask=mask_p, token_type_ids=token_type_ids_p)
            pooled_output_cp = premise_outputs.hidden_states[-1][:, 0, :]

        if self.model_type == "bert_w_cp":
            output = torch.cat((pooled_output_c, pooled_output_cp), dim=1)
        elif self.model_type == "bert_w_cps":
            stance = inputs['stance']
            output = torch.cat((pooled_output_c, pooled_output_cp, stance), dim=1)
        else:
            output = pooled_output_c


        output = self.dropout(output)
        output = self.dense(output)
        output = self.sigmoid(output)
        return output


In [85]:
from transformers import Trainer

class CustomTrainer(Trainer):
    def __init__(self, *args, class_weights=None, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = torch.tensor(class_weights, dtype=torch.float32) if class_weights is not None else None

    def compute_loss(self, model, inputs, return_outputs=False):
        print("INPUTS",inputs)
        labels = inputs.pop("labels").to(device)
        outputs = model(**inputs)
        logits = outputs.logits
        loss = self.weighted_loss(logits, labels, self.class_weights)
        return (loss, outputs) if return_outputs else loss

    @staticmethod
    def weighted_loss(logits, labels, class_weights):
        loss = torch.nn.functional.binary_cross_entropy_with_logits(logits, labels, weight=class_weights.to(device))
        return loss


    def _prepare_inputs(self, inputs):
        print("INPUTS",inputs)
        print("INPUTS KEYS:",inputs.keys())
        return {
            'input_ids': inputs['ids_c'].to(self.args.device),
            'attention_mask': inputs['mask_c'].to(self.args.device),
            'token_type_ids': inputs['token_type_ids_c'].to(self.args.device),
            'input_ids_p': inputs['ids_p'].to(self.args.device),
            'attention_mask_p': inputs['mask_p'].to(self.args.device),
            'token_type_ids_p': inputs['token_type_ids_p'].to(self.args.device),
            'stance': inputs['stance'].to(self.args.device)
        }



In [90]:
def train(epoch, model,datasets,load_model:False):
    model.to(device)
    model.train()
    data_collator = DefaultDataCollator()

    class_weights = compute_class_weights_from_df(train_df_to_ds)

    training_args = TrainingArguments(
        output_dir='./Model_Checkpoints',
        overwrite_output_dir=True,
        evaluation_strategy="epoch",
        per_device_train_batch_size=TRAIN_BATCH_SIZE,
        per_device_eval_batch_size=VALID_BATCH_SIZE,
        fp16=True if device == 'cuda' else False,
        num_train_epochs=EPOCHS,
        logging_steps=100,
        logging_dir='./logs',
        learning_rate=LEARNING_RATE,
        save_strategy='epoch',
        metric_for_best_model='macro_f1',
        load_best_model_at_end=True,
        save_total_limit=1,
        )


    print("OPTIMIZER AND SCHEDULER INSTANTIATION")
    optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
    scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=len(datasets['train']) // TRAIN_BATCH_SIZE * EPOCHS,
    )
    print("Done !")
    print()



    trainer = CustomTrainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=datasets['train'],
        eval_dataset=datasets['validation'],
        data_collator=data_collator,
        optimizers=(optimizer, scheduler),
        class_weights=class_weights,
        )

    if not load_model:
        # Start the training
        trainer.train()

In [91]:
SEED = [0,42,121]

for seed in SEED:
      print("-"*100)
      print(f"########################### Run with seed: {seed} ########################################")

      print(f"Setting seed for reproducibility")
      set_reproducibility(seed)
      print("Done")
      print()

      print("SELECTED HYPERPARAMETERS")
      hyp_str = f"run_num: {seed}  epochs: {EPOCHS}, batch_size: {TRAIN_BATCH_SIZE}, dropout: {DROPOUT},lr : {LEARNING_RATE},"
      print(hyp_str)
      print()

      print("MODEL CREATION")
      model = MultilabelClassifier(drop_out=DROPOUT, num_labels = NUM_LABELS, model_type="base")
      model.to(device)
      print("Done")
      print()

      print("START TRAINING")
      print()
      train(EPOCHS,model,wc_datasets,load_model=False)
      print("Cleaning GPU memory...")
      print()
      del model
      torch.cuda.empty_cache()
      print("Cleaning Done!")


      print(f"############################## Training {seed} done. #######################################################")


----------------------------------------------------------------------------------------------------
########################### Run with seed: 0 ########################################
Setting seed for reproducibility
Done

SELECTED HYPERPARAMETERS
run_num: 0  epochs: 10, batch_size: 64, dropout: 0.2,lr : 1e-05,

MODEL CREATION


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Done

START TRAINING

Inverse Class Frequencies: [2.72511369 2.16673363 1.31312393 1.3137637 ]
OPTIMIZER AND SCHEDULER INSTANTIATION
Done !



ValueError: The batch received was empty, your model won't be able to train on it. Double-check that your training dataset contains keys expected by the model: inputs,label_ids,label.

### Notes

**Do not mix models**. Each model has its own instructions.

You are **free** to select the BERT-based model card from huggingface.

#### Examples

```
bert-base-uncased
prajjwal1/bert-tiny
distilbert-base-uncased
roberta-base
```

### BERT w/ C

<center>
    <img src="images/bert_c.png" alt="BERT w/ C" />
</center>

### BERT w/ CP

<center>
    <img src="images/bert_cp.png" alt="BERT w/ CP" />
</center>

### BERT w/ CPS

<center>
    <img src="images/bert_cps.png" alt="BERT w/ CPS" />
</center>

### Input concatenation

<center>
    <img src="images/input_merging.png" alt="Input merging" />
</center>

### Notes

The **stance** input has to be encoded into a numerical format.

You **should** use the same model instance to encode **premise** and **conclusion** inputs.

# [Task 3 - 0.5 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using per-category binary F1-score.
* Compute the average binary F1-score over all categories (macro F1-score).

### Example

You start with individual predictions ($\rightarrow$ samples).

```
Openess to change:    0 0 1 0 1 1 0 ...
Self-enhancement:     1 0 0 0 1 0 1 ...
Conversation:         0 0 0 1 1 0 1 ...
Self-transcendence:   1 1 0 1 0 1 0 ...
```

You compute per-category binary F1-score.

```
Openess to change F1:    0.35
Self-enhancement F1:     0.55
Conversation F1:         0.80
Self-transcendence F1:   0.21
```

You then average per-category scores.
```
Average F1: ~0.48
```

# [Task 4 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate **all** defined models.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Pick **at least** three seeds for robust estimation.
* Compute metrics on the validation set.
* Report **per-category** and **macro** F1-score for comparison.

# [Task 5 - 1.0 points] Error Analysis

You are tasked to discuss your results.

### Instructions

* **Compare** classification performance of BERT-based models with respect to baselines.
* Discuss **difference in prediction** between the best performing BERT-based model and its variants.

### Notes

You can check the [original paper](https://aclanthology.org/2022.acl-long.306/) for suggestions on how to perform comparisons (e.g., plots, tables, etc...).

# [Task 6 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Model card

You are **free** to choose the BERT-base model card you like from huggingface.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Model Training

You are **free** to choose training hyper-parameters for BERT-based models (e.g., number of epochs, etc...).

### Neural Libraries

You are **free** to use any library of your choice to address the assignment (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

# The End