<a href="https://colab.research.google.com/github/danielsaggau/IR_LDC/blob/main/model/ECTHR/ECTHR_SIMCSE_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install and Load packages

In [None]:
!git clone https://github.com/danielsaggau/IR_LDC.git

In [None]:
%cd IR_LDC

In [None]:
!pip install -r requirements.txt

# load Datasets

In [10]:
from datasets import load_dataset
dataset = load_dataset("lex_glue", "ecthr_b")

Downloading builder script:   0%|          | 0.00/23.3k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/32.4k [00:00<?, ?B/s]

Downloading and preparing dataset lex_glue/ecthr_b to /root/.cache/huggingface/datasets/lex_glue/ecthr_b/1.0.0/8a66420941bf6e77a7ddd4da4d3bfb7ba88ef48c1d55302a568ac650a095ca3a...


Downloading data:   0%|          | 0.00/32.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset lex_glue downloaded and prepared to /root/.cache/huggingface/datasets/lex_glue/ecthr_b/1.0.0/8a66420941bf6e77a7ddd4da4d3bfb7ba88ef48c1d55302a568ac650a095ca3a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

# Connect to Huggingface
Alternativ 1 via pop up window and entering access token

In [None]:
#from huggingface_hub import notebook_login
#notebook_login()
#access code:

Alternativ 2 using the direct command

In [4]:
!python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('XXXX')"

# Set labels 
We set the labels to 10 and also pass this argument to the ```AutoModelForSequenceClassification``` function

In [5]:
label_list = list(range(10))
num_labels = len(label_list)

Instantiating the model and the tokenizer from our pre-trained model. This model was pre-trained similarly to `SIMCSE`. Further are using the ``use_fast=True`` specification of the tokenizer.

In [6]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("danielsaggau/simcse_longformer_ecthr_b", use_auth_token=True, use_fast=True)
# tokenizer =  AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
model = AutoModelForSequenceClassification.from_pretrained("danielsaggau/simcse_longformer_ecthr_b", num_labels=10, problem_type='multi_label_classification')
# model = AutoModelForSequenceClassification.from_pretrained("allenai/longformer-base-4096", num_labels=10)

Downloading:   0%|          | 0.00/453 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/702k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/167M [00:00<?, ?B/s]

Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at danielsaggau/simcse_longformer_ecthr_b and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Data Collator 
Set colaltor to ``pad_to_multiple`` of 8 for efficiency (FP16)

In [7]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) # fp16

Here we add global attention mask for the longformer as our attention mechanism is twofold 

In [8]:
import numpy as np

def preprocess_function(examples):
        # Tokenize the texts
        cases = []
        padding = "max_length"
        max_seq_length=4096
        for case in examples['text']:
            cases.append(f' {tokenizer.sep_token} '.join([fact for fact in case]))
        batch = tokenizer(cases, padding=padding, max_length=4096, truncation=True)
        # use global attention on CLS token
        global_attention_mask = np.zeros((len(cases),max_seq_length), dtype=np.int32)
        global_attention_mask[:, 0] = 1
        batch['global_attention_mask'] = list(global_attention_mask)
        batch["labels"] = [[1.0 if label in labels else 0.0 for label in label_list] for labels in examples["labels"]]
        return batch

In [11]:
tokenized_data = dataset.map(preprocess_function, batched=True,remove_columns=['text'])

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [12]:
# cast label IDs to floats
import torch 
tokenized_data.set_format("torch")
tokenized_data = (tokenized_data
          .map(lambda x : {"float_labels": x["labels"].to(torch.float)}, remove_columns=["labels"])
          .rename_column("float_labels", "labels"))

  0%|          | 0/9000 [00:00<?, ?ex/s]

  0%|          | 0/1000 [00:00<?, ?ex/s]

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [13]:
from transformers import EvalPrediction
from scipy.special import expit
from sklearn.metrics import f1_score

def compute_metrics(p: EvalPrediction):
        # Fix gold labels
        y_true = np.zeros((p.label_ids.shape[0], p.label_ids.shape[1] + 1), dtype=np.int32)
        y_true[:, :-1] = p.label_ids
        y_true[:, -1] = (np.sum(p.label_ids, axis=1) == 0).astype('int32')
        # Fix predictions
        logits = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
        preds = (expit(logits) > 0.5).astype('int32')
        y_pred = np.zeros((p.label_ids.shape[0], p.label_ids.shape[1] + 1), dtype=np.int32)
        y_pred[:, :-1] = preds
        y_pred[:, -1] = (np.sum(preds, axis=1) == 0).astype('int32')
        # Compute scores
        macro_f1 = f1_score(y_true=y_true, y_pred=y_pred, average='macro', zero_division=0)
        micro_f1 = f1_score(y_true=y_true, y_pred=y_pred, average='micro', zero_division=0)
        return {'macro-f1': macro_f1, 'micro-f1': micro_f1}

# Specify the training arguments
Here we use a respective batch size of 6 as we are using longformer. Further we use a learning rate of 3e-5 as done by chalkidis et al in lexglue. 
Further we save results by epoch. The metric for the best model is our micro f1. One needs to ensure that the highest score is best so we use greater is better and we load the best model at the end. 
For more pronounced performance increase number of epochs. 

In [14]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="/ecthr_b_classsification",
    learning_rate=3e-5,
    per_device_train_batch_size=6,
    per_device_eval_batch_size=6,
    num_train_epochs=20,
    weight_decay=0.01,
    fp16=True,
    eval_steps=500,
    save_strategy="steps",
    evaluation_strategy="steps",
    logging_steps=500,
    logging_dir='./logs',  
    logging_first_step = True,
    logging_strategy = 'steps',
    #push_to_hub=True,
    metric_for_best_model="micro-f1",
    greater_is_better=True,
    load_best_model_at_end = True
)

Specifying costum trainer with multiple label classification loss

In [15]:
from torch import cuda
cuda.empty_cache()

In [16]:
from transformers import EarlyStoppingCallback
from transformers import Trainer
hist =Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    eval_dataset=tokenized_data['validation'],
    train_dataset=tokenized_data['train'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=5)])
hist.train()

Using cuda_amp half precision backend
***** Running training *****
  Num examples = 9000
  Num Epochs = 20
  Instantaneous batch size per device = 6
  Total train batch size (w. parallel, distributed & accumulation) = 6
  Gradient Accumulation steps = 1
  Total optimization steps = 30000
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Macro-f1,Micro-f1
500,0.2586,0.22316,0.416768,0.625095
1000,0.17,0.18816,0.511461,0.719033
1500,0.159,0.180757,0.553979,0.731597
2000,0.1329,0.172609,0.549759,0.737791
2500,0.1307,0.165109,0.71096,0.784973
3000,0.1304,0.153277,0.65808,0.781669
3500,0.105,0.165728,0.703103,0.764331
4000,0.11,0.152246,0.688728,0.799706
4500,0.1062,0.153134,0.696974,0.792913
5000,0.0837,0.155447,0.735737,0.800147


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 6
Saving model checkpoint to /ecthr_b_classsification/checkpoint-500
Configuration saved in /ecthr_b_classsification/checkpoint-500/config.json
Model weights saved in /ecthr_b_classsification/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /ecthr_b_classsification/checkpoint-500/tokenizer_config.json
Special tokens file saved in /ecthr_b_classsification/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 6
Saving model checkpoint to /ecthr_b_classsification/checkpoint-1000
Configuration saved in /ecthr_b_classsification/checkpoint-1000/config.json
Model weights saved in /ecthr_b_classsification/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in /ecthr_b_classsification/checkpoint-1000/tokenizer_config.json
Special tokens file saved in /ecthr_b_classsification/checkpoint-1000/special_tokens_map.json
***** Running Evaluation *****
  Num 

Step,Training Loss,Validation Loss,Macro-f1,Micro-f1
500,0.2586,0.22316,0.416768,0.625095
1000,0.17,0.18816,0.511461,0.719033
1500,0.159,0.180757,0.553979,0.731597
2000,0.1329,0.172609,0.549759,0.737791
2500,0.1307,0.165109,0.71096,0.784973
3000,0.1304,0.153277,0.65808,0.781669
3500,0.105,0.165728,0.703103,0.764331
4000,0.11,0.152246,0.688728,0.799706
4500,0.1062,0.153134,0.696974,0.792913
5000,0.0837,0.155447,0.735737,0.800147


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 6
Saving model checkpoint to /ecthr_b_classsification/checkpoint-6500
Configuration saved in /ecthr_b_classsification/checkpoint-6500/config.json
Model weights saved in /ecthr_b_classsification/checkpoint-6500/pytorch_model.bin
tokenizer config file saved in /ecthr_b_classsification/checkpoint-6500/tokenizer_config.json
Special tokens file saved in /ecthr_b_classsification/checkpoint-6500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from /ecthr_b_classsification/checkpoint-5000 (score: 0.8001474926253687).


TrainOutput(global_step=6500, training_loss=0.12541871590797718, metrics={'train_runtime': 5856.5242, 'train_samples_per_second': 30.735, 'train_steps_per_second': 5.122, 'total_flos': 2.2917757943808e+16, 'train_loss': 0.12541871590797718, 'epoch': 4.33})

In [17]:
hist.evaluate(eval_dataset=tokenized_data['validation'])

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 6


{'eval_loss': 0.15544691681861877,
 'eval_macro-f1': 0.7357372124149993,
 'eval_micro-f1': 0.8001474926253687,
 'eval_runtime': 39.8864,
 'eval_samples_per_second': 25.071,
 'eval_steps_per_second': 4.187,
 'epoch': 4.33}

# Frozen layer continued training 

In [None]:
for name, param in model.named_parameters():
     if name.startswith("longformer."): # choose whatever you like here
        param.requires_grad = False

In [None]:
model

In [None]:
for name, param in model.named_parameters():
     print(name, param.requires_grad)

In [None]:
from transformers import EarlyStoppingCallback
from transformers import Trainer
froz =Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    eval_dataset=tokenized_data['validation'],
    train_dataset=tokenized_data['train'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=5)])
froz.train()

In [None]:
hist.evaluate(eval_dataset=tokenized_data['validation'])

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 6


{'eval_loss': 0.15733712911605835,
 'eval_macro-f1': 0.7414001299967447,
 'eval_micro-f1': 0.8,
 'eval_runtime': 38.2757,
 'eval_samples_per_second': 26.126,
 'eval_steps_per_second': 4.363,
 'epoch': 2.5}

# ECTHR Task A 

In [None]:
dataset = load_dataset("lex_glue", "ecthr_a")

In [None]:
tokenizer = AutoTokenizer.from_pretrained("danielsaggau/simcse_longformer_ecthr_b", use_auth_token=True, use_fast=True)
# tokenizer =  AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
model = AutoModelForSequenceClassification.from_pretrained("danielsaggau/simcse_longformer_ecthr_b", num_labels=10, problem_type='multi_label_classification')

In [None]:
for name, param in model.named_parameters():
     if name.startswith("longformer."): # choose whatever you like here
        param.requires_grad = False

In [None]:
tokenized_data = dataset.map(preprocess_function, batched=True,remove_columns=['text'])

In [None]:
# cast label IDs to floats
import torch 
tokenized_data.set_format("torch")
tokenized_data = (tokenized_data
          .map(lambda x : {"float_labels": x["labels"].to(torch.float)}, remove_columns=["labels"])
          .rename_column("float_labels", "labels"))

  0%|          | 0/9000 [00:00<?, ?ex/s]

  0%|          | 0/1000 [00:00<?, ?ex/s]

  0%|          | 0/1000 [00:00<?, ?ex/s]

In [None]:
from transformers import EarlyStoppingCallback
from transformers import Trainer
froz_A =Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    eval_dataset=tokenized_data['validation'],
    train_dataset=tokenized_data['train'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=5)])
froz_A.train()

Using cuda_amp half precision backend
***** Running training *****
  Num examples = 9000
  Num Epochs = 20
  Instantaneous batch size per device = 6
  Total train batch size (w. parallel, distributed & accumulation) = 6
  Gradient Accumulation steps = 1
  Total optimization steps = 30000


Step,Training Loss,Validation Loss,Macro-f1,Micro-f1
250,0.3993,0.299194,0.079324,0.282774
500,0.2626,0.279984,0.082493,0.299776
750,0.2445,0.265523,0.083184,0.292487
1000,0.2255,0.254524,0.109528,0.318283
1250,0.2189,0.246292,0.143195,0.363313
1500,0.2088,0.236498,0.18359,0.395944
1750,0.2009,0.230766,0.229706,0.441941
2000,0.1991,0.227964,0.2337,0.440781
2250,0.1945,0.222225,0.251685,0.464716
2500,0.1905,0.220647,0.259803,0.47664


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 6
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 6
Saving model checkpoint to /slbert_ecthr_b_classsification/checkpoint-500
Configuration saved in /slbert_ecthr_b_classsification/checkpoint-500/config.json
Model weights saved in /slbert_ecthr_b_classsification/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /slbert_ecthr_b_classsification/checkpoint-500/tokenizer_config.json
Special tokens file saved in /slbert_ecthr_b_classsification/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 6
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 6
Saving model checkpoint to /slbert_ecthr_b_classsification/checkpoint-1000
Configuration saved in /slbert_ecthr_b_classsification/checkpoint-1000/config.json
Model weights saved in /slbert_ecthr_b_classsification/checkpoint-1000/pytorch_model.bin
tokenizer config file saved i