<a href="https://colab.research.google.com/github/danielsaggau/IR_LDC/blob/main/model/SCOTUS/scotus_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers datasets
import torch as nn

In [10]:
from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    EvalPrediction,
    HfArgumentParser,
    TrainingArguments,
    default_data_collator,
    set_seed,
    EarlyStoppingCallback,
    Trainer
)

In [11]:
from transformers import TrainerCallback 
from datasets import load_metric
import numpy as np
import torch as nn

In [12]:
from datasets import load_dataset
dataset = load_dataset("lex_glue", "scotus")

Downloading builder script:   0%|          | 0.00/23.3k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/32.4k [00:00<?, ?B/s]

Downloading and preparing dataset lex_glue/scotus to /root/.cache/huggingface/datasets/lex_glue/scotus/1.0.0/8a66420941bf6e77a7ddd4da4d3bfb7ba88ef48c1d55302a568ac650a095ca3a...


Downloading data:   0%|          | 0.00/105M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1400 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1400 [00:00<?, ? examples/s]

Dataset lex_glue downloaded and prepared to /root/.cache/huggingface/datasets/lex_glue/scotus/1.0.0/8a66420941bf6e77a7ddd4da4d3bfb7ba88ef48c1d55302a568ac650a095ca3a. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [13]:
!python -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('XXXX')"

In [14]:
tokenizer = AutoTokenizer.from_pretrained('danielsaggau/longformer_simcse_scotus', use_auth_token=True,use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained('danielsaggau/longformer_simcse_scotus',use_auth_token=True, num_labels=14)

Downloading:   0%|          | 0.00/453 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/702k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/167M [00:00<?, ?B/s]

Some weights of LongformerForSequenceClassification were not initialized from the model checkpoint at danielsaggau/longformer_simcse_scotus and are newly initialized: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [16]:
tokenized_data = dataset.map(preprocess_function, batched=True)



  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [35]:
def compute_metrics(eval_pred):
    metric1 = load_metric("f1")
    #roc_auc_score = load_metric("roc_auc")
    accuracy_metric = load_metric("accuracy")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    micro1 = metric1.compute(predictions=predictions, references=labels, average="micro")["f1"]
    macro1 = metric1.compute(predictions=predictions, references=labels, average="macro")["f1"]
    results = accuracy_metric.compute(references=labels, predictions=predictions)['accuracy']
    #roc = roc_auc_score.compute(references=labels, prediction_scores=predictions)['roc_auc']
    return { "f1-micro": micro1, "f1-macro": macro1, "accuracy": results}#, 'roc':roc}

In [41]:
training_args = TrainingArguments(
    output_dir="/slbert_scotus_classsification_scotus_pretrain_frozen",
    learning_rate=3e-5,
    per_device_train_batch_size=6,
    per_device_eval_batch_size=6,
    num_train_epochs=8,
    weight_decay=0.01,
    save_strategy="steps",
    evaluation_strategy="steps",
    push_to_hub=True,
    fp16=True,
    eval_steps=250,
    metric_for_best_model="f1-micro",
    save_total_limit=5,
    greater_is_better=True,
    load_best_model_at_end = True
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [18]:
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) # fp16

In [39]:
for name, param in model.named_parameters():
     if name.startswith("longformer."): # choose whatever you like here
        param.requires_grad = False

In [42]:
trainer = Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    eval_dataset=tokenized_data['test'],
    train_dataset=tokenized_data["train"],
    tokenizer=tokenizer,
    data_collator=data_collator,    
    callbacks = [EarlyStoppingCallback(early_stopping_patience=5)])
trainer.train()

/slbert_scotus_classsification_scotus_pretrain_frozen is already a clone of https://huggingface.co/danielsaggau/slbert_scotus_classsification_scotus_pretrain_frozen. Make sure you pull the latest changes with `repo.git_pull()`.
Using cuda_amp half precision backend
***** Running training *****
  Num examples = 5000
  Num Epochs = 8
  Instantaneous batch size per device = 6
  Total train batch size (w. parallel, distributed & accumulation) = 6
  Gradient Accumulation steps = 1
  Total optimization steps = 6672
  Number of trainable parameters = 269838
The following columns in the training set don't have a corresponding argument in `LongformerForSequenceClassification.forward` and have been ignored: text. If text are not expected by `LongformerForSequenceClassification.forward`,  you can safely ignore this message.
Initializing global attention on CLS token...


Step,Training Loss,Validation Loss,F1-micro,F1-macro,Accuracy
250,No log,1.80928,0.726429,0.582726,0.726429
500,0.075700,1.769787,0.728571,0.584886,0.728571
750,0.075700,1.66356,0.73,0.585309,0.73
1000,0.327800,1.596937,0.728571,0.583179,0.728571
1250,0.327800,1.596879,0.731429,0.585849,0.731429
1500,0.164800,1.583458,0.734286,0.592274,0.734286
1750,0.164800,1.50748,0.732143,0.590091,0.732143
2000,0.344400,1.493603,0.734286,0.596876,0.734286
2250,0.344400,1.481413,0.735714,0.595304,0.735714
2500,0.277000,1.455464,0.735714,0.597343,0.735714


[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...

TrainOutput(global_step=4000, training_loss=0.2500329294204712, metrics={'train_runtime': 2206.1988, 'train_samples_per_second': 18.131, 'train_steps_per_second': 3.024, 'total_flos': 1.4089988001191808e+16, 'train_loss': 0.2500329294204712, 'epoch': 4.8})

In [43]:
eval_dataset=tokenized_data['validation']
trainer.evaluate(eval_dataset=eval_dataset)

***** Running Evaluation *****
  Num examples = 1400
  Batch size = 6
The following columns in the evaluation set don't have a corresponding argument in `LongformerForSequenceClassification.forward` and have been ignored: text. If text are not expected by `LongformerForSequenceClassification.forward`,  you can safely ignore this message.
Initializing global attention on CLS token...


Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on CLS token...
Initializing global attention on C

{'eval_loss': 1.2117364406585693,
 'eval_f1-micro': 0.7585714285714287,
 'eval_f1-macro': 0.6772390142590096,
 'eval_accuracy': 0.7585714285714286,
 'eval_runtime': 63.7377,
 'eval_samples_per_second': 21.965,
 'eval_steps_per_second': 3.671,
 'epoch': 4.8}