# Lightweight Fine-Tuning Project

* PEFT technique: LoRA
* Model: distilbert-base-german-cased
* Evaluation approach: Accuracy
* Fine-tuning dataset: SH108/german-court-decisions

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [11]:
# Fix - See https://knowledge.udacity.com/questions/1060123

!pip install -U datasets

!pip install -q "datasets==2.15.0"

# Reload kernel after this

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Using cached datasets-3.3.2-py3-none-any.whl (485 kB)
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.15.0
    Uninstalling datasets-2.15.0:
      Successfully uninstalled datasets-2.15.0
[0mSuccessfully installed datasets-3.3.2


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m

In [12]:
#dataset_name = "sms_spam"
#text_field = "sms"
#limit_data = "" #unlimited
#label_key = "label"
#remove_columns = []
#id2label=dict({0: "not spam", 1: "spam"}),
#label2id=dict({"not spam": 0, "spam": 1}),
#model_name = "distilbert-base-uncased"

# =========================================

dataset_name="SH108/german-court-decisions"
text_field="decision"
limit_data="[:10000]"
label_key = "convicted"
remove_columns = ['court', 'state', 'costs to defendant', 'dismissed', 'date', 'costs to plaintiff', 'offense']
id2label={0: "not convicted", 1: "convicted"}
label2id={"not convicted": 0, "convicted": 1}
model_name="distilbert-base-german-cased"


In [13]:
# Load the dataset
from datasets import load_dataset

# Since the dataset contains only a "train" split create the test / train split
ds = load_dataset(dataset_name, split=f"train{limit_data}").train_test_split(
    test_size=0.2, shuffle=True, seed=42
)

# Remove some columns
ds = ds.remove_columns(remove_columns)

# Enable label
if label_key is not None and label_key != "label":
    ds = ds.rename_column(label_key, "label")

splits = ["train", "test"]

ds

DatasetDict({
    train: Dataset({
        features: ['decision', 'label'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['decision', 'label'],
        num_rows: 2000
    })
})

### Preprocess the data

In [14]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

ds_tokenized = {}
for split in splits:
    ds_tokenized[split] = ds[split].map(
        lambda x: tokenizer(x[text_field], truncation=True, padding="max_length"), batched=True
    )



Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [15]:
ds_tokenized

{'train': Dataset({
     features: ['decision', 'label', 'input_ids', 'attention_mask'],
     num_rows: 8000
 }),
 'test': Dataset({
     features: ['decision', 'label', 'input_ids', 'attention_mask'],
     num_rows: 2000
 })}

### Setup the model

In [16]:
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
    force_download=False, #Suppress FutureWarning
)

# Freeze parameters
for param in model.base_model.parameters():
    param.requires_grad = False

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# Define metric
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

In [18]:
# Factory for Training Arguments
from transformers import TrainingArguments
def create_training_args(output_dir):
    return TrainingArguments(
        output_dir=output_dir,
        learning_rate=1e-4,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=2,
        weight_decay=0.01,
        load_best_model_at_end=True,
        #remove_unused_columns=False
    )

In [19]:
# Training without PEFT / LoRA
import numpy as np

from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=create_training_args(f"./data/{model_name}_{dataset_name}"),
    train_dataset=ds_tokenized["train"],
    eval_dataset=ds_tokenized["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [20]:
# Get untrained performance
performance_pretrained = trainer.evaluate()
performance_pretrained

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.750350832939148,
 'eval_accuracy': 0.1035,
 'eval_runtime': 34.1987,
 'eval_samples_per_second': 58.482,
 'eval_steps_per_second': 3.655}

In [21]:
# Train
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2301,0.167689,0.9245
2,0.1768,0.153241,0.932


Checkpoint destination directory ./data/distilbert-base-german-cased_SH108/german-court-decisions/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/distilbert-base-german-cased_SH108/german-court-decisions/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1000, training_loss=0.20341574096679688, metrics={'train_runtime': 352.8866, 'train_samples_per_second': 45.34, 'train_steps_per_second': 2.834, 'total_flos': 2119478378496000.0, 'train_loss': 0.20341574096679688, 'epoch': 2.0})

In [22]:
performance_trained = trainer.evaluate()
performance_trained

{'eval_loss': 0.1532410830259323,
 'eval_accuracy': 0.932,
 'eval_runtime': 33.7886,
 'eval_samples_per_second': 59.192,
 'eval_steps_per_second': 3.699,
 'epoch': 2.0}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [23]:
from peft import LoraConfig, TaskType

from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_storage=torch.bfloat16,
)

model_lora_base = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
    force_download=False, #Suppress FutureWarning
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)


# Create a PEFT Config for LoRA
# See https://knowledge.udacity.com/questions/1027860
lora_config = LoraConfig(
    r=8, # Rank
    lora_alpha=32,
    target_modules=['v_lin'],
    lora_dropout=0.1,
    task_type=TaskType.SEQ_CLS
)

# Get trainable PEFT model
from peft import get_peft_model
model_lora = get_peft_model(model_lora_base, lora_config)

model_lora.print_trainable_parameters()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,257,988 || all params: 68,066,308 || trainable%: 1.8481801598523604


In [24]:
import numpy as np

from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

trainer_lora = Trainer(
    model=model_lora,
    args=create_training_args(f"./data/{model_name}_{dataset_name}_lora"),
    train_dataset=ds_tokenized["train"],
    eval_dataset=ds_tokenized["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer_lora.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.1742,0.079775,0.9685
2,0.0764,0.058249,0.9735


Checkpoint destination directory ./data/distilbert-base-german-cased_SH108/german-court-decisions_lora/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/distilbert-base-german-cased_SH108/german-court-decisions_lora/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1000, training_loss=0.1252991943359375, metrics={'train_runtime': 770.9133, 'train_samples_per_second': 20.755, 'train_steps_per_second': 1.297, 'total_flos': 2152206630912000.0, 'train_loss': 0.1252991943359375, 'epoch': 2.0})

In [25]:
performance_lora = trainer_lora.evaluate()
performance_lora

{'eval_loss': 0.05824864283204079,
 'eval_accuracy': 0.9735,
 'eval_runtime': 41.7284,
 'eval_samples_per_second': 47.929,
 'eval_steps_per_second': 2.996,
 'epoch': 2.0}

In [26]:
model_lora.save_pretrained(model_name + "-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [27]:
from peft import AutoPeftModelForSequenceClassification
lora_model = AutoPeftModelForSequenceClassification.from_pretrained(model_name + "-lora", force_download=False)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-german-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
from transformers import AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained(model_name,force_download=False)

decision="""
1.
Die Beklagte wird verurteilt, an die Klägerin 1.822,96 € für das vorgerichtliche Abmahnschreiben zuzüglich Zinsen in Höhe von 5 Prozentpunkten über dem jeweiligen Basiszinssatz seit dem 13.10.2017 zu zahlen.
2.
Wegen der Zahlungsmehrforderung wird die Klage abgewiesen.
3.
Die Beklagte hat die Kosten des Rechtsstreits zu tragen.
4.
Das Urteil ist für die Klägerin vorläufig vollstreckbar gegen Sicherheitsleistung in Höhe von 110 % des jeweils zu vollstreckenden Betrages.
Beschluss
Der Streitwert wird auf 60.000,00 € festgesetzt.
"""
inputs = tokenizer(decision, return_tensors="pt")
    
# Make prediction
with torch.no_grad():
    outputs = lora_model(**inputs).logits
    probabilities = torch.nn.functional.softmax(outputs, dim=1)
    predicted_class = torch.argmax(probabilities)
    
# Print result
print(f"Classification: {id2label[int(predicted_class)]}")

Classification: convicted


## Comparision

In [29]:
import pandas as pd
comparision = pd.DataFrame.from_dict(
    {"pretrained": performance_pretrained,
     "normal": performance_trained, 
     "LoRA": performance_lora}
)
comparision

Unnamed: 0,pretrained,normal,LoRA
eval_loss,0.750351,0.153241,0.058249
eval_accuracy,0.1035,0.932,0.9735
eval_runtime,34.1987,33.7886,41.7284
eval_samples_per_second,58.482,59.192,47.929
eval_steps_per_second,3.655,3.699,2.996
epoch,,2.0,2.0


In [30]:
!tar chvfz distilbert-base-german-cased-lora.tar.gz distilbert-base-german-cased-lora

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


distilbert-base-german-cased-lora/
distilbert-base-german-cased-lora/adapter_model.bin
distilbert-base-german-cased-lora/adapter_config.json
distilbert-base-german-cased-lora/README.md
