# BERT models for swedish sentence classification
The present notebook briefly presents the models used for sentence classification.
Training is run by subsequent notebooks.

## The models
The main model is a BERT language model pretrained on swedish.
It is the first-appearing result deemed suitable in [this search for models pre-trained on swedish](https://huggingface.co/models?language=sv&sort=downloads&search=bert+swedish).

The second model is based on the former, but is set up for fine-tuning with LoRA.
This choice was based on Edvard's personal fascination and interest in the technique, as it
- is often, due to memory constraints, the only feasible option for training models beyond a certain size
- can save a lot of memory if fine-tuning many models using the same base
- is based on a beautifully simple idea

In [1]:
from models import swedish_classifier

swedish_classifier.model??

[31mSignature:[39m swedish_classifier.model() -> transformers.models.bert.modeling_bert.BertForSequenceClassification
[31mSource:[39m   
[38;5;28;01mdef[39;00m model() -> BertForSequenceClassification:
    [33m"""BERT-based swedish classifier for fine-tuning."""[39m
    [38;5;28;01mreturn[39;00m BertForSequenceClassification.from_pretrained(
        [33m"KB/bert-base-swedish-cased"[39m, num_labels=[32m2[39m
    )
[31mFile:[39m      ~/model-deployment-starter/models/swedish_classifier.py
[31mType:[39m      function

In [2]:
swedish_classifier.lora_model??

[31mSignature:[39m swedish_classifier.lora_model(lora_r: int = [32m4[39m) -> peft.peft_model.PeftModel
[31mSource:[39m   
[38;5;28;01mdef[39;00m lora_model(lora_r: int = [32m4[39m) -> PeftModel:
    [33m"""BERT-based swedish classifier configured for LoRA fine-tuning."""[39m
    [38;5;28;01mreturn[39;00m get_peft_model(
        model(),
        LoraConfig(
            [33m"SEQ_CLS"[39m,
            target_modules=[[33m"query"[39m, [33m"value"[39m],
            r=lora_r,
            lora_alpha=[32m2[39m * lora_r,
            lora_dropout=[32m0.2[39m,
            modules_to_save=[[33m"classifier"[39m],
        )
    )
[31mFile:[39m      ~/model-deployment-starter/models/swedish_classifier.py
[31mType:[39m      function

## Training notes
The training routine implemented in `models.swedish_classifier.train` implements early stopping using the criterion of two epochs without validation loss improvement.

Below are some notes on hyperparameter settings.
- Batch size is left to `transformer`'s default value of 8.
- Dropout probability in the main model is left to the default 0.1.
- Learn rates were experimentally set as large as possible while still showing decreasing validation loss in the initial epochs, which turned out to mean `1e-5` for the main model and `1e-4` for the LoRA one.
- For LoRA rank, the [original paper](https://arxiv.org/pdf/2106.09685) shows anything between 1 and 64 to be effective. The present code uses 4.
- For LoRA alpha, `2 * lora_r` is [common practice](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide).
- LoRA dropout probability is slightly higher than default, 0.2, based on a hunch that the increased regularization is useful given the small amount of data.

## % of parameters tuned in the LoRA model
The LoRA setup tunes only 0.12% of the parameters, nearly eliminating the training memory requirements (which are about 3X the size of the model itself).
Look for "lora_A" and "lora_B" in the layers printed at the bottom of this notebook to understand how the fraction can be so small.
While fine-tuning optimizes all layers, with sizes such as $768 \times 768$ and $3072 \times 768$ (1'000'000s of parameters), the LoRA setup optimizes a subset of replacement layers of size $768 \times 4$ (1'000s of parameters).

In [3]:
swedish_classifier.lora_model().print_trainable_parameters()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 148,994 || all params: 124,841,476 || trainable%: 0.1193


## Neural network layers

In [4]:
swedish_classifier.model()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(50325, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [5]:
swedish_classifier.lora_model()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at KB/bert-base-swedish-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): BertForSequenceClassification(
      (bert): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(50325, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0-11): 12 x BertLayer(
              (attention): BertAttention(
                (self): BertSdpaSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.2, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default