# Continued Pretraining Notebook

Continued pretraining is the process of taking a pre-existing language model and further training it on a domain-specific corpus to adapt its knowledge to a particular context.

### Notebook Overview

In this notebook, we fine-tune SmolLM2 (135M) on SwissProt sequences to enhance its understanding of protein language and improve performance on downstream protein-related tasks.

- Model Card: [https://huggingface.co/HuggingFaceTB/SmolLM-135M](HuggingFaceTB/SmolLM-135M)
- Dataset Link: [https://huggingface.co/datasets/khairi/uniprot-swissprot](khairi/uniprot-swissprot)

In [33]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling
from transformers.trainer import Trainer
from transformers.training_args import TrainingArguments, OptimizerNames
from peft import get_peft_model, LoraConfig, TaskType

In [None]:
# uncomment to use wandb logging
# !wandb login

In [None]:
# uncomment to enable pushing to the HuggingFace Hub
# !huggingface-cli login

In [42]:
model_id = "HuggingFaceTB/SmolLM2-135M" # use a smaller model for testing
dataset_id = "khairi/uniprot-swissprot"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # Set pad token to eos token for open-ended generation

loading file vocab.json from cache at /home/khairi/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-135M/snapshots/93efa2f097d58c2a74874c7e644dbc9b0cee75a2/vocab.json
loading file merges.txt from cache at /home/khairi/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-135M/snapshots/93efa2f097d58c2a74874c7e644dbc9b0cee75a2/merges.txt
loading file tokenizer.json from cache at /home/khairi/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-135M/snapshots/93efa2f097d58c2a74874c7e644dbc9b0cee75a2/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /home/khairi/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-135M/snapshots/93efa2f097d58c2a74874c7e644dbc9b0cee75a2/special_tokens_map.json
loading file tokenizer_config.json from cache at /home/khairi/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-135M/snapshots/93efa2f097d58c2a74874c7e644dbc9b0cee75a2/tokenizer_config.json
loading file chat_template.jinja f

In [19]:
model = AutoModelForCausalLM.from_pretrained(model_id)

model

loading configuration file config.json from cache at /home/khairi/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-135M/snapshots/93efa2f097d58c2a74874c7e644dbc9b0cee75a2/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "eos_token_id": 0,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 576,
  "initializer_range": 0.041666666666666664,
  "intermediate_size": 1536,
  "is_llama_config": true,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 9,
  "num_hidden_layers": 30,
  "num_key_value_heads": 3,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_interleaved": false,
  "rope_scaling": null,
  "rope_theta": 100000,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.52.0",
  "use_cache": true,
  "vocab_size": 49152
}

loading weights file model.safetensors from

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb): LlamaRotaryEm

#### Efficient finetuning with LoRA
##### Full Fine-tuning vs LoRA:
Full fine-tuning updates all the parameters of a pre-trained model, which is computationally expensive and memory-intensive, especially for large models. In contrast, LoRA (Low-Rank Adaptation) introduces small adapter layers into the model and only trains these new weights, leaving the original model parameters frozen. This drastically reduces compute and storage requirements while still allowing the model to adapt to the new domain.

#### Our Approach:
Instead of fine-tuning the entire SmolLM2 model, we train only the LoRA adapters. However, during continued pretraining on protein sequences, we also fine-tune the embedding layers to allow the model to learn representations for new tokens (i.e., amino acids) that may not have been fully captured during the original pretraining.

In [20]:
lora_rank = 8
lora_alpha = 32
lora_dropout = 0.1
modules = [
    "q_proj", "v_proj", "k_proj", "o_proj", # attention layer
    "up_proj", "down_proj", "gate_proj", # feedforward layer
    "lm_head", "embed_tokens" # update embeddings when learning a new language
]

In [21]:
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=lora_rank,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    target_modules=modules
)

lora_config

LoraConfig(task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, inference_mode=False, r=8, target_modules={'gate_proj', 'lm_head', 'down_proj', 'up_proj', 'embed_tokens', 'k_proj', 'v_proj', 'o_proj', 'q_proj'}, exclude_modules=None, lora_alpha=32, lora_dropout=0.1, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', trainable_token_indices=None, loftq_config={}, eva_config=None, corda_config=None, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False), lora_bias=False)

In [22]:
model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

trainable params: 3,237,888 || all params: 137,752,896 || trainable%: 2.3505


In [23]:
dataset = load_dataset(dataset_id)

dataset

DatasetDict({
    train: Dataset({
        features: ['Entry', 'Sequence'],
        num_rows: 455692
    })
    validation: Dataset({
        features: ['Entry', 'Sequence'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['Entry', 'Sequence'],
        num_rows: 10000
    })
})

### Dataset Processing Steps

##### Filter by Length
Keep only sequences with length ≤ 512 to ensure consistent input size for the model.

##### Format Protein Sequences
- Separate each amino acid with a space so that every amino acid is treated as an individual token.
- Enclose the full sequence with special tokens <protein> at the start and </protein> at the end.

Example:
```
Original:  MIGMLESLQH
Formatted: <protein> M I G M L E S L Q H </protein>
```
##### Tokenization
Convert the formatted sequences into input IDs using the model’s tokenizer, producing the final input ready for training.

In [26]:
def filter_sequence_by_length(example):
    """
    Filter sequences longer than 512 tokens.

    Args:
        example (Dict[str, Any]): row from the dataset

    Returns:
        boolean: True if the sequence length is less than or equal to 512, False otherwise
    """
    return len(example['Sequence']) <= 512

In [None]:
eos_token = tokenizer.eos_token
def format_sequence(example):
    """
    Format the sequence by adding spaces between each character.
    Each amino acid is treated as a separate token.
    Add special tokens <protein> and </protein> to the start and end of the sequence.
    Add eos token to the end of the sequence.

    Args:
        example (Dict[str, Any]): row from the dataset

    Returns:
        Dict[str, Any]: row with formatted sequence
    """
    sequence = ' '.join(list(example['Sequence']))
    sequence = f'<protein> {sequence} </protein> {eos_token}' 
    
    return {'text': sequence}

In [36]:
def tokenize_input(example):
    """
    Tokenize the input sequence.

    Args:
        example (Dict[str, Any]): row from the dataset

    Returns:
        Dict[str, Any]: row with tokenized input_ids and attention_mask
    """
    tokenized = tokenizer(
        example['text'],
        padding=True,
    )
    
    return {
        'input_ids': tokenized['input_ids'],
        'attention_mask': tokenized['attention_mask']
    }

In [40]:
train_data = dataset['train'] \
    .filter(filter_sequence_by_length) \
    .map(format_sequence) \
    .map(tokenize_input, batched=True, batch_size=1024) \
    .select_columns(['input_ids', 'attention_mask'])

train_data

Map:   0%|          | 0/455692 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 455692
})

In [41]:
valid_data = dataset['validation'] \
    .filter(filter_sequence_by_length) \
    .map(format_sequence) \
    .map(tokenize_input, batched=True, batch_size=1024) \
    .select_columns(['input_ids', 'attention_mask'])
    
valid_data

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 2000
})

In [34]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

data_collator

DataCollatorForLanguageModeling(tokenizer=GPT2TokenizerFast(name_or_path='HuggingFaceTB/SmolLM2-135M', vocab_size=49152, model_max_length=8192, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|endoftext|>', '<|im_start|>', '<|im_end|>', '<repo_name>', '<reponame>', '<file_sep>', '<filename>', '<gh_stars>', '<issue_start>', '<issue_comment>', '<issue_closed>', '<jupyter_start>', '<jupyter_text>', '<jupyter_code>', '<jupyter_output>', '<jupyter_script>', '<empty_output>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|im_end|>", rstrip=False, lstrip=False, sin

In [None]:
output_dir = "/tmp/pilm-builder-results"
batch_size = 4
max_epochs = 3
learning_rate = 2e-4
weight_decay = 0.01
warmup_ratio = 0.15
use_fp16 = torch.cuda.is_available()
run_name = "cpt-smollm-swissprot-lora"
gradient_accumulation_steps = 2  # to simulate a larger batch size

In [None]:
training_args = TrainingArguments(
    do_train=True,
    do_eval=True,
    per_device_eval_batch_size=batch_size,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    # Logging & evaluation config
    output_dir=output_dir,
    eval_strategy="steps",
    save_strategy="steps",
    eval_steps=100,
    save_steps=100,
    logging_steps=50,
    save_total_limit=3,
    num_train_epochs=max_epochs,
    # Optimizer config
    optim=OptimizerNames.ADAMW_TORCH,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    warmup_ratio=warmup_ratio,
    # fp16
    fp16=use_fp16,
    run_name=run_name,
    report_to="none",  # change to "wandb" to enable logging to Weights & Biases
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    # Uncomment the following lines to enable pushing to the HuggingFace Hub
    # push_to_hub=True,
    # hub_model_id="<your-username>/cpt-smollm-swissprot-lora",
)

PyTorch: setting up devices


In [None]:
# Uncomment the following line to use full data
train_data = train_data.select(range(1000))  # use a smaller subset for testing
valid_data = valid_data.select(range(100))  # use a smaller subset for testing

In [48]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=valid_data,
    data_collator=data_collator,
)

Using auto half precision backend
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [49]:
training_stats = trainer.train()

***** Running training *****
  Num examples = 1,000
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 750
  Number of trainable parameters = 3,237,888


Step,Training Loss,Validation Loss
100,2.8694,2.836969
200,2.827,2.825628
300,2.8235,2.816247
400,2.8069,2.813277
500,2.813,2.811435
600,2.807,2.810436
700,2.8086,2.80846



***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /tmp/pilm-builder-results/checkpoint-100
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in /tmp/pilm-builder-results/checkpoint-100/tokenizer_config.json
Special tokens file saved in /tmp/pilm-builder-results/checkpoint-100/special_tokens_map.json

***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /tmp/pilm-builder-results/checkpoint-200
Saving Trainer.data_collator.tokenizer by default as Trainer.processing_class is `None`
tokenizer config file saved in /tmp/pilm-builder-results/checkpoint-200/tokenizer_config.json
Special tokens file saved in /tmp/pilm-builder-results/checkpoint-200/special_tokens_map.json

***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /tmp/pilm-builder-results/checkpoint-300
Saving Trainer.data_collator.to

In [52]:
for key, value in training_stats.metrics.items():
    print(f"{key}: {value}")

train_runtime: 1037.9977
train_samples_per_second: 2.89
train_steps_per_second: 0.723
total_flos: 1022401035648000.0
train_loss: 2.841126180013021
epoch: 3.0


In [None]:
# Uncomment the following line to save the model locally and/or push to the HuggingFace Hub
# model.save_pretrained("/path/to/save/model")
# model.push_to_hub("<username>/<model-id>", private=True)