# Continued Pretraining of Unsloth Models On Swissprot Sequences
Author: [Khairi Abidi](https://github.com/abidikhairi/)

This notebook demonstrates continued pretraining for protein sequence modeling.

Key Features:

- Memory Efficient: LoRA for consumer GPUs.

The model learns to generate model/functional protein sequences.

## Installation and Setup
Install the required packages for continued pretraining with memory-efficient techniques.

In [1]:
%env WANDB_PROJECT=Unsloth-CPT

env: WANDB_PROJECT=Unsloth-CPT


## Connect to 3rd party services

- **WandB**: for experiment tracking.
- **HuggingFace Hub**: for model checkpoints uploading.

In [2]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HUGGING_FACE_TOKEN")
wandb_token = user_secrets.get_secret("WANDB_API_KEY")

In [3]:
!wandb login {wandb_token}
!huggingface-cli login --token {hf_token}

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `hf`CLI if you want to set the git credential as well.
Token is valid (permission: write).
The token `KAGGLE_TOKEN` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `KAGGLE_TOKEN`


## GPU Environment Detection
Verify GPU availability and display hardware specifications for optimal training configuration.

In [4]:
import torch

# Verify CUDA availability and display GPU specifications
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    # Display current GPU details for training optimization
    print(f"Current GPU: {torch.cuda.current_device()}")
    print(f"GPU name: {torch.cuda.get_device_name()}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    # Provide guidance for enabling GPU in Colab
    print("⚠️  No GPU available. This notebook requires a GPU for efficient training.")
    print("In Colab: Runtime → Change runtime type → Hardware accelerator → GPU")

CUDA available: True
Number of GPUs: 2
Current GPU: 0
GPU name: Tesla T4
GPU memory: 15.8 GB


## Core Library Imports
Import essential libraries for pre-training, model configuration, and experiment tracking.

In [5]:
# Model and tokenization
from unsloth import FastLanguageModel

# Training and Setup
from unsloth import (
    UnslothTrainer,
    UnslothTrainingArguments,
    is_bfloat16_supported
)

# Dataset handling
from datasets import load_dataset

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-09-26 06:38:12.559492: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1758868692.808241     122 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1758868692.882312     122 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


🦥 Unsloth Zoo will now patch everything to make training faster!


## Test Models:
- Gemma3-1B (unsloth/gemma-3-1b-pt): loss is oscilating between 5.4 5.6
- Qwen-7B (unsloth/Qwen3-1.7B): 

In [6]:
model_name = 'unsloth/Qwen3-1.7B'
max_seq_len = 2048
dtype = torch.float16
load_in_4bit = True

print(f'Loading model: {model_name}')
print(f'Max input length: {max_seq_len}')
print(f'Model dtype: {dtype}')
print(f'Is 4bit quantization supported: {load_in_4bit}')

Loading model: unsloth/Qwen3-1.7B
Max input length: 2048
Model dtype: torch.float16
Is 4bit quantization supported: True


In [7]:
# Load model with automatic device mapping
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_len,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Ensure tokenizer has proper padding token for batch processing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

==((====))==  Unsloth 2025.9.7: Fast Qwen3 patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.41G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

In [8]:
print(f"✅ Model loaded successfully!")
print(f"📊 Model parameters: ~{sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
print(f"🧮 Quantized parameters: ~{sum(p.numel() for p in model.parameters() if hasattr(p, 'quant_type')) / 1e6:.1f}M")

✅ Model loaded successfully!
📊 Model parameters: ~1034.8M
🧮 Quantized parameters: ~685.8M


In [9]:
def compute_model_size(model):
    n_params = 0
    for p in model.parameters():
        n_params += p.nelement() * p.element_size()
    for p in model.buffers():
        n_params += p.nelement() * p.element_size()

    return n_params / (1024 ** 3)

print(f"📊 Model size : {compute_model_size(model):.2f} GB")

📊 Model size : 1.29 GB


## Peft Configuration
Configure LoRA weight into base model

In [10]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], 
    lora_alpha = 64,
    lora_dropout = 0,
    bias = "none",    
    use_gradient_checkpointing = "unsloth",
    use_rslora = True,
    loftq_config = None,
)

model.print_trainable_parameters()
print(f"📊 Model size : {compute_model_size(model):.2f} GB")

Unsloth: Offloading input_embeddings to disk to save VRAM
Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2025.9.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM
trainable params: 761,790,464 || all params: 2,793,530,368 || trainable%: 27.2698
📊 Model size : 4.71 GB


## Swissprot Dataset Setup
Configure the Swissprot sequences dataset.

In [12]:
# Define structured output format for protein formatting
protein_start = "<start_protein>"   # Begin protein sequence
protein_end = "<end_protein>"       # End protein sequence
eos_token = tokenizer.eos_token     # EOS so that generation does not goes forever

In [17]:
def filter_dataset_example(example):
    return len(example['Sequence']) <= 256

print("✅ Dataset filtering functions defined")

✅ Dataset filtering functions defined


In [20]:
def process_dataset_example(example):
    """Convert Swissprot example to formatted protein"""
    sequence = example["Sequence"]

    # Experim: let the tokenizer decide
    sequence = ' '.join(list(sequence)) # Amino acid level tokenization
    text = f'{protein_start} {sequence} {protein_end} {eos_token}'
    
    return {
        "text": text,
    }

print("✅ Dataset processing functions defined")

✅ Dataset processing functions defined


In [22]:
# Load and preprocess Swissprot training dataset
print("🔄 Loading Swissprot sequences dataset...")
dataset = load_dataset("khairi/uniprot-swissprot")

# Apply formatting to all rows
# 1. filter sequences by length
# 2. tokenize sequence with special tokens
dataset = dataset \
    .filter(filter_dataset_example) \
    .map(process_dataset_example)

train_data = dataset['train']
valid_data = dataset['validation'].select(range(128)) # Pick 128 protein for evaluation

print(f"✅ Dataset loaded and processed!")
print(f"📊 Training examples: {len(train_data):,}")
print(f"📊 Validation examples: {len(valid_data):,}")
print(f"🎯 Sample protein: {train_data[0]['text']}")
print(f"🎯 Sample protein (tokenized): {' '.join(tokenizer.convert_ids_to_tokens(tokenizer.encode(train_data[0]['text'])))}")

🔄 Loading Swissprot sequences dataset...
✅ Dataset loaded and processed!
📊 Training examples: 239,381
📊 Validation examples: 128
🎯 Sample protein: <start_protein> M R S L A I L T T L L A G H A F A Y P K P A P Q S V N R R D W P S I N E F L S E L A K V M P I G D T I T A A C D L I S D G E D A A A S L F G I S E T E N D P C G D V T V L F A R G T C D P G N V G V L V G P W F F D S L Q T A L G S R T L G V K G V P Y P A S V Q D F L S G S V Q N G I N M A N Q I K S V L Q S C P N T K L V L G G Y S Q G S M V V H N A A S N L D A A T M S K I S A V V L F G D P Y Y G K P V A N F D A A K T L V V C H D G D N I C Q G G D I I L L P H L T Y A E D A D T A A A F V V P L V S <end_protein> <|im_end|>
🎯 Sample protein (tokenized): < start _pro tein > ĠM ĠR ĠS ĠL ĠA ĠI ĠL ĠT ĠT ĠL ĠL ĠA ĠG ĠH ĠA ĠF ĠA ĠY ĠP ĠK ĠP ĠA ĠP ĠQ ĠS ĠV ĠN ĠR ĠR ĠD ĠW ĠP ĠS ĠI ĠN ĠE ĠF ĠL ĠS ĠE ĠL ĠA ĠK ĠV ĠM ĠP ĠI ĠG ĠD ĠT ĠI ĠT ĠA ĠA ĠC ĠD ĠL ĠI ĠS ĠD ĠG ĠE ĠD ĠA ĠA ĠA ĠS ĠL ĠF ĠG ĠI ĠS ĠE ĠT ĠE ĠN ĠD ĠP ĠC ĠG ĠD ĠV ĠT ĠV ĠL ĠF ĠA ĠR ĠG

## Training Setup
Configure training parameters optimized for learning a new language with memory constraints.

In [23]:
training_args = UnslothTrainingArguments(
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 8,

    # Use warmup_ratio and num_train_epochs for longer runs!
    max_steps = 5000,
    # warmup_steps = 10,
    warmup_ratio = 0.1,
    # num_train_epochs = 1,

    learning_rate = 5e-5,
    embedding_learning_rate = 2e-5,

    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 200,
    eval_steps = 200,
    save_steps = 200,
    eval_strategy = 'steps',
    save_total_limit = 3,
    load_best_model_at_end = True,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "cosine",
    
    output_dir = "/tmp/outputs",
    run_name = 'qwen-3-cpt-swissport',
    report_to = "wandb", # Use this for WandB etc

    # Push to Hub, set true in production
    push_to_hub=True,
    hub_model_id='khairi/Hisoka-1B'
)

In [24]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_data,
    eval_dataset = valid_data,
    dataset_text_field = "text",
    max_seq_length = max_seq_len,
    dataset_num_proc = 2,
    args = training_args,
)

Unsloth: Tokenizing ["text"] (num_proc=8):   0%|          | 0/239381 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=8):   0%|          | 0/128 [00:00<?, ? examples/s]

In [None]:
# Execute CPT
print("🚀 Starting CPT...")

# Run the training process
trainer.train()

print("✅ Training completed successfully!")
print(f"💾 Model saved to: {training_args.output_dir}")

🚀 Starting CPT...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 239,381 | Num Epochs = 1 | Total steps = 5,000
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 761,790,464 of 2,793,530,368 (27.27% trained)
[34m[1mwandb[0m: Currently logged in as: [33mflursky[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
200,2.7299,2.659739
400,2.6562,2.647046
600,2.6471,2.634559
800,2.6339,2.627264
1000,2.6158,2.608304
1200,2.5923,2.585157
1400,2.5665,2.542497
1600,2.5138,2.512151
1800,2.4886,2.48721
2000,2.4544,2.451288


Unsloth: Not an error, but Qwen3ForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


In [25]:
@torch.no_grad()
def generate_protein():
    inputs = tokenizer(f"{protein_start}", return_tensors='pt')

    
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    output = model.generate(**inputs, max_new_tokens=512, top_k=250, do_sample=True)
    print(tokenizer.decode(output[0]))

generate_protein()

RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)

In [None]:
@torch.no_grad()
def causal_lm():
    import random
    prompt = random.choice([
        'Hello World!',
        'Once upon a time',
        'BRAC5 is',
    ])
    print(f"prompt >> {prompt}")
    inputs = tokenizer(prompt, return_tensors='pt')

    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    output = model.generate(**inputs, max_new_tokens=256, do_sample=True)
    print(tokenizer.decode(output[0]))

causal_lm()