Continued Pretraining of Qwen-0.5B On Swissprot Sequences
Author: [Khairi Abidi](https://github.com/abidikhairi/)

This notebook demonstrates continued pretraining for protein sequence modeling.

Key Features:

- Memory Efficient: LoRA for consumer GPUs.

The model learns to generate model/functional protein sequences.

## Installation and Setup
Install the required packages for continued pretraining with memory-efficient techniques.

In [1]:
%env WANDB_PROJECT=Unsloth-CPT

env: WANDB_PROJECT=Unsloth-CPT


## Connect to 3rd party services

- **WandB**: for experiment tracking.
- **HuggingFace Hub**: for model checkpoints uploading.

In [2]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HUGGING_FACE_TOKEN")
wandb_token = user_secrets.get_secret("WANDB_API_KEY")

In [3]:
!wandb login {wandb_token}
!huggingface-cli login --token {hf_token}

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `hf`CLI if you want to set the git credential as well.
Token is valid (permission: write).
The token `KAGGLE_TOKEN` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `KAGGLE_TOKEN`


## GPU Environment Detection
Verify GPU availability and display hardware specifications for optimal training configuration.

In [4]:
import torch

# Verify CUDA availability and display GPU specifications
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    # Display current GPU details for training optimization
    print(f"Current GPU: {torch.cuda.current_device()}")
    print(f"GPU name: {torch.cuda.get_device_name()}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    # Provide guidance for enabling GPU in Colab
    print("⚠️  No GPU available. This notebook requires a GPU for efficient training.")
    print("In Colab: Runtime → Change runtime type → Hardware accelerator → GPU")

CUDA available: True
Number of GPUs: 2
Current GPU: 0
GPU name: Tesla T4
GPU memory: 15.8 GB


## Core Library Imports
Import essential libraries for pre-training, model configuration, and experiment tracking.

In [18]:
# Model and tokenization
from unsloth import FastLanguageModel

# Training and Setup
from unsloth import (
    UnslothTrainer,
    UnslothTrainingArguments,
    is_bfloat16_supported
)

# Dataset handling
from datasets import load_dataset

In [6]:
model_name = 'unsloth/gemma-3-1b-pt'
max_seq_len = 1024
dtype = torch.float16
load_in_4bit = True

print(f'Loading model: {model_name}')
print(f'Max input length: {max_seq_len}')
print(f'Model dtype: {dtype}')
print(f'Is 4bit quantization supported: {load_in_4bit}')

Loading model: unsloth/gemma-3-1b-pt
Max input length: 1024
Model dtype: torch.float16
Is 4bit quantization supported: True


In [8]:
# Load model with automatic device mapping
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_len,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Ensure tokenizer has proper padding token for batch processing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

==((====))==  Unsloth 2025.9.7: Fast Gemma3 patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

In [9]:
print(f"✅ Model loaded successfully!")
print(f"📊 Model parameters: ~{sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
print(f"🧮 Quantized parameters: ~{sum(p.numel() for p in model.parameters() if hasattr(p, 'quant_type')) / 1e6:.1f}M")

✅ Model loaded successfully!
📊 Model parameters: ~662.9M
🧮 Quantized parameters: ~336.9M


In [13]:
def compute_model_size(model):
    n_params = 0
    for p in model.parameters():
        n_params += p.nelement() * p.element_size()
    for p in model.buffers():
        n_params += p.nelement() * p.element_size()

    return n_params / (1024 ** 3)

print(f"📊 Model size : {compute_model_size(model):.2f} GB")

📊 Model size : 0.92 GB


## Peft Configuration
Configure LoRA weight into base model

In [14]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], 
    lora_alpha = 128,
    lora_dropout = 0.1,
    bias = "none",    
    use_gradient_checkpointing = "unsloth",
    use_rslora = True,
    loftq_config = None,
)

model.print_trainable_parameters()
print(f"📊 Model size : {compute_model_size(model):.2f} GB")

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.1.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.


Unsloth: Making `model.base_model.model.model.embed_tokens` require gradients
trainable params: 69,033,984 || all params: 1,085,770,880 || trainable%: 6.3581
📊 Model size : 1.24 GB


## Swissprot Dataset Setup
Configure the Swissprot sequences dataset.

In [15]:
# Define structured output format for protein formatting
protein_start = "<start_protein>"   # Begin protein sequence
protein_end = "<end_protein>"       # End protein sequence
eos_token = tokenizer.eos_token     # EOS so that generation does not goes forever

In [16]:
def process_dataset_example(example):
    """Convert Swissprot example to formatted protein"""
    sequence = example["Sequence"]

    # Experim: let the tokenizer decide
    # sequence = ' '.join(list(sequence)) # Amino acid level tokenization
    text = f'{protein_start} {sequence} {protein_end} {eos_token}'
    
    return {
        "text": text,
    }

print("✅ Dataset processing functions defined")

✅ Dataset processing functions defined


In [56]:
# Load and preprocess Swissprot training dataset
print("🔄 Loading Swissprot sequences dataset...")
dataset = load_dataset("khairi/uniprot-swissprot")

# Apply conversation formatting to all examples
dataset = dataset.map(process_dataset_example) \

train_data = dataset['train']
valid_data = dataset['validation'].select(range(128)) # Pick 128 protein for evaluation

print(f"✅ Dataset loaded and processed!")
print(f"📊 Training examples: {len(train_data):,}")
print(f"📊 Validation examples: {len(valid_data):,}")
print(f"🎯 Sample protein: {train_data[0]['text']}")
print(f"🎯 Sample protein (tokenized): {' '.join(tokenizer.convert_ids_to_tokens(tokenizer.encode(train_data[0]['text'])))}")

🔄 Loading Swissprot sequences dataset...
✅ Dataset loaded and processed!
📊 Training examples: 455,692
📊 Validation examples: 128
🎯 Sample protein: <start_protein> MSLEQKKGADIISKILQIQNSIGKTTSPSTLKTKLSEISRKEQENARIQSKLSDLQKKKIDIDNKLLKEKQNLIKEEILERKKLEVLTKKQQKDEIEHQKKLKREIDAIKASTQYITDVSISSYNNTIPETEPEYDLFISHASEDKEDFVRPLAETLQQLGVNVWYDEFTLKVGDSLRQKIDSGLRNSKYGTVVLSTDFIKKDWTNYELDGLVAREMNGHKMILPIWHKITKNDVLDYSPNLADKVALNTSVNSIEEIAHQLADVILNR <end_protein> <eos>
🎯 Sample protein (tokenized): <bos> < start _ protein > ▁MS LE Q KK G ADI ISK IL Q IQ NS IG KT TSP STL KT KL SE ISR KE Q EN ARI Q SK L SDL Q KK K ID ID NK LL KE KQ NL IK EE ILER KK LEV LT KK QQ K DE IE HQ K KL K RE ID AI K AST Q Y IT D VS ISS Y NN TIP ETE PE Y DL FISH ASE DK ED F VR PLA ET LQ QL GV NV WY DE FT L KV G DS LR QK IDS GL R NS KY G TV VL ST DF IK KD WT NY ELD GL VA REM NG HK MIL PI WH KIT K ND V LD Y SP NL AD K VAL N TS VN SI EE IA H QL ADV IL NR ▁< end _ protein > ▁ <eos>


## Training Setup
Configure training parameters optimized for learning a new language with memory constraints.

In [57]:
training_args = UnslothTrainingArguments(
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 8,

    # Use warmup_ratio and num_train_epochs for longer runs!
    # max_steps = 120,
    # warmup_steps = 10,
    warmup_ratio = 0.1,
    num_train_epochs = 1,

    learning_rate = 5e-5,
    embedding_learning_rate = 2e-5,

    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    logging_steps = 400,
    eval_steps = 400,
    save_steps = 400,
    eval_strategy = 'steps',
    save_total_limit = 3,
    load_best_model_at_end = True,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "cosine",
    
    output_dir = "/tmp/outputs",
    run_name = 'gemma2-cpt-swissport',
    report_to = "wandb", # Use this for WandB etc

    # Push to Hub, set true in production
    push_to_hub=True,
    hub_model_id='khairi/Hisoka-1B'
)

In [58]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_data,
    eval_dataset = valid_data,
    dataset_text_field = "text",
    max_seq_length = max_seq_len,
    dataset_num_proc = 2,
    args = training_args,
)

Unsloth: Switching to float32 training since model cannot work with float16


In [None]:
# Execute CPT
print("🚀 Starting CPT...")

# Run the training process
trainer.train()

print("✅ Training completed successfully!")
print(f"💾 Model saved to: {training_args.output_dir}")

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 455,692 | Num Epochs = 1 | Total steps = 14,241
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32
 "-____-"     Trainable parameters = 69,033,984 of 1,085,770,880 (6.36% trained)


🚀 Starting CPT...


[34m[1mwandb[0m: Currently logged in as: [33mflursky[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss


In [None]:
def generate_protein():
    inputs = tokenizer(f"{protein_start}", return_tensors='pt')

    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    output = model.generate(**inputs, max_new_tokens=512, top_k=250, do_sample=True)
    print(tokenizer.decode(output[0]))

generate_protein()

In [None]:
def causal_lm():
    import random
    prompt = random.choice([
        'Hello World!',
        'Once upon a time',
        'BRAC5 is',
    ])
    print(f"prompt >> {prompt}")
    inputs = tokenizer(prompt, return_tensors='pt')

    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    output = model.generate(**inputs, max_new_tokens=256, top_k=250, do_sample=True)
    print(tokenizer.decode(output[0]))

causal_lm()