# Continued Pretraining of Qwen-0.5B On Swissprot Sequences

Author: [Khairi Abidi](https://github.com/abidikhairi/)

This notebook demonstrates continued pretraining for protein sequence modeling.

Key Features:

- Memory Efficient: LoRA for consumer GPUs

The model learns to generate model/functional protein sequences.

## Installation and Setup
Install the required packages for continued pretraining with memory-efficient techniques.

In [9]:
%env CUDA_VISIBLE_DEVICES=0,1

env: CUDA_VISIBLE_DEVICES=0,1


In [10]:
%%capture
!pip install --quiet transformers datasets trl bitsandbytes peft trackio

## Connect to 3rd party services

- **WandB**: for experiment tracking.
- **HuggingFace Hub**: for model checkpoints uploading.

In [11]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HUGGING_FACE_TOKEN")
wandb_token = user_secrets.get_secret("WANDB_API_KEY")

In [12]:
%env WANDB_PROJECT=Qwen-CPT

env: WANDB_PROJECT=Qwen-CPT


In [13]:
!wandb login {wandb_token}

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


In [14]:
!huggingface-cli login --token {hf_token}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `hf`CLI if you want to set the git credential as well.
Token is valid (permission: write).
The token `KAGGLE_TOKEN` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `KAGGLE_TOKEN`


## GPU Environment Detection
Verify GPU availability and display hardware specifications for optimal training configuration.

In [15]:
import torch

# Verify CUDA availability and display GPU specifications
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Number of GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    # Display current GPU details for training optimization
    print(f"Current GPU: {torch.cuda.current_device()}")
    print(f"GPU name: {torch.cuda.get_device_name()}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    # Provide guidance for enabling GPU in Colab
    print("⚠️  No GPU available. This notebook requires a GPU for efficient training.")
    print("In Colab: Runtime → Change runtime type → Hardware accelerator → GPU")

CUDA available: True
Number of GPUs: 2
Current GPU: 0
GPU name: Tesla T4
GPU memory: 15.8 GB


## Core Library Imports
Import essential libraries for pre-training, model configuration, and experiment tracking.

In [16]:
# Model and tokenization
from transformers import (
    AutoModelForCausalLM,            # Causal language model loading
    AutoTokenizer,                   # Text tokenization
    DataCollatorForLanguageModeling, # Batch inputs handling
    BitsAndBytesConfig,              # Quantization configuration
)

# Model optimization
from torch.optim import AdamW
from transformers import get_scheduler

# Training and Setup
from transformers import (
    Trainer,
    TrainingArguments
)

# Dataset handling
from datasets import load_dataset

# Logging configuration
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [17]:
# Select model optimized for instruction-following and reasoning
model_name = "Qwen/Qwen2.5-0.5B"          # 0.5B parameter model balances capability and memory usage
max_seq_length = 768                      # Token limit for protein sequences (reduce if OOM)

print(f"Loading model: {model_name}")
print(f"Max sequence length: {max_seq_length}")

Loading model: Qwen/Qwen2.5-0.5B
Max sequence length: 768


In [18]:
# Load model with automatic device mapping
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",                    # Auto-distribute across available GPUs/CPU
    trust_remote_code=True,               # Allow custom model code execution
    dtype=torch.float16,                 # Use FP16 for non-quantized operations
)

config.json:   0%|          | 0.00/681 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

In [19]:
# Load corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True               # Allow custom tokenizer code
)

# Ensure tokenizer has proper padding token for batch processing
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [20]:
print(f"✅ Model loaded successfully!")
print(f"📊 Model parameters: ~{sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
print(f"🧮 Quantized parameters: ~{sum(p.numel() for p in model.parameters() if hasattr(p, 'quant_type')) / 1e6:.1f}M")

✅ Model loaded successfully!
📊 Model parameters: ~494.0M
🧮 Quantized parameters: ~0.0M


In [21]:
def compute_model_size(model):
    n_params = 0
    for p in model.parameters():
        n_params += p.nelement() * p.element_size()
    for p in model.buffers():
        n_params += p.nelement() * p.element_size()

    return n_params / (1024 ** 3)

print(f"📊 Model size : {compute_model_size(model):.2f} GB")

📊 Model size : 0.92 GB


## Swissprot Dataset Setup
Configure the Swissprot sequences dataset.

In [22]:
# Define structured output format for protein formatting
protein_start = "<start_protein>"   # Begin protein sequence
protein_end = "<end_protein>"       # End protein sequence
eos_token = tokenizer.eos_token     # EOS so that generation does not goes forever

In [23]:
def process_dataset_example(example):
    """Convert Swissprot example to formatted protein"""
    sequence = example["Sequence"]

    # Experim: let the tokenizer decide
    sequence = ' '.join(list(sequence)) # Amino acid level tokenization
    text = f'{protein_start} {sequence} {protein_end} {eos_token}'
    
    return {
        "text": text,
    }

print("✅ Dataset processing functions defined")

✅ Dataset processing functions defined


In [24]:
def tokenize_dataset_example(examples):
    return tokenizer(examples['text'], return_tensors='pt', padding=True)

print("✅ Dataset tokenization functions defined (batch mode)")

✅ Dataset tokenization functions defined (batch mode)


In [25]:
# Load and preprocess Swissprot training dataset
print("🔄 Loading Swissprot sequences dataset...")
dataset = load_dataset("khairi/uniprot-swissprot")

# Apply conversation formatting to all examples
dataset = dataset.map(process_dataset_example) \
    .map(tokenize_dataset_example, batched=True, batch_size=32)

train_data = dataset['train']
valid_data = dataset['validation'].select(range(128)) # Pick 128 protein for evaluation

print(f"✅ Dataset loaded and processed!")
print(f"📊 Training examples: {len(train_data):,}")
print(f"📊 Validation examples: {len(valid_data):,}")
print(f"🎯 Sample protein: {train_data[0]['text']}")

🔄 Loading Swissprot sequences dataset...


README.md:   0%|          | 0.00/169 [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/116M [00:00<?, ?B/s]

valid.parquet:   0%|          | 0.00/540k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/2.66M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/455692 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/455692 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/455692 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

✅ Dataset loaded and processed!
📊 Training examples: 455,692
📊 Validation examples: 128
🎯 Sample protein: <start_protein> M S L E Q K K G A D I I S K I L Q I Q N S I G K T T S P S T L K T K L S E I S R K E Q E N A R I Q S K L S D L Q K K K I D I D N K L L K E K Q N L I K E E I L E R K K L E V L T K K Q Q K D E I E H Q K K L K R E I D A I K A S T Q Y I T D V S I S S Y N N T I P E T E P E Y D L F I S H A S E D K E D F V R P L A E T L Q Q L G V N V W Y D E F T L K V G D S L R Q K I D S G L R N S K Y G T V V L S T D F I K K D W T N Y E L D G L V A R E M N G H K M I L P I W H K I T K N D V L D Y S P N L A D K V A L N T S V N S I E E I A H Q L A D V I L N R <end_protein> <|endoftext|>


## Training Setup
Configure training parameters optimized for learning a new language with memory constraints.

In [26]:
# Prepare data collator
data_collator = DataCollatorForLanguageModeling(mlm=False, tokenizer=tokenizer)

In [27]:
# Configure CPT training parameters for sequence modeling
training_args = TrainingArguments(
    # Memory-efficient batch configuration
    per_device_train_batch_size=4,   # Small batch for GPU memory constraints
    gradient_accumulation_steps=8,   # Effective batch size = 8 * 4 = 32
       
    # Training duration and monitoring
    max_steps=1000,                    # Short demo run (increase to 500+ for production)
    logging_steps=25,                  # Log metrics every step for close monitoring
    save_steps=25,
    eval_steps=25,
    eval_strategy='steps',
    
    # Stability and output configuration
    output_dir="./cpt_outputs",
    max_grad_norm=0.1,               # Aggressive gradient clipping for stable training
    report_to="wandb",                # use wandb for experiment tracking
    run_name='qwen0.5B-cpt-swissprot',
    fp16=True,

    # Push to Hub, uncomment in production
    push_to_hub=True,
    hub_model_id='khairi/Shizuku-0.5B'
    
)

### Optimizer and Scheduler Setup

Configure **AdamW optimizer** with different learning rates for the embedding and layer parameters of the model.  
Then sets up a **cosine learning rate scheduler** with a short warmup (fast warmup/slow cooldown) and total training steps.

In [29]:
print("🚀 Initializing optimizer...")
print("🌙 Setting up cosine LR scheduler...")

optimizer = AdamW([
        {'params': model.model.embed_tokens.parameters(), 'lr': 1e-4},
        {'params': model.model.layers.parameters(), 'lr': 3e-4}
    ],
    betas=(0.99, 0.98),
    weight_decay=0.01
)

lr_scheduler = get_scheduler(
    name='cosine',
    optimizer=optimizer,
    num_warmup_steps=50,
    num_training_steps=1000
)

print("✅ Optimizer ready!")
print("✨ LR scheduler ready!")

🚀 Initializing optimizer...
🌙 Setting up cosine LR scheduler...
✅ Optimizer ready!
✨ LR scheduler ready!


In [30]:
trainer = Trainer(
    model=model,                           #  Qwen0.5B model
    train_dataset=train_data,              # Training dataset
    eval_dataset=valid_data,               # Evaluation dataset
    args=training_args,                    # Training configuration
    data_collator=data_collator,           # Batch handling
    optimizers=(optimizer, lr_scheduler)   # Optimization
)

In [31]:
# Execute CPT
print("🚀 Starting CPT...")

# Run the training process
trainer.train()

print("✅ Training completed successfully!")
print(f"💾 Model saved to: {training_args.output_dir}")

🚀 Starting CPT...


[34m[1mwandb[0m: Currently logged in as: [33mflursky[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


ValueError: Attempting to unscale FP16 gradients.

In [None]:
with torch.no_grad():
    print(tokenizer.decode(model.generate(tokenizer.encode("Hello World!", return_tensors="pt").to("cuda:0"))[0]))

In [None]:
with torch.no_grad():
    print(tokenizer.decode(model.generate(tokenizer.encode(f"{protein_start} ", return_tensors="pt").to("cuda:0"))[0], max_new_tokens=16, top_k=250, do_sample=True))