<a href="https://colab.research.google.com/github/ashivashankars/CMPE255_Assignments/blob/main/UnSlothai_continued_pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colab 5 ‚Äî Continued Pretraining (Domain Adaptation)

**Goal:** Continue pretraining an LLM on domain text (e.g., legal snippets) to adapt vocabulary/knowledge. Then do a short SFT step for instruction‚Äëfollowing.

## Overview
First, we feed raw **domain documents** so the model learns the jargon. Then we do a short instruction fine‚Äëtune so it responds helpfully about that domain. You‚Äôll save an adapter and re‚Äëload it later.

In [1]:
# %%capture
!pip -q install --upgrade pip
# Core libs
!pip -q install "unsloth>=2025.10.0" "transformers>=4.45.0" "datasets>=2.19.0" "accelerate>=1.0.0" "trl>=0.9.6" "peft>=0.13.0" "bitsandbytes>=0.44.0" "evaluate>=0.4.3" "scikit-learn>=1.5.0"

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.8/1.8 MB[0m [31m79.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.[0m[31m
[0m

In [2]:
import os, random, numpy as np, torch, platform
from datetime import datetime
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED);
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
print("Timestamp:", datetime.now())
print("Python:", platform.python_version())
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0))
    print("Capability:", torch.cuda.get_device_capability(0))
else:
    print("‚ö†Ô∏è GPU not found. Colab > Runtime > Change runtime type > GPU is recommended.")

Timestamp: 2025-11-09 05:59:36.023549
Python: 3.12.12
Torch: 2.8.0+cu126
CUDA available: True
Device: NVIDIA A100-SXM4-40GB
Capability: (8, 0)


## Tiny inline domain corpus (always available)

In [3]:
import os, textwrap, json
os.makedirs("data", exist_ok=True)
legal_corpus = textwrap.dedent('''
SECTION 1. Definitions. "Agreement" means a written contract between parties.
SECTION 2. Termination. Either party may terminate upon material breach after notice.
SECTION 3. Confidentiality. Receiving Party shall protect Confidential Information using reasonable care.
''').strip()
open("data/legal_corpus.txt","w").write(legal_corpus)
print(open("data/legal_corpus.txt").read())

SECTION 1. Definitions. "Agreement" means a written contract between parties.
SECTION 2. Termination. Either party may terminate upon material breach after notice.
SECTION 3. Confidentiality. Receiving Party shall protect Confidential Information using reasonable care.


## Load base model (QLoRA)

In [4]:
from unsloth import FastLanguageModel
import torch
dtype = torch.bfloat16 if torch.cuda.is_available() else None
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-unsloth-bnb-4bit",
    max_seq_length=2048,
    dtype=dtype,
    load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.11.2 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


## Continued pretraining (text completion objective)

In [5]:
from datasets import Dataset
raw = Dataset.from_list([{"text": open("data/legal_corpus.txt").read()}])
def tok(b):
    out = tokenizer(b["text"], truncation=True, padding="max_length", max_length=1024)
    out["labels"] = out["input_ids"].copy()
    return out
tokenized = raw.map(tok, batched=True, remove_columns=["text"])

from unsloth import UnslothTrainer
from transformers import TrainingArguments
args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    max_steps=80,
    logging_steps=10,
    bf16=True,
    output_dir="out_cpt",
    save_strategy="no",
    report_to="none",
)
trainer = UnslothTrainer(model=model, args=args, train_dataset=tokenized, tokenizer=tokenizer)
trainer.train()

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1 | Num Epochs = 80 | Total steps = 80
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Step,Training Loss
10,9.0755
20,6.964
30,6.2011
40,5.9367
50,5.8214
60,5.7577
70,5.7201
80,5.7063


Unsloth: Will smartly offload gradients to save VRAM!


TrainOutput(global_step=80, training_loss=6.397830009460449, metrics={'train_runtime': 56.7956, 'train_samples_per_second': 22.537, 'train_steps_per_second': 1.409, 'total_flos': 3709436417802240.0, 'train_loss': 6.397830009460449, 'epoch': 80.0})

## Short SFT on domain QA (embedded)

In [7]:
qa = [
  {"q":"What triggers termination under the Agreement?", "a":"A material breach after notice."},
  {"q":"How should confidential information be protected?", "a":"Using reasonable care by the receiving party."},
]
from datasets import Dataset
train_ds = Dataset.from_list([{"messages":[
    {"role":"system","content":"You are a helpful legal assistant."},
    {"role":"user","content":x["q"]},
    {"role":"assistant","content":x["a"]},
]} for x in qa])

# Fix: Set the chat_template for the tokenizer
tokenizer.chat_template = "{% for message in messages %}{% if message['role'] == 'user' %}{{ '\n' + 'USER: ' + message['content'] + '\n' }}{% elif message['role'] == 'system' %}{{ '\n' + 'SYSTEM: ' + message['content'] + '\n' }}{% elif message['role'] == 'assistant' %}{{ '\n' + 'ASSISTANT: ' + message['content'] + '\n' }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '\n' + 'ASSISTANT: ' }}{% endif %}"

def to_tokens(batch):
    texts=[tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=False) for m in batch["messages"]]
    t=tokenizer(texts, truncation=True, padding="max_length", max_length=512); t["labels"]=t["input_ids"].copy(); return t
tokenized = train_ds.map(to_tokens, batched=True, remove_columns=["messages"])

args2 = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    max_steps=80,
    logging_steps=10,
    bf16=True,
    output_dir="out_cpt_sft",
    save_strategy="no",
    report_to="none",
)
trainer2 = UnslothTrainer(model=model, args=args2, train_dataset=tokenized, tokenizer=tokenizer)
trainer2.train()

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2 | Num Epochs = 80 | Total steps = 80
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Step,Training Loss
10,6.3013
20,0.0
30,0.0
40,0.0
50,0.0
60,0.0
70,0.0
80,0.0


TrainOutput(global_step=80, training_loss=0.7876677513122559, metrics={'train_runtime': 49.1984, 'train_samples_per_second': 13.009, 'train_steps_per_second': 1.626, 'total_flos': 3709436417802240.0, 'train_loss': 0.7876677513122559, 'epoch': 80.0})

## Inference

In [11]:
from unsloth import FastLanguageModel
FastLanguageModel.for_inference(model)
prompt="Explain in one sentence what 'material breach' means in a contract."

# Ensure pad_token_id is set for the tokenizer
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# 1. Format the prompt string using the chat template
formatted_prompt = tokenizer.apply_chat_template(
    [{"role":"user","content":prompt}],
    tokenize=False, # Get string output
    add_generation_prompt=True
)

# 2. Tokenize the formatted prompt string to get input_ids and attention_mask
tokenized_input = tokenizer(
    formatted_prompt,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=2048 # Use max_seq_length from model loading
).to(model.device)

# Pass input_ids, attention_mask, and explicitly pad_token_id/eos_token_id to generate
y = model.generate(
    input_ids=tokenized_input["input_ids"],
    attention_mask=tokenized_input["attention_mask"],
    max_new_tokens=96,
    do_sample=False,
    pad_token_id=tokenizer.pad_token_id, # Explicitly pass pad_token_id
    eos_token_id=tokenizer.eos_token_id  # Explicitly pass eos_token_id
)
print(tokenizer.decode(y[0], skip_special_tokens=True))


USER: Explain in one sentence what'material breach' means in a contract.

ASSISTANT:!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


## Save adapter

In [12]:
model.save_pretrained("llama31_cpt_adapter")
tokenizer.save_pretrained("llama31_cpt_adapter")
print("Saved to llama31_cpt_adapter")

Saved to llama31_cpt_adapter


In [15]:
from google.colab import files
files.download("/content/data/legal_corpus.txt")
files.download("/content/huggingface_tokenizers_cache/models--unsloth--llama-3.1-8b-unsloth-bnb-4bit")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>