<a href="https://colab.research.google.com/github/akshatamadavi/data_mining/blob/main/unsloth_ai/05_continued_pretraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 05_continued_pretraining.ipynb ‚Äî Unsloth Continued Pretraining (Causal LM)

**Goal:** Continue pretraining a small Unsloth model on **unlabeled text** (new language/domain) using a **causal language modeling** objective.

**What this notebook covers**
1. Environment & GPU check
2. Install dependencies
3. Load a tiny Unsloth model in **4-bit** and attach **LoRA** adapters for efficient updates
4. Build a small corpus (or load your own `.txt` files / JSONL)
5. Tokenize into contiguous blocks and train with `Trainer` (`mlm=False`)
6. Save checkpoint + quick inference demo

> Replace the toy corpus with your real data to teach the model a **new language or domain**.

In [None]:
#@title ‚è±Ô∏è Setup ‚Äî GPU check
import torch
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("Running on CPU ‚Äî switch Colab runtime to GPU for training speed.")

In [None]:
#@title üì¶ Install libraries (Colab-friendly)
!pip -q install -U unsloth transformers datasets bitsandbytes peft accelerate sentencepiece
import os
os.environ["BITSANDBYTES_NOWELCOME"] = "1"

In [None]:
#@title üîß Config ‚Äî model & training params
from dataclasses import dataclass

BASE_MODEL = "unsloth/SmolLM2-135M-Instruct-bnb-4bit"  # tiny + fast; swap for Gemma/Llama/Mistral if you have VRAM
OUTPUT_DIR = "outputs_continued_pretrain"

# Training
MAX_SEQ_LEN = 1024
BATCH_PER_DEVICE = 4
GRAD_ACCUM = 8
EPOCHS = 2  # increase for real training
LR = 2e-5
SEED = 42

print({k:v for k,v in dict(BASE_MODEL=BASE_MODEL, OUTPUT_DIR=OUTPUT_DIR, MAX_SEQ_LEN=MAX_SEQ_LEN,
                           BATCH_PER_DEVICE=BATCH_PER_DEVICE, GRAD_ACCUM=GRAD_ACCUM,
                           EPOCHS=EPOCHS, LR=LR, SEED=SEED).items()})

## 1) Load model in 4-bit and attach LoRA for continued pretraining
- LoRA adapters update a small subset of parameters (efficient)
- 4-bit quantization keeps VRAM low
- We train with **causal LM** objective (next-token prediction)

In [None]:
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import AutoConfig

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = BASE_MODEL,
    max_seq_length = MAX_SEQ_LEN,
    dtype = None,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = SEED,
    max_seq_length = MAX_SEQ_LEN,
)
print("Model & tokenizer ready.")

## 2) Build / load your corpus
You can provide a folder of `.txt` files or a JSONL of `{"text": ...}` rows.

Below we create a **toy corpus** with a few sentences in a hypothetical new language/domain. Replace with your own.

In [None]:
import os, json, glob
from datasets import Dataset

#@title üëâ Choose your corpus source
USE_TXT_FOLDER = False #@param {type:"boolean"}
TXT_FOLDER = "/content/corpus_txt" #@param {type:"string"}
JSONL_PATH = "/content/corpus.jsonl" #@param {type:"string"}

def ensure_toy_corpus():
    toy = [
        {"text": "Nolori safi tem. Vairu melek tora; kivar duneh. (NewLang)"},
        {"text": "Data sciencia praxi: version control, reproducibilis, experimentum tracking."},
        {"text": "Guidelines: tokens segmente, contextus longus, regulae syntaxicae novas."},
        {"text": "Conversatio: Q: 'Salve?' A: 'Pax et lumen!'"},
        {"text": "Domaino-medicus: symptomata, anamnesis, differentialis, consilium therapiae."}
    ]
    with open(JSONL_PATH, "w") as f:
        for r in toy:
            f.write(json.dumps(r, ensure_ascii=False)+"\n")

def load_corpus_dataset():
    if USE_TXT_FOLDER and os.path.isdir(TXT_FOLDER):
        texts = []
        for p in glob.glob(os.path.join(TXT_FOLDER, "**/*.txt"), recursive=True):
            with open(p, "r", encoding="utf-8", errors="ignore") as f:
                txt = f.read().strip()
                if txt:
                    texts.append({"text": txt})
        if not texts:
            ensure_toy_corpus()
            print("No .txt files found; using toy corpus.")
            return Dataset.from_json(JSONL_PATH)
        return Dataset.from_list(texts)
    else:
        if not os.path.exists(JSONL_PATH):
            ensure_toy_corpus()
            print("Created toy JSONL corpus at", JSONL_PATH)
        return Dataset.from_json(JSONL_PATH)

train_raw = load_corpus_dataset()
train_raw

## 3) Tokenize into contiguous blocks for CLM
- We **concatenate** texts and split into blocks of `MAX_SEQ_LEN` tokens
- Use `DataCollatorForLanguageModeling` with `mlm=False` (causal LM)

In [None]:
from datasets import DatasetDict
import itertools

tokenizer.pad_token = tokenizer.eos_token if tokenizer.pad_token is None else tokenizer.pad_token

def tokenize_function(batch):
    return tokenizer(batch["text"], add_special_tokens=False)

tokenized = train_raw.map(tokenize_function, batched=True, remove_columns=[c for c in train_raw.column_names if c != "text"])

def group_texts(examples):
    # Concatenate
    concatenated = list(itertools.chain.from_iterable(examples["input_ids"]))
    total_length = (len(concatenated) // MAX_SEQ_LEN) * MAX_SEQ_LEN
    concatenated = concatenated[:total_length]
    # Split
    result = {
        "input_ids": [concatenated[i:i+MAX_SEQ_LEN] for i in range(0, total_length, MAX_SEQ_LEN)]
    }
    result["labels"] = [ids.copy() for ids in result["input_ids"]]
    return result

lm_ds = tokenized.map(group_texts, batched=True, remove_columns=tokenized.column_names)
lm_ds

In [None]:
#@title 4) Train with Trainer (causal LM)
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer
from unsloth import is_bfloat16_supported

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

args = TrainingArguments(
    output_dir = OUTPUT_DIR,
    per_device_train_batch_size = BATCH_PER_DEVICE,
    gradient_accumulation_steps = GRAD_ACCUM,
    num_train_epochs = EPOCHS,
    learning_rate = LR,
    warmup_ratio = 0.1,
    logging_steps = 10,
    save_steps = 200,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    optim = "adamw_8bit",
    seed = SEED,
)

trainer = Trainer(
    model = model,
    args = args,
    train_dataset = lm_ds,
    data_collator = data_collator,
)

trainer.train()
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print("Saved to:", OUTPUT_DIR)

## 5) Quick inference sanity check
Generate text in the **new language/domain** to see if the model picked up patterns.

In [None]:
import torch

def generate(prompt, max_new_tokens=120, temperature=0.7, top_p=0.9):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True,
                             temperature=temperature, top_p=top_p)
    return tokenizer.decode(out[0], skip_special_tokens=True)

test_prompt = "Nolori safi tem:"
print("\n=== Sample Generation ===\n")
print(generate(test_prompt))

## 6) (Optional) Export to Ollama (manual Modelfile)
Ollama can load HF adapters via a **Modelfile**. The simplest path is to export the PEFT adapters and point Ollama to the base model + LoRA.

Below we just create a skeleton `Modelfile`. You may need to adjust paths on your machine.


In [None]:
from pathlib import Path
modelfile_path = Path(OUTPUT_DIR)/"Modelfile"
modelfile_text = f"""
# Example Modelfile for Ollama (adjust paths on your system)
FROM {BASE_MODEL}
PARAMETER lora {OUTPUT_DIR}
TEMPLATE "You are a helpful assistant."
"""
modelfile_path.write_text(modelfile_text)
print("Wrote Modelfile to:", modelfile_path)
print("Next on your machine:  ollama create my-continued-model -f", modelfile_path)


---
### Notes & Tips
- For **larger corpora**, increase `EPOCHS` and adjust batch/grad-accum. If you hit OOM, reduce `BATCH_PER_DEVICE` or `MAX_SEQ_LEN`.
- To use a different base (e.g., `unsloth/gemma-3-1b-it-bnb-4bit`), just change `BASE_MODEL` and re-run.
- If your tokenizer needs special handling (e.g., custom BOS/EOS), set `tokenizer.pad_token = tokenizer.eos_token` (already handled above).
- Keep text **clean & UTF-8**. For multilingual corpora, ensure the base model supports that script/tokenization reasonably well.
- For full finetuning (no LoRA), load in higher precision and skip `get_peft_model` (requires more VRAM).

Happy continued pretraining! ü¶•