# Install Libraries

In [None]:
%%capture
!pip install unsloth "xformers==0.0.28.post2"
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Load Llama 3.2 3B Instruct

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

### get_peft_model() - It takes a base model - which you can load from the Transformers library - and the LoraConfig containing the parameters for how to configure a model for training with LoRA.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,   # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Offloading input_embeddings to disk to save VRAM


  offloaded_W = torch.load(filename, map_location = "cpu", mmap = True)


Unsloth: Offloading output_embeddings to disk to save VRAM


Unsloth 2025.2.15 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


Unsloth: Training embed_tokens in mixed precision to save VRAM
Unsloth: Training lm_head in mixed precision to save VRAM


## Example output before training

In [None]:
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(

        "What are the symptoms of schizophrenia?", # instruction
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the symptoms of schizophrenia?\n\n### Response:\nSchizophrenia is a chronic and severe mental disorder that affects how a person thinks, feels, and behaves. The symptoms of schizophrenia can vary widely from person to person, but common symptoms include:\n\n*   **Hallucinations**: Seeing, hearing, or feeling things that are not there\n*   **Delusions**:']

# Continued Pre Training

## Load Markdown Data

In [None]:
import re
import os

# Define the maximum word limit for each section
MAX_WORDS = 500

# Define EOS_TOKEN
EOS_TOKEN = tokenizer.eos_token

# Function to split text into chunks of a specific word limit
def split_into_chunks(text, max_words):
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_words):
        chunks.append(" ".join(words[i:i + max_words]))
    return chunks

# Function to process a single Markdown document
def process_markdown(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    # Extract the title from the first `#` header
    title_match = re.search(r"^#\s+(.*)$", content, re.MULTILINE)
    title = title_match.group(1).strip() if title_match else "Untitled"

    # Extract sections starting with `##`
    sections = re.split(r"^##\s+(.*)$", content, flags=re.MULTILINE)

    # The first part is the content before the first `##`, ignore it
    sections = sections[1:] if len(sections) > 1 else []

    documents = []

    # Process each section
    for i in range(0, len(sections), 2):
        section_title = sections[i].strip()
        section_text = sections[i + 1].strip() if i + 1 < len(sections) else ""

        # Split the section text into chunks if it exceeds the word limit
        chunks = split_into_chunks(section_text, MAX_WORDS)

        for chunk in chunks:
            documents.append({
                "title": title,
                "text": f"### {section_title}\n{chunk}{EOS_TOKEN}"
            })

    return documents

# Function to process multiple Markdown files
def process_markdown_files(folder_path):
    all_documents = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.md'):
            file_path = os.path.join(folder_path, file_name)
            all_documents.extend(process_markdown(file_path))
    return all_documents

# Example usage
if __name__ == "__main__":
    folder_path = "/content/datasets"  # Replace with your folder path
    documents = process_markdown_files(folder_path)

    # Print the processed documents
    for doc in documents:
        print(doc)

    # Optionally, save the results to a file
    with open("processed_documents.json", "w", encoding="utf-8") as output_file:
        import json
        json.dump(documents, output_file, indent=4, ensure_ascii=False)


{'title': 'SCHIZOPHRENIA', 'text': '### What is schizophrenia?\nSchizophrenia is a chronic and severe disorder that affects how a person thinks, feels, and acts. Although schizophrenia is not as common as other mental disorders, it can be very disabling. Approximately 7 or 8 individuals out of 1,000 will have schizophrenia in their lifetime. People with the disorder may hear voices or see things that aren’t there. They may believe other people are reading their minds, controlling their thoughts, or plotting to harm them. This can be scary and upsetting to people with the illness and make them withdrawn or extremely agitated. It can also be scary and upsetting to the people around them. People with schizophrenia may sometimes talk about strange or unusual ideas, which can make it difficult to carry on a conversation. They may sit for hours without moving or talking. Sometimes people with schizophrenia seem perfectly fine until they talk about what they are really thinking. Families and 

## Convert the Jsons into pyarrow dataset type

In [None]:
from datasets import Dataset
import pyarrow as pa

def convert_to_arrow_dataset(documents):
    # Convert the list of dictionaries to a Dataset
    arrow_table = pa.table({
        "title": [doc["title"] for doc in documents],
        "text": [doc["text"] for doc in documents],
    })
    return Dataset(arrow_table)

In [None]:
dataset = convert_to_arrow_dataset(documents)

## Use UnslothTrainer to set up the training parameters and start the CPT

In [None]:
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 60,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Converting train dataset to ChatML (num_proc=2):   0%|          | 0/32 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/32 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/32 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/32 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.357 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 32 | Num Epochs = 30
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 8
\        /    Total batch size = 16 | Total steps = 60
 "-____-"     Number of trainable parameters = 982,515,712


Step,Training Loss
1,2.1117
2,2.0526
3,2.0489
4,1.9935
5,1.9548
6,1.7647
7,1.6731
8,1.627
9,1.4857
10,1.3512


## Inferencing after Training 

In [None]:
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(

        "What are the symptoms of schizophrenia?", # instruction
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)


['<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the symptoms of schizophrenia?\n\n### Response:\nSymptoms of schizophrenia include a range of problems. Sometimes people with schizophrenia have positive symptoms. These are symptoms that people with schizophrenia do not have, but they should. These symptoms include the following: Hallucinations are sensory experiences that occur in the absence of a stimulus. These experiences can occur in any of the five senses']

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(

        "What are some of the symptoms of bipolar disorder?", # instruction
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 100, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are some of the symptoms of bipolar disorder?\n\n### Response:\nPeople with bipolar disorder go through unusual mood changes. Sometimes they feel very happy and “up,” and are much more energetic and active than usual. This is called a “manic episode.” Sometimes people with bipolar disorder feel very sad and “down,” have low energy, and are much less active. This is called depression or a “depressive episode.” Mood swings that are severe and persistent are symptoms of bipolar disorder. Other symptoms may include:<|eot_id|>']

# Instruction Fine Tuning

In [None]:

alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

# Load JSONL dataset
from datasets import load_dataset
dataset = load_dataset("json", data_files="/content/datasets/instruct_prompts.jsonl", split="train")
EOS_TOKEN = tokenizer.eos_token

# Formatting function adjusted for 'prompt' and 'completion'
def formatting_prompts_func(examples):
    prompts = examples["prompt"]
    completions = examples["completion"]
    texts = []
    for prompt, completion in zip(prompts, completions):
        # Format prompt and completion with EOS token
        text = alpaca_prompt.format(prompt, completion) + EOS_TOKEN
        texts.append(text)
    return { "text": texts }

# Apply the formatting function
dataset = dataset.map(formatting_prompts_func, batched=True)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

## Using SFTTrainer for IFT

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Converting train dataset to ChatML (num_proc=2):   0%|          | 0/40 [00:00<?, ? examples/s]

Applying chat template to train dataset (num_proc=2):   0%|          | 0/40 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/40 [00:00<?, ? examples/s]

Tokenizing train dataset (num_proc=2):   0%|          | 0/40 [00:00<?, ? examples/s]

In [None]:

#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
14.578 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 40 | Num Epochs = 12
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 982,515,712


Step,Training Loss
1,3.47
2,3.5448
3,2.4135
4,1.5681
5,1.1271
6,0.6415
7,0.5675
8,0.5697
9,0.7558
10,0.7063


## Inferencing after IFT

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "What are some of the symptoms of bipolar disorder?",  # Instruction
            ""  # Leave output blank for generation
        )
    ],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 150, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are some of the symptoms of bipolar disorder?\n\n### Response:\nBipolar Disorder symptoms include mood episodes of mania or depression.{"type": "symptoms", "context": "Bipolar Disorder", "symptoms": ["Feeling unusually upbeat or irritable", "Increased energy or activity", "Decreased need for sleep", "Racing thoughts", "Risky behaviors"]}{"type": "symptoms", "context": "Bipolar Disorder", "symptoms": ["Feeling unusually upbeat or irritable", "Increased energy or activity", "Decreased need for sleep", "Racing thoughts", "Risky behaviors"]}{"type": "symptoms", "context": "Bipolar Disorder", "symptoms": ["Feeling unusually upbeat or irritable", "Increased energy']

# Saving LoRA adapters and .gguf files

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.2G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.1 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 50%|█████     | 14/28 [00:00<00:00, 19.48it/s]
We will save to Disk and not RAM now.
100%|██████████| 28/28 [00:57<00:00,  2.06s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model/pytorch_model-00001-of-00002.bin...
Unsloth: Saving model/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at model into f16 GGUF format.
The output location will be /content/model/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin

In [None]:
!zip -r /content/model.zip /content/model

  adding: content/model/ (stored 0%)
  adding: content/model/pytorch_model-00002-of-00002.bin (deflated 8%)
  adding: content/model/unsloth.Q4_K_M.gguf (deflated 2%)
  adding: content/model/tokenizer_config.json (deflated 94%)
  adding: content/model/special_tokens_map.json (deflated 71%)
  adding: content/model/tokenizer.json (deflated 85%)
  adding: content/model/pytorch_model-00001-of-00002.bin