<a href="https://colab.research.google.com/github/arminwitte/mistral-peft/blob/main/unsloth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Domain Adaptation using QLoRA

This notebook is set to demonstrate the fine-tuning of the language model mistral 7b to data from my dissertation

[Witte, A. (2018). Dynamics of unsteady heat transfer and skin friction in pulsating flow across a cylinder (Doctoral dissertation, Technische Universität München).](https://mediatum.ub.tum.de/doc/1436573/document.pdf)

Training is achieved in two parts. In part 1 the LoRA is pre-trained with self-supervised training (CLM). Afterwards, in part 2, the adapter is fine-tuned for instructions using supervised training.

**Notes:**
- Data preparation is very rudimentary and might impair the quality of the response.
- Instruction data for supervised training was crated using [alpacarizer](https://github.com/arminwitte/alpacarizer)
- The notebook is set to run in Google Colab on a T4 GPU

In [1]:
# Clone the git repo to access the utilities
!git clone https://github.com/arminwitte/mistral-peft mistralpeft

fatal: destination path 'mistralpeft' already exists and is not an empty directory.


In [None]:
# Make sure to be on the repo directory and hard reset to the latest commit
import os
from pathlib import Path
git_path = Path("/content/mistralpeft")
if not os.getcwd() == git_path:
    os.chdir(git_path)
!pwd
!git fetch --all
!git reset --hard origin/main

/content/mistralpeft
Fetching origin
HEAD is now at af14699 added instruction data created with alpacarizer


In [3]:
# Access Google Drive to permanently save e.g. checkpoints
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Set the path to the LoRA weights
path_to_lora_weights = Path("/content/drive/MyDrive/mistral-peft/lora_weights")

In [None]:
# Install the required packages from pypi or github

# https://github.com/unslothai/unsloth/issues/1181#issuecomment-2439923851
!pip install pip3-autoremove
# !pip-autoremove torch torchvision torchaudio -y
!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu124
!pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

!pip install -r requirements.txt

Looking in indexes: https://download.pytorch.org/whl/cu124
Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-9ohdvf_1/unsloth_ca5fbf9c57f144d0b003007655499d46
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-9ohdvf_1/unsloth_ca5fbf9c57f144d0b003007655499d46
  Resolved https://github.com/unslothai/unsloth.git to commit 2b5d81d75281c02480927cf3ca0dea7c8e98d484
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
# Load packages
import unsloth
from transformers import Trainer, TrainingArguments
from huggingface_hub import login
from peft import PeftModel
from datasets import Dataset
import json
from google.colab import userdata
from unsloth import FastLanguageModel
import torch
import matplotlib.pyplot as plt
import json
import gc
from trl import SFTTrainer

from mistralpeft.utils import TextExtractor, CLMPreprocessor

In [None]:
# Login to HuggingFace using Colab's secrets to be able to download models
login(userdata.get("HF_TOKEN"))

In [7]:
bananas

NameError: name 'bananas' is not defined

## Part 1: Self-supervised pretraining

### 1. Extract and transform text from dissertation (PDF)

In [None]:
# TextExtractor is a simple ET(L) class to acquire a text corpus
pdf_files = [
    "Dissertation.pdf",
]

pdf_urls = [
    "https://mediatum.ub.tum.de/doc/1360567/1360567.pdf",
    "https://mediatum.ub.tum.de/doc/1601190/1601190.pdf",
    "https://mediatum.ub.tum.de/doc/1597610/1597610.pdf"
    "https://mediatum.ub.tum.de/doc/1584750/1584750.pdf",
    "https://mediatum.ub.tum.de/doc/1484812/1484812.pdf",
    "https://mediatum.ub.tum.de/doc/1335646/1335646.pdf",
    "https://mediatum.ub.tum.de/doc/1326486/1326486.pdf",
    "https://mediatum.ub.tum.de/doc/1306410/1306410.pdf",
    "https://mediatum.ub.tum.de/doc/1444929/1444929.pdf",
]

data_path = Path("data/processed_documents.json")
if True: # not data_path.is_file():
    with TextExtractor("data/processed_documents.json") as extractor:
        # Process local files
        extractor.process_documents(pdf_files)

        # Process URLs
        #extractor.process_documents(pdf_urls, url_list=True)
        pass

### 2. Load and Prepare Model

In [None]:
#Load model and tokenizer from unsloth
max_seq_length = 2*2*2048 # Choose any! We auto support RoPE Scaling internally!
dtype = torch.float16 # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2025.3.9: Fast Mistral patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


KeyboardInterrupt: 

In [None]:
# Set the LoRA
peft_model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # rank
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # 0 is optimized
    bias = "none",    # "none" is optimized
    use_gradient_checkpointing = "unsloth",
    random_state = 4711,
    use_rslora = False,  # rank stabilized LoRA
    loftq_config = None, # LoftQ
)

### 3. Prepare CLM Training Data

In [None]:
# Preprocess the corpus for Causal Language Modeling (CLM)
json_file_paths = ["data/processed_documents.json"]
preprocessor = CLMPreprocessor(json_file_paths, tokenizer)
dataset = preprocessor.preprocess()

In [None]:
# Split into training and test set
train_test_set = dataset.train_test_split(test_size=0.1)
print(f"Created {len(train_test_set['train'])} training examples and {len(train_test_set['test'])} test examples")

# Preview a training example
example = train_test_set["train"][0]
print("Example input:")
print(preprocessor.tokenizer.decode(example['input_ids'][:256]))

In [None]:
# Show a sequence from the tokenized input and labels.
# For CLM the labels should be shifted to the left by 1.
print(train_test_set["train"]["input_ids"][0][16:33])
print(train_test_set["train"]["labels"][0][16:33])

### 4. Train the Model

In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/mistral-peft/results/2025-03-12",
    num_train_epochs=4,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=3, # Creates a virtual batch size of 3
    learning_rate=3e-4,
    fp16=True, # numerical precision of adapter is float16
    logging_steps=1,
    eval_strategy="steps",
    eval_steps=10,
    save_strategy="steps",
    save_steps=10,
    load_best_model_at_end=True,
    optim="paged_adamw_8bit", # Memory efficient optimizer
    log_level="info",
    report_to="none",
)

# Initialize trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_test_set['train'],
    eval_dataset=train_test_set['test']
)

In [None]:
# Start training
trainer.train(resume_from_checkpoint=False)#"/content/drive/MyDrive/mistral-peft/results/2025-02-26/checkpoint-264")

In [None]:
# Save the LoRA adapter weights
peft_model.save_pretrained(path_to_lora_weights)

In [None]:
# Plot the training loss
import glob
paths = glob.glob("/content/drive/MyDrive/mistral-peft/results/2025-03-12/checkpoint-*")
train_info_path = Path(sorted(paths)[-1]) / "trainer_state.json"
with open(train_info_path, "r") as f:
    train_info = json.load(f)

print(train_info.keys())
train_steps = [int(d["step"]) for d in train_info["log_history"] if "loss" in d]
train_loss = [float(d["loss"]) for d in train_info["log_history"] if "loss" in d]
eval_steps = [int(d["step"]) for d in train_info["log_history"] if "eval_loss" in d]
eval_loss = [float(d["eval_loss"]) for d in train_info["log_history"] if "eval_loss" in d]
fig, ax = plt.subplots()
ax.plot(train_steps, train_loss)
ax.plot(eval_steps, eval_loss)
ax.set_xlabel("Steps")
ax.set_ylabel("Training Loss")
ax.grid(True)
ax.title.set_text("Loss over steps")
plt.show()

### 5. Test the Model

In [None]:
def test_continuation(model_token_list):# Example queries

    for model, tokenizer in model_token_list:
        # model.generation_config.pad_token_id = tokenizer.pad_token_id
        FastLanguageModel.for_inference(model)

    orig = ["SI is used when processes are either too complex to gain insight using first principles, i.e. physical laws, or the calculation is too costly in terms of time or resources. Its goal is to obtain an efficient and accurate model for a process from acquired data. The properties of this model can be analyzed to get insight in the process, or the model may be used for further calculations.",
            "In pulsating or oscillating flow, heat transfer can damp, but also drive instabilities. The first observations of this thermoacoustic effect caused by heat transfer were reported by Sondhauß and Rijke.",
            "the unit step response gives the cumulated contribution at each instant in time. The step response is the output acquired in answer to a Heaviside step function"
            ]

    queries = [
        "SI is used when processes are either too complex to gain insight using first principles, i.e. physical laws, or the calculation is too costly in terms of time or resources. Its goal is to",
        "In pulsating or oscillating flow, heat transfer can damp, but also",
        "the unit step response gives the cumulated contribution at",
    ]

    # Generate responses
    for i, query in enumerate(queries):

        print(f"\nOriginal:\t\t {orig[i]}")
        print(f"\nQuery:\t\t\t {query}")
        j = 0
        for model, tokenizer in model_token_list:
            inputs = tokenizer(query, return_tensors='pt').to("cuda")
            outputs = model.generate(
                input_ids=inputs.input_ids,
                attention_mask=inputs.attention_mask,
                max_new_tokens=50,
                num_return_sequences=1,
                temperature=0.3,  # Using Ollama's likely default
                top_p=0.9,
                top_k=10,
                repetition_penalty=1.2,
                do_sample=True,
            )
            text = tokenizer.decode(outputs[0], skip_special_tokens=True)
            print(f"\nResponse model {j}:\t {text}")
            j += 1
        print("-" * 80)

In [None]:
# Run the test in the original model and the fine-tuned model
test_continuation([(model, tokenizer), (peft_model, tokenizer)])

## Part 2: Supervised instruction finetuning

In [None]:
# clean up
gc.collect()
torch.cuda.empty_cache()

### 1. Load and prepare model

In [None]:
# Get the quantized instruct model and the corresponding tokenizer
instruct_model, instruct_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

==((====))==  Unsloth 2025.3.9: Fast Mistral patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
# Prepare the instruct model with the adapter
instruct_model.enable_input_require_grads()
peft_model = PeftModel.from_pretrained(instruct_model, path_to_lora_weights, is_trainable=True)
peft_model._mark_only_adapters_as_trainable(peft_model)

### 2. Prepare instruction data

In [None]:
# load data from json file
with open("data/instruction_data.json", "r") as f:
    instruction_data = json.load(f)

# Create a dataset from the list
dataset = Dataset.from_list(instruction_data)

In [None]:
# Format the dataset in alpaca-style
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output)
        texts.append(text)
    return { "text" : texts, }
pass

dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/915 [00:00<?, ? examples/s]

In [None]:
# Split into training and test set
train_test_set = dataset.train_test_split(test_size=0.1)
print(f"Created {len(train_test_set['train'])} training examples and {len(train_test_set['test'])} test examples")

### 3. Fine-tuning of the adapter

In [None]:
# Set the training arguments for supervised training
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/mistral-peft/results/2025-03-12/instruct/",
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    num_train_epochs=4,
    learning_rate = 2e-4,
    weight_decay = 0.01,
    fp16 = True,
    bf16 = False,
    save_steps=10,
    logging_steps=10,
    optim = "adamw_8bit",
    lr_scheduler_type = "linear",
    seed = 815,
    report_to = "none"
)

# Initialize the trainer
trainer = SFTTrainer(
    model = peft_model,
    args=training_args,
    train_dataset=train_test_set['train'],
    eval_dataset=train_test_set['test'],
    tokenizer = instruct_tokenizer,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    formatting_func=formatting_prompts_func,
)

Map (num_proc=2):   0%|          | 0/915 [00:00<?, ? examples/s]

In [None]:
# Execute training
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 915 | Num Epochs = 4 | Total steps = 912
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 41,943,040/3,800,305,664 (1.10% trained)


Step,Training Loss
10,2.3966
20,1.2727
30,1.324
40,1.1693
50,1.0163
60,0.9977
70,1.0923
80,0.8848
90,1.0613
100,0.9295




TrainOutput(global_step=912, training_loss=0.3987302373964013, metrics={'train_runtime': 3762.2081, 'train_samples_per_second': 0.973, 'train_steps_per_second': 0.242, 'total_flos': 3.892617183483494e+16, 'train_loss': 0.3987302373964013, 'epoch': 3.9836065573770494})

In [None]:
# save the trained adapter
peft_model.save_pretrained(Path(path_to_lora_weights,"instruct"), save_embedding_layers=True)

In [None]:
# Plot the training loss
import glob
paths = glob.glob("/content/drive/MyDrive/mistral-peft/results/2025-03-12/instruct/checkpoint-*")
train_info_path = Path(sorted(paths)[-1]) / "trainer_state.json"
with open(train_info_path, "r") as f:
    train_info = json.load(f)

print(train_info.keys())
train_steps = [int(d["step"]) for d in train_info["log_history"] if "loss" in d]
train_loss = [float(d["loss"]) for d in train_info["log_history"] if "loss" in d]
eval_steps = [int(d["step"]) for d in train_info["log_history"] if "eval_loss" in d]
eval_loss = [float(d["eval_loss"]) for d in train_info["log_history"] if "eval_loss" in d]
fig, ax = plt.subplots()
ax.plot(train_steps, train_loss)
ax.plot(eval_steps, eval_loss)
ax.set_xlabel("Steps")
ax.set_ylabel("Training Loss")
ax.grid(True)
ax.title.set_text("Loss over steps")
plt.show()

### 4. Test the fine-tuned instruction model

In [None]:
def test_instruct(model_token_list):
    for model, tokenizer in model_token_list:
        model.generation_config.pad_token_id = tokenizer.pad_token_id

    questions = ["Explain System Identification",
                 "How is heat transfer influenced by pulsating or oscillating flow?",
                 "What is a unit step response?"]

    for i, question in enumerate(questions):
        print(f"\nQuestion:\t\t\t {question}")
        j = 0
        for model, tokenizer in model_token_list:
            message = [
                {"role": "user", "content": question}
            ]

            instruct_tokenizer.pad_token_id = instruct_tokenizer.eos_token_id
            formatted_prompt = instruct_tokenizer.apply_chat_template(message, return_tensors="pt", tokenize=False)
            inputs = instruct_tokenizer(formatted_prompt, return_tensors="pt", padding=True).to("cuda")

            outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True,)
            response = tokenizer.batch_decode(outputs)
            print(f"\nResponse model {j}:\t {response}")
            j += 1
        print("-" * 80)

In [None]:
peft_instruct_model = PeftModel.from_pretrained(instruct_model, Path(path_to_lora_weights,"instruct"))

In [38]:
test_instruct([(instruct_model, instruct_tokenizer), (peft_instruct_model, instruct_tokenizer)])


Question:			 Explain System Identification

Response model 0:	 ["<s><s>[INST] Explain System Identification[/INST] System Identification (SI) is a process of acquiring a model for a system from acquired data, rather than deriving it from first principles or using a known analogy. It involves two main tasks: building a model and validating the model to ensure it accurately represents the behavior of the real system.\n\nThere are several types of systems that can be identified, including deterministic systems, stochastic systems, and hybrid systems. Deterministic systems always produce the same output for a given input, whereas stochastic systems exhibit random behavior. Hybrid systems combine both deterministic and stochastic components.\n\nThe identified model takes the form of an equation or a set of equations that relates the input(s) to the output(s). This model can be in various forms, such as a difference equation, a finite impulse response (FIR) model, or a partial differential 