# Domain Adaptation using QLoRA

This notebook demonstrates how to:
1. Extract text from a technical PDF
2. Prepare training data for causal language modelling (CLM)
3. Fine-tune a language (llama 3.2 3b) base model using QLoRA
4. Apply the QLoRA and the learned weights to the instruct model
5. Answer some test questions

In [1]:
# Clone the git repo to access the utilities
!git clone https://github.com/arminwitte/mistral-peft mistralpeft

Cloning into 'mistralpeft'...
remote: Enumerating objects: 204, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 204 (delta 4), reused 14 (delta 2), pack-reused 183 (from 3)[K
Receiving objects: 100% (204/204), 641.52 MiB | 42.42 MiB/s, done.
Resolving deltas: 100% (98/98), done.
Updating files: 100% (62/62), done.


In [2]:
# Make sure to be on the repo directory and pull
import os
if not os.getcwd() == "/kaggle/working/mistralpeft":
    os.chdir("/kaggle/working/mistralpeft")
!pwd
!git fetch --all
!git reset --hard origin/main

/kaggle/working/mistralpeft
Fetching origin
HEAD is now at 0224574 Merge branch 'main' of https://github.com/arminwitte/mistral-peft


In [3]:
# Install the required packages from pypi
!pip install -r requirements.txt --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h

In [4]:
# Load packages
from transformers import Trainer, TrainingArguments, AutoTokenizer, pipeline
from pathlib import Path
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftConfig, PeftModel
from datasets import Dataset

from mistralpeft.utils import TextExtractor, CLMPreprocessor

In [5]:
# Login to HuggingFace using Kaggle's secrets to be able to download models
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("huggingface")
login(secret_value_0) 

## 1. Extract Sentences from PDF
Several PDFs from my former research group at university (Thermo-Fluiddynamics Group, Prof. Polifke) are chosen to form the corpus

In [6]:
# TextExtractor is a simple ETL class to acquire a text corpus
pdf_files = [
    "Dissertation.pdf",
]
    
pdf_urls = [
    "https://mediatum.ub.tum.de/doc/1360567/1360567.pdf",
    "https://mediatum.ub.tum.de/doc/1601190/1601190.pdf",
    "https://mediatum.ub.tum.de/doc/1597610/1597610.pdf"
    "https://mediatum.ub.tum.de/doc/1584750/1584750.pdf",
    "https://mediatum.ub.tum.de/doc/1484812/1484812.pdf",
    "https://mediatum.ub.tum.de/doc/1335646/1335646.pdf",
    "https://mediatum.ub.tum.de/doc/1326486/1326486.pdf",
    "https://mediatum.ub.tum.de/doc/1306410/1306410.pdf",
    "https://mediatum.ub.tum.de/doc/1444929/1444929.pdf",
]

data_path = Path("data/processed_documents.json")
if not data_path.is_file():
    with TextExtractor("data/processed_documents.json") as extractor:
        # Process local files
        extractor.process_documents(pdf_files)
            
        # Process URLs
        extractor.process_documents(pdf_urls, url_list=True)

## 2. Prepare MCLM Training Data

In [7]:
# Specify the model and load the tokenizer
# Llama 3.2 3B with approx. 3 billion parameters
model_name = "mistralai/Mistral-7B-v0.3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [8]:
# Preprocess the corpus for Causal Language Modeling (CLM)
json_file_paths = ["data/processed_documents.json"]
preprocessor = CLMPreprocessor(json_file_paths, tokenizer)
dataset = preprocessor.preprocess()

In [9]:
# Split into training and test set
train_test_set = dataset.train_test_split(test_size=0.1)
print(f"Created {len(train_test_set['train'])} training examples and {len(train_test_set['test'])} test examples")

# Preview a training example
example = train_test_set["train"][0]
print("\nExample input:")
print(preprocessor.tokenizer.decode(example['input_ids'][:256]))

Created 264 training examples and 30 test examples

Example input:
effort should be spent to overcome these limitations. As the computational effort of the hybrid models is still considerable, methods to build nonlinear low-order models of the ﬂame dynamics in a general and con- sistent way are required. In the scope of this thesis, in P APER -ANN, artiﬁcial neural networks have been used to extend the CFD/SI approach to the nonlin- ear regime. Unfortunately, a high uncertainty of amplitudes of thermoacoustic oscillations predicted was observed. Hence, more sophisticated methods are re- quired. One way to improve the results are white- or grey-box models, which account for the physics of the ﬂame more accurately. Another idea is to use not only the time series of the input and output signal to identify the model, but also 12 Hybrid Reduced Order / LES Models of self-e Xcited Combustion Instabilities in Multi-Burner Systems ﬁeld data. This allows to use more information to build the mod

## 3. Load and Prepare Model

In [10]:
# Load the base model
# Q4_K_M quantization of the base model is achieved through BitsAndBytes. It requires CUDA!
quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,  # Use load_in_4bit=True for 4-bit quantization
        bnb_4bit_quant_type="nf4", # use normalized float 4
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=False, # do not quantize scaling factors for Q4_K_M
    )

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=quantization_config,
    )
if base_model.config.pad_token_id is None:
    base_model.config.pad_token_id = base_model.config.eos_token_id

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [11]:
# Configure the (Q)LoRA adaptor to use a rank of r=4
model = prepare_model_for_kbit_training(base_model)
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj",],# "o_proj", "up_proj", "down_proj", "gate_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model, config)

## 4. Train the Model

The LoRA adapter has about 6M parameters to train (compared to 3B parameters of the full LLM)

In [12]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4, # Creates a virtual batch size of 3
    learning_rate=1e-4,
    fp16=True, # numerical precision of adapter is float16
    logging_steps=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    optim="paged_adamw_8bit", # Memory efficient optimizer
    log_level="info",
    report_to="none",
)

# Initialize trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_test_set['train'],
    eval_dataset=train_test_set['test']
)

# Start training
trainer.train(resume_from_checkpoint="results/checkpoint-66")

Using auto half precision backend
Loading model from results/checkpoint-66.
  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
***** Running training *****
  Num examples = 264
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 4
  Total optimization steps = 66
  Number of trainable parameters = 4,718,592
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 1
  Continuing training from global step 66
  Will skip the first 1 epochs then the first 0 batches in the first epoch.


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-66 (score: 3.9107666015625).


Epoch,Training Loss,Validation Loss


TrainOutput(global_step=66, training_loss=0.0, metrics={'train_runtime': 0.0611, 'train_samples_per_second': 4319.656, 'train_steps_per_second': 1079.914, 'total_flos': 4.618544199657062e+16, 'train_loss': 0.0, 'epoch': 1.0})

In [13]:
# Save the LoRA adapter weights:
lora_save_path = "lora_weights" 
peft_model.save_pretrained(lora_save_path)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.3/snapshots/d8cadc02ac76bd617a919d50b092e59d2d110aff/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.47.0",
  "use_cache": true,
  "vocab_size": 32768
}



## 6. Test the Model

peft_model = PeftModel.from_pretrained(base_model, "results/checkpoint-231")

In [14]:
peft_model = PeftModel.from_pretrained(base_model, lora_save_path)

In [15]:
def test_continuation(model, tokenizer):# Example queries
    queries = [
        "SI is used when processes are either too complex to gain insight using first principles, i.e. physical laws, or the calculation is too costly in terms of time or resources. Its goal is to",
        "In pulsating or oscillating flow, heat transfer can damp, but also drive instabilities.",
        "The unit impulse response is a time domain model. It shows",
    ]
    
    # Generate responses
    for query in queries:
            
        inputs = tokenizer(query, return_tensors='pt', padding=True, truncation=True).to("cuda")
        
        outputs = model.generate(**inputs, max_new_tokens=150, num_return_sequences=1)
        
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        print(f"\nQuery: {query}")
        print(f"\nResponse: {text}")
        print("-" * 80)

In [16]:
### 6.1 Answers to the test questions by the instruct model w/o LoRA

### 6.2 Answers to the test questions by the adapted model

In [17]:
test_continuation(base_model, tokenizer)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Query: SI is used when processes are either too complex to gain insight using first principles, i.e. physical laws, or the calculation is too costly in terms of time or resources. Its goal is to

Response: SI is used when processes are either too complex to gain insight using first principles, i.e. physical laws, or the calculation is too costly in terms of time or resources. Its goal is to gain insight into the system and to predict its behavior.

The is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of eq

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Query: In pulsating or oscillating flow, heat transfer can damp, but also drive instabilities.

Response: In pulsating or oscillating flow, heat transfer can damp, but also drive instabilities. The of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of
--------------------------------------------------------------------------------

Query: The unit impulse response is a time domain model. It shows

Response: The unit impulse response is a time domain model. It shows the response of a system to a unit impulse. The unit impulse is a signal that is zero everywhere except at t=0 where it is 1. The unit i

In [18]:
test_continuation(peft_model, tokenizer)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Query: SI is used when processes are either too complex to gain insight using first principles, i.e. physical laws, or the calculation is too costly in terms of time or resources. Its goal is to

Response: SI is used when processes are either too complex to gain insight using first principles, i.e. physical laws, or the calculation is too costly in terms of time or resources. Its goal is to gain insight into the system and to predict its behavior.

The is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of equations that describe the system. is a set of eq

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Query: In pulsating or oscillating flow, heat transfer can damp, but also drive instabilities.

Response: In pulsating or oscillating flow, heat transfer can damp, but also drive instabilities. The of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of
--------------------------------------------------------------------------------

Query: The unit impulse response is a time domain model. It shows

Response: The unit impulse response is a time domain model. It shows the response of a system to a unit impulse. The unit impulse is a signal that is zero everywhere except at t=0 where it is 1. The unit i