# Domain Adaptation using QLoRA

This notebook demonstrates how to:
1. Extract text from a technical PDF
2. Prepare training data for causal language modelling (CLM)
3. Fine-tune a language (llama 3.2 3b) base model using QLoRA
4. Apply the QLoRA and the learned weights to the instruct model
5. Answer some test questions

In [1]:
# Clone the git repo to access the utilities
!git clone https://github.com/arminwitte/mistral-peft mistralpeft

fatal: destination path 'mistralpeft' already exists and is not an empty directory.


In [2]:
# Make sure to be on the repo directory and pull
import os
if not os.getcwd() == "/kaggle/working/mistralpeft":
    os.chdir("/kaggle/working/mistralpeft")
!pwd
!git fetch --all
!git reset --hard origin/main

/kaggle/working/mistralpeft
Fetching origin
remote: Enumerating objects: 17, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Total 17 (delta 7), reused 7 (delta 7), pack-reused 10 (from 1)[K
Unpacking objects: 100% (17/17), 341.66 MiB | 13.71 MiB/s, done.
From https://github.com/arminwitte/mistral-peft
   88d57ca..3fddfb9  main       -> origin/main
Updating files: 100% (37/37), done.
HEAD is now at 3fddfb9 update r=16


In [3]:
# Install the required packages from pypi
!pip install -r requirements.txt --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h

In [4]:
# Load packages
from transformers import Trainer, TrainingArguments, AutoTokenizer, pipeline
from pathlib import Path
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftConfig, PeftModel
from datasets import Dataset

from mistralpeft.utils import TextExtractor, CLMPreprocessor

In [5]:
# Login to HuggingFace using Kaggle's secrets to be able to download models
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("huggingface")
login(secret_value_0) 

## 1. Extract Sentences from PDF
Several PDFs from my former research group at university (Thermo-Fluiddynamics Group, Prof. Polifke) are chosen to form the corpus

In [7]:
# TextExtractor is a simple ETL class to acquire a text corpus
pdf_files = [
    "Dissertation.pdf",
]
    
pdf_urls = [
    "https://mediatum.ub.tum.de/doc/1360567/1360567.pdf",
    "https://mediatum.ub.tum.de/doc/1601190/1601190.pdf",
    "https://mediatum.ub.tum.de/doc/1597610/1597610.pdf"
    "https://mediatum.ub.tum.de/doc/1584750/1584750.pdf",
    "https://mediatum.ub.tum.de/doc/1484812/1484812.pdf",
    "https://mediatum.ub.tum.de/doc/1335646/1335646.pdf",
    "https://mediatum.ub.tum.de/doc/1326486/1326486.pdf",
    "https://mediatum.ub.tum.de/doc/1306410/1306410.pdf",
    "https://mediatum.ub.tum.de/doc/1444929/1444929.pdf",
]

data_path = Path("data/processed_documents.json")
if not data_path.is_file():
    with TextExtractor("data/processed_documents.json") as extractor:
        # Process local files
        extractor.process_documents(pdf_files)
            
        # Process URLs
        extractor.process_documents(pdf_urls, url_list=True)

## 2. Prepare MCLM Training Data

In [8]:
# Specify the model and load the tokenizer
# Llama 3.2 3B with approx. 3 billion parameters
model_name = "meta-llama/Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [9]:
# Preprocess the corpus for Causal Language Modeling (CLM)
json_file_paths = ["data/processed_documents.json"]
preprocessor = CLMPreprocessor(json_file_paths, tokenizer)
dataset = preprocessor.preprocess()

Token indices sequence length is longer than the specified maximum sequence length for this model (152682 > 131072). Running this sequence through the model will result in indexing errors


In [10]:
# Split into training and test set
train_test_set = dataset.train_test_split(test_size=0.1)
print(f"Created {len(train_test_set['train'])} training examples and {len(train_test_set['test'])} test examples")

# Preview a training example
example = train_test_set["train"][0]
print("\nExample input:")
print(preprocessor.tokenizer.decode(example['input_ids'][:256]))

Created 232 training examples and 26 test examples

Example input:
2. ISSN 00457825. doi: 10.01016/0045-7825(82)90071-8. T. J. Hughes, L. P. Franca, and G. M. Hulbert. “A new ﬁnite element formulation for com- putational ﬂuid dynamics: VIII. The Galerkin/least-squares method for advective-diffusive equations”. Computer Methods in Applied Mechanics and Engineering, 73(2):173–189, May 1989. ISSN 00457825. doi: 10.01016/0045-7825(89)90111-4. T. Hofmeister, T. Hummel, B. Schuermans, and T. Sattelmayer. “Quantiﬁcation of Energy Transformation Processes Between Acoustic and Hydrodynamic Modes in Non-Compact Thermoacoustic Systems via a Helmholtz-Hodge Decomposition Approach”. In Volume 4 A: Combustion, Fuels, and Emissions, page V 04 AT 04 A 013, Phoenix, Arizona, USA, June 2019. American Society of Mechanical Engineers. ISBN 978-0-7918-586


## 3. Load and Prepare Model

In [11]:
# Load the base model
# Q4_K_M quantization of the base model is achieved through BitsAndBytes. It requires CUDA!
quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,  # Use load_in_4bit=True for 4-bit quantization
        bnb_4bit_quant_type="nf4", # use normalized float 4
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=False, # do not quantize scaling factors for Q4_K_M
    )

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    quantization_config=quantization_config,
    )
if base_model.config.pad_token_id is None:
    base_model.config.pad_token_id = base_model.config.eos_token_id

config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

In [11]:
# Configure the (Q)LoRA adaptor to use a rank of r=4
model = prepare_model_for_kbit_training(base_model)
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model, config)

## 4. Train the Model

The LoRA adapter has about 6M parameters to train (compared to 3B parameters of the full LLM)

In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=3, # Creates a virtual batch size of 3
    learning_rate=3e-4,
    fp16=True, # numerical precision of adapter is float16
    logging_steps=1,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    optim="paged_adamw_8bit", # Memory efficient optimizer
    log_level="info",
    report_to="none",
)

# Initialize trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_test_set['train'],
    eval_dataset=train_test_set['test']
)

# Start training
trainer.train(resume_from_checkpoint=False)#"results/checkpoint-231")

Using auto half precision backend
***** Running training *****
  Num examples = 232
  Num Epochs = 10
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 3
  Gradient Accumulation steps = 3
  Total optimization steps = 770
  Number of trainable parameters = 24,313,856
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss


In [None]:
# Save the LoRA adapter weights:
lora_save_path = "lora_weights" 
peft_model.save_pretrained(lora_save_path)

## 5. Apply the learned weights to the instruct model

In [19]:
# Load the instruct model
quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,  # Use load_in_4bit=True for 4-bit quantization
        bnb_4bit_quant_type="nf4", # use normalized float 4
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=False, # do not quantize scaling factors
    )
model_name = "meta-llama/Llama-3.2-3B-Instruct"
instruct_tokenizer = AutoTokenizer.from_pretrained(model_name)
instruct_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    # quantization_config=quantization_config,
)
if instruct_tokenizer.pad_token_id is None:
    instruct_tokenizer.pad_token_id = instruct_tokenizer.eos_token_id
if instruct_model.config.pad_token_id is None:
    instruct_model.config.pad_token_id = instruct_model.config.eos_token_id

# Load the LoRA configuration and weights
# lora_weights_path = "lora_weights"
lora_weights_path = "results/checkpoint-231"
# peft_config = PeftConfig.from_pretrained(lora_weights_path)

# Apply LoRA adapter to the instruct model
lora_instruct_model = PeftModel.from_pretrained(instruct_model, lora_weights_path)  

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## 6. Test the Model

In [20]:
def test_qanda(model, tokenizer):
    # Example queries
    queries = [
        "Explain the CFD/SI method.",
        "What is a Rijke tube?",
        "How is a finite impulse response computed?",
        "When analyzing heat transfer using system identification, what key criteria were used to validate the identified models, and what minimum performance threshold was considered acceptable?",
        "For a heated cylinder in pulsating cross-flow at Reynolds numbers between 0.4-40, what explains the appearance of amplitude peaks in the frequency response at Strouhal numbers between 0-1, and how does this behavior change with Reynolds number?",
    ]

    # Create pipeline for inference
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    
    # Generate responses
    for query in queries:
        messages = [{"role": "user", "content": query}]
        prompt = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

        outputs = pipe(prompt, max_new_tokens=120, do_sample=True)
        
        print(f"\nQuery: {query}")
        print(f"\nResponse: {outputs[0]['generated_text']}")
        print("-" * 80)

In [13]:
def test_qanda2(model, tokenizer):# Example queries
    queries = [
        "Explain the CFD/SI method.",
        "What is a Rijke tube?",
        "How is a finite impulse response computed?",
        "When analyzing heat transfer using system identification, what key criteria were used to validate the identified models, and what minimum performance threshold was considered acceptable?",
        "For a heated cylinder in pulsating cross-flow at Reynolds numbers between 0.4-40, what explains the appearance of amplitude peaks in the frequency response at Strouhal numbers between 0-1, and how does this behavior change with Reynolds number?",
    ]
    
    # Generate responses
    for query in queries:
        messages = [{"role": "user", "content": query}]
    
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
            
        inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")
        
        outputs = model.generate(**inputs, max_new_tokens=150, num_return_sequences=1)
        
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        print(f"\nQuery: {query}")
        print(f"\nResponse: {text.split('assistant')[1]}")
        print("-" * 80)

### 6.1 Answers to the test questions by the instruct model w/o LoRA

In [21]:
test_qanda(instruct_model, instruct_tokenizer)

Device set to use cuda:0



Query: Explain the CFD/SI method.

Response: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 23 Feb 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

Explain the CFD/SI method.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

CF/S ( Finite Element/ Finite) is a numerical technique used to solve partial differential equations (DE) in fields such as fluid, heat, and mass transfer. is a method for solving partial differential equations using finite elements and finite difference. It is widely used in fields such as engineering, physics, and.
--------------------------------------------------------------------------------

Query: What is a Rijke tube?

Response: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 23 Feb 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

What is a Rijke tube?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



In [22]:
test_qanda2(instruct_model, instruct_tokenizer)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Query: Explain the CFD/SI method.

Response: 

CF/S ( C/S stands forcompress and) is a numerical method used solvecompress fluid flow problems In, it combines compress flow simulations theacics,acics and dynamics. is based on Nav-Stu s equations the, is the, and, the,. is a of in linear and linear,, of and, the. is a of the, and the,, the, the, the, and the. is a of the, and the, and the, the, and the, the, and. is of the, and the, the, the, and the, the and the, the, the, and the. is a of the, the, the, the and the, the, the and, the
--------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Query: What is a Rijke tube?

Response: 

A Rijke tube is a device used in physics to create a vacuum through a process known as Joule's law of, the of expansion of gases. It was invented by Dutch physicist D. Rijke in 1850. The tube consists of a metal tube with a end, one end closed and other end open. The tube is heated from one end, the heated end is then rotated at a speed so the heated air inside the is forced to expand. The expansion of air creates a pressure difference between ends of tube, creating a pressure gradient. the pressure difference is sufficient create a vacuum the other end the tube
--------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



Query: How is a finite impulse response computed?

Response: 

A finite impulse response (IR) is a mathematical that represents the output of a system when a impulse is applied to it. is computed by taking the convolution of input signal with impulse. convolution is defined as:

(x) = () �� � �� �� ������������������������������������������������������������������������������������������������
--------------------------------------------------------------------------------


KeyboardInterrupt: 

### 6.2 Answers to the test questions by the adapted model

In [23]:
test_qanda(lora_instruct_model, instruct_tokenizer)

Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'Mamba2ForCausalLM', 'MarianFor


Query: Explain the CFD/SI method.

Response: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 23 Feb 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

Explain the CFD/SI method.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

CF/S ( Finite Element/S Method is a numerical technique used solve partial differential equations (DEs) by discretizing domain into smaller, elements. elements are typically rectangular or triangular. is a method used solve partial differential equations (s) by discret the domain into smaller, elements typically rectangular triangular. is a method used solve partial equations by discret the domain smaller, elements typically rectangular triangular. is numerical technique solve partial equations discret domain smaller elements typically rectangular triangular The C/S method a numerical technique solve partial equations discret domain smaller elements typically rectangular triangular
----