# Domain Adaptation using QLoRA

This notebook demonstrates how to:
1. Extract sentences from a technical PDF
2. Prepare MLM training data
3. Fine-tune a language model using QLoRA

In [1]:
!git clone https://github.com/arminwitte/mistral-peft mistralpeft

Cloning into 'mistralpeft'...
remote: Enumerating objects: 92, done.[K
remote: Counting objects: 100% (92/92), done.[K
remote: Compressing objects: 100% (91/91), done.[K
remote: Total 92 (delta 53), reused 6 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (92/92), 7.58 MiB | 38.45 MiB/s, done.
Resolving deltas: 100% (53/53), done.


In [2]:
import os
if not os.getcwd() == "/kaggle/working/mistralpeft":
    os.chdir("/kaggle/working/mistralpeft")
!pwd
!git pull 

/kaggle/working/mistralpeft
Already up to date.


In [3]:
!pip install -r requirements.txt --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 MB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
from mistralpeft.utils import TextExtractor, prepare_for_training, generate_response, CLAPreprocessor 
from transformers import Trainer, TrainingArguments, AutoTokenizer
from pathlib import Path

In [5]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("huggingface")

from huggingface_hub import login
login(secret_value_0) 


## 1. Extract Sentences from PDF

First, we'll extract and clean sentences from the dissertation PDF.

In [6]:
# Example usage
if __name__ == "__main__":
    # Local files example
    pdf_files = [
        "Dissertation.pdf",
        # "document2.pdf"
    ]
    
    # URLs example
    pdf_urls = [
        "https://mediatum.ub.tum.de/doc/1360567/1360567.pdf",
        "https://mediatum.ub.tum.de/doc/1601190/1601190.pdf",
        # "https://example.com/doc1.pdf",
        # "https://example.com/doc2.pdf"
    ]
    
    with TextExtractor("output/processed_documents.json") as extractor:
        # Process local files
        extractor.process_documents(pdf_files)
        
        # Process URLs
        extractor.process_documents(pdf_urls, url_list=True)





Processing documents: 100%|██████████| 1/1 [00:09<00:00,  9.76s/it]
Processing documents: 100%|██████████| 2/2 [00:28<00:00, 14.33s/it]


## 2. Prepare MLM Training Data

Now we'll create masked language modeling examples for training.

In [7]:
model_name = "meta-llama/Llama-3.2-3B"  # Or the specific quantized version if you are using one.
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [8]:
json_file_paths = ["output/processed_documents.json"]
preprocessor = CLAPreprocessor(json_file_paths, tokenizer)
dataset = preprocessor.preprocess()

Token indices sequence length is longer than the specified maximum sequence length for this model (152916 > 131072). Running this sequence through the model will result in indexing errors


In [9]:
print(f"Created {len(dataset)} examples")

# Preview a training example
example = dataset[0]
print("\nExample input:")
print(preprocessor.tokenizer.decode(example['input_ids']))

Created 105 examples

Example input:
<|begin_of_text|>Technische Universität München Institut für Energietechnik Professur für Thermofluiddynamik Dynamics of Unsteady Heat Transfer and Skin Friction in Pulsating Flow Across a Cylinder Armin Witte Vollständiger Abdruck der von der Fakultät für Maschinenwesen der Technischen Universität München zur Erlangung des akademischen Grades eines DOKTOR – INGENIEURS genehmigten Dissertation. Vorsitzender: Prof. Dr.-Ing. Harald Klein Prüfer der Dissertation: 1. Prof. Wolfgang Polifke, Ph.D. 2. Prof. Dr.-Ing. Jens von Wolfersdorf Die Dissertation wurde am 26.004.02018 bei der Technischen Universität München eingereicht und durch die Fakultät für Maschinenwesen am 09.010.02018 angenommen. Acknowledgments This thesis was conceived at the Thermo-Fluid Dynamics Group of the Technical University of Munich during my time as a research assistant. Financial support was provided by Deutsche Forschungsgemeinschaft (DFG), project PO 710/15-1. First of all, I 

## 3. Load and Prepare Model

We'll now load the base model and prepare it for QLoRA fine-tuning.

In [10]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

In [11]:
quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,  # Use load_in_8bit=True for 8-bit quantization
    )

base_model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="auto",
    quantization_config=quantization_config,
    )

config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

In [12]:
from datasets import Dataset
train_test_set = dataset.train_test_split(test_size=0.1)

In [13]:
train_test_set 

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 94
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 11
    })
})

In [14]:
# Prepare for LoRA training
model = prepare_for_training(
    base_model,
    lora_r=4,
    lora_alpha=16,
    lora_dropout=0.05
)

## 4. Train the Model

Now we'll fine-tune the model on our domain-specific data.

In [15]:
os.environ["WANDB_DISABLED"] = "true"

In [16]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,
    learning_rate=3e-4,
    fp16=True,
    logging_steps=1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    optim="paged_adamw_8bit",
    log_level="info"
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_test_set['train'],
    eval_dataset=train_test_set['test']
)

# Start training

trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using auto half precision backend
***** Running training *****
  Num examples = 94
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 282
  Number of trainable parameters = 2,293,760
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return f

Epoch,Training Loss,Validation Loss
1,3.6966,3.832937
2,3.8142,3.394962
3,3.7934,3.267214


  return fn(*args, **kwargs)

***** Running Evaluation *****
  Num examples = 11
  Batch size = 1
Saving model checkpoint to ./results/checkpoint-94
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-3B/snapshots/13afe5124825b4f3751f836b40dafda64c1ed062/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 24,
  "num_hidden_layers": 28,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rop

TrainOutput(global_step=282, training_loss=4.083654373250109, metrics={'train_runtime': 5091.5375, 'train_samples_per_second': 0.055, 'train_steps_per_second': 0.055, 'total_flos': 1.9551033873137664e+16, 'train_loss': 4.083654373250109, 'epoch': 3.0})

## 5. Test the Model

Let's test the fine-tuned model with some domain-specific queries.

In [17]:
def test_qanda(model):
    # Example queries about heat transfer and fluid dynamics
    queries = [
        "Explain the relationship between pulsating crossflow and heat transfer efficiency.",
        "What are the key factors affecting skin friction in the experimental setup?",
        "Summarize the main findings regarding heat transfer dynamics in the study.",
        "How does the Reynolds number influence the observed phenomena?"
    ]
    
    # Generate responses
    for query in queries:
        print(f"\nQuery: {query}")
        response = generate_response(
            model,
            tokenizer,
            query,
            max_new_tokens=512,
            temperature=0.7
        )
        print(f"Response: {response}")
        print("-" * 80)

In [18]:
test_qanda(model)


Query: Explain the relationship between pulsating crossflow and heat transfer efficiency.
Response: . 2. the of heat in flow..3 the heat in cross..4 the cross in flow 5 the in.6 the of in  the in 7 in8 cross cross cross flow heat heat heat heat heat heat heat flow flow flow flow flow flow flow flow heat heat heat heat heat heat. the the the the the the the the the the the the the the the of of of of of of of in in in in in in in in in in in in in. the the the the the the the the the the the the the the of of of of of of of in in in in in in in in in in in in. the the the the the the the the the the the the the of of of of of of in in in in in in in in in in in. the the the the the the the the the the the the of of of of of of in in in in in in in in in in in. the the the the the the the the the the of of of of of in in in in in in in in in in. the the the the the the the the the the of of of of in in in in in in in in. the the the the the the of of of of in in in in in in in in. the t

In [19]:
test_qanda(base_model)


Query: Explain the relationship between pulsating crossflow and heat transfer efficiency.
Response: what is effect the on efficiency puls cross and puls in heat.
Explain the relationship puls cross and in heat transfer what effect the efficiency puls cross and in.
Explain relationship puls cross heat and., explain the relationship puls cross heat. the between puls cross heat. explain relationship cross in transfer heat. relationship cross in heat. cross heat relationship transfer. cross heat transfer relationship explain the relationship cross. transfer heat cross explain relationship. explain cross heat relationship. heat cross transfer explain. cross explain relationship heat. cross transfer relationship heat. transfer heat explain. explain cross. heat transfer relationship. heat cross. cross relationship heat. cross heat. heat cross. cross. heat. cross heat cross heat transfer explain relationship. heat cross heat transfer. cross transfer heat. cross relationship. transfer heat cro

## 6. Save the Model

Finally, let's save our fine-tuned model for later use.

In [20]:
# Save the fine-tuned model
output_dir = Path("./final_model")
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model saved to {output_dir}")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-3B/snapshots/13afe5124825b4f3751f836b40dafda64c1ed062/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 24,
  "num_hidden_layers": 28,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": true,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.47.0",
  "u

Model saved to final_model
