# Domain Adaptation using QLoRA

This notebook demonstrates how to:
1. Extract sentences from a technical PDF
2. Prepare MLM training data
3. Fine-tune a language model using QLoRA

In [1]:
!git clone https://github.com/arminwitte/mistral-peft mistralpeft

Cloning into 'mistralpeft'...
remote: Enumerating objects: 40, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 40 (delta 18), reused 6 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (40/40), 7.54 MiB | 39.00 MiB/s, done.
Resolving deltas: 100% (18/18), done.


In [2]:
import os
if not os.getcwd() == "/kaggle/working/mistralpeft":
    os.chdir("/kaggle/working/mistralpeft")
!pwd
!git pull 

/kaggle/working/mistralpeft
Already up to date.


In [3]:
!pip install -r requirements.txt --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.3/51.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 MB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m103.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
from mistralpeft.utils import TextExtractor, MLMPreprocessor, load_base_model, prepare_for_training, generate_response, CLAPreprocessor 
from transformers import Trainer, TrainingArguments, AutoTokenizer
from pathlib import Path

  signature = inspect.formatargspec(regargs, varargs, varkwargs, defaults,


[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [5]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("huggingface")

from huggingface_hub import login
login(secret_value_0) 


  and should_run_async(code)


## 1. Extract Sentences from PDF

First, we'll extract and clean sentences from the dissertation PDF.

In [6]:
# Initialize extractor
extractor = TextExtractor(language="en")  # or "de" for German

# Process the PDF
pdf_path = "Dissertation.pdf"
extraction_result = extractor.process_document(
    pdf_path,
    output_path="extracted_sentences.json"
)

print(f"Extracted {extraction_result['metadata']['num_sentences']} sentences")

# Preview some sentences
print("\nExample sentences:")
for sentence in extraction_result['sentences'][:3]:
    print(f"- {sentence}")

Reading PDF: 100%|██████████| 315/315 [00:07<00:00, 40.95it/s]

Raw text length: 575529
Cleaned text length: 575529
Sentences: 140
Extracted 140 sentences

Example sentences:
- Technische Universität München Institut für Energietechnik Professur für Thermofluiddynamik Dynamics of Unsteady Heat Transfer and Skin Friction in Pulsating Flow Across a Cylinder Armin Witte Vollständiger Abdruck der von der Fakultät für Maschinenwesen der Technischen Universität München zur Erlangung des akademischen Grades eines DOKTOR – INGENIEURS genehmigten Dissertation. Vorsitzender: Prof. Dr.-Ing. Harald Klein Prüfer der Dissertation: 1. Prof. Wolfgang Polifke, Ph.D. 2. Prof. Dr.-Ing. Jens von Wolfersdorf Die Dissertation wurde am 26.04.2018 bei der Technischen Universität München eingereicht und durch die Fakultät für Maschinenwesen am 09.10.2018 angenommen. Acknowledgments This thesis was conceived at the Thermo-Fluid Dynamics Group of the Technical University of Munich during my time as a research assistant. Financial support was provided by Deutsche Forschungsge




## 2. Prepare MLM Training Data

Now we'll create masked language modeling examples for training.

In [7]:
model_name = "mistralai/Mistral-7B-v0.3"  # Or the specific quantized version if you are using one.
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [8]:
# Initialize preprocessor
preprocessor = MLMPreprocessor(
    tokenizer_name="mistralai/Mistral-7B-v0.3",
    mask_probability=0.15,
    max_length=512,
    seed=42
)

# Create datasets
datasets = preprocessor.process_files(
    ["extracted_sentences.json"],
    output_dir="processed_data",
    train_split=0.9
)

print(f"Created {len(datasets['train'])} training and {len(datasets['val'])} validation examples")

# Preview a training example
example = datasets['train'][0]
print("\nExample input:")
print(preprocessor.tokenizer.decode(example['input_ids']))

Creating MLM examples: 100%|██████████| 140/140 [00:00<00:00, 317.70it/s]


Saving the dataset (0/1 shards):   0%|          | 0/126 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/14 [00:00<?, ? examples/s]

Created 126 training and 14 validation examples

Example input:
<s><unk>ische Universität München Institut für<unk>ner<unk>iete<unk>ik Professur für Therm<unk>luiddynamik Dynamics<unk><unk>steady Heat Transfer and Skin Friction in Pulsating<unk> Across<unk> C<unk>inder Armin Witte Vollständiger Abdruck der von der Fakultät für<unk>chinenwesen der Technischen Universität München zur Erlangung des akademischen Grades eines<unk><unk><unk>OR – IN<unk>IEURS genehm<unk> Dissertation. V<unk>itzender: Prof. Dr.-Ing. Harald Klein Prüfer der<unk>sert<unk>: 1.<unk>. Wol<unk>ang<unk>ifke<unk> Ph.D. 2. Prof. Dr.-Ing.<unk>ens von Wolfersdorf<unk> Dissertation wurde<unk> 26<unk>04.201<unk> bei der Technischen Universität München<unk>ereicht und durch die Fakultät für Maschinen<unk>esen am 09.10.<unk>018 angenommen. Acknowledgments This thesis was con<unk> at the Thermo-<unk>uid Dynamics Group<unk> the Technical University of Munich during my time as a research assistant. Financial support<unk> provid

In [9]:
json_file_paths = ["extracted_sentences.json"]
preprocessor = CLAPreprocessor(json_file_paths, tokenizer)
dataset = preprocessor.preprocess()

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## 3. Load and Prepare Model

We'll now load the base model and prepare it for QLoRA fine-tuning.

In [10]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

In [11]:
quantization_config = BitsAndBytesConfig(
        load_in_4bit=True,  # Use load_in_8bit=True for 8-bit quantization
    )

model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="auto",
    quantization_config=quantization_config,
    )

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [12]:
from datasets import Dataset
train_test_set = dataset.train_test_split(test_size=0.1)

In [13]:
train_test_set 

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 126
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14
    })
})

In [14]:
# Load base model and tokenizer
#model, tokenizer = load_base_model()

# Prepare for LoRA training
model = prepare_for_training(
    model,
    lora_r=8,
    lora_alpha=16,
    lora_dropout=0.05
)

## 4. Train the Model

Now we'll fine-tune the model on our domain-specific data.

In [15]:
os.environ["WANDB_DISABLED"] = "true"

In [16]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=3e-4,
    fp16=True,
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    optim="paged_adamw_8bit",
    log_level="debug"
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_test_set['train'],
    eval_dataset=train_test_set['test']
)

# Start training

trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using auto half precision backend
Currently training with a batch size of: 1
***** Running training *****
  Num examples = 126
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 4
  Total optimization steps = 93
  Number of trainable parameters = 6,815,744
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kw

Epoch,Training Loss,Validation Loss
1,1.7615,1.560666
2,1.2284,1.25287



***** Running Evaluation *****
  Num examples = 14
  Batch size = 1
Saving model checkpoint to ./results/checkpoint-32
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.3/snapshots/d8cadc02ac76bd617a919d50b092e59d2d110aff/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.47.0",
  "use_cache": true,
  "vocab_size": 32768
}

  return fn(*args, **kwargs)

***** Running Evaluation **

TrainOutput(global_step=93, training_loss=2.121923282582273, metrics={'train_runtime': 11863.5799, 'train_samples_per_second': 0.032, 'train_steps_per_second': 0.008, 'total_flos': 5.243397862785024e+16, 'train_loss': 2.121923282582273, 'epoch': 2.9206349206349205})

## 5. Test the Model

Let's test the fine-tuned model with some domain-specific queries.

In [17]:
# Example queries about heat transfer and fluid dynamics
queries = [
    "Explain the relationship between pulsating crossflow and heat transfer efficiency.",
    "What are the key factors affecting skin friction in the experimental setup?",
    "Summarize the main findings regarding heat transfer dynamics in the study.",
    "How does the Reynolds number influence the observed phenomena?"
]

# Generate responses
for query in queries:
    print(f"\nQuery: {query}")
    response = generate_response(
        model,
        tokenizer,
        query,
        max_new_tokens=512,
        temperature=0.7
    )
    print(f"Response: {response}")
    print("-" * 80)

  and should_run_async(code)



Query: Explain the relationship between pulsating crossflow and heat transfer efficiency.
Response: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

## 6. Save the Model

Finally, let's save our fine-tuned model for later use.

In [18]:
# Save the fine-tuned model
output_dir = Path("./final_model")
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model saved to {output_dir}")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.3/snapshots/d8cadc02ac76bd617a919d50b092e59d2d110aff/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.47.0",
  "use_cache": true,
  "vocab_size": 32768
}

tokenizer config file saved in final_model/tokenizer_config.json
Special tokens file saved in final_model/special_tokens_map.json


Model saved to final_model
