# Domain Adaptation using QLoRA

This notebook demonstrates how to:
1. Extract sentences from a technical PDF
2. Prepare MLM training data
3. Fine-tune a language model using QLoRA

In [1]:
from utils import SentenceExtractor, MLMPreprocessor, load_base_model, prepare_for_training, generate_response
from transformers import Trainer, TrainingArguments, AutoTokenizer
from pathlib import Path

In [2]:
# from huggingface_hub import login
# login()

## 1. Extract Sentences from PDF

First, we'll extract and clean sentences from the dissertation PDF.

In [3]:
# # Initialize extractor
# extractor = SentenceExtractor(language="en")  # or "de" for German

# # Process the PDF
# pdf_path = "Dissertation.pdf"
# extraction_result = extractor.process_document(
#     pdf_path,
#     output_path="extracted_sentences.json"
# )

# print(f"Extracted {extraction_result['metadata']['num_sentences']} sentences")

# # Preview some sentences
# print("\nExample sentences:")
# for sentence in extraction_result['sentences'][:3]:
#     print(f"- {sentence}")

## 2. Prepare MLM Training Data

Now we'll create masked language modeling examples for training.

In [4]:
model_name = "mistralai/Mistral-7B-v0.3"  # Or the specific quantized version if you are using one.
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [5]:
# Initialize preprocessor
preprocessor = MLMPreprocessor(
    tokenizer_name="mistralai/Mistral-7B-v0.3",
    mask_probability=0.15,
    max_length=512,
    seed=42
)

# Create datasets
datasets = preprocessor.process_files(
    ["extracted_sentences.json"],
    output_dir="processed_data",
    train_split=0.9
)

print(f"Created {len(datasets['train'])} training and {len(datasets['val'])} validation examples")

# Preview a training example
example = datasets['train'][0]
print("\nExample input:")
print(preprocessor.tokenizer.decode(example['input_ids']))

Creating MLM examples: 100%|██████████| 10340/10340 [00:00<00:00, 16349.28it/s]


Saving the dataset (0/1 shards):   0%|          | 0/9306 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1034 [00:00<?, ? examples/s]

Created 9306 training and 1034 validation examples

Example input:
<s><unk>ische Universität München Institut für<unk>ner<unk>iete<unk>ik Professur für Therm<unk>luiddynamik Dynamics<unk><unk>steady Heat Transfer and Skin Friction in Pulsating<unk> Across<unk> C<unk>inder Armin Witte Vollständiger Abdruck der von der Fakultät für<unk>chinenwesen der Technischen Universität München zur Erlangung des akademischen Grades eines<unk><unk><unk>OR – IN<unk>IEURS genehm<unk> Dissertation.


In [6]:
example["input_ids"]

[1,
 0,
 4519,
 25058,
 25130,
 21697,
 4214,
 0,
 1847,
 0,
 2064,
 29474,
 0,
 1617,
 8797,
 1092,
 4214,
 1310,
 1626,
 0,
 9840,
 3326,
 4560,
 1617,
 1152,
 26473,
 0,
 0,
 29045,
 29492,
 24959,
 25737,
 1072,
 4659,
 1030,
 2129,
 3801,
 1065,
 1135,
 8318,
 1845,
 0,
 5636,
 2324,
 0,
 1102,
 0,
 5820,
 1778,
 2008,
 1162,
 26823,
 1318,
 1561,
 22359,
 5654,
 19154,
 1319,
 1374,
 1659,
 2576,
 1659,
 1169,
 1259,
 1285,
 6059,
 4214,
 0,
 1106,
 10052,
 29495,
 24436,
 1659,
 7540,
 3864,
 25058,
 25130,
 6924,
 3620,
 5498,
 1737,
 1402,
 13835,
 5015,
 3864,
 2546,
 3318,
 13299,
 0,
 0,
 0,
 1785,
 1532,
 3461,
 0,
 8221,
 2758,
 29503,
 17966,
 17681,
 0,
 4201,
 1889,
 1120,
 29491]

## 3. Load and Prepare Model

We'll now load the base model and prepare it for QLoRA fine-tuning.

In [7]:
# Load base model and tokenizer
model, tokenizer = load_base_model()

# Prepare for LoRA training
model = prepare_for_training(
    model,
    lora_r=8,
    lora_alpha=16,
    lora_dropout=0.05
)

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Some parameters are on the meta device because they were offloaded to the cpu and disk.


## 4. Train the Model

Now we'll fine-tune the model on our domain-specific data.

In [None]:
# Set up training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=3e-4,
    fp16=True,
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    optim="paged_adamw_8bit"
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=datasets['train'],
    eval_dataset=datasets['val']
)

# Start training
trainer.train()



## 5. Test the Model

Let's test the fine-tuned model with some domain-specific queries.

In [None]:
# Example queries about heat transfer and fluid dynamics
queries = [
    "Explain the relationship between pulsating crossflow and heat transfer efficiency.",
    "What are the key factors affecting skin friction in the experimental setup?",
    "Summarize the main findings regarding heat transfer dynamics in the study.",
    "How does the Reynolds number influence the observed phenomena?"
]

# Generate responses
for query in queries:
    print(f"\nQuery: {query}")
    response = generate_response(
        model,
        tokenizer,
        query,
        max_new_tokens=512,
        temperature=0.7
    )
    print(f"Response: {response}")
    print("-" * 80)

## 6. Save the Model

Finally, let's save our fine-tuned model for later use.

In [None]:
# Save the fine-tuned model
output_dir = Path("./final_model")
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model saved to {output_dir}")