# The Goal
Goal is to fine tune an LLM such that it can generate a domain suggestion for a website, based on the prompt.

# The setup
## LLM
As base opensource LLM, unsloth/mistral-7b is used. It is a rather small model (which fits perfectly for the highly specific task) + already an instruct model. Although one could agree, that it is even too powerfull because it is multimodal (which is useless now but good to have for the future).

## Library
For fine tuning, the opensource framework unsloth is used.

## General settings


For the sake of simplicity, synthetic data is genereted with an LLM. It is common (for example in distilation)
to use the same LLM that is later fine tuned for generating input for this LLM. We will use Mistral AI here.
Mistral AI is opensource, avaiable in small sizes and powerful (LLama/GPT level).

For the usage, Thinking/CoT is enabled. Mainly for the purpose, that the LLM filters byself stupid suggestions out ("we-are-trustworthy-money.com" for example would be valid but stupid).

Generate 300 domain names from different business fields and provide a related prompt for each one. Send me your structure in JSON format, as is common for LLM fine-tuning ("text": "<PROMPT>###<OUTPUT>").

Bear in mind that users may ask in strange ways, but some ask directly and seriously. Vary the prompts in the structure, saying, etc.

In [2]:
!pip install unsloth

Collecting unsloth
  Downloading unsloth-2025.7.11-py3-none-any.whl.metadata (47 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/47.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.3/47.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.7.11 (from unsloth)
  Downloading unsloth_zoo-2025.7.11-py3-none-any.whl.metadata (8.1 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.31.post1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.27-py3-none-any.whl.metadata (11 kB)
Collecting datasets<4.0.0,>=3.4.1 (from unsloth)
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting trl!=0.15.0,!=0.19.0,!=0.9.0,!=0.9.1,!=0.9.2,!=0.9.3,>=0.7.9 (from unsloth)
 

In [3]:
from unsloth import FastLanguageModel
import torch

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [4]:
import json
from datasets import Dataset

with open("prompts.json") as f:
    data = json.load(f)

mistral_data = Dataset.from_list([
    {"text": f"<s>[INST] {item['input']} [/INST] {item['output']}</s>"}
    for item in data
])



In [5]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "mistralai/Mistral-7B-v0.1",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,  # LoRA-Rang
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

==((====))==  Unsloth 2025.7.11: Fast Mistral patching. Transformers: 4.54.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/155 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Unsloth 2025.7.11 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [6]:
mistral_data = mistral_data.train_test_split(test_size=0.1)
train_data = mistral_data['train']
val_data = mistral_data['test']


In [19]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_data,
    eval_dataset=val_data,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        logging_strategy="steps",
        eval_steps=1,
        save_strategy="no",
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)


Unsloth: Tokenizing ["text"]:   0%|          | 0/285 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"]:   0%|          | 0/32 [00:00<?, ? examples/s]

In [13]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 285 | Num Epochs = 2 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 7,283,675,136 (0.58% trained)


Step,Training Loss
1,0.4184
2,0.5385
3,0.4045
4,0.5024
5,0.4269
6,0.4172
7,0.5919
8,0.4932
9,0.5681
10,0.533


In [20]:
trainer.save_model("reference");

Apparantly, the model is converging (although test data is so far not checked). However, their might be space for improvement because the training appears instable.

In [29]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "mistralai/Mistral-7B-v0.1",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,  # LoRA-Rang
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_data,
    eval_dataset=val_data,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=1e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        logging_strategy="steps",
        eval_steps=1,
        save_strategy="no",
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="outputs",
    ),
)
trainer_stats = trainer.train()

==((====))==  Unsloth 2025.7.11: Fast Mistral patching. Transformers: 4.54.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth: Tokenizing ["text"]:   0%|          | 0/285 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"]:   0%|          | 0/32 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 285 | Num Epochs = 2 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 7,283,675,136 (0.58% trained)


Step,Training Loss
1,4.291
2,4.7065
3,4.1845
4,3.9922
5,3.6878
6,3.2844
7,2.8827
8,2.4339
9,2.338
10,2.1023


Apparantly, this tuning did not work. The concrete hyperparamters would requiere a hyparamter tuning with for example the ray framework (see below).

In [None]:
!pip install ray[tune]

In [None]:
from unsloth import FastLanguageModel
from transformers import TrainingArguments
from trl import SFTTrainer
import ray
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune import CLIReporter
import os

# Initialize Ray
ray.init()

# Load base model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="mistralai/Mistral-7B-v0.1",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Define train function for Ray Tune
def train_mistral(config):
    # Apply LoRA with configurable parameters
    model_peft = FastLanguageModel.get_peft_model(
        model,
        r=config["r"],
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj"
        ],
        lora_alpha=config["lora_alpha"],
        lora_dropout=config["lora_dropout"],
        bias="none",
        use_gradient_checkpointing=True,
        random_state=3407,
        use_rslora=False,
        loftq_config=None,
    )

    # Training arguments with configurable params
    training_args = TrainingArguments(
        per_device_train_batch_size=config["batch_size"],
        gradient_accumulation_steps=config["gradient_accumulation_steps"],
        warmup_steps=config["warmup_steps"],
        max_steps=60,
        learning_rate=config["lr"],
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        logging_strategy="steps",
        eval_steps=1,
        save_strategy="no",
        optim="adamw_8bit",
        weight_decay=config["weight_decay"],
        lr_scheduler_type=config["scheduler"],
        seed=3407,
        output_dir=os.path.join(config["output_dir"], tune.get_trial_dir()),
    )

    # Create trainer
    trainer = SFTTrainer(
        model=model_peft,
        tokenizer=tokenizer,
        train_dataset=train_data,
        eval_dataset=val_data,
        dataset_text_field="text",
        max_seq_length=2048,
        dataset_num_proc=2,
        packing=True,
        args=training_args,
    )

    # Start training
    trainer.train()

    # Evaluate and report metrics to Tune
    eval_metrics = trainer.evaluate()
    tune.report(
        eval_loss=eval_metrics["eval_loss"],
        perplexity=float(eval_metrics["eval_loss"]),  # Example metric
    )

# Define search space
config = {
    "lr": tune.loguniform(1e-5, 1e-3),
    "batch_size": tune.choice([2, 4, 8]),
    "r": tune.choice([8, 16, 32]),
    "lora_alpha": tune.choice([8, 16, 32]),
    "lora_dropout": tune.uniform(0.0, 0.2),
    "weight_decay": tune.uniform(0.0, 0.1),
    "gradient_accumulation_steps": tune.choice([2, 4, 8]),
    "warmup_steps": tune.choice([5, 10, 20]),
    "scheduler": tune.choice(["linear", "cosine", "constant"]),
    "output_dir": "./ray_results",
}

# PBT Scheduler
scheduler = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="eval_loss",
    mode="min",
    perturbation_interval=1,
    hyperparam_mutations={
        "lr": tune.loguniform(1e-5, 1e-3),
        "batch_size": [2, 4, 8],
        "lora_dropout": tune.uniform(0.0, 0.2),
        "weight_decay": tune.uniform(0.0, 0.1),
    }
)

# CLI Reporter
reporter = CLIReporter(
    parameter_columns={
        "lr": "lr",
        "batch_size": "bs",
        "r": "r",
        "lora_alpha": "alpha",
        "weight_decay": "wd",
    },
    metric_columns=["eval_loss", "perplexity", "training_iteration"],
)

# Run Tune experiment
analysis = tune.run(
    train_mistral,
    resources_per_trial={"gpu": 1},
    config=config,
    num_samples=8,  # Number of trials
    scheduler=scheduler,
    progress_reporter=reporter,
    local_dir="./ray_results",
    name="mistral_tuning",
)

print("Best config:", analysis.get_best_config(metric="eval_loss", mode="min"))

Dataset augmentation:
- I would use the same inputs but with different languages (German, French) and add some times errors, to make it robust against spelling.


LLM-as-a-Judge Evaluation Framework

I have neither an API to LLM nor a powerfull enough API, but I will describe how I would do it.

The following three judgments could be used as a reference for the LLM:
1. Expectation vs. answer: What domain would the LLM give, given the description of what the trained LLM above was given? Is it far or near enough?
2. Seriousness: Is the given domain serious or rather unserious? The LLM will be able to judge.
3. Usability: Is the domain name meaningful or nonsense?

In [None]:
from openai import OpenAI


client = OpenAI(api_key="<DeepSeek API Key>", base_url="https://api.deepseek.com")

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system",
         "content": "You are a domain judger.\
         You will check if the domain fits well to the describtion and\
         you will check if you would have suggested a similiar domain\
         to the given one. You will also check if the domain is a serious\
         one or nonensense. Your feedback is a score, wherby 3/3 means\
         every category is fullified."},
        {"role": "user", "content": "Hello"},
    ],
    stream=False
)

4. Edge Case Discovery and Analysis
Possible edge cases and failure traps are where the given context does not make the business clear (for example, a rocket could be military or space-related). Language mixing also makes this unclear, as it is not clear in which language the domain should be. If the person speaks a non-Latin language, it can be difficult for the LLM because it may assume that the domain should be in this language too.

The main problem categories were:
1. Ambiguity: Too little context allows for the interpretation of the business or meaning.
2. Language misunderstanding: Using different languages in the same text can lead to confusion over the target language.

One solution is to feed the network with exactly such cases. For example, it would be good to add Chinese prompts and even mixed prompts (Chinese with English terms).

Safety Guardrails

To adress this, the LLM could be trained with illegal prompts of which the output is then instead of the domain a warning.