## 1 - Install requirements

Change torch version if you are currently using some different one

In [1]:
%pip install --upgrade pip
%pip install --disable-pip-version-check torch==2.6.0 transformers accelerate evaluate rouge_score loralib peft tqdm

Collecting pip
  Using cached pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Using cached pip-25.2-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.3.1
    Uninstalling pip-24.3.1:
      Successfully uninstalled pip-24.3.1
Successfully installed pip-25.2
Note: you may need to restart the kernel to use updated packages.
Collecting evaluate
  Using cached evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Using cached rouge_score-0.1.2-py3-none-any.whl
Collecting loralib
  Using cached loralib-0.1.2-py3-none-any.whl.metadata (15 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Using cached datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Using cached dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from evaluate)
  Using cached pandas-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Coll

Import necessary packages

In [1]:
import json
import random
import re
import time
from typing import List, Tuple, Dict, Any

import numpy as np
import torch
from datasets import Dataset
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from rouge_score import rouge_scorer
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, \
    DataCollatorForLanguageModeling, GenerationConfig

  from .autonotebook import tqdm as notebook_tqdm


## 2 - Generating domain name suggestions using Prompt Engineering

For this task I have chosen Qwen3-0.6B because it is small and would fully fit into my machines VRAM

In [2]:
model_name = 'Qwen/Qwen3-0.6B'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype='auto',
    device_map='auto'
)

`torch_dtype` is deprecated! Use `dtype` instead!


First I want to test how this model would generate domain names without any prompt engineering

In [51]:
prompt = "Give me domain name suggestion based on this business description - My business is a specialized online marketplace dedicated to showcasing and selling small Japanese board games"
messages = [
    {'role': 'user', 'content': prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=500
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip('\n')

print(content)

Here are some domain name suggestions based on your business description:

1. **JagameMarket.com**  
2. **JagameShop.com**  
3. **JagameOnline.com**  
4. **JagameLabs.com**  
5. **JagameTales.com**  
6. **JagameGames.com**  
7. **JagameGaming.com**  
8. **JagameWorld.com**  
9. **JagameWorldOnline.com**  
10. **JagameJunk.com**  

These names are all related to Japanese board games and could work well for your online marketplace. Let me know if you'd like more options!


"Jagame" is repeated for every given domain name, which show no creativity. It could be improved by providing only the domain names, without any explanation and in a more structured way, also more creative.

### 2.1 - Zero Shot Inference
I will test how the model would output domain names with clear instructions what I want to see in the models output - 10 domains in a list.

In [52]:
business_description = "My business is a specialized online marketplace dedicated to showcasing and selling small Japanese board games"
prompt = """You are a domain name generator. Suggest 10 creative, brandable, and easy-to-remember domain names based on the business description. The names should be:
- Short, catchy, and easy to spell
- Available in common domain extensions like .com, .store, or .shop
- Avoid using hyphens or numbers unless essential

Respond ONLY with a list in the following structure:
1. domain_name.com
2. domain_name.com
...

Do not include any additional text outside of the list.
Here is the business description:
"""
messages = [
    {'role': 'user', 'content': prompt + business_description}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=500
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip('\n')

print(content)

1. boardgamehub.com  
2. gameshow.com  
3. japanboardgames.com  
4. jpopgames.com  
5. gamezone.com  
6. jgamestore.com  
7. japangames.com  
8. gamestore.com  
9. boardgame.com  
10. japangames.com


This is better because of the output structure, which could be later used for easy parsing. However the names could be more creative and also with different TLDs, because most likely these domains would be already taken.

### 2.2 - One Shot Inference
I will add one example to the models input to see if it helps with providing different TLDs.

In [57]:
business_description_example = """This business provides AI-powered retail crime prevention solutions for self-checkout systems
1. scanshield.ai
2. checkoutguard.eu
3. retailsentinel.com
4. shopsecure.ai
5. fraudblocker.tech
6. safescantech.top
7. retailwatchpro.store
8. smartcartsecurity.info
9. visioncheckout.org
10. guardpointai.net

Here is the business description:
"""
messages = [
    {'role': 'user', 'content': prompt + business_description_example + business_description}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=500
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip('\n')

print(content)

1. japanboard.com  
2. gamehaven.store  
3. boardgamehub.net  
4. gamesworld.io  
5. japangames.com  
6. boardgamesstore.net  
7. japangamesinfo.com  
8. boardgamearena.org  
9. gamescape.io  
10. japangamesstore.com


The output is in the same format as well as in zero-shot inference, but the names generated have different TLDs, which is what I wanted, however most of the names are generic, not creative, which would most likely mean that they would be taken.

### 2.3 - Few Shot Inference
I will add even more examples to the models input to see if it helps with providing even more different TLDs and more creative names.

In [58]:
business_description_example_2 = """Digital marketing agency focused on helping small businesses grow online
1. growthdigital.eu
2. digitalreach.io
3. smallbizboom.co
4. marketgrowpro.com
5. digitalboostlab.top
6. onlinebizhelp.store
7. growthmarketing.info
8. digitalscalehub.zone
9. bizgrowthpartners.link
10. marketinglaunchpad.biz


Here is the business description:
"""
business_description_example_3 = """Fitness studio offering yoga, pilates, and meditation classes
1. zenfitstudio.org
2. mindfulmovement.co
3. yogabliss.net
4. pilatespure.com
5. serenityworkout.top
6. balancestudiohub.link
7. flowfitnesscenter.biz
8. tranquilbody.info
9. holisticfitlab.zone
10. peacefulpilates.store


Here is the business description:
"""

messages = [
    {'role': 'user',
     'content': prompt + business_description_example + business_description_example_2 + business_description_example_3 + business_description}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=500
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip('\n')

print(content)

1. boardgamehub.com  
2. gamestore.co  
3. japan_games.info  
4. johokagame.com  
5. boardgameshop.com  
6. jisagame.net  
7. boardgamesupply.com  
8. japanboardgames.store  
9. japan-games.net  
10. boardgamesmall.com


By providing even more examples it now generates domains with different TLDs and a relevant domain name to the business description and a bit more creative names, however creativeness could still be improved, because now it is mostly generic words / names.

### 2.4 - Parameter tuning
I will now test parameter tuning such as `temperature` changing and thinking mode

The default temperature settings for this model is `temperature=0.6`

In [55]:
generation_config = GenerationConfig(do_sample=True, temperature=0.1)
messages = [
    {'role': 'user',
     'content': prompt + business_description_example + business_description_example_2 + business_description_example_3 + business_description}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    generation_config=generation_config,
    max_new_tokens=500
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip('\n')

print(content)

1. boardgamestore.com  
2. jumigamed.com  
3. gamehub.net  
4. jumigam.com  
5. boardgameshop.com  
6. jumigamstore.com  
7. gamezone.com  
8. boardgames.net  
9. jumigamstore.org  
10. boardgamesmall.com


By including `do_sample=True` and `temperature=0.1` I now get non creative domain names with most of them being almost repeated, such as "jumigam" and "boardgames" and most of them are with `.com` extension

In [59]:
generation_config = GenerationConfig(do_sample=True, temperature=0.9)
messages = [
    {'role': 'user',
     'content': prompt + business_description_example + business_description_example_2 + business_description_example_3 + business_description}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    generation_config=generation_config,
    max_new_tokens=500
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip('\n')

print(content)

1. gameshop.com  
2. boardgames.pro  
3. jokobigame.top  
4. nihongo_boardgames.org  
5. jokobigames.com  
6. boardgamestore.io  
7. jokobigames.store  
8. nihongo-boardgames.com  
9. boardgames.com  
10. gameshop.net


By including `do_sample=True` and `temperature=0.9` I now get more creative outputs, however some of them are irrelevant to the business description, it just integrates some random Japanese words to the domain name. This is too random.

In [60]:
generation_config = GenerationConfig(do_sample=True, temperature=0.7, top_p=0.8, top_k=20, min_p=0)
start_time = time.time()
messages = [
    {'role': 'user',
     'content': prompt + business_description_example + business_description_example_2 + business_description_example_3 + business_description}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    generation_config=generation_config,
    max_new_tokens=500
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip('\n')

print(content)
print('Time taken:', time.time() - start_time)

1. boardgamez.com  
2. gameguru.store  
3. jokoboard.io  
4. jumiboard.top  
5. boardgameshop.org  
6. boardgames.net  
7. gamestore24.com  
8. jokoboard.info  
9. boardgamehub.com  
10. boardgamezone.org
Time taken: 2.9761745929718018


I have now used the suggested parameters from HuggingFace page for this model without using Thinking mode.

Domain names are mostly relevant to the description, but the names are being repeated also, such as "boardgame" and some kind of random words are being used, such as "joko", "jumi" which do not help with brandability.

In [61]:
generation_config = GenerationConfig(do_sample=True, temperature=0.6, top_p=0.95, top_k=20, min_p=0)
messages = [
    {'role': 'user',
     'content': prompt + business_description_example + business_description_example_2 + business_description_example_3 + business_description}
]
start_time = time.time()
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    generation_config=generation_config,
    max_new_tokens=1000
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip('\n')
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip('\n')

print('thinking content:', thinking_content)
print('content:', content)
print('Time taken:', time.time() - start_time)

thinking content: <think>
Okay, let's see. The user wants me to generate 10 domain names based on a business description. The business is a specialized online marketplace for small Japanese board games. The domain names need to be short, catchy, easy to spell, available in common extensions like .com, .store, .shop, and avoid hyphens or numbers unless necessary.

First, I need to make sure each domain name is unique and fits the theme. The user provided some examples, so I should check if they already have those. The first business description is about AI for retail, so the next one is about digital marketing, then fitness studio, and finally Japanese board games.

Starting with the Japanese board games. Words like "zen," "yog" (yoga), "mindful," "serenity," "holy," "peaceful," "tranquil," "holistic," "peaceful." Maybe combine these with common extensions. For example, "zenboardstore.org" uses "zen" and "store," which is a common extension. "mindfulgames.com" uses "mindful" and "games.

I have used Thinking mode now with suggested parameters from the HuggingFace repo for this model - `temperature=0.6`, `top_p=0.95`, `top_k=20`, `min_p=0`.

These results are not good, because it takes too much inspiration from the few shot examples provided, also it takes 8 times longer to complete the inference, so I would say that Thinking mode should not be used for this task, especially because it provides much worse results I think because of too much context being provided to the model and because it is quite small - 0.6B it kind of starts to hallucinate.

In conclusion from Prompt engineering the best results have been by using few-shot inference and default parameters for the Qwen3-0.6B model without the thinking mode enabled, but the results could be more creative and not as random.

## 3 - Fine tuning open source LLM model for domain name suggesting

I will fine tune the Qwen3-0.6B model by using LoRA and gathering my own dataset which will have this format - (business description, relevant domain name)

### 3.1 - Creating dataset
I will create the dataset by using OpenAIs GPT 5 model by prompting it manually and processing it into usable format with python scripting.

Prompt used:

Create a dataset which I would be able to use to train LLM. Dataset should have these pairs - (business description, relevant domain name). An example of pair - (Fitness studio offering yoga, pilates, and meditation classes, 1. zenfitstudio.org
2. mindfulmovement.co
3. yogabliss.net
4. pilatespure.com
5. serenityworkout.top
6. balancestudiohub.link
7. flowfitnesscenter.biz
8. tranquilbody.info
9. holisticfitlab.zone
10. peacefulpilates.store)

In [6]:
def parse_pairs_file(file_path: str) -> List[Tuple[str, List[str]]]:
    """
    Parse the pairs file and extract business description with all domain suggestions.
    Each business description gets paired with all of its domain suggestions as a list.
    """
    pairs = []

    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()

    # Split by business entries (separated by ---)
    entries = content.split('---')

    for entry in entries:
        if not entry.strip():
            continue

        # Extract business description
        desc_match = re.search(r'\*\*Business Description:\*\* (.+?)(?=\n\n|\n\*\*|$)', entry, re.DOTALL)
        if not desc_match:
            continue

        business_desc = desc_match.group(1).strip()

        # Extract domain suggestions
        domain_section = re.search(r'\*\*Domain Suggestions:\*\*.*?(?=\n\n|$)', entry, re.DOTALL)
        if not domain_section:
            continue

        # Extract individual domain names
        domain_lines = domain_section.group(0).split('\n')[1:]  # Skip the header
        domains = []
        for line in domain_lines:
            if line.strip() and re.match(r'\d+\.\s+', line):
                domain = re.search(r'\d+\.\s+(.+)', line).group(1).strip()
                if domain:
                    domains.append(domain)

        if domains:
            pairs.append((business_desc, domains))

    return pairs


def create_training_format(pairs: List[Tuple[str, List[str]]]) -> List[dict]:
    training_data = []

    for business_desc, domains in pairs:
        # Create a prompt-completion pair with multiple domain suggestions
        prompt = f'Here is the business description:\n{business_desc}'

        # Format completion with numbered domain suggestions
        completion_lines = []
        for i, domain in enumerate(domains, 1):
            completion_lines.append(f'{i}. {domain}')

        completion = '\n'.join(completion_lines)

        training_data.append({
            'prompt': prompt,
            'completion': completion
        })

    return training_data


def split_data(pairs: List[Tuple[str, List[str]]], train_ratio: float = 0.8, random_seed: int = 42) -> Tuple[
    List[Tuple[str, List[str]]], List[Tuple[str, List[str]]]]:
    """
    Split the data into training and testing sets.

    Args:
        pairs: List of (business_description, domain_suggestions) tuples
        train_ratio: Ratio of data to use for training (default: 0.8)
        random_seed: Random seed for reproducibility (default: 42)

    Returns:
        Tuple of (training_pairs, testing_pairs)
    """
    # Set random seed for reproducibility
    random.seed(random_seed)

    # Shuffle the pairs
    shuffled_pairs = pairs.copy()
    random.shuffle(shuffled_pairs)

    # Calculate split point
    split_index = int(len(shuffled_pairs) * train_ratio)

    # Split the data
    training_pairs = shuffled_pairs[:split_index]
    testing_pairs = shuffled_pairs[split_index:]

    return training_pairs, testing_pairs


def save_training_data(training_data: List[dict], output_path: str):
    """Save training data to JSON file."""
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(training_data, f, indent=2, ensure_ascii=False)


# Parse the pairs file
pairs = parse_pairs_file('data/pairs')
print(f"Found {len(pairs)} business descriptions with domain suggestions")

# Split into training and testing sets
training_pairs, testing_pairs = split_data(pairs, train_ratio=0.8, random_seed=42)
print(f"Training set: {len(training_pairs)} examples")
print(f"Testing set: {len(testing_pairs)} examples")

# Create training data
training_data = create_training_format(training_pairs)
testing_data = create_training_format(testing_pairs)

# Save both datasets
save_training_data(training_data, 'data/training_dataset.json')
save_training_data(testing_data, 'data/testing_dataset.json')

print('\nExample training pairs:')
for i, example in enumerate(training_data[:2]):
    print(f'\nExample {i + 1}:')
    print(f'Prompt: {example["prompt"]}')
    print(f'Completion: {example["completion"]}')
    print('-' * 50)

print('\nExample testing pairs:')
for i, example in enumerate(testing_data[:2]):
    print(f'\nExample {i + 1}:')
    print(f'Prompt: {example["prompt"]}')
    print(f'Completion: {example["completion"]}')
    print('-' * 50)

Found 25 business descriptions with domain suggestions
Training set: 20 examples
Testing set: 5 examples

Example training pairs:

Example 1:
Prompt: Here is the business description:
Food truck serving gourmet burgers made with locally sourced ingredients
Completion: 1. gourmetburgertruck.com
2. localingredientsburgers.food
3. foodtruckgourmet.net
4. artisanburgermobile.org
5. farmtotablburgers.co
6. mobilegourmetburgers.biz
7. localburgertruck.info
8. gourmetstreetburgers.truck
9. farmfreshburgertruck.mobile
10. artisanburgerco.food
--------------------------------------------------

Example 2:
Prompt: Here is the business description:
Yoga instructor offering private classes and corporate wellness programs
Completion: 1. privateyogainstructor.com
2. corporatewellnessyoga.org
3. personalyogateacher.net
4. workplacewellnessyoga.co
5. mindfulnessinstructor.biz
6. yogaforcompanies.info
7. privateyogasessions.pro
8. corporateyogateacher.wellness
9. customyogaprograms.services
10. wellnes

Output of GPT 5 model is stored in `data/pairs` file. It got processed using python into two separate JSON files for training and testing fine tuned model.

In [7]:
def load_training_data(tokenizer, file_path: str):
    """Load the training dataset from JSON file."""
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    texts = []
    for item in data:
        # Create chat messages
        messages = [
            {'role': 'user', 'content': item['prompt']},
            {'role': 'assistant', 'content': item['completion']}
        ]

        # Apply chat template to create the training text
        # This will format it exactly like the model expects during inference
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False
        )
        texts.append(text)

    return texts


def tokenize_function(examples, tokenizer, max_length=512):
    tokenized = tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=max_length,
        return_tensors='pt'
    )

    tokenized['labels'] = tokenized['input_ids'].clone()

    return tokenized

In [8]:
model_name = 'Qwen/Qwen3-0.6B'

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype='auto',
    device_map='auto'
)

print('Loading training dataset...')
training_texts = load_training_data(tokenizer, 'data/training_dataset.json')

dataset = Dataset.from_dict({'text': training_texts})
print(f'Dataset size: {len(dataset)} examples')

tokenized_dataset = dataset.map(
    lambda x: tokenize_function(x, tokenizer),
    batched=True,
    remove_columns=dataset.column_names
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

output_dir = 'fine-tune-model'

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=1e-3,
    num_train_epochs=5,
    logging_steps=5,
    save_steps=50,
    save_total_limit=3,
    warmup_steps=50,
    lr_scheduler_type='cosine',
    dataloader_pin_memory=False,
    remove_unused_columns=False,
    report_to=None,
    load_best_model_at_end=False,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

print('Starting training...')
trainer.train()

print(f'Saving model to {output_dir}')
trainer.save_model()
tokenizer.save_pretrained(output_dir)

Loading training dataset...
Dataset size: 20 examples


Map: 100%|██████████| 20/20 [00:00<00:00, 1462.68 examples/s]


trainable params: 10,092,544 || all params: 606,142,464 || trainable%: 1.6650
Starting training...


Step,Training Loss
5,3.5651
10,2.6705
15,1.8126


Saving model to fine-tune-model


('fine-tune-model/tokenizer_config.json',
 'fine-tune-model/special_tokens_map.json',
 'fine-tune-model/chat_template.jinja',
 'fine-tune-model/vocab.json',
 'fine-tune-model/merges.txt',
 'fine-tune-model/added_tokens.json',
 'fine-tune-model/tokenizer.json')

If you get Out of memory error try using `fp16=True` in `Training arguments`. By using given `Training arguments` training used ~4,6 GB of VRAM and on my machine took about 2 minutes.

In [13]:
model_name = 'Qwen/Qwen3-0.6B'

tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype='auto',
    device_map='auto'
)

fine_tuned_model = PeftModel.from_pretrained(base_model,
                                             'fine-tune-model/',
                                             torch_dtype='auto',
                                             is_trainable=False)

In [14]:
prompt = 'Here is the business description:\nSpecialized online marketplace dedicated to showcasing and selling small Japanese board games'
messages = [
    {'role': 'user', 'content': prompt}
]

start_time = time.time()
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors='pt').to(fine_tuned_model.device)

generation_config = GenerationConfig(do_sample=True, temperature=0.7, top_p=0.8, top_k=20, min_p=0)
# conduct text completion
generated_ids = fine_tuned_model.generate(
    **model_inputs,
    generation_config=generation_config,
    max_new_tokens=500
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip('\n')

print(content)
print('Time taken:', time.time() - start_time)

1. boardgamehub.jp
2. gamesandgarden.net
3. japanesegamedev.co
4. boardgamepro.info
5. onlinegamedev.org
6. boardgameclub.biz
7. gamegamedesign.jp
8. onlinegamedesign.co
9. boardgamearena.info
10. gamesandgamedev.biz
Time taken: 5.0345635414123535


With fine tuned model I now get relevant to the business description domain names and a variety of extensions, with a simple prompt. Names could also be more creative, but by comparing them to few-shot results I would say these are more creative.

In [17]:
rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)


def generate_response(model, prompt: str) -> str:
    """
    Generate response from a model given a prompt.

    Args:
        model: The model to generate from
        prompt: Input prompt

    Returns:
        Generated response
    """
    messages = [{'role': 'user', 'content': prompt}]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
    )

    model_inputs = tokenizer([text], return_tensors='pt').to(model.device)

    with torch.no_grad():
        generated_ids = model.generate(
            **model_inputs,
            generation_config=generation_config,
            max_new_tokens=500
        )

    output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
    content = tokenizer.decode(output_ids, skip_special_tokens=True).strip('\n')

    return content


def calculate_rouge_scores(reference: str, prediction: str) -> Dict[str, float]:
    """
    Calculate ROUGE scores between reference and prediction.

    Args:
        reference: Reference text
        prediction: Predicted text

    Returns:
        Dictionary with ROUGE scores
    """
    scores = rouge_scorer.score(reference, prediction)

    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }


def evaluate_models(dataset_path: str, max_samples: int = None) -> Dict[str, Any]:
    """
    Evaluate both models on the dataset.

    Args:
        dataset_path: Path to the JSON dataset
        max_samples: Maximum number of samples to evaluate (None for all)

    Returns:
        Dictionary with evaluation results
    """
    # Load dataset
    print(f"Loading dataset from {dataset_path}...")
    with open(dataset_path, 'r', encoding='utf-8') as f:
        dataset = json.load(f)

    if max_samples:
        dataset = dataset[:max_samples]

    print(f"Evaluating on {len(dataset)} samples...")

    base_scores = []
    fine_tuned_scores = []
    base_times = []
    fine_tuned_times = []

    for i, sample in enumerate(tqdm(dataset, desc="Evaluating")):
        prompt = sample['prompt']
        reference = sample['completion']

        # Generate with base model
        start_time = time.time()
        base_prediction = generate_response(base_model, prompt)
        base_time = time.time() - start_time
        base_times.append(base_time)

        # Generate with fine-tuned model
        start_time = time.time()
        fine_tuned_prediction = generate_response(fine_tuned_model, prompt)
        fine_tuned_time = time.time() - start_time
        fine_tuned_times.append(fine_tuned_time)

        # Calculate ROUGE scores
        base_rouge = calculate_rouge_scores(reference, base_prediction)
        fine_tuned_rouge = calculate_rouge_scores(reference, fine_tuned_prediction)

        base_scores.append(base_rouge)
        fine_tuned_scores.append(fine_tuned_rouge)

        # Print sample comparison every 10 samples
        if (i + 1) % 10 == 0:
            print(f"\nSample {i + 1}:")
            print(f"Prompt: {prompt[:100]}...")
            print(f"Reference: {reference[:100]}...")
            print(f"Base: {base_prediction[:100]}...")
            print(f"Fine-tuned: {fine_tuned_prediction[:100]}...")
            print(f"Base ROUGE: {base_rouge}")
            print(f"Fine-tuned ROUGE: {fine_tuned_rouge}")
            print("-" * 50)

    # Calculate average scores
    avg_base_scores = {
        'rouge1': np.mean([s['rouge1'] for s in base_scores]),
        'rouge2': np.mean([s['rouge2'] for s in base_scores]),
        'rougeL': np.mean([s['rougeL'] for s in base_scores])
    }

    avg_fine_tuned_scores = {
        'rouge1': np.mean([s['rouge1'] for s in fine_tuned_scores]),
        'rouge2': np.mean([s['rouge2'] for s in fine_tuned_scores]),
        'rougeL': np.mean([s['rougeL'] for s in fine_tuned_scores])
    }

    # Calculate average times
    avg_base_time = np.mean(base_times)
    avg_fine_tuned_time = np.mean(fine_tuned_times)

    results = {
        'base_model': {
            'scores': avg_base_scores,
            'avg_time': avg_base_time,
            'individual_scores': base_scores
        },
        'fine_tuned_model': {
            'scores': avg_fine_tuned_scores,
            'avg_time': avg_fine_tuned_time,
            'individual_scores': fine_tuned_scores
        },
        'improvement': {
            'rouge1': avg_fine_tuned_scores['rouge1'] - avg_base_scores['rouge1'],
            'rouge2': avg_fine_tuned_scores['rouge2'] - avg_base_scores['rouge2'],
            'rougeL': avg_fine_tuned_scores['rougeL'] - avg_base_scores['rougeL']
        }
    }

    return results


def print_results(results: Dict[str, Any]):
    """
    Print evaluation results in a formatted way.

    Args:
        results: Evaluation results dictionary
    """
    print("\n" + "=" * 60)
    print("ROUGE EVALUATION RESULTS")
    print("=" * 60)

    print(f"\nBase Model ({model_name}):")
    print(f"  ROUGE-1: {results['base_model']['scores']['rouge1']:.4f}")
    print(f"  ROUGE-2: {results['base_model']['scores']['rouge2']:.4f}")
    print(f"  ROUGE-L: {results['base_model']['scores']['rougeL']:.4f}")
    print(f"  Avg Time: {results['base_model']['avg_time']:.3f}s")

    print(f"\nFine-tuned Model:")
    print(f"  ROUGE-1: {results['fine_tuned_model']['scores']['rouge1']:.4f}")
    print(f"  ROUGE-2: {results['fine_tuned_model']['scores']['rouge2']:.4f}")
    print(f"  ROUGE-L: {results['fine_tuned_model']['scores']['rougeL']:.4f}")
    print(f"  Avg Time: {results['fine_tuned_model']['avg_time']:.3f}s")

    print(f"\nImprovement:")
    print(f"  ROUGE-1: {results['improvement']['rouge1']:+.4f}")
    print(f"  ROUGE-2: {results['improvement']['rouge2']:+.4f}")
    print(f"  ROUGE-L: {results['improvement']['rougeL']:+.4f}")

    print("=" * 60)


# Evaluate on testing dataset
dataset_path = 'data/testing_dataset.json'

results = evaluate_models(dataset_path)

print_results(results)

Loading dataset from data/testing_dataset.json...
Evaluating on 5 samples...


Evaluating: 100%|██████████| 5/5 [00:54<00:00, 10.90s/it]


ROUGE EVALUATION RESULTS

Base Model (Qwen/Qwen3-0.6B):
  ROUGE-1: 0.4733
  ROUGE-2: 0.0414
  ROUGE-L: 0.3733
  Avg Time: 5.477s

Fine-tuned Model:
  ROUGE-1: 0.4867
  ROUGE-2: 0.0345
  ROUGE-L: 0.3667
  Avg Time: 5.424s

Improvement:
  ROUGE-1: +0.0133
  ROUGE-2: -0.0069
  ROUGE-L: -0.0067





Rouge score results are basically the same, there is no big difference by comparing base model and fine tuned model. However that might be expected, because Rouge is not really made for this task. Rouge takes test sets completion text and compares it word by word to the generated models output, so if the words are the same - Rouge score increases. But because this task is kind of a creative one for LLMs - the words can be very different and because of this Rouge score will be low.

## 4 - Analysis

Comparing prompt engineering results and fine-tuned models results

| Factor                    | **Prompt Engineering / Few-Shot**                   | **Fine-Tuned Model**                                       |
|---------------------------|-----------------------------------------------------|------------------------------------------------------------|
| **Setup Cost**            | Low - just write prompts                            | Higher - requires dataset prep & training                  |
| **Scalability**           | More manual work, need to think of the best prompt  | You can provide more examples for training                 |
| **Creativity**            | Highly dependent on temperature and example quality | Learns patterns, moderate creativity without manual tuning |
| **Relevance**             | Consistent                                          | Consistent                                                 |
| **Diversity**             | Parameter-tuned prompting gave more diversity       | More TLD diversity by default                              |
| **Control**               | Easy to tweak by just changing prompt               | Harder to tweak, because needs retraining                  |
| **Inference Speed**       | \~5.4s                                              | \~5.4s                                                     |
| **Metric Scores (ROUGE)** | Slightly lower ROUGE-1 - 0.4733                           | Slightly higher ROUGE-1 - 0.4867                                 |
| **Cost**                  | Free                                                | Free                                                       |

### Overall:

Fine-tuning brings slight measurable improvement (ROUGE-1) and stylistic consistency (better diversity of TLDs and creative word combos), but doesn’t dramatically outperform prompting in raw metrics - this is common for creative tasks where "correctness" is subjective.

Few-shot + parameter tuning is surprisingly strong: with a curated few-shot set and tuned temperature/top-p, I get nearly the same creativity without training overhead. However because I chose a small model if I would provide even more examples it would start to hallucinate and output irrelevant information.

Fine-tuning should be used further for production environment, because by using it you would be able to generate a variety of different TLDs and hopefully more creative names.

### Suggestions for the future

By using gathered data from users which domain names they prefer or by parsing what kind of business chose what kind of name as their domain it would be easy to generate new datasets and fine tune model even further, this should provide even more uniqueness, creativeness and suggest relevant domain names.

In the future it could also be possible to separate domain names and TLDs into separate suggestions and for example for each domain there would be a list with suggested relevant TLDs for it and Domain team could also check which suggested domain names are already taken, also instead of current 10 suggestions there could be much more suggestions for this particular challenge of suggesting taken domains, because the more domains suggested and creative ones the more of them should be available.