# NameForge: AI Domain Name Generator Homework

**Author:** Ferdinand Koenig
**Date:** Sep 2025

---

## Introduction

This notebook documents the AI Engineer homework assignment for building a domain name generator.
Objectives:

- Generate synthetic dataset of business descriptions → domain names
- Build a baseline domain generator (mock / open-source LLM)
- Implement evaluation framework (LLM-as-a-judge / safety checks)
- Analyze edge cases and iteratively improve
- Ensure safety guardrails for inappropriate content


In [1]:
import pandas as pd

## 1️⃣ Step: Synthetic Dataset Creation

### 1.1 Methodology

**Objective:**
Generate a synthetic dataset of business descriptions mapped to domain names, with diversity in business types, complexity levels, and edge cases, while ensuring safety and reproducibility.

**Steps Taken:**

1. **Vocabulary Selection**
   - **Business types:** cafe, restaurant, tech startup, online store, boutique, law firm, travel agency, bookstore, etc.
   - **Adjectives / descriptors:** organic, eco-friendly, bright, cozy, modern, smart, fresh, premium, global, innovative
   - **Nouns / themes:** hub, shop, store, lab, studio, solutions, works, spot, corner
   - **TLDs:** .com, .net, .org, .io, .co
   - Vocabulary is stored externally in `src/vocab.py` for maintainability and easy updates.

2. **Complexity Levels**
   - **Simple:** Short, straightforward descriptions → short domains (e.g., “Organic cafe”)
   - **Medium:** Include location or moderate complexity (e.g., “Organic cafe in downtown area”)
   - **Complex:** Long or multi-purpose descriptions (e.g., “Premium organic cafe in downtown area offering community events for busy professionals”)

3. **Edge Cases**
   - ~5% of entries include unusual or extreme cases:
     - Extremely long business descriptions
     - Very short or ambiguous descriptions
     - Uncommon characters (e.g., symbols @$%^)
   - These cases test model robustness and evaluation coverage.

4. **Safety Guardrails**
   - Forbidden words: `adult`, `nude`, `porn`, `illegal`
   - Any generated domain containing forbidden words is replaced with `blocked.com`

5. **Dataset Generation Process**
   - Randomly combine adjectives, nouns, and business types according to complexity
   - Assign complexity distribution: 40% simple, 40% medium, 20% complex
   - Randomly insert edge cases
   - Save datasets as CSV with fields: `business_description`, `domain_name`, `complexity`

6. **Train/Test Split**
   - Train dataset: 500 entries
   - Test dataset: 100 entries
   - Stored in `data/raw/` as `train_dataset.csv` and `test_dataset.csv`

### 1.2 Practical Considerations

- **Reflecting client needs:**
  - The synthetic dataset is designed to mimic the examples provided in the homework task, ensuring generated domain names are relevant to realistic business descriptions.

- **Resource efficiency:**
  - No fine-tuned LLM is used at this stage due to GPU requirements.
  - No external API calls (OpenAI, Claude, etc.) are used because free accounts may not have access or sufficient credits.

- **Reproducibility:**
  - Dataset generation relies on deterministic Python code with controlled randomness (`random` module).
  - Vocabulary is externalized for maintainability (`src/vocab.py`).
  - The process can be fully run on a standard laptop without specialized hardware.

- **Edge cases and safety:**
  - ~5% of entries are extreme or unusual to test model robustness.
  - Safety guardrails ensure forbidden words are blocked.



In [1]:
from src.data_utils import generate_train_test

generate_train_test(train_size=2_700, test_size=100)

Dataset generated at data/raw/train_dataset.csv (2700 entries)
Dataset generated at data/raw/test_dataset.csv (100 entries)


# LoRA Fine-Tuning of Mistral-7B-Instruct (4-bit)

## 1. Overview

We performed **parameter-efficient fine-tuning** of a large language model (Mistral-7B-Instruct) using **LoRA (Low-Rank Adaptation)** in combination with **4-bit quantization** and Hugging Face’s Trainer.

- **Dataset:** CSV of business descriptions → domain names
- **Prompting:** YAML template for structured input
- **Target:** Fine-tune the model for domain-specific naming suggestions

---

## 2. Why LoRA?

- Traditional full fine-tuning of a 7B-parameter model is **compute- and memory-intensive**.
- LoRA introduces **small trainable adapters** into existing weights:
  - Adds low-rank matrices `A` and `B` to frozen weights:
    \[
    W' = W + BA
    \]
  - `W` is frozen (pre-trained weight)
  - `BA` is trainable, tiny (~0.1% of total params)
- **Advantages:**
  - **Memory-efficient**: Only a few million parameters trainable instead of billions.
  - **Fast training**: Lightweight updates.
  - **Task-effective**: Adapts attention patterns without altering most of the model.

---

## 3. Why 4-bit Quantization?

- Reduces VRAM usage for large models (7B parameters) from ~28GB → 4–5GB for inference/training.
- Makes fine-tuning feasible on a single GPU (48 GB in our case) while keeping speed reasonable.
- Works well with LoRA because **most parameters remain frozen**, so low-precision weights are sufficient.

---

## 4. GPU Capacity and Utilization

- **GPU:** 48 GB VRAM
- **Current usage:** 4–5 GB VRAM, ~40% compute utilization
- **Why low?**
  - LoRA trains only 6.8M parameters out of 7.25B → tiny backward pass.
  - 4-bit weights + gradient checkpointing reduce compute per forward pass.
  - Small batch size / sequence length also limit utilization.

---

## 5. Strategies to Increase GPU Utilization

1. **Increase batch size**
   - Current: 8 × 4 = 32 effective batch
   - Can go higher if VRAM allows → fills GPU cores better.

2. **Increase sequence length**
   - Current: `MAX_LENGTH = 512`
   - Longer sequences increase compute per batch → higher utilization.

3. **Train more LoRA layers**
   - Currently only `q_proj` and `v_proj` in attention.
   - Including MLP layers or `k_proj`/`o_proj` increases trainable parameters → more GPU compute.

4. **Optimize data loading**
   - `dataloader_num_workers` = 32
   - Pin memory = True
   - Ensures GPU is not idle waiting for CPU.

5. **Multi-GPU (optional)**
   - Spreading the model across multiple GPUs boosts compute usage.

---

## 6. Why Q and V Matrices in Attention?

- **Attention mechanism:**
\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d}}\right) V
\]

- **Q (Query)** → determines *which tokens to focus on*
- **V (Value)** → determines *what information is propagated*

**Choosing Q+V for LoRA:**

- High leverage: small adapters in these matrices produce **significant task-specific adaptation**.
- Minimal parameters needed → efficient training.
- K and O, or MLP layers, can be adapted later for more aggressive fine-tuning, but Q+V gives most effect per parameter.

---

## 7. How LoRA Works (Simplified)

1. Identify target matrices in the model (Q and V projections).
2. Freeze original weights `W`.
3. Add **low-rank matrices** `A` and `B` such that `ΔW = BA`.
4. Train `BA` only, keeping all other parameters frozen.
5. During forward pass:
   - `W' = W + BA` is used in attention computations
   - Only `BA` gradients are computed → lightweight backward pass

**Effect:** The model “learns” new behavior efficiently, without touching billions of frozen parameters.

---

## 8. Summary

| Feature                  | Choice / Value                        | Reason                                                                 |
|---------------------------|--------------------------------------|------------------------------------------------------------------------|
| Model                     | Mistral-7B-Instruct-v0.3             | Large LLM, strong instruction-following capabilities                 |
| Fine-tuning method        | LoRA                                  | Parameter-efficient, memory-friendly                                   |
| Quantization              | 4-bit NF4                             | Reduce VRAM, maintain speed                                           |
| Target LoRA layers        | `q_proj` + `v_proj`                   | Max effect on attention with minimal parameters                        |
| Batch size                | 8 × 4 = 32 effective                  | Fits GPU memory, can be increased for better utilization               |
| Gradient checkpointing    | Enabled                               | Saves memory during training                                           |
| GPU utilization           | ~40%, 4–5GB / 48GB                    | Expected for tiny trainable adapter on frozen 7B model                 |
| Improvements to utilization | longer sequences, more LoRA layers, larger batches | More GPU compute, still memory safe                                  |

---

**Takeaway:**

- **LoRA + 4-bit quantization** allows fine-tuning **large models efficiently**.
- Targeting **attention Q+V matrices** gives maximal effect for minimal compute.
- GPU underutilization is normal due to **tiny trainable portion**; utilization can be increased with **larger sequences, batch size, or additional adapters**.



> **TODO** add the things from README

Result:

  You are a domain name generator AI. You generate **safe, creative, and memorable domain names** based on a business description.
  - Do not generate inappropriate content.
  - Use only alphanumeric characters and standard TLDs (.com, .net, .org, etc.).
  - Provide 2–3 suggestions per description.
  - If the description is unsafe, output exactly: "__BLOCKED__".

  --- EXAMPLE START ---
  Business Description:
  "organic coffee shop in downtown area"

  Output:
  ["organicbeanscafe.com", "downtowncoffee.org", "freshbreworganic.net"]
  --- EXAMPLE END ---

  --- TASK START ---
  Business Description:
  A local farmer offering fresh eggs

  Output only a JSON array of 2–3 domain names: ["freshstorelab.net", "freshlablab.com", "freshlabshop.com"]



  You are a domain name generator AI. You generate **safe, creative, and memorable domain names** based on a business description.
  - Do not generate inappropriate content.
  - Use only alphanumeric characters and standard TLDs (.com, .net, .org, etc.).
  - Provide 2–3 suggestions per description.
  - If the description is unsafe, output exactly: "__BLOCKED__".

  --- EXAMPLE START ---
  Business Description:
  "organic coffee shop in downtown area"

  Output:
  ["organicbeanscafe.com", "downtowncoffee.org", "freshbreworganic.net"]
  --- EXAMPLE END ---

  --- TASK START ---
  Business Description:
  adult website

  Output only a JSON array of 2–3 domain names: ["adultstorebeans.org", "adultstorecorner.io", "__BLOCKED__"]

  Business Description:
  online store in neighborhood

  Output only a JSON array of 2–3 domain names:
  ["onlestorecorner.co", "onlestoreworks.co", "onlestoreworks.io"]

  Business Description:
  fitness studio

  Output only a JSON array of 2–3 domain names:
  ["fitnessstorehub.net", "__BLOCKED__", "__BLOCKED__"]

  Business Description:
  travel agency

  Output only a JSON array of 2–3 domain names:
  ["travelstorehub.co", "__BLOCKED__", "__BLOCKED__"]

  --- TASK END ---

Lessons:
- Too repetitive data generator
- __BLOCKED__ should be repeated to enforce JSON style
- More variability in data needed

Prompt for ChatGPT to produce b data set of size 300


You are a data generation assistant. Your task is to produce **realistic, human-like business descriptions** along with **3 plausible domain names** and a **complexity level**. The output must be in **CSV format** with the following columns:

business_description, domain_names, complexity

Follow these rules:

1. **business_description**:
   - Should be human-like, natural-sounding.
   - Include adjectives, business types, locations, and optional purposes.
   - Include a mix of simple, medium, and complex descriptions:
       - simple: 1–2 words + business type, e.g., "fresh bakery".
       - medium: add location or modifier, e.g., "cozy coffee shop in downtown".
       - complex: longer descriptions with adjectives, nouns, business types, locations, purposes, e.g., "vibrant educational platform for creative learning in city center for busy professionals".
   - Avoid unsafe content (adult, drugs, piracy, violence, gambling).

2. **domain_names**:
   - Must be a JSON array of 3 plausible domain names.
   - Follow grammar patterns: [adjective][noun][suffix][tld] or [adjective][business_type][suffix][tld].
   - Use realistic top-level domains: .com, .net, .org, .io, .co.
   - Domain names must be alphanumeric; remove spaces and special characters.
   - Each description should have **different, meaningful domains**.
   - Use **rare adjectives or nouns occasionally** for variety.
   - Include generic business types occasionally (e.g., "platform", "store", "service") to increase diversity.

3. **complexity**:
   - Must be one of: simple, medium, complex.
   - Reflect the complexity of the business description.

4. **Quantity**:
   - Generate exactly 50 examples per response.
   - Ensure diversity; no duplicates of business descriptions or domains.

5. **Output format**:
   - CSV, comma-separated.
   - Example row:
     "cozy coffee shop in downtown offering organic blends","['cozycoffeeblend.com','downtowncoffeecorner.net','organiccoffeestudio.org']",medium

---

**Instructions for you**:

- Start generating realistic human-like data immediately.
- Include variety in adjectives, nouns, business types, locations, and purposes.
- Ensure all domains are plausible and match the description.
- Make the output ready to append to an existing CSV dataset.
