# Project Overview

> Before running the cells, make sure to create a virtual environment and add an API key. 

The aim is to develop and compare approaches for generating domain name suggestions from business descriptions using Large Language Models (LLMs): **Prompt Engineering** and **Fine-Tuning**.

In [None]:
!pip install openai pandas pydantic numpy

In [None]:
!pip install llama-cpp-python

In [None]:
!pip install uvicorn

In [126]:
!export OPENAI_API_KEY="[KEY HERE]" 

In [None]:
# if you want to check if it is there or not
!echo $OPENAI_API_KEY

In [71]:
from openai import OpenAI
import pandas as pd
from pydantic import BaseModel
from typing import List
import numpy as np
import math

In [72]:
NUM_OF_SUGGESTIONS = 5

## 1. Prompt Engineering Approaches

### 1.1. Prompt A: Baseline (One-Shot)

In [73]:
prompt_A = f"""
Suggest {NUM_OF_SUGGESTIONS} domain names for the business described by the user. Every domain should include ONLY latin characters.
"""

### 1.2. Prompt B: Constrained & Brand-Focused (Persona, Few-Shot)

In [74]:
prompt_B = f"""
You are a world-class Brand Strategist and Naming Specialist (think Lexicon Branding or Pentagram).
Your task is to generate {NUM_OF_SUGGESTIONS} high-potential domain name ideas for a business.

### BRAND STRATEGY LOGIC:
1. **The Radio Test:** If heard once, the spelling must be intuitive. No intentional misspellings (e.g., no 'Kleen' for 'Clean').
2. **Outcome-Centric:** Prioritize the *feeling* or *benefit* (e.g., 'Swift' vs. 'FastDelivery').
3. **Phonosemantics:** Use hard consonants (k, t, p) for tech/efficiency; soft vowels (o, a, l) for wellness/luxury.
4. **Syllable Economy:** Maximum 3 syllables. 1-2 is the "Gold Standard."
5. **Alphabet Check:** If there are non-latin characters, convert them to latin. Suggestions should STRICTLY include ONLY latin characters. 

### NAMING ARCHETYPES (Provide a mix):
- **Evocative:** Uses a real word that captures a vibe (e.g., 'Patagonia', 'Slack').
- **Compound:** Two short words joined (e.g., 'DoorDash', 'YouTube').
- **Abstract/Blended:** Unique, brandable sounds or prefixes (e.g., 'Zillow', 'Vanta').
- **Oxymoronic:** If the business has conflicting goals (e.g., 'CheapLuxury'), create a name that bridges the gap (e.g., 'GrandLite').

### EXAMPLES OF HIGH QUALITY DOMAINS:

1. Business: AI-driven logistics platform that makes shipping invisible and effortless.
   Strategy: Focus on the "Outcome" (Benefit over Feature).
   Suggestions: ['EasyCargo', 'VanishShipping', 'PackageArrived']

2. Business: High-end organic skincare that uses ancient volcanic minerals.
   Strategy: Use "Phonosemantics" (Soft vowels for luxury, hard roots for minerals).
   Suggestions: ['MineralSkin', 'Vitre', 'AshLuxe', 'RelicCare']

3. Business: A budget airline that feels like a private club.
   Strategy: "Oxymoronic Branding" (Bridging high-end vibes with low-cost reality).
   Suggestions: ['WingPrive', 'Goldjet', 'ApexAir']

4. Business: Professional-grade coding tools for children/beginners.
   Strategy: "The Radio Test" (Short, punchy, easy to spell).
   Suggestions: ['Koda', 'Codio', 'FableCode']

5. Business: A neighborhood bakery in Brooklyn using traditional Polish recipes.
   Strategy: "Evocative/Local" (Hinting at heritage without being a literal map).
   Suggestions: ['CrustPL', 'Cracow', 'BabkaBakery']
"""

### 1.3. Prompt C: Structured (Chain-of-Thought)

In [75]:
prompt_C = f"""
You are a world-class Brand Strategist and Naming Specialist with expertise in linguistics, phonosemantics, and brand architecture.

## OBJECTIVE
Generate {NUM_OF_SUGGESTIONS} domain name ideas that are highly strategic, linguistically optimized, and fully compliant with the constraints provided by the user.
---

## CORE BRAND STRATEGY PRINCIPLES

1. **Radio Test Compliance:** Names must be instantly spellable when heard aloud. No ambiguous spellings.  
2. **Outcome-Centric Positioning:** Names should evoke the emotional transformation or benefit, not the functional mechanism.  
3. **Phonosemantic Intentionality:**  
   - Hard consonants (K, T, P, D) → speed, precision, authority  
   - Soft consonants (L, M, F, V) → trust, luxury, flow  
   - Open vowels (A, O) → warmth, approachability  
4. **Syllable Economy:** Prefer 2 syllables; maximum 3.  
5. **Constraint Absolutism:** No exceptions—rules must be strictly followed.  
6. **Structural Sophistication:** Avoid "Generic Compound" names (e.g., CarInsurance.com). 
    If using a compound, ensure it creates a new "concept" (e.g., PayPal, DoorDash).
7. **Non-Descriptive Rule:** Names should convey *vibe*, not function.  
   - Example: Use “Bolt” instead of “FastDrop.”  
8. **Anti-Generic Filter:** Avoid feature-like names. If it sounds like a product description, it fails.  
9. **Morpheme Variety:** Prefer subtle Latinate or Greek roots to create “new-real” words (e.g., Aura + ly → Auraly).  
10. **Verb Potential (“Bar Test”):** Could users naturally say “I’ll [Brand Name] it”? Names that are too long or clunky fail.
---

## TWO-STAGE NAMING PROCESS

### STAGE 1 — STRATEGIC DEEP DIVE
1.1 Semantic Deconstruction: Identify core emotional outcomes and 3-5 brand adjectives.
1.2 Ideation: Generate a "Long List" of 15-20 raw ideas across different archetypes.
1.3 Validation: Select the best candidates and run them through this checklist:
- Name: [Candidate]
- Syllable Count: [X] 
- Radio Test: [Pass/Fail]
- Phonosemantic Alignment: [Brief Logic]
- Verdict: [Keep/Discard]

### STAGE 2 — FINAL DELIVERABLES
Present the top {NUM_OF_SUGGESTIONS} names.
**[Name]**
- **Logic:** [Why it works phonetically and strategically]

---

## NAMING ARCHETYPE GUIDE

**Evocative Names:** Emotional resonance, analogies, or metaphorical associations.  
*Examples:* Stripe (movement), Asana (flow), Notion (idea spark)  

**Compound Names:** Fuse two concrete concepts for clarity.  
*Examples:* PayPal (payment + friend), Snapchat (instant + conversation)  

**Abstract/Invented Names:** Fully original linguistic constructs.  
*Examples:* Figma, Canva, Airtable  

**Oxymoronic/Paradox Names:** Combine opposites to create distinctive positioning.  
*Examples:* LegalZoom (law + speed), QuickBooks (accounting + ease)

---

**INSTRUCTION:** Begin Stage 1 analysis with full chain-of-thought reasoning. Then proceed to Stage 2 for final name recommendations.
"""


## 2. Evaluation Methodology

### 2.1. Test Cases

To start experimenting, let's create some categories for business descriptions. For the sake of keeping the experiments manageable, **6 categories** were created:

1. **Clear/Specific:** Straightforward, focused, and simple business descriptions.

2. **Abstract/Creative:** Descriptions that are metaphorical or use "catchy" imagery.

3. **Niche/Technical:** Descriptions using domain-specific terminology or niche concepts.
4. **Local:** Descriptions with specific geographic or cultural anchors.

5. **Mixed Language:** Descriptions primarily in English but containing non-English terms or names.
6. **Ambiguous:** Descriptions that are either too messy (multiple industries). too thin (no details), or with internal contradictions (e.g., "High-end Budget Fashion").

>Creating categories ensures coverage across different types of user input. These categories were used solely to ensure diversity in the evaluation set, not as separate evaluation dimensions.

---

A total of 4 descriptions were created for each category, totaling to 24 unique business descriptions. These consist of:
* 1 ChatGPT generated (GPT-5)
* 1 Claude generated (Sonnet 4.5)
* 1 Gemini generated (Gemini 3)
* 1 Human input

In [76]:
df_businesses = pd.read_csv("descriptions.csv", delimiter=";")
pd.set_option('max_colwidth', 400)
df_businesses

Unnamed: 0,category,business_description
0,Clear/Specific,A mobile app development company building iOS and Android apps for small businesses.
1,Abstract/Creative,"A bookstore that breathes magic into pages, where every corner whispers untold stories."
2,Niche/Technical,A biotech firm specializing in CRISPR gene-editing therapies for rare hereditary diseases.
3,Local,A family-run pizzeria in Naples using century-old recipes and locally milled flour.
4,Mixed Language,A fashion boutique offering haute couture dresses con detalles inspirados en la cultura mexicana.
5,Ambiguous,"A fitness studio-café-co-working space that offers pilates classes, smoothies, and freelance workstations."
6,Clear/Specific,Family-owned bakery specializing in fresh sourdough bread and pastries made daily from organic ingredients
7,Abstract/Creative,Where threads of imagination weave tomorrow's stories into tangible experiences
8,Niche/Technical,B2B SaaS platform providing real-time ETL pipelines for enterprise data warehousing with Kubernetes orchestration
9,Local,Traditional izakaya serving Hokkaido-style grilled fish near Shibuya Crossing in Tokyo


In [77]:
business_descriptions = (
    df_businesses
    .assign(category=df_businesses["category"].str.lower())
    .groupby("category")["business_description"]
    .apply(list)
    .to_dict()
)

>For each business description, 5 domain suggestions will be generated, totaling to **120 suggestions** per prompt.

### 2.2. Evaluation Design

Evaluation was performed by a single human rater using predefined criteria to ensure consistency. The setup consists of:

- 24 business descriptions
- 5 domains per prompt per business
- 2 metrics: Relevance (1–5), Brandability (1–5)
- Single evaluator

---

**Metric 1: Relevance (1–5)**
>How well does this domain fit the business description?

Scale:
* 1 = unrelated
* 3 = somewhat relevant
* 5 = very well aligned

**Metric 2: Brandability (1–5)**
>Does this sound like a brand someone would actually use?

Scale:
* 1 = awkward / generic / spammy
* 3 = acceptable
* 5 = catchy, clean, memorable

## 3. Prompt Experiments

In [83]:
# a structure the LLM will use to reply with
class DomainName(BaseModel):
    name: str
    logic: str # why is this domain suggested?
    
class DomainSuggestions(BaseModel):
    domains: List[DomainName]

In [84]:
client = OpenAI()

In [85]:
def query(prompt, userinput):
    response = client.responses.parse(
        model="gpt-4o-mini",
        instructions=prompt,
        input=userinput,
        text_format=DomainSuggestions,
    )
    # find speed using internal timestamps
    api_speed_sec = response.completed_at - response.created_at
    input_tokens = response.usage.input_tokens
    output_tokens = response.usage.output_tokens
    suggestions_list = response.output_parsed.domains
        
    return suggestions_list, input_tokens, output_tokens, api_speed_sec

In [86]:
def generate_suggestions(descriptions, prompt, promptname):
    rows = []
    for category, descriptions in descriptions.items():
        for description in descriptions:
            rows.append({
                "category": category,
                "description": description
            })
    results = []
    for row in rows:
        suggestions, input_tokens, output_tokens, api_speed_sec = query(
            prompt=prompt,
            userinput=row["description"]
        )
    
        for s in suggestions:
            results.append({
                "category": row["category"].lower(),
                "description": row["description"],
                "prompt_used": promptname,
                "domain": s.name,
                "reason": s.logic,
                "speed_sec": round(api_speed_sec, 2),
                "input_tokens": input_tokens,
                "output_tokens": output_tokens
            })
    return results

In [87]:
def remove_TLD(df):
    df_temp = df.copy()
    
    # Remove TLD
    df_temp["domain"] = df_temp["domain"].str.split(".").str[0]
    df_TLD_removed = df_temp.sort_values(by="category").reset_index(drop=True)

    return df_TLD_removed

In [132]:
# for loading results in later runs
df_promptA = pd.read_csv('/result_files/promptA_generated_domains.csv')
df_promptB = pd.read_csv('/result_files/promptB_generated_domains.csv')
df_promptC = pd.read_csv('/result_files/promptC_generated_domains.csv')

In [None]:
# !!!! REGENERATES RESULTS !!!!
promptA_results = generate_suggestions(business_descriptions, prompt_A, "prompt_A")
df_promptA = pd.DataFrame(promptA_results)
df_promptA = remove_TLD(df_promptA)
df_promptA.to_csv("/result_files/promptA_generated_domains.csv", index=False)

In [None]:
# !!!! REGENERATES RESULTS !!!!
promptB_results = generate_suggestions(business_descriptions, prompt_B, "prompt_B")
df_promptB = pd.DataFrame(promptB_results)
df_promptB = remove_TLD(df_promptB)
df_promptB.to_csv("/result_files/promptB_generated_domains.csv", index=False)

In [77]:
# !!!! REGENERATES RESULTS !!!!
promptC_results = generate_suggestions(business_descriptions, prompt_C, "prompt_C")
df_promptC = pd.DataFrame(promptC_results)
df_promptC = remove_TLD(df_promptC)
df_promptC.to_csv("/result_files/promptC_generated_domains.csv", index=False)

## 4. Prompt Engineering Results

In [91]:
def export_for_scoring(df, filename):
    df_prompt_eval = df.copy()
    df_prompt_eval['relevance'] = ''
    df_prompt_eval['brandability'] = ''
    df_prompt_eval_sort = df_prompt_eval.sort_values(by=['category', 'description'])
    df_prompt_eval_sort.to_csv(filename, index=False)

In [124]:
export_for_scoring(df_promptA, "promptA_scores.csv")
export_for_scoring(df_promptB, "promptB_scores.csv")
export_for_scoring(df_promptC, "promptC_scores.csv")

After exporting domains and adding empty columns for relevance and brandability. The suggestions were rated manually. The results are loaded back.

In [107]:
df_promptA_scores = pd.read_csv('/result_files/promptA_scored.csv', delimiter=";")
df_promptB_scores = pd.read_csv('/result_files/promptB_scored.csv', delimiter=";")
df_promptC_scores = pd.read_csv('/result_files/promptC_scored.csv', delimiter=";")

In [108]:
def prepare_for_evaluation(df):
    df_eval = (
    df
    .groupby(['category', 'description', 'prompt_used'], as_index=False)
    .agg({
        'domain': list,  
        'reason': list,  
        'relevance': 'mean',
        'brandability': 'mean',
        'speed_sec': 'mean',     
        'input_tokens': lambda x: int(x.mean()),               
        'output_tokens': lambda x: int(x.mean())    
        
    })
    )
    return df_eval

In [109]:
df_promptA_eval = prepare_for_evaluation(df_promptA_scores)
df_promptB_eval = prepare_for_evaluation(df_promptB_scores)
df_promptC_eval = prepare_for_evaluation(df_promptC_scores)

In [110]:
def evaluate(df1, df2, df3):
    INPUT_PRICE_PER_M_CENTS = 15   # 15 cents per 1M input tokens
    OUTPUT_PRICE_PER_M_CENTS = 60  # 60 cents per 1M output tokens

    df_eval = pd.concat([df1, df2, df3], ignore_index=True)

    avg_by_prompt = (
        df_eval
        .groupby('prompt_used', as_index=False)
        .agg({
            'relevance': 'mean',
            'brandability': 'mean',
            'speed_sec': 'sum',
            'input_tokens': 'sum',
            'output_tokens': 'sum'
        })
    )

    # Calculate costs in cents
    avg_by_prompt['input_cost(cents)'] = (
        avg_by_prompt['input_tokens'] / 1_000_000 * INPUT_PRICE_PER_M_CENTS
    )
    avg_by_prompt['output_cost(cents)'] = (
        avg_by_prompt['output_tokens'] / 1_000_000 * OUTPUT_PRICE_PER_M_CENTS
    )
    avg_by_prompt['total_cost(cents)'] = (
        avg_by_prompt['input_cost(cents)'] + avg_by_prompt['output_cost(cents)']
    )

    # Round selectively
    avg_by_prompt = avg_by_prompt.round({
        'relevance': 1,
        'brandability': 1,
        'speed_sec': 0,
        'input_tokens': 0,
        'output_tokens': 0,
        'input_cost(cents)': 2,
        'output_cost(cents)': 2,
        'total_cost(cents)': 2
    })

    return avg_by_prompt


In [111]:
df_avg_by_prompt = evaluate(df_promptA_eval,
                            df_promptB_eval,
                            df_promptC_eval)
df_avg_by_prompt

Unnamed: 0,prompt_used,relevance,brandability,speed_sec,input_tokens,output_tokens,input_cost(cents),output_cost(cents),total_cost(cents)
0,prompt_A,3.8,2.7,59.0,3283,2827,0.05,0.17,0.22
1,prompt_B,3.8,3.5,64.0,17035,3222,0.26,0.19,0.45
2,prompt_C,3.5,3.1,103.0,20203,4677,0.3,0.28,0.58


**Relevance**

* **Prompts A & B:** These models generated domains that align closely with the business descriptions.
* **Prompt C:** Tended toward over-creativity, often losing the core essence of the business. For example, using `PulseEdge` for an FPGA acceleration service created ambiguity, potentially being mistaken for a healthcare brand.

**Brandability**

* **Prompt A:** Produced relevant but generic and less memorable names.
* **Prompt B:** Adding specific branding constraints significantly improved catchiness without losing context.
* **Prompt C:** Leaned too heavily into cliches (e.g., `LoafLuxe`, `YeastJoy`), which felt forced rather than professional.

**Speed and Performance**

* Increased prompt complexity (Prompt C) led to higher latency.
* Given that Prompt C did not provide superior results, the additional generation time implies a significant bottleneck.

**Token Usage and Cost**

* **Prompt A:** Most cost-effective.
* **Prompt B:** Moderate cost increase, but justified by the improved brandability.
* **Prompt C:** Most expensive with the lowest return on quality.

---

#### **Prompt Comparison Summary**

| Metric | Prompt A | Prompt B | Prompt C |
| --- | --- | --- | --- |
| **Relevance** | High | High | Low |
| **Brandability** | Low | High | Medium (Cliche) |
| **Speed** | Fast | Moderate | Slow |
| **Cost** | Lowest | Moderate | Highest |

---

#### **Final Recommendation**

**Prompt B** offers the optimal balance of relevance, brandability, and cost-efficiency. While Prompt A is a viable baseline for high-speed/low-cost requirements, it lacks the creative edge needed for branding. Prompt C is over-engineered and inefficient for this use case.

> **Model Selection Note:**
> During testing, **gpt-5-nano** was evaluated but discarded due to high latency (~15–20 seconds per request). **gpt-4o-mini** was selected as the production model for its superior speed and reliability.


## 5. Fine-Tuning Experiment

### 5.1. Data Construction

> Data construction was done using AI tools and manual input. The final synthetic dataset includes **54 (business description, domain) pairs**.

In [180]:
json_file_path = 'dataset_finetune.json' 
df_finetune = pd.read_json(json_file_path)
df_finetune

Unnamed: 0,description,domain
0,A cafe specializing in organic coffee and pastries,Groundfolk.com
1,Mobile app for guided meditation and mindfulness,Stillpoint.com
2,Subscription box for eco-friendly household products,Earthwell.com
3,Boutique selling handmade jewelry,Glintwell.com
4,Online marketplace for custom pet accessories,Tailmark.com
5,App for tracking daily water intake,Flowza.com
6,Delivery service for fresh farm produce,FarmDash.com
7,Platform connecting freelance creative professionals with the businesses,Freelanced.com
8,Boutique fitness studio offering yoga and pilates,Kinfolk.com
9,Digital marketing agency for small businesses,Brandkit.com


## 5.2. Model & Training Setup

Model and the training environment was picked considering available hardware resources.

#### **Technical Specifications**

| Component | Detail |
| --- | --- |
| **Base Model** | [Llama 3.2 (3B)](https://huggingface.co/meta-llama/Llama-3.2-3B) by Meta |
| **Training Technique** | **QLoRA** (Quantized Low-Rank Adaptation) |
| **Environment** | Google Colab |

---

#### **Development Environment**

The fine-tuning process was executed via Google Colab. The notebook used has been specifically adjusted to fit this use case.

* **Active Notebook:** [View on Google Colab](https://colab.research.google.com/drive/1RiIQB9_-7ggQ_7MkOW4okGkUpO2UPPJt?usp=sharing)
* **Original Source:** This work is based on the [Unsloth Synthetic Data Notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Meta_Synthetic_Data_Llama3_2_%283B%29.ipynb).

---

#### **Model Weights & Distribution**

The trained have been uploaded and is available for public use on Hugging Face:

> **Hugging Face Repository:** [eylulguleryuz/domain-suggester-gguf](https://huggingface.co/eylulguleryuz/domain-suggester-gguf)


In [None]:
from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="eylulguleryuz/domain-suggester-gguf",
	filename="llama-3.2-3b-instruct.Q4_K_M.gguf",
)

In [64]:
def query_local(business_description):
    answer = llm.create_chat_completion(
	messages = [        
    {"role": "system", "content": "You are a helpful assistant that generates domain names from business descriptions."},
    {"role": "user", "content": f"Give me 5 domain names in a list. {business_description}"},

    ]) 
    return list_domains(answer)

In [65]:
def list_domains(llm_answer):
    
    text = llm_answer['choices'][0]['message']['content']
    markdown = text.replace("\\n", "\n")
    
    items = []
    
    for line in text.split("\n"):
        if line.strip().startswith(tuple("123456789")):
            items.append(line.split(". ", 1)[1])
    
    return items

In [79]:
def generate_suggestions_local(descriptions):
    rows = []
    for category, descriptions in descriptions.items():
        for description in descriptions:
            rows.append({
                "category": category,
                "description": description
            })
    results = []
    for row in rows:
        suggestions = query_local(row["description"])
    
        for s in suggestions:
            results.append({
                "category": row["category"].lower(),
                "description": row["description"],
                "domain": s,
            })
    return results

# 6. Fine-Tuning Results

In [None]:
results_local = generate_suggestions_local(business_descriptions)

In [122]:
df_local = pd.DataFrame(results_local)
df_local = remove_TLD(df_local)
df_local.to_csv("/result_files/finetuned_generated_domains.csv", index=False)
df_local.head(5)

Unnamed: 0,category,description,domain
0,abstract/creative,"We are the lighthouse in a sea of data, guiding your brand toward the shores of untapped potential.",Lighthouse
1,abstract/creative,We are your north star in overcoming all the peculiarities of becoming a vlogger.,Frameup
2,abstract/creative,We are your north star in overcoming all the peculiarities of becoming a vlogger.,Contentcraft
3,abstract/creative,We are your north star in overcoming all the peculiarities of becoming a vlogger.,Streamwell
4,abstract/creative,We are your north star in overcoming all the peculiarities of becoming a vlogger.,Vlogify


In [92]:
export_for_scoring(df_local, "finetuned_domains_scores.csv")

> **Only 17** business descriptions returned suggestions. For the rest, no suggestions were generated.

This is likely due to the models complexity. It might not always be capable to understand the user's query. It wouldn't be fair to give the model a penalty just because it couldn't understand the query in this case. So the metrics will be calculated on whatver the model was able to provide (85 suggestions).

In [None]:
df_finetuned_scores = pd.read_csv('/result_files/finetuned_domains_scored.csv', delimiter=";")

In [123]:
df_finetuned_eval = (
    df_finetuned_scores
    .groupby(['category', 'description'], as_index=False)
    .agg({
        'domain': list,  
        'relevance': 'mean',
        'brandability': 'mean',
    })
    )
df_finetuned_eval.head(5)

Unnamed: 0,category,description,domain,relevance,brandability
0,abstract/creative,"We are the lighthouse in a sea of data, guiding your brand toward the shores of untapped potential.","[Lighthouse, Chartwell, Guidinglight, Beaconwise, Safehaven]",2.6,2.8
1,abstract/creative,We are your north star in overcoming all the peculiarities of becoming a vlogger.,"[Frameup, Contentcraft, Streamwell, Vlogify, Viewpoint]",2.2,3.6
2,ambiguous,Service where you can outsource random tasks you don't have time for and also offers language lessons.,"[Assistify, OutsourceHub, LinguaTask, TaskMate, TaskLingua]",3.8,3.2
3,ambiguous,Exclusive affordable luxury tax consulting and used car tire sales with no-strings-attached lifetime commitments.,"[RoadsideTax, DealMark, Wheelmark, TireMark, TaxTirePro]",2.8,3.6
4,ambiguous,High-end budget fashion and luxury electronics with occasional catering services,"[GourmetGadget, StyleBoutique, HauteTable, LuxeBite, CoutureCircuit]",2.4,2.6


In [105]:
overall_avg = df_finetuned_eval[['relevance', 'brandability']].mean().round(1)
print("Fine-tuned LLM Scores")
overall_avg

Fine-tuned LLM Scores


relevance       3.1
brandability    2.9
dtype: float64

## 7. Discussion & Trade-offs

In [118]:
df_avg_prompt = df_avg_by_prompt[['prompt_used', 'relevance', 'brandability']]

df_avg_finetune = (
    df_finetuned_eval[['relevance', 'brandability']]
    .mean()
    .round(1)
    .to_frame()
    .T
)

df_avg_finetune['prompt_used'] = 'finetuned'

df_eval_full = pd.concat(
    [df_avg_prompt, df_avg_finetune],
    ignore_index=True
)

df_eval_full


Unnamed: 0,prompt_used,relevance,brandability
0,prompt_A,3.8,2.7
1,prompt_B,3.8,3.5
2,prompt_C,3.5,3.1
3,finetuned,3.1,2.9


#### **Prompt Engineering**

**Pros**

* High relevance across all categories.
* Best overall balance of relevance and brandability (Prompt B).
* No training data required.
* Fast iteration by adjusting prompts.

**Cons**

* Brandability depends heavily on prompt design.
* Creativity might stop or go overboard without explicit constraints.

**Performance**

* Relevance: highest (3.8).
* Brandability: highest with Prompt B (3.5).

**Scalability**

* Scales well across domains.
* Prompt changes do not require retraining.

**Cost**

* Low development cost.
* The token usage needs to be monitored. 
    For the best performing prompt, generating suggestions for 24 user queries costed 45 cents.
        This might increase significantly on the production environment. 

---

#### **Fine-Tuning**

**Pros**

* Slightly improved brandability over a baseline prompt.
* Learns creative patterns from curated examples.

**Cons**

* Lower relevance due to limited training data.
* Poor generalization outside training distribution.
* Requires comprehensive dataset creation and maintenance.
* Harder to iterate and debug.

**Performance**

* Relevance: lowest (3.1).
* Brandability: moderate (2.9).

**Scalability**

* Does not scale well without more data.
* Requires retraining for new domains ideas or constraints.

**Cost**

* Higher upfront cost (data preparation + training).
* Additional infrastructure and maintenance overhead.

---


## 8. Recommendations & Next Steps

#### **Recommendation**

**Use Prompt B as the MVP solution.**

Because:

* Highest brandability (3.5)
* Tied highest relevance (3.8)
* Lower cost and latency than Prompt C
* Outperforms fine-tuned model on both metrics
---

#### **Next Steps**

#### 1. Production readiness

* Add domain availability filtering as a post-processing step
* Cache frequent queries to reduce cost and latency
* Make sure no domain names with special letters are suggested as a post-processing step as it seems like LLM is not consistent with this.

#### 2. User-in-the-loop feedback

* Track which suggested domains users click or register
* Use this feedback to refine prompt constraints over time

#### 3. Targeted fine-tuning (if it will be needed)

* Revisit fine-tuning if:

  * Sufficient real user data is collected
  * Enough resources can be dedicated to fine-tune a larger model

---

## 9. Serving Prompt B through API


In the project directory, please find `api.py` and `client.py` files. Run these seperately from the notebook to test out the API.

The API will return a result with this schema:

```
{
  "suggestions": [
    {
      "name": "",
      "logic": ""
    },
    {
      "name": "",
      "logic": ""
    },
    ...
  ],
  "input_tokens": 0,
  "output_tokens": 0,
  "api_speed_sec": 0.0
}
```
