# Guide: Choosing the right model for your use case

As an LLM developer, you always wonder which model to pick for your specific use case.

In this notebook we present a **structured way to approach picking the right model for your use case**.

**Use case**: We take **Chat** as an example use case, where we build a playful and helpful assistant that is cost-effective. We evaluate the performance of two popular models against criteria we are interested in - clarity, objectivity and tone.

We demonstrate LLM-as-a-Judge (LLMJ) can be used to guide the decision.

* **Models compared**: Llama-3.1-8B-Instruct, GPT-4o
* **Evaluator model**: [Atla Selene](https://www.atla-ai.com/post/selene-1) - 70B General Purpose Evaluator model
* **Dataset**: We subsample 50 chat examples from [RewardBench](https://huggingface.co/datasets/allenai/reward-bench) and generate responses using the foundation models

The notebook has 3 sections:
1. Initial setup
2. Running evaluations against the criteria
3. Results & Insights

# 1. Initial Setup

### Install dependencies

In [1]:
%%capture
!pip install atla==0.5.0 plotly==5.24.1 datasets==3.4.1

### Setup Atla API key & client

You can get your Atla API key for free [here](https://www.atla-ai.com/sign-up).

In [2]:
from atla import AsyncAtla
import os

os.environ["ATLA_API_KEY"] = "<<your API key>>"
client = AsyncAtla()

### Load the dataset

The dataset has 50 data points, each with the following format:
* user_input
* Llama-3.1-8B-Instruct output
* GPT-4o output

In [3]:
from datasets import load_dataset

df = load_dataset("AtlaAI/demo-chat-responses", split="train").to_pandas()
df.head(5)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/360 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/124k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50 [00:00<?, ? examples/s]

Unnamed: 0,user_input,Llama-3.1-8B-Instruct,GPT-4o
0,How do I detail a car?,Detailing a car - it's like giving your vehicl...,Detailing a car involves thoroughly cleaning a...
1,Who created the Superman cartoon character?,"You want to know about the Man of Steel, huh? ...",Superman was created by writer Jerry Siegel an...
2,What is Atlantis?,"Atlantis - the ultimate mystery, the ultimate ...",Atlantis is a legendary island and civilizatio...
3,"Hi, I'm in the mood for a Bloody Mary. Can you...","A Bloody Mary, the ultimate brunch drink for t...",Certainly! A Bloody Mary is a classic cocktail...
4,what are some good ways to spread ashes?,Spreading ashes - the ultimate way to scatter ...,Spreading the ashes of a loved one can be a me...


In [4]:
# Extract the responses from the foundation models
assistant_responses = df.drop(["user_input"], axis=1)
assistant_response_columns = assistant_responses.columns

# 2. Run evaluations against the criteria

### The evaluation criteria

We have chosen the following 3 criteria that are relevant to chat applications.
1. Clarity - Is the chat output clear?
2. Tone - Does the chat output match the tone I want?
3. Objectivity - Is the chat output unbiased?

In [5]:
evaluation_metrics = {
    "clarity": """
        Rate the response's clarity on a scale of 1 (high school student) to 5 (Nobel laureate).
    """,
    "tone": """
        I want my assistant to have a great sense of humor.
        Rate how funny the response is on a scale of 1 (boring) to 5 (funny).
    """,
    "objectivity": """
        I want my assistant to be unbiased. Its responses should only portray facts and never include opinions or other subjective statements.
        Rate the response's objectivity on a scale of 1 to 5:
        - A score of 1 means the response is completely subjective.
        - A score of 2 means the response is mainly subjective, but contains some objective statements.
        - A score of 3 means the response is a mix of subjective and objective statements.
        - A score of 4 means the response is mainly objective, but contains some subjective statements.
        - A score of 5 means the response is completely objective.
    """,
}

### Running the evaluations using Atla Selene

In [6]:
import pandas as pd

evaluation_results = pd.DataFrame(
    index=assistant_response_columns,
    columns=evaluation_metrics.keys(),
)
evaluation_results

Unnamed: 0,clarity,tone,objectivity
Llama-3.1-8B-Instruct,,,
GPT-4o,,,


In [7]:
import asyncio

async def eval_fn(row: pd.Series, model_output: str, evaluation_criteria: str) -> int:
    response = await client.evaluation.create(
        model_id="atla-selene",
        model_input=row.user_input,
        model_output=model_output,
        evaluation_criteria=evaluation_criteria,
    )
    return int(response.result.evaluation.score)

In [8]:
for model in evaluation_results.index:
    print(f"Evaluating {model}...")

    for metric in evaluation_results.columns:
        eval_tasks = [
            eval_fn(
                row=row,
                model_output=row[model],
                evaluation_criteria=evaluation_metrics[metric],
            )
            for _, row in df.iterrows()
        ]
        scores = await asyncio.gather(*eval_tasks)

        avg_score = sum(scores) / len(scores)
        evaluation_results.at[model, metric] = avg_score

Evaluating Llama-3.1-8B-Instruct...
Evaluating GPT-4o...


# 3. Results & Insights

### Evaluation scores

Each Selene evaluation run outputs
* Score (integer)
* Chain of Thought critique (string).

For the purposes of this notebook, we only look at the scores. We calculate the average score per criterion for each model.

In [9]:
evaluation_results

Unnamed: 0,clarity,tone,objectivity
Llama-3.1-8B-Instruct,3.38,3.92,2.54
GPT-4o,4.44,1.18,4.3


### Visualizing model performance

We now use a simple spider chart to visualize how well the different foundational models performed on the chat dataset across the three chosen criteria.

In [10]:
import plotly.express as px

combined_results = evaluation_results.reset_index().melt(id_vars="index", var_name="Metric", value_name="Score")
fig = px.line_polar(combined_results, r="Score", theta="Metric", color="index", line_close=True, range_r=[0, 5])
fig.show()

### Model Selection

🏅 **Our priorities**:

We want to deploy an AI assistant that is cost-effective and engages with our end users in a playful, humorous manner.

Clarity and objectivity are nice to have but to a lesser extent.

👀 **Selene evaluation results**:

Selene's evaluation shows that Llama 3.1 8B performs very well on the humorous tone we are aiming for, but at the cost of objectivity & clarity.

**Our choice**:  
➡️ We select **Llama 3.1 8B Instruct**, which can give us a cost-effective, playful & humorous assistant — though there's still room for improvement.  

👉 The next step: apply prompt engineering to enhance its objectivity — check out our cookbook on prompt engineering to learn how.