# Can we use traditional ML metrics to evaluate our model?

Under certain circumstances, such as when we have a labelled dataset and the LLM is used as a classifier, we can use traditional ML metrics to evaluate our model.

Imagine we have a dataset with articles and we want to classify them as about *__AI__*, *__music__*, *__food__*, *__education__*, or *__unknown__*.

In [23]:
from common import get_response
import json

In [24]:
with open("test_classification_cases.json", "r") as f:
    test_cases = json.load(f)

test_cases["tests"][2]

{'id': 'article_about_education',
 'name': 'Article about education',
 'article': 'Ofsted has announced changes to how school inspections are carried out to reduce pressures on teachers and school leaders.\nOfsted provides independent, up to date evaluations on the quality of education, behaviour, personal development, safeguarding, and leadership, which schools and parents value. \nWe want the inspection system to be as helpful as possible for everyone, which is why we’ve listened to calls from teachers and school leaders and are working with Ofsted to make improvements to the process. \nHere we tell you everything you need to know about the changes by Ofsted and why they’re happening.',
 'category': 'education'}

In [25]:
expected_categories = [test_case["category"] for test_case in test_cases["tests"]]
expected_categories

['AI', 'music', 'education', 'food', 'food', 'unknown']

In [29]:
instruction = """Classify the following article into one of the following categories:
 - AI
 - music
 - food
 - education

If the article is not about any of the categories above, classify it as "unknown".

Provide no explanation, or any other text, just the category.
"""

In [30]:
classified_categories = []

for test_case in test_cases["tests"]:
    article = test_case["article"]
    full_prompt = f"{instruction}\n\n{article}"
    response = get_response(full_prompt)
    classified_categories.append(response)


In [31]:
classified_categories

['AI', 'education', 'education', 'food', 'education', 'unknown']

## Con estos resultados, podemos calcular accuracy.

In [32]:
def accuracy_score(y_true: list[str], y_pred: list[str]) -> float:
    return sum(1 for t, p in zip(y_true, y_pred) if t == p) / len(y_true)

accuracy_score(expected_categories, classified_categories)


0.6666666666666666