# Evaluating LLMs on a Content Generation Task

* docs: https://docs.evidentlyai.com/introduction
* repo: https://github.com/evidentlyai/evidently/

In [None]:
#! pip install openai

In [1]:
from openai import OpenAI
import pandas as pd

In [2]:
my_topics = [
    "testing in AI engineering is as important as in development",
    "CI/CD is applicable in AI",
    "Collaboration of subject matter experts and AI engineers improves product",
    "Start LLM apps development from test cases generation",
    "evidently is agreat for LLM testing, use it"
]

In [3]:
OA_client = OpenAI()

## Basic content generation

In [4]:
response = OA_client.responses.create(
        model="gpt-4o-mini",
        input=f"Write a paragraph about {my_topics[0]}"
    )

text = response

In [6]:
text.output_text

'Testing in AI engineering is as critical as in traditional software development, as it ensures the reliability, accuracy, and fairness of AI models. In AI systems, where complex algorithms interact with vast datasets, rigorous testing helps identify biases, mitigate errors, and validate performance against intended outcomes. Just as traditional software undergoes unit and integration testing, AI models require specialized approaches, such as validation of training data, performance assessments on unseen datasets, and ethical evaluations. This thorough testing process is essential not only to enhance the functionality of AI applications but also to foster trust among users, ensuring that AI technologies are safe, responsible, and effective in real-world scenarios.'

In [7]:
def basic_tweet_generation(model, topic):
    response = OA_client.responses.create(
        model=model,
        input=f"Write a paragraph about {topic}"
    )

    text = response.output_text
    return text

In [8]:
basic_tweet_generation("gpt-3.5-turbo", my_topics[-1])

"Evidently is a powerful tool for conducting LLM testing due to its comprehensive features and user-friendly interface. With Evidently, users can easily upload their machine learning models, evaluate their performance, and gain insights into the model's behavior. The platform offers a range of statistical tests and visualization tools that allow users to thoroughly analyze the model's predictions and make informed decisions about its optimization. Moreover, Evidently provides detailed reports that highlight key metrics and performance indicators, making it an essential resource for researchers, data scientists, and developers looking to enhance their machine learning models."

## Tracing and Evaluation

In [None]:
#! pip install tracely

In [None]:
#! pip install evidently

In [10]:
from evidently.ui.workspace import CloudWorkspace

from evidently import Dataset, DataDefinition, Report
from evidently.descriptors import *
from evidently.presets import TextEvals
from evidently.llm.templates import BinaryClassificationPromptTemplate

In [9]:
from tracely import init_tracing
from tracely import trace_event

In [11]:
client = CloudWorkspace(url="https://app.evidently.cloud/")

In [12]:
project = client.create_project("[LLM Course] Content Generation",  org_id = "fd72dd71-e4ef-46dd-87cf-da6bb75b825e")

In [13]:
init_tracing(
    project_id=str(project.id), # Project ID from Evidently Cloud
    export_name="content generation",
)

<opentelemetry.sdk.trace.TracerProvider at 0x304656fd0>

In [16]:
@trace_event()
def basic_tweet_generation(topic, model="gpt-3.5-turbo", instructions=""):
    response = OA_client.responses.create(
        instructions=instructions,
        model=model,
        input=f"Write a paragraph about {topic}"
    )

    text = response.output_text
    return text

In [17]:
basic_tweets = [basic_tweet_generation(topic, model="gpt-3.5-turbo", instructions="") for topic in my_topics]

In [18]:
basic_tweets

['Testing in AI engineering is just as crucial as in development, if not more so. The complexity of AI systems and their reliance on massive amounts of data make them prone to errors and unexpected outcomes. Testing helps ensure the reliability, accuracy, and safety of AI models before they are deployed in real-world applications. By conducting thorough testing, AI engineers can identify and correct any issues or biases in the models, improve their performance, and build trust among users. In addition, effective testing in AI engineering helps streamline the development process, minimize risks, and ultimately enhance the overall quality of AI systems.',
 'Continuous Integration/Continuous Deployment (CI/CD) is highly applicable in the field of Artificial Intelligence (AI) due to the iterative nature of AI development. AI models require constant updates, enhancements, and optimizations based on new data and feedback. Implementing CI/CD practices in AI development ensures that changes an

In [19]:
dataset = client.load_dataset("0196e7e5-948b-71d9-b21d-e9213560262f")

In [20]:
dataset.data_definition

DataDefinition(id_column=None, timestamp='timestamp', service_columns=ServiceColumns(trace_link='_evidently_trace_link'), numerical_columns=[], categorical_columns=['basic_tweet_generation.model', 'basic_tweet_generation.instructions'], text_columns=['id', 'service_name', 'basic_tweet_generation.topic', 'basic_tweet_generation.result'], datetime_columns=[], classification=None, regression=None, llm=None, numerical_descriptors=[], categorical_descriptors=[], test_descriptors=None, ranking=None)

In [21]:
dataset.as_dataframe()

Unnamed: 0,id,service_name,timestamp,_evidently_trace_link,basic_tweet_generation.topic,basic_tweet_generation.model,basic_tweet_generation.instructions,basic_tweet_generation.result
0,3245b5a5-b901-6c54-ad25-7de79cb3e6bc,unknown_service,2025-05-19 09:38:43.723715,0196e7e5-948b-71d9-b21d-e9213560262f/3245b5a5-...,Collaboration of subject matter experts and AI...,gpt-3.5-turbo,,Collaboration between subject matter experts a...
1,6bce4626-b31e-2dc6-fd9d-4500e0406b64,,2025-05-19 09:38:45.886141,0196e7e5-948b-71d9-b21d-e9213560262f/6bce4626-...,Start LLM apps development from test cases gen...,gpt-3.5-turbo,,Start LLM Apps Development always begins with ...
2,772a561a-afb0-1dd8-9e3d-18967520d04b,,2025-05-19 09:38:47.938784,0196e7e5-948b-71d9-b21d-e9213560262f/772a561a-...,"evidently is agreat for LLM testing, use it",gpt-3.5-turbo,,Evidently is a fantastic tool for conducting L...
3,a95e9e1d-c1ef-a1c2-236c-cae8818d0c6e,,2025-05-19 09:38:40.504285,0196e7e5-948b-71d9-b21d-e9213560262f/a95e9e1d-...,CI/CD is applicable in AI,gpt-3.5-turbo,,Continuous Integration/Continuous Deployment (...
4,cc1fb1c7-d989-3aa8-42d1-bc5f4c13a7cc,,2025-05-19 09:38:38.687509,0196e7e5-948b-71d9-b21d-e9213560262f/cc1fb1c7-...,testing in AI engineering is as important as i...,gpt-3.5-turbo,,Testing in AI engineering is just as crucial a...


In [22]:
tweet_quality = BinaryClassificationPromptTemplate(
    pre_messages = [("system","You are evaluating the quality of tweets")],
    criteria="""
        Text is ENGAGING if it meets at least one of the following:
            •        Contains a strong hook (e.g. question, surprise, bold statement)
            •        Uses emotion, humor, or opinion
            •        Encourages interaction (calls to action, second-person voice like “you”)
            •        Demonstrates personality or a distinct tone
            •        Includes vivid language, metaphors, or emojis
            •        Sparks curiosity or gives a new insight

        Text is NEUTRAL if it:
            •        Merely states a fact or observation without emotion or opinion
            •        Lacks clear personality, tone, or call to action
            •        Uses generic language with no rhetorical style
            •        Reads like an internal note, report, or placeholder
        """,
    target_category="ENGAGING",
    non_target_category="NEUTRAL",
    uncertainty="non_target",
    include_reasoning=True,
)

In [23]:
descriptors = [
    TextLength("basic_tweet_generation.result", alias="Length"),
    Sentiment("basic_tweet_generation.result", alias="Sentiment"),
    LLMEval("basic_tweet_generation.result", template=tweet_quality,
           provider="openai", model="gpt-4o-mini",
           alias="Tweet quality")
]

In [24]:
dataset.add_descriptors(descriptors=descriptors)

In [25]:
dataset.as_dataframe()

Unnamed: 0,id,service_name,timestamp,_evidently_trace_link,basic_tweet_generation.topic,basic_tweet_generation.model,basic_tweet_generation.instructions,basic_tweet_generation.result,Length,Sentiment,Tweet quality,Tweet quality reasoning
0,3245b5a5-b901-6c54-ad25-7de79cb3e6bc,unknown_service,2025-05-19 09:38:43.723715,0196e7e5-948b-71d9-b21d-e9213560262f/3245b5a5-...,Collaboration of subject matter experts and AI...,gpt-3.5-turbo,,Collaboration between subject matter experts a...,769,0.987,NEUTRAL,The text primarily states facts about collabor...
1,6bce4626-b31e-2dc6-fd9d-4500e0406b64,,2025-05-19 09:38:45.886141,0196e7e5-948b-71d9-b21d-e9213560262f/6bce4626-...,Start LLM apps development from test cases gen...,gpt-3.5-turbo,,Start LLM Apps Development always begins with ...,793,0.9565,NEUTRAL,The text merely states the facts about the imp...
2,772a561a-afb0-1dd8-9e3d-18967520d04b,,2025-05-19 09:38:47.938784,0196e7e5-948b-71d9-b21d-e9213560262f/772a561a-...,"evidently is agreat for LLM testing, use it",gpt-3.5-turbo,,Evidently is a fantastic tool for conducting L...,817,0.9774,NEUTRAL,The text primarily provides a factual descript...
3,a95e9e1d-c1ef-a1c2-236c-cae8818d0c6e,,2025-05-19 09:38:40.504285,0196e7e5-948b-71d9-b21d-e9213560262f/a95e9e1d-...,CI/CD is applicable in AI,gpt-3.5-turbo,,Continuous Integration/Continuous Deployment (...,747,0.8658,NEUTRAL,The text primarily presents factual informatio...
4,cc1fb1c7-d989-3aa8-42d1-bc5f4c13a7cc,,2025-05-19 09:38:38.687509,0196e7e5-948b-71d9-b21d-e9213560262f/cc1fb1c7-...,testing in AI engineering is as important as i...,gpt-3.5-turbo,,Testing in AI engineering is just as crucial a...,657,0.9371,NEUTRAL,The text primarily states facts and observatio...


In [26]:
report = Report(
    metrics=[TextEvals()]
)

In [27]:
basic_tweets_eval =  report.run(dataset, tags=["gpt-3.5-turbo", "basic generation"])

In [28]:
client.add_run(project.id, basic_tweets_eval, include_data=True)

UUID('0196e7ef-a820-7d83-9ae3-7d4d7b1e56c6')

## Better content generation

In [29]:
instruction="""You are a chief editor with 10 years of experience in technical writing. 
        You specialize in creating concise, engaging, and to-the-point content for engineers. 
        Your style is clear, direct, and focused on delivering technical value without fluff."""

In [30]:
better_tweets = [basic_tweet_generation(topic, model="gpt-4o-mini", instructions=instruction) for topic in my_topics]

In [31]:
updated_dataset = client.load_dataset("0196e7e5-948b-71d9-b21d-e9213560262f")

In [32]:
updated_dataset.as_dataframe()

Unnamed: 0,id,service_name,timestamp,_evidently_trace_link,basic_tweet_generation.topic,basic_tweet_generation.model,basic_tweet_generation.instructions,basic_tweet_generation.result
0,3245b5a5-b901-6c54-ad25-7de79cb3e6bc,unknown_service,2025-05-19 09:38:43.723715,0196e7e5-948b-71d9-b21d-e9213560262f/3245b5a5-...,Collaboration of subject matter experts and AI...,gpt-3.5-turbo,,Collaboration between subject matter experts a...
1,3b91742b-b1d3-7294-fe49-d94f2f54afb7,,2025-05-19 09:47:00.646931,0196e7e5-948b-71d9-b21d-e9213560262f/3b91742b-...,Start LLM apps development from test cases gen...,gpt-4o-mini,You are a chief editor with 10 years of experi...,Starting LLM (Large Language Model) applicatio...
2,6bce4626-b31e-2dc6-fd9d-4500e0406b64,,2025-05-19 09:38:45.886141,0196e7e5-948b-71d9-b21d-e9213560262f/6bce4626-...,Start LLM apps development from test cases gen...,gpt-3.5-turbo,,Start LLM Apps Development always begins with ...
3,6d143f05-c27a-833e-949c-ab1cdde781f1,,2025-05-19 09:46:56.858116,0196e7e5-948b-71d9-b21d-e9213560262f/6d143f05-...,Collaboration of subject matter experts and AI...,gpt-4o-mini,You are a chief editor with 10 years of experi...,The collaboration between subject matter exper...
4,772a561a-afb0-1dd8-9e3d-18967520d04b,,2025-05-19 09:38:47.938784,0196e7e5-948b-71d9-b21d-e9213560262f/772a561a-...,"evidently is agreat for LLM testing, use it",gpt-3.5-turbo,,Evidently is a fantastic tool for conducting L...
5,a95e9e1d-c1ef-a1c2-236c-cae8818d0c6e,,2025-05-19 09:38:40.504285,0196e7e5-948b-71d9-b21d-e9213560262f/a95e9e1d-...,CI/CD is applicable in AI,gpt-3.5-turbo,,Continuous Integration/Continuous Deployment (...
6,cc1fb1c7-d989-3aa8-42d1-bc5f4c13a7cc,,2025-05-19 09:38:38.687509,0196e7e5-948b-71d9-b21d-e9213560262f/cc1fb1c7-...,testing in AI engineering is as important as i...,gpt-3.5-turbo,,Testing in AI engineering is just as crucial a...
7,e3e2a58b-7ce6-00a7-95db-98529ca03b9b,,2025-05-19 09:46:53.222711,0196e7e5-948b-71d9-b21d-e9213560262f/e3e2a58b-...,CI/CD is applicable in AI,gpt-4o-mini,You are a chief editor with 10 years of experi...,Continuous Integration and Continuous Delivery...
8,fbb4def5-15a6-f93c-1979-a883f2579adf,,2025-05-19 09:47:05.751717,0196e7e5-948b-71d9-b21d-e9213560262f/fbb4def5-...,"evidently is agreat for LLM testing, use it",gpt-4o-mini,You are a chief editor with 10 years of experi...,Evidently is an excellent tool for testing lar...
9,fd889ff1-ab08-b867-babb-bae1e1f36040,,2025-05-19 09:46:49.489396,0196e7e5-948b-71d9-b21d-e9213560262f/fd889ff1-...,testing in AI engineering is as important as i...,gpt-4o-mini,You are a chief editor with 10 years of experi...,Testing in AI engineering is as crucial as in ...


In [33]:
tmp_frame = updated_dataset.as_dataframe()
better_tweets = tmp_frame[tmp_frame["basic_tweet_generation.model"] == "gpt-4o-mini"]

In [34]:
better_tweets

Unnamed: 0,id,service_name,timestamp,_evidently_trace_link,basic_tweet_generation.topic,basic_tweet_generation.model,basic_tweet_generation.instructions,basic_tweet_generation.result
1,3b91742b-b1d3-7294-fe49-d94f2f54afb7,,2025-05-19 09:47:00.646931,0196e7e5-948b-71d9-b21d-e9213560262f/3b91742b-...,Start LLM apps development from test cases gen...,gpt-4o-mini,You are a chief editor with 10 years of experi...,Starting LLM (Large Language Model) applicatio...
3,6d143f05-c27a-833e-949c-ab1cdde781f1,,2025-05-19 09:46:56.858116,0196e7e5-948b-71d9-b21d-e9213560262f/6d143f05-...,Collaboration of subject matter experts and AI...,gpt-4o-mini,You are a chief editor with 10 years of experi...,The collaboration between subject matter exper...
7,e3e2a58b-7ce6-00a7-95db-98529ca03b9b,,2025-05-19 09:46:53.222711,0196e7e5-948b-71d9-b21d-e9213560262f/e3e2a58b-...,CI/CD is applicable in AI,gpt-4o-mini,You are a chief editor with 10 years of experi...,Continuous Integration and Continuous Delivery...
8,fbb4def5-15a6-f93c-1979-a883f2579adf,,2025-05-19 09:47:05.751717,0196e7e5-948b-71d9-b21d-e9213560262f/fbb4def5-...,"evidently is agreat for LLM testing, use it",gpt-4o-mini,You are a chief editor with 10 years of experi...,Evidently is an excellent tool for testing lar...
9,fd889ff1-ab08-b867-babb-bae1e1f36040,,2025-05-19 09:46:49.489396,0196e7e5-948b-71d9-b21d-e9213560262f/fd889ff1-...,testing in AI engineering is as important as i...,gpt-4o-mini,You are a chief editor with 10 years of experi...,Testing in AI engineering is as crucial as in ...


In [35]:
better_dataset = Dataset.from_pandas(
    better_tweets,
    data_definition = dataset.data_definition,
    descriptors = descriptors
)

In [36]:
better_dataset.as_dataframe()

Unnamed: 0,id,service_name,timestamp,_evidently_trace_link,basic_tweet_generation.topic,basic_tweet_generation.model,basic_tweet_generation.instructions,basic_tweet_generation.result,Length,Sentiment,Tweet quality,Tweet quality reasoning
1,3b91742b-b1d3-7294-fe49-d94f2f54afb7,,2025-05-19 09:47:00.646931,0196e7e5-948b-71d9-b21d-e9213560262f/3b91742b-...,Start LLM apps development from test cases gen...,gpt-4o-mini,You are a chief editor with 10 years of experi...,Starting LLM (Large Language Model) applicatio...,796,0.95,NEUTRAL,The text primarily consists of factual stateme...
3,6d143f05-c27a-833e-949c-ab1cdde781f1,,2025-05-19 09:46:56.858116,0196e7e5-948b-71d9-b21d-e9213560262f/6d143f05-...,Collaboration of subject matter experts and AI...,gpt-4o-mini,You are a chief editor with 10 years of experi...,The collaboration between subject matter exper...,718,0.9422,NEUTRAL,The text primarily states facts and observatio...
7,e3e2a58b-7ce6-00a7-95db-98529ca03b9b,,2025-05-19 09:46:53.222711,0196e7e5-948b-71d9-b21d-e9213560262f/e3e2a58b-...,CI/CD is applicable in AI,gpt-4o-mini,You are a chief editor with 10 years of experi...,Continuous Integration and Continuous Delivery...,709,0.8481,NEUTRAL,The text presents factual information about CI...
8,fbb4def5-15a6-f93c-1979-a883f2579adf,,2025-05-19 09:47:05.751717,0196e7e5-948b-71d9-b21d-e9213560262f/fbb4def5-...,"evidently is agreat for LLM testing, use it",gpt-4o-mini,You are a chief editor with 10 years of experi...,Evidently is an excellent tool for testing lar...,613,0.8658,NEUTRAL,The text merely states facts about the tool Ev...
9,fd889ff1-ab08-b867-babb-bae1e1f36040,,2025-05-19 09:46:49.489396,0196e7e5-948b-71d9-b21d-e9213560262f/fd889ff1-...,testing in AI engineering is as important as i...,gpt-4o-mini,You are a chief editor with 10 years of experi...,Testing in AI engineering is as crucial as in ...,728,0.9169,NEUTRAL,The text presents factual information regardin...


In [37]:
better_tweets_eval = report.run(better_dataset, tags=["gpt-4o-mini", "basic generation"])

In [38]:
client.add_run(project.id, better_tweets_eval, include_data=True)

UUID('0196e7f3-edad-7477-8195-aebefddae614')

## Chained generation

In [39]:
init_tracing(
    project_id=str(project.id), # Project ID from Evidently Cloud
    export_name="content generation: chained",
    as_global=False
)

<opentelemetry.sdk.trace.TracerProvider at 0x30d741610>

In [40]:
instruction = """You are an engineer at Evidently AI. 
        You also have 10 years of experience in technical writing.
        You specialize in creating fun, consise, engaging, and to-the-point blogs for engineers.
        Your style is motivational, clear, direct, and focuses on delivering technical value without fluff.
        """

In [41]:
@trace_event()
def chained_tweet_generation(tweet, model="gpt-3.5-turbo", instructions=""):
    response = OA_client.responses.create(
        instructions=instructions,
        model=model,
        input=f"Make this tweet more engaging, fun. Make it more personal. \
        Make this tweet fit under 280 characters. Use a persuasive tone. Do not use any emotions or hashtags.\n Tweet: {tweet}"
    )

    refined_tweet = response.output_text
    return refined_tweet

In [42]:
refined_tweets = [chained_tweet_generation(tweet, "gpt-4.1", instruction) for tweet in better_tweets["basic_tweet_generation.result"]]

In [43]:
chain_dataset = client.load_dataset("0196e7f4-d756-7703-826a-fc43b1b275e2")

In [44]:
chain_dataset.data_definition

DataDefinition(id_column=None, timestamp='timestamp', service_columns=ServiceColumns(trace_link='_evidently_trace_link'), numerical_columns=[], categorical_columns=['chained_tweet_generation.model', 'chained_tweet_generation.instructions'], text_columns=['id', 'service_name', 'chained_tweet_generation.tweet', 'chained_tweet_generation.result'], datetime_columns=[], classification=None, regression=None, llm=None, numerical_descriptors=[], categorical_descriptors=[], test_descriptors=None, ranking=None)

In [45]:
chain_descriptors = [
    TextLength("chained_tweet_generation.result", alias="Length"),
    Sentiment("chained_tweet_generation.result", alias="Sentiment"),
    LLMEval("chained_tweet_generation.result", template=tweet_quality,
           provider="openai", model="gpt-4o-mini",
           alias="Tweet quality")
]

In [46]:
chain_dataset.add_descriptors(descriptors=chain_descriptors)

In [47]:
chain_tweets_eval = report.run(chain_dataset, tags=["gpt-4.1", "chained generation"])

In [48]:
client.add_run(project.id, chain_tweets_eval, include_data=True)

UUID('0196e7f8-2a31-7993-93a1-930645018c89')