# Fiddler LLM Evaluation Quick Start Guide

Fiddler is the pioneer in enterprise AI Observability, offering a unified platform enabling all model stakeholders to monitor model performance and investigate the source of model degradation. Fiddler's AI Observability platform supports traditional ML models and Generative AI applications. Fiddler can assist teams during the evaluation phase of selecting LLM models before developing an application. This guide explains how to compare LLM outputs from different models, such as GPT3.5 and Claude, to help determine the most suitable model for your language model application.

---

You can start using Fiddler ***in minutes*** by following these 6 quick steps:

1. Connect to Fiddler
2. Create or Retrieve a Fiddler Project
3. Load Data Samples
4. Enable Specific Fiddler LLM Enrichments
5. Provide Information About the LLM Project
6. Publish Datasets for Model Comparison

Get insights!

## 0. Imports

In [1]:
%pip install -q fiddler-client

import numpy as np
import pandas as pd
import fiddler as fdl

print(f"Running Fiddler Python client version {fdl.__version__}")

Note: you may need to restart the kernel to use updated packages.
Running Fiddler Python client version 3.7.1


## 1. Connect to Fiddler

Before you can add information about your LLM datasets with Fiddler, you'll need to connect using the Fiddler Python client.


---

**We need a couple pieces of information to get started.**
1. The URL you're using to connect to Fiddler
2. Your authorization token

Your authorization token can be found by navigating to the **Credentials** tab on the **Settings** page of your Fiddler environment.

In [2]:
URL = ''  # Make sure to include the full URL (including https:// e.g. 'https://your_company_name.fiddler.ai').
TOKEN = ''

Constants for this example notebook, change as needed to create your own versions

In [3]:
PROJECT_NAME = 'ash_quickstart_examples'  # If the project already exists, the notebook will create the model under the existing project.
MODEL_NAME = 'fiddler_llm_evaluation'

GPT_NAME = 'gpt3.5_dataset'
CLAUDE_NAME = 'claude_dataset'

# Sample data hosted on GitHub
PATH_TO_SAMPLE_GPT_CSV = 'https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/refs/heads/main/quickstart/data/chat_sample_part1.csv'
PATH_TO_SAMPLE_CLAUDE_CSV = 'https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/refs/heads/main/quickstart/data/chat_sample_part2.csv'

Now just run the following code block to connect to the Fiddler API!

In [4]:
fdl.init(url=URL, token=TOKEN)

250111T01:42:34.997Z     INFO| attached stderr handler to logger: auto_attach_log_handler=True, and root logger not configured 
250111T01:42:34.998Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/server-info GET -- emit req (0 B, timeout: (5, 15)) 
250111T01:42:35.155Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/server-info GET -- resp code: 200, took 0.157 s, resp/req body size: (841 B, 0 B) 
250111T01:42:35.157Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/version-compatibility GET -- emit req (0 B, timeout: (5, 15)) 
250111T01:42:35.205Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/version-compatibility GET -- resp code: 200, took 0.048 s, resp/req body size: (2 B, 0 B) 


## 2. Create a Fiddler Project

Once you connect, you can create a new project by specifying a unique project name for the name parameter with either the Project.create() or the Project.get_or_create() methods. If the project already exists, the get_or_create() method will instead return the existing project which is helpful when running this notebook multiple times and when using an existing project to house Fiddler examples. 

*Note: get_or_create() requires Fiddler Python client 3.7+.*

In [5]:
project = fdl.Project.get_or_create(name=PROJECT_NAME)

# Check if the project has an ID to distinguish new vs existing
print(f'{"Created new" if project.id is None else "Retrieved existing"} project with id = {project.id} and name = {project.name}')


250111T01:42:35.213Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/projects GET -- emit req (0 B, timeout: (5, 100)) 
250111T01:42:35.289Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/projects GET -- resp code: 200, took 0.075 s, resp/req body size: (424 B, 0 B) 


Retrieved existing project with id = fb5cab59-ed67-47da-b109-659256697af4 and name = ash_quickstart_examples


## 3. Load Dataset Samples

In [6]:
gpt_df = pd.read_csv(PATH_TO_SAMPLE_GPT_CSV)
claude_df = pd.read_csv(PATH_TO_SAMPLE_CLAUDE_CSV)

gpt_df

Unnamed: 0,row_id,comment,completion_tokens,duration,feedback,feedback2,model_name,prompt,prompt_tokens,response,run_id,session_id,source_doc0,source_doc1,source_doc2,total_tokens,ts
0,0122875f-58d6-4653-a63f-a479aedba732,Does Fiddler support explainations for object ...,51,5.6889,0,,gpt-3.5-turbo,I asked for an example,5,Sure! Point explanations in Fiddler are genera...,0122875f-58d6-4653-a63f-a479aedba732,a07fdcc2-8b5b-4910-a7d5-fc7801cc56ba,"---\ntitle: ""Point Explainability""\nslug: ""po...","---\ntitle: ""Point Explainability""\nslug: ""po...","---\ntitle: ""Point Explanations""\nslug: ""poin...",56,2024-04-20 0:00:00
1,321af934-6336-4a62-b881-cab8eaf37c3e,,56,3.7242,0,,gpt-3.5-turbo,Can you give me an example of how point explai...,12,Point explanations are generated using additiv...,321af934-6336-4a62-b881-cab8eaf37c3e,a07fdcc2-8b5b-4910-a7d5-fc7801cc56ba,"---\ntitle: ""Point Explainability""\nslug: ""po...","---\ntitle: ""Point Explainability""\nslug: ""po...","slug: ""point-explainability"" _The first numb...",68,2024-04-19 23:01:49
2,7718e0d7-be31-4eb0-9d99-31ff8c74c0c8,,70,3.3897,0,,gpt-3.5-turbo,Do I have to provide my gradients?,7,"No, it is not necessary for you to provide you...",7718e0d7-be31-4eb0-9d99-31ff8c74c0c8,a07fdcc2-8b5b-4910-a7d5-fc7801cc56ba,"slug: ""global-explainability"" https://www.fid...","slug: ""global-explainability"" https://www.fid...","slug: ""point-explanations"" space where disti...",77,2024-04-19 22:03:38
3,0438a48b-336b-4bb8-8fc4-c351049dc4f7,,163,4.3193,1,like,gpt-3.5-turbo,What is explainability? What kind of explainat...,10,Explainability refers to the ability to unders...,0438a48b-336b-4bb8-8fc4-c351049dc4f7,a07fdcc2-8b5b-4910-a7d5-fc7801cc56ba,"---\ntitle: ""Point Explainability""\nslug: ""po...","---\ntitle: ""Point Explainability""\nslug: ""po...","---\ntitle: ""Point Explainability""\nslug: ""po...",173,2024-04-19 21:05:27
4,4cfda560-8ea0-4ee1-9c7e-e2ec6e8663ce,,69,0.4639,1,like,gpt-3.5-turbo,Are there example notebooks to get started wit...,10,"Yes, there are example notebooks available to ...",4cfda560-8ea0-4ee1-9c7e-e2ec6e8663ce,3ec4ea93-7ca7-46b6-8b6e-c376521ef7dd,"---\ntitle: ""Simple Monitoring""\nslug: ""quick...","---\ntitle: ""Simple Monitoring""\nslug: ""quick...","---\ntitle: ""NLP Monitoring""\nslug: ""simple-n...",79,2024-04-19 20:07:16
5,6f70fb8f-0768-45ea-8dea-d7d106b4a12f,,58,3.6482,1,like,gpt-3.5-turbo,is it possible to set alerts on metrics such a...,14,"Yes, alerts can be set on metrics such as accu...",6f70fb8f-0768-45ea-8dea-d7d106b4a12f,3ec4ea93-7ca7-46b6-8b6e-c376521ef7dd,"slug: ""fraud-detection"" \n2. Monitoring vario...","slug: ""fraud-detection"" \n2. Monitoring vario...",| Model Task Type | Metric ...,72,2024-04-19 19:09:05
6,2a557053-cf7a-40af-a017-fddf19e92c30,,34,4.2104,0,,gpt-3.5-turbo,How many models I can add per project?,8,There is no specific mention of the maximum nu...,2a557053-cf7a-40af-a017-fddf19e92c30,3ec4ea93-7ca7-46b6-8b6e-c376521ef7dd,"---\ntitle: ""About Models""\nslug: ""about-mode...","---\ntitle: ""About Models""\nslug: ""about-mode...","slug: ""project-structure"" \n\nYou can collate...",42,2024-04-19 18:10:55
7,2d89ff99-b9f2-463c-b03f-cfd7737a9164,,28,1.4583,-1,dislike,gpt-3.5-turbo,How frequently I can publish events?,6,The documentation does not provide information...,2d89ff99-b9f2-463c-b03f-cfd7737a9164,3ec4ea93-7ca7-46b6-8b6e-c376521ef7dd,"---\ntitle: ""Publishing Production Data""\nslu...","---\ntitle: ""Publishing Production Data""\nslu...","slug: ""quick-start"" absolute** or **relative*...",34,2024-04-19 17:12:44
8,74eb3163-1fdc-4923-be0c-3c0c7331a30a,,33,1.7937,1,like,gpt-3.5-turbo,Which model format types Fiddler API supports?,7,The Fiddler API supports the following model f...,74eb3163-1fdc-4923-be0c-3c0c7331a30a,3ec4ea93-7ca7-46b6-8b6e-c376521ef7dd,"slug: ""product-tour"" )\n\n**Projects** repres...","slug: ""product-tour"" )\n\n**Projects** repres...","slug: ""fdlmodelinfo"" ""None"",\n ""12-3"": ""A...",40,2024-04-19 16:14:33
9,3f8fa9c3-f32a-4594-98ec-adfffd75d2ba,,70,5.2251,0,,gpt-3.5-turbo,how do you support llmops,5,Fiddler supports MLOps by providing monitoring...,3f8fa9c3-f32a-4594-98ec-adfffd75d2ba,3b4e843a-4d70-4fe4-b310-50f81535e250,"---\ntitle: ""ML Algorithms In Fiddler""\nslug:...","---\ntitle: ""ML Algorithms In Fiddler""\nslug:...","slug: ""global-explainability"" https://www.fid...",75,2024-04-19 15:16:22


## 4. Enable Fiddler LLM Enrichments

After picking a sample of our chatbot's prompts and responses, we can request that Fiddler execute a series of enrichment services that can "score" our prompts and responses for a variety of insights.  These enrichment services can detect AI safety issues like PII leakage, hallucinations, toxicity, and more.  We can also opt-in for enrichment services like embedding generation which will allow us to track prompt and response outliers and drift. A full description of these enrichments can be found [here](https://docs.fiddler.ai/platform-guide/llm-monitoring/enrichments-private-preview).

---
Define a list of Fiddler AI backend enrichments for various aspects of the model's input and output, including text embeddings, sentiment analysis, and PII detection. Each enrichment is represented by an appropriate Fiddler API enrichment object, such as TextEmbedding or Enrichment, with associated configuration.

In [7]:
fiddler_backend_enrichments = [
    # Generate text embeddings for the prompt (question) column
    fdl.TextEmbedding(
        name='Prompt TextEmbedding',
        source_column='question',
        column='Enrichment Prompt Embedding',
        n_tags=10,
    ),
    # Generate text embeddings for the response column
    fdl.TextEmbedding(
        name='Response TextEmbedding',
        source_column='response',
        column='Enrichment Response Embedding',
        n_tags=10,
    ),
    # Generate text embeddings for the source documents (rag documents) column
    fdl.TextEmbedding(
        name='Source Docs TextEmbedding',
        source_column='source_docs',
        column='Enrichment Source Docs Embedding',
        n_tags=10,
    ),
    # Enrichment to assess response faithfulness using source docs and the response
    fdl.Enrichment(
        name='Faithfulness',
        enrichment='ftl_response_faithfulness',
        columns=['source_docs', 'response'],
        config={'context_field': 'source_docs', 'response_field': 'response'},
    ),
    # Perform sentiment analysis on the question and response columns
    fdl.Enrichment(
        name='Enrichment QA Sentiment',
        enrichment='sentiment',
        columns=['question', 'response'],
    ),
    # Detect personally identifiable information (PII) in the question column
    fdl.Enrichment(
        name='Rag PII', enrichment='pii', columns=['question'], allow_list=['fiddler']
    ),
]


## 5.  Provide Information About the LLM Project

Now it's time to onboard information about our LLM dataset to Fiddler.  We do this by defining a `ModelSpec` object.


---


The `ModelSpec` object will contain some **information about how your LLM datasets are structured**.
  
*Just include:*
1. The **input/output** columns.  These are just the raw inputs and outputs tracked in our LLM dataset.
2. Any **metadata** columns. Make sure to include the 'model' column we generated earlier. 
3. The **custom features** which contain the configuration of the enrichments we opted for.

We'll also want to set the **task** to LLM, since these datasets are generated from LLMs.


In [8]:
model_spec = fdl.ModelSpec(
    inputs=['question', 'response', 'source_docs'],
    metadata=['session_id', 'comment', 'timestamp', 'feedback', 'model'],
    custom_features=fiddler_backend_enrichments,
)

model_task = fdl.ModelTask.LLM

Set this up in Fiddler by configuring a Model object to represent your LLM evaluation project.

In [None]:
llm_project = fdl.Model.from_data(
    source=gpt_df,
    name=MODEL_NAME,
    project_id=project.id,
    spec=model_spec,
    task=model_task,
    max_cardinality=5,
)

250111T01:44:27.945Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/files/upload POST -- emit req (0.112 MB, timeout: (120, 100)) 
250111T01:44:28.255Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/files/upload POST -- resp code: 200, took 0.309 s, resp/req body size: (499 B, 0.112 MB) 
250111T01:44:28.260Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/model-factory POST -- emit req (1361 B, timeout: (5, 100)) 
250111T01:44:28.584Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/model-factory POST -- resp code: 500, took 0.322 s, resp/req body size: (161 B, 1361 B) 
250111T01:44:28.586Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/model-factory POST -- attempt 1 failed, deadline in 5.0 min, retry in 0.67 s 
250111T01:44:29.256Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/model-factory POST -- emit req (1361 B, timeout: (5, 100)) 
250111T01:44:29.639Z     INFO| http: https://preprod.cloud.fiddler.ai/v3/model-factory POST -- resp code: 500, took 0.382 

Now call the create method to create it in Fiddler.

In [None]:
llm_project.create()
print(
    f'New model created with id = {llm_.id} and name = {llm_project.name}'
)

## 6. Publish Data for Comparison

Information about the LLM datasets is now onboarded to Fiddler. It's time to actually start adding the data itself to the preproduction environment for comparison!

  
Let's load in some sample data (prompts and responses) from our GPT and Claude datasets.

In [None]:
publish_job_gpt = llm_project.publish(
    source=gpt_df,
    environment=fdl.EnvType.PRE_PRODUCTION,
    dataset_name=GPT_NAME,
)

# Print the Job ID for tracking
print(f'Initiated pre-production environment data upload with Job ID = {publish_job_gpt.id}')

Finally, load the second dataset for comparison with the first. 

In [None]:
publish_job_claude = llm_project.publish(
    source=claude_df,
    environment=fdl.EnvType.PRE_PRODUCTION,
    dataset_name=CLAUDE_NAME,
)

# Print the Job ID for tracking
print(f'Initiated pre-production environment data upload with Job ID = {publish_job_claude.id}')


# Get insights

**You're all done!**

You can now head to your Fiddler environment and start comparing your claude and gpt3.5 datasets using metric cards.  

<table>
    <tr>
        <td>
            <img src="https://raw.githubusercontent.com/fiddler-labs/fiddler-examples/main/quickstart/images/LLM_evaluation_metric_cards.png" />
        </td>
    </tr>
</table>

**What's Next?**

Try the [ML Monitoring - Quick Start Guide](https://docs.fiddler.ai/quickstart-notebooks/quick-start)

---


**Questions?**  
  
Check out [our docs](https://docs.fiddler.ai/) for a more detailed explanation of what Fiddler has to offer.

Join our [community Slack](http://fiddler-community.slack.com/) to ask any questions!

If you're still looking for answers, fill out a ticket on [our support page](https://fiddlerlabs.zendesk.com/) and we'll get back to you shortly.