
Note: This is a sample notebook that gives hands on experience to some of the features of DEP AI deval package.

# Introduction: deval package and it's purpose
Evaluating Large Language Models (LLMs) is crucial for optimizing their performance and ensuring their reliability.
This evaluation typically occurs in two primary settings.
The first is data-driven experimentation, where a set of data points is used to optimize solution parameters, prompts, and other variables.
This is usually a repeatable, batch workflow where a large dataset is processed, and aggregated results are expected after a certain period.
The second setting is an online environment, where results are instantly available for the current state of the solution and a small data sample.
This online setting is useful for early experimentation when a larger dataset is not yet available or for online monitoring of a production solution to detect drift or other real-time metrics.

The focus of the deval package is to provide quality evaluation, assessing the quality of a model's response based on user input and additional context.
Since model responses are unstructured text, models themselves must evaluate the output, as there is no single correct answer and quality must be carefully defined.
The models that assess the output of other LLMs are known as LLM-as-a-Judge.
These are usually general-purpose models used in specific workflows to generate evaluation metrics.
These workflows are not instantaneous, and using the models can be costly on a large scale.
Therefore, LLMs-as-a-Judge are typically employed for batch processing to optimize solution parameters and detect drift, as well as for early online experimentation.

Online monitoring of a production solution is generally expensive and only used for special applications.
In such cases, other inexpensive, classic metrics like user-input length, model output length, latency, compute usage, etc., are more relevant.

Below we will show at first the experimentation against a dataset to test and optimize a solution.
Subsequently, the deval with online monitoring is conducted where an immediate response is available.
Online monitoring is also instrumental in identifying LLM-as-a-Judge metrics and customizing them to meet project needs.

## Installation
In this Codespace has the DEval python SDK is already installed from the Deloitte artifactory. Four major ways exist to install the DEval sdk:
- Through Deloitte Artifactory
- Download the latest release as zip: https://github.com/Deloitte-US-Engineering/devals/releases
- Clone the source code repo: https://github.com/Deloitte-US-Engineering/devals
- Build a wheel file from the source code (get release code or clone repo and use a tool like : https://pypi.org/project/setuptools/)

## Minimal Example

### Getting an LLM as a Judge and embedding model

TL;DR

To effectively evaluate your GenAI solution, an LLM-as-a-Judge is utilized to score unstructured data. This process involves using a general-purpose LLM to generate a score, where a higher score indicates better quality.

We employ Claude-3-5-Haiku, a general-purpose LLM, to serve as the judge for our data. This LLM is responsible for creating scores that assess the quality of the data, with higher scores representing better outcomes. The Claude-3-5-Haiku instance is managed by LiteLLM, which is integrated into the DEP AI LLM Gateway accelerator. LiteLLM operates within a container in this Codespace and can be accessed locally via port 4000. The LLM-as-a-Judge is crucial for scoring the unstructured data processed by a GenAI solution. Evaluation frameworks have developed specialized workflows to generate scores based on specific metrics, ensuring a comprehensive assessment of data quality.

Some metrics use the semantic distance, for this we need an embedding model. In this example we use a self-hosted embedding model managed by Ollama and managed by LiteLLM inside the LLM Gateway.

In [None]:
from langchain_openai import ChatOpenAI
my_llm = ChatOpenAI(
    model="gemini-2.5-flash",
    api_key="sk-TE5BPNfSh4IOCNpW3I5EDQ",  # if you prefer to pass api key in directly instead of using env vars
    base_url="http://0.0.0.0:4000"
)

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(api_key="sk-TE5BPNfSh4IOCNpW3I5EDQ",
    base_url="http://0.0.0.0:4000",
    model="gemini-embedding-001")

### Defining the evalution data

TL;DR
- deval provides a loose data structure defining the an atomic request

Data for evaluating an LLM can always be abstracted to a question, an answer, a context, and a ground truth. The context can be many things, a context from a RAG pipeline, search engine results, chat history, or a full-text. The ground truth is also not a strict structure, it can be an entity for classification, a json for extraction tasks, or a set of possible answers.

In devals we use the TypedDict EvaluatorRequest defining the question, answer, context, and ground truth structure for one 

In [None]:
from deval.evaluators.models import EvaluatorRequest
#### The EvaluatorRequest is our standard dataformat, it is a TypedDict so it is not checked during runtime for correctness ####
#help(EvaluatorRequest) # Execute me for more information

my_evaluation_data: EvaluatorRequest = {
    "question": "How is the weather today",
    "answer": "Raining cats and dogs",
    "context": "Currently a storm system is present rotating around an area of low pressure, which produces strong winds and heavy rain."
}

### Using a pre-defined evaluation workflow

TL;DR
- In deval, a “preset” is a predefined workflow for evaluating LLM outputs.
- Presets combine a JSON-like structure for metrics and options, and can include custom Python code.
- Presets are portable and can be shared or executed anywhere.

In deval, a preset is a predefined workflow used to evaluate LLM outputs. Each preset consists of two main components:

1. A JSON-like structure that defines the metrics to be used and basic customization options.
2. Optional Python code for custom instructions to be executed at runtime.

A preset contains all the necessary information to evaluate an LLM output, making it easy to create once and reuse anywhere, at any time. Users are free to share presets in any format they choose; deval does not enforce a specific sharing method.

For example, when evaluating question-answer scenarios, we use the “question answer” preset, which includes four metrics: groundedness, response_drift_embedding, response_drift, and coherence_measure. In our minimal example, the response drift metrics cannot be calculated because they require ground truth data, which is not present in our input. These metrics will be automatically skipped during evaluation.

In [None]:
import pprint
from deval.presets.qa_preset import QAPreset
#### The QAPreset is a pre-defined workflow ####
#help(QAPreset()) # Execute me for more information

evaluation_workflow = QAPreset()
evaluation_workflow.set_llm_as_a_judge(my_llm)
evaluation_workflow.set_embedding_model(embeddings)

#### You can get the metrics inside the preset with get_metrics() ####
print("These are the names of the metrics inside the workflow.\n" \
"We have standardized the names of the metrics based on this work:https://kx.deloitte/documents/view/92541 \n" \
"Nevertheless, the native names of the evaluation frameworks are also possible to let users decide which naming convention they prefer.\n")
pprint.pprint(["Framework Name(s): {} \nDeloitte Name: {} \n {}".format(x['name'], x['name_deloitte'], "-"*10) for x in evaluation_workflow.get_metrics()])

# pprint.pprint([x for x in evaluation_workflow.get_metrics()]) # Execute me for more detailed information about the metrics

### Execution

TL;DR
- If the data does not meet a metric’s requirements within a preset, that metric will be skipped.
- Langfuse integration is enabled by default in Deval but can be explicitly disabled.
- Deval returns results in a standardized format, consistent across different evaluation frameworks.

All Deval metrics include metadata that indicates whether a metric can be calculated. If a metric cannot be calculated, usually due to insufficient data, it will be skipped and a warning will be logged.

For monitoring, Deval is automatically integrated with Langfuse. This integration can be disabled by setting the environment variable DEVAL_LANGFUSE to "0" or "False". Langfuse integration will also be deactivated if the Python Langfuse SDK is not installed or if the Langfuse API key is missing from the environment variables.

Deval outputs results as a standardized dictionary, ensuring consistency across all supported evaluation frameworks. Additionally, Deval can automatically track results in monitoring tools like Langfuse.

In [None]:
#### Explicitly deactivate langfuse inside devals to keep it minimal ####
import os
os.environ['DEVAL_LANGFUSE'] = "0"
result = evaluation_workflow.calc_metrics(my_evaluation_data)

#### The return is a list of dictionaries for each metric, the metrics 'response_drift_embedding' and  'response_drift' require a ground truth, which was not given, so they are in the results  ####
print(["{} : {}".format(x["name"], x["score"]) for x in result])

#### Warning ####
# It can happen that some metrics need to be skipped since the LLM is not responding as the evaluation framework expected it.
# This can have multiple reasons but usually the prompts inside the evaluation frameworks were designed for one LLM type and version.
# This underlines the importance of defining strict presets including LLM definition and settings to ensure reproducibility.



For more examples, see https://github.com/Deloitte-US-Engineering/dep-devals/tree/main/deval/deval/examples