Replies: 1 comment 1 reply
-
|
Are we generating one new log table and one new metric? Aything else? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
SWIP-16 Support LLM-as-Judge on Top of GenAI Observability
Motivation
SkyWalking already provides GenAI observability capabilities. Based on the existing GenAI semantic conventions and analysis pipeline, SkyWalking can recognize GenAI spans from multiple data sources, including SkyWalking native traces, OTLP, and Zipkin, extract GenAI-related attributes, build Virtual GenAI entities, and display runtime metrics such as traffic, latency, token usage, TTFT, TPOT, and estimated cost in the GenAI dashboard.
These capabilities answer "what happened during the invocation" and "how are performance and cost," but they do not yet answer "what is the quality of the model output." For GenAI applications in production, users usually also need to continuously observe quality signals such as:
Today, such evaluation usually depends on external evaluation platforms, custom business-side scripts, or manual sampling workflows. This causes several problems:
This SWIP proposes introducing
LLM-as-Judgeinto SkyWalking OAP. On top of the existing GenAI observability capabilities, the whole evaluation feature reuses the current trace ingestion and GenAI span analysis pipeline and supports SkyWalking native traces, OTLP, and Zipkin as input sources. OAP samples runtime GenAI spans, extracts evaluation inputs, invokes a configurable judge model, and writes evaluation results into SkyWalking structured records. ForSCORE-type evaluation results, OAP further generates metrics and displays them in the GenAI dashboard. Meanwhile, the UI adds an evaluation result page for detailed result browsing and supports jumping from a single evaluation result to the related trace, so that quality observability is integrated into the existing GenAI observability system.Architecture Graph
Proposed Changes
1. Introduce an independent AI evaluation module
OAP adds a dedicated
ai-evaluationmodule to host runtime AI evaluation capabilities. In the current implementation, evaluation data is decoupled through an asynchronous local in-memory queue so that judge invocation is not executed synchronously on the trace analysis critical path. This module is responsible for:This keeps evaluation logic modular and avoids coupling judge-related logic directly into the existing analyzer core.
2. Reuse the existing GenAI observability analysis entry
This capability does not introduce a new parallel collection pipeline. Instead, it is built on top of the existing GenAI observability capability and directly reuses the current trace ingestion paths, including SkyWalking native traces, OTLP, and Zipkin.
When OAP parses spans, the existing pipeline already recognizes GenAI spans. On top of that, the new analysis listener reuses runtime context and extracts the following information:
traceIdspanIdserviceNameserviceInstanceNameoperationNameproviderNamemodelNamestartTimeMillisendTimeMilliserrorGenAI-related tagsThis information is packaged as the evaluation context and passed to the AI evaluation service, making evaluation results a natural extension of the existing GenAI observability model.
3. Introduce task-based LLM-as-Judge evaluation
The evaluation logic is task-driven rather than hardcoded around fixed dimensions. Each task defines the following fields:
namevalueTypeinstructionThe initial default tasks include:
FaithfulnessRelevanceTaskCompletionHallucinationBased on the configured
system-prompt, extracted GenAI context, and task list, OAP builds the prompt and sends it to the external judge model. The judge model returns structured JSON, and each task result contains at least:valuereasonThis design keeps evaluation dimensions configurable rather than hardcoded in the implementation.
4. Support an OpenAI-compatible judge provider
This SWIP introduces the
JudgeModelProviderabstraction and provides the first implementation,OpenAICompatibleProvider.Runtime configuration includes:
providerendpointmodelapi-keyThis allows OAP to call judge endpoints compatible with the OpenAI API format while leaving room for future provider extensions.
5. Introduce evaluation planning, prompt building, asynchronous queueing, and result parsing
The implementation introduces a clear runtime evaluation pipeline. In the current implementation, after a GenAI span hits sampling, OAP does not synchronously invoke evaluation. Instead, it first places the evaluation task into a local asynchronous in-memory queue, and a background evaluation consumer executes the remaining steps:
EvaluationInputExtractorEvaluationPlannerEvaluationPlanEvaluationPromptBuilderlocal async in-memory queueevaluation consumerEvaluationResultParserEvaluationResultThe end-to-end flow is:
This decouples evaluation orchestration from transport and storage logic and avoids blocking external model invocation on the trace analysis critical path.
6. Use PPM sampling for runtime evaluation
Because LLM-as-Judge introduces additional model invocation cost and runtime overhead, this SWIP does not evaluate all GenAI spans. Instead, it introduces a sampling strategy.
The sampling rate uses PPM,
parts per million:1_000_000means 100% evaluation100_000means 10% evaluation10_000means 1% evaluation0means runtime evaluation is disabledThe module validates that the sampling rate is within
[0, 1_000_000]and applies the configured sampling strategy before invoking the judge model.7. Write evaluation results into SkyWalking structured records
Each evaluation result is written into SkyWalking storage through
AIEvaluationResultRecordas a structured record.The record includes the following core fields:
trace_idsegment_idspan_idspan_typetask_namevalue_typevaluereasonjudge_modelevaluation_timetime_bucketThis lets each evaluation result directly link back to existing trace and span data. In merged record storage mode, the data is written into the logical record table
ai_evaluation_result.8. Generate MAL labeled metrics from SCORE-type evaluation results
In addition to persisting evaluation results as structured records, this SWIP also proposes converting
SCORE-type evaluation results into MAL-based labeled metrics.For tasks where
valueType = SCORE, the judge returns a numeric result in[0.0, 1.0]. OAP converts the result into a MALSampleFamilyand uses a MAL rule to generate the final metric. The task name is kept as a metric label instead of being encoded into the metric name, so newly configured evaluation tasks do not require additional OAL statements or new hardcoded metrics.The initial metric is:
gen_ai_evaluation_score_ppmThe metric is attached to the Virtual GenAI service instance dimension, using
service_nameas the service key andmodel_nameas the instance key. Thetask_nameremains as a labeled value dimension, allowing the same metric to represent scores forFaithfulness,Relevance,TaskCompletion,Hallucination, or any user-defined task.Because SkyWalking MAL labeled values are stored as long values, the score is scaled before entering the MAL pipeline:
For example, a judge score of
0.86is stored as860000. Query or UI code should divide the metric value by1,000,000when displaying the original score.This labeled metric supports:
task_namein the GenAI dashboardThis means the new capability is not only a record persistence feature, but also extends the GenAI dashboard from performance and cost observability to quality observability.
9. Add an evaluation result page and trace jump capability
In addition to aggregated dashboard metrics, the UI adds an evaluation result page for displaying structured evaluation details.
The page displays at least the following fields:
traceIdsegmentIdspanIdserviceNameoperationNametaskNamevalueTypevaluereasonjudgeModelevaluationTimeUsers can filter the page by service, task name, evaluation result type, and time range, making it easier to investigate low scores, anomalies, or suspicious results.
Most importantly, each evaluation result keeps its association with the original trace. Users can click
traceIdor a jump button in the evaluation result page to open the related trace detail page directly and continue investigating the full call chain, contextual spans, and related GenAI tags.This allows SkyWalking not only to show an evaluation result, but also to connect that result with runtime trace analysis and form a closed-loop troubleshooting experience from quality signal to execution context.
Compatibility
This SWIP introduces a new OAP capability and a new record data model. The main compatibility impacts include:
ai_evaluation_resultSCORE-type evaluation results additionally generate a MAL labeled metric for dashboard displaygen_ai_evaluation_score_ppmmetric stores scores scaled by1,000,000; query and UI layers need to divide by1,000,000to display the original[0.0, 1.0]scoretraceIdGeneral usage docs
ai-evaluationmodule in OAP.SCORE-type tasks, OAP generates the MAL labeled metricgen_ai_evaluation_score_ppmfrom evaluation results.gen_ai_evaluation_score_ppmvalues by1,000,000to display the original score.Beta Was this translation helpful? Give feedback.
All reactions