### DeepEval configuration Guide

This notebook demonstrates the basic usage of the `deepeval` library. We'll cover:

- Logging test cases  
- Running evaluations  
- Viewing and saving results locally  
- Evaluating DeepEval metrics through the Trace metrics API


In [None]:
pip install deepeval

## RAG Test Case: 

In this example, we define a **Retrieval-Augmented Generation (RAG)** test case using `deepeval`. The goal is to evaluate how well a language model's response aligns with both the expected output and the retrieved context.

### What We're Doing

- **Input**: A user asks _"What causes seasonal color changes in leaves?"_
- **Actual Output**: The model's generated response.
- **Expected Output**: A reference answer used for comparison.
- **Context**: The full context provided to the model for generation.
- **Retrieval Context**: The subset of documents retrieved for grounding the answer.

We use `LLMTestCase` to encapsulate this information, which will later be evaluated using various DeepEval metrics such as:
- `AnswerRelevancyMetric`
- `ContextualRelevancyMetric`
- `ContextualRecallMetric`
- `ContextualPrecisionMetric`
- `FaithfulnessMetric`
- `HallucinationMetric`

This setup allows us to assess factual consistency, grounding, and hallucination risk in RAG-based systems.


In [5]:
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    ContextualRelevancyMetric,
    ContextualRecallMetric,
    ContextualPrecisionMetric,
    FaithfulnessMetric,
    HallucinationMetric
)

# Define RAG test case with context and retrieval_context
tc = LLMTestCase(
    input="What causes seasonal color changes in leaves?",
    actual_output="Leaves change color due to reduced chlorophyll production in fall, revealing carotenoids and anthocyanins. Temperature and light changes trigger this process.",
    expected_output="Seasonal leaf color changes are primarily caused by the breakdown of chlorophyll in autumn, revealing underlying pigments like carotenoids (yellows/oranges) and anthocyanins (reds/purples), triggered by shorter days and cooler temperatures.",
    context=[
        "Photosynthesis slows in autumn due to reduced sunlight and temperature changes.",
        "Chlorophyll breaks down faster than it's produced, unmasking existing carotenoids.",
        "Anthocyanins are newly synthesized in some species as sugars become trapped in leaves.",
        "The process is influenced by both photoperiod (day length) and temperature changes."
    ],
    retrieval_context=[
        "Photosynthesis slows in autumn due to reduced sunlight and temperature changes.",
        "Chlorophyll breaks down faster than it's produced, unmasking existing carotenoids.",
        "Anthocyanins are newly synthesized in some species as sugars become trapped in leaves.",
        "The process is influenced by both photoperiod (day length) and temperature changes."
    ]
)


In [None]:
import os
OPENAI_API_KEY="Your OpenAI API Key Here"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

## Metric Evaluation

In this step, we initialize a set of evaluation metrics with custom thresholds and apply them to our RAG test case.

### Metrics Used

- `AnswerRelevancyMetric (≥ 0.7)`: Measures how relevant the model's answer is to the input question.
- `ContextualRelevancyMetric (≥ 0.8)`: Evaluates how well the answer relates to the provided context.
- `ContextualRecallMetric (≥ 0.9)`: Measures how much relevant information from the context is included in the output.
- `ContextualPrecisionMetric (≥ 0.85)`: Measures how much of the output is grounded in the relevant context.
- `FaithfulnessMetric (≥ 0.9)`: Checks whether the generated answer is faithful to the source context.
- `HallucinationMetric (≤ 0.1)`: Detects content in the answer that is not supported by the context.

### Evaluation Loop

Each metric is applied to the test case using the `measure()` method. The results are stored in a list as dictionaries containing:
- `metric_key`: The name of the metric
- `value`: The computed score

These results can be used for reporting, logging, or visualization.


In [None]:
# Initialize metrics with appropriate thresholds
metrics = [
    AnswerRelevancyMetric(threshold=0.7),
    ContextualRelevancyMetric(threshold=0.8),
    ContextualRecallMetric(threshold=0.9),
    ContextualPrecisionMetric(threshold=0.85),
    FaithfulnessMetric(threshold=0.9),
    HallucinationMetric(threshold=0.1)
]

# Evaluate all metrics
metric_results = {}
for m in metrics:
    m.measure(tc)
    metric_results[m.__class__.__name__] = m.score


In [8]:
print(metric_results)

{'AnswerRelevancyMetric': 1.0, 'ContextualRelevancyMetric': 1.0, 'ContextualRecallMetric': 1.0, 'ContextualPrecisionMetric': 1.0, 'FaithfulnessMetric': 1.0, 'HallucinationMetric': 0.0}


## Posting Evaluation Metrics to Trace metric API

This script sends the evaluation metric results (e.g., from DeepEval) to the TRACE Metric API using an authenticated HTTP POST request.

### Authentication

- Uses an **Authorization token** (`AUTH_TOKEN`) for secure access to the API.
- Includes an **X-User-Id** header to identify the user performing the operation.

### Endpoint

- **Base URL**: `https://api.cognitiveview.com`
- **API Path**: `/cv/v1/metrics`
- **Full Endpoint**: `https://api.cognitiveview.com/cv/v1/metrics`

### Payload Structure

#### `metric_metadata`
Metadata describing the context of the evaluation:
- `application_name`: Name of the application being evaluated.
- `version`: Version of the application.
- `resource_name`: The evaluated resource (e.g., a model or endpoint).
- `resource_id`: Unique ID of the resource.
- `url`: The endpoint URL of the resource.
- `provider`: Source of the metric system (e.g., `deepeval`).
- `use_case`: The business or functional use case (e.g., `transportation`).

#### `metric_data`
Data containing the metric scores:
- `resource_id`: The ID of the instance or model run being scored.
- `resource_name`: Name of the evaluated resource.
- `deepeval`: Dictionary of computed metric scores 

In [None]:
def post_metrics_to_TRACE_Metric_API(metric_results, auth_token, user_id):
  """
  Posts DeepEval metric results to the Trace Metric API.

  Args:
    metric_results (dict): Dictionary of computed metric scores.
    auth_token (str): Authorization token for the API.
    user_id (str): User ID for the API (default is "C473421_T181751").

  Returns:
    dict: Response JSON from the API.
  """
  import requests

  BASE_URL = "https://api.cognitiveview.com"
  url = f"{BASE_URL}/cv/v1/metrics"

  headers = {
    "Authorization": auth_token,
    "Content-Type": "application/json",
    "X-User-Id": user_id,
  }

  payload = {
    "metric_metadata": {
    "application_name": "chat-application",
    "version": "1.0.0",
    "resource_name": "chat-completion",
    "resource_id": "R-756",
    "url": "https://api.example.com/chat",
    "provider": "deepeval",
    "use_case": "transportation"
    },
    "metric_data": {
    "resource_id": "res_123456",
    "resource_name": "chat-completion",
    "deepeval": metric_results
    } 
  }

  response = requests.post(url, headers=headers, json=payload)
  print(f"Status Code: {response.status_code}")
  try:
    print("Response JSON:", response.json())
    return response.json()
  except Exception:
    print("Response Text:", response.text)
    return None

# Example usage:
AUTH_TOKEN = "Your-Authorization-Token-Here"  # Replace with your actual token
user_id = "user_id"  # Replace with your actual user ID
post_metrics_to_TRACE_Metric_API(metric_results,AUTH_TOKEN,user_id)



In [None]:
import requests

def fetch_report_result(report_id, auth_token, user_id):
    """
    Fetches the result of a report from the CognitiveView API.

    Args:
        report_id (str): The ID of the report to fetch.
        auth_token (str): The authorization token for the API.
        user_id (str): The user ID for the API.

    Returns:
        dict: The JSON response from the API if successful, else None.
    """
    base_url = "https://api.cognitiveview.com"
    endpoint = f"/cv/v1/metrics/{report_id}"
    url = base_url + endpoint

    headers = {
        "Authorization": auth_token,
        "Content-Type": "application/json",
        "X-User-Id": user_id,
    }

    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Failed to fetch report. Status code: {response.status_code}")
        return None

# Example usage:
AUTH_TOKEN = "Your-Authorization-Token-Here"  # Replace with your actual token
report_id = "report_id"  # Replace with the actual report ID you want to fetch
user_id = "user_id"  # Replace with your actual user ID
report = fetch_report_result("report_id", AUTH_TOKEN, "user_id")
print(report)

{'report_id': 'SiKZrabyAyjX7fyGnMM4wX', 'application_id': 'DOC-gI2TC8RnekSZT3Ot', 'provider': 'evidently', 'use_case': 'financial_services', 'application_name': 'customer_service_bot', 'resource_type': 'genai', 'pillars': [{'pillar': 'performance', 'score': 0.0, 'colour': '🔴', 'metrics_count': 0}, {'pillar': 'fairness_and_bias', 'score': 0.5, 'colour': '🟠', 'metrics_count': 1}, {'pillar': 'safety_and_truthfulness', 'score': 1.0, 'colour': '🟢', 'metrics_count': 1}, {'pillar': 'task_adherence', 'score': 0.0, 'colour': '🔴', 'metrics_count': 0}, {'pillar': 'reliability', 'score': 0.0, 'colour': '🔴', 'metrics_count': 0}, {'pillar': 'robustness', 'score': 0.5, 'colour': '🟠', 'metrics_count': 1}, {'pillar': 'privacy', 'score': 0.0, 'colour': '🔴', 'metrics_count': 0}], 'metrics': [{'metric_name': 'ToxicityLLMEval', 'canonical_details': [{'name': 'safety', 'description': 'Checks for harmful, biased, or unsafe content in model responses.'}], 'common_metric_name': 'toxicity', 'common_metric_descr