## Evidently configuration Guide

This notebook demonstrates the basic usage of the `evidently` library. We'll cover:

- Logging test cases  
- Running evaluations  
- Viewing and saving results locally  
- Evaluating Evidently metrics through the Trace metrics API


In [None]:
!pip install sentence-transformers
!pip install -U evidently pandas

In [2]:
import evidently
print(evidently.__version__)

0.7.9


### Imports Explained

This script uses the **Evidently** library to evaluate LLM (Large Language Model) outputs using a variety of built-in descriptors. Here’s a breakdown of what each import does:


#### Evidently Core

```python
from evidently import Dataset
from evidently import DataDefinition
```

- **`Dataset`**: Represents the input data for evaluation (typically a `pandas.DataFrame`).
- **`DataDefinition`**: Defines the structure of the dataset (e.g., which column contains the output text, reference, context, etc.).

---

#### Evidently Descriptors

```python
from evidently.descriptors import (
    DeclineLLMEval,
    Sentiment,
    TextLength,
    NegativityLLMEval,
    PIILLMEval,
    BiasLLMEval,
    ToxicityLLMEval,
    ContextQualityLLMEval,
    ContextRelevance
)
```

These are **prebuilt descriptors** that evaluate specific aspects of LLM-generated responses:

- **`DeclineLLMEval`**: Detects if the model declined to answer (e.g., refused or deferred).
- **`Sentiment`**: Analyzes the sentiment (positive, neutral, or negative) of the output.
- **`TextLength`**: Measures the number of words or characters in the text.
- **`NegativityLLMEval`**: Evaluates the level of negativity in the generated text.
- **`PIILLMEval`**: Detects whether Personally Identifiable Information (PII) is present.
- **`BiasLLMEval`**: Detects possible social, cultural, or political biases in responses.
- **`ToxicityLLMEval`**: Identifies toxic or harmful content in the output.
- **`ContextQualityLLMEval`**: Measures how well the context is used or preserved in the output.
- **`ContextRelevance`**: Evaluates whether the response is relevant to the provided context.

---

These descriptors allow automated, explainable evaluation of LLM responses using a consistent and extensible framework.



In [None]:
import pandas as pd
from evidently import Dataset
from evidently import DataDefinition
from evidently.descriptors import  DeclineLLMEval, Sentiment, TextLength, NegativityLLMEval,PIILLMEval, BiasLLMEval, ToxicityLLMEval, ContextQualityLLMEval, ContextRelevance

### Preparing a Dummy Dataset for Testing

In this section, we create a sample dataset using Python lists and convert it into a `pandas.DataFrame` for testing purposes.

Each entry in the dataset contains:
- **`question`**: A user query or prompt.
- **`answer`**: The model-generated response.
- **`context`**: Supporting or reference information to evaluate against.

This dummy dataset will be used for evaluating various LLM metrics such as relevance, factual correctness, and tone using Evidently descriptors.


In [5]:
data = [
    [
        "What is the chemical symbol for gold?",
        "Gold chemical symbol is Au.",
        "Gold is a chemical element with the symbol Au and atomic number 79. It is a dense, soft, yellow metal highly valued for its rarity and conductivity."
    ],
    [
        "What is the capital of Japan?",
        "The capital of Japan is Tokyo.",
        "Tokyo is the capital city of Japan and one of the most populous metropolitan areas in the world."
    ],
    [
        "Tell me a joke.",
        "Why don't programmers like nature? Too many bugs!",
        "Programmers often use the term 'bug' to describe an error in code, which is humorously extended to nature, which has literal bugs."
    ],
    [
        "When does water boil?",
        "Water's boiling point is 100 degrees Celsius.",
        "At standard atmospheric pressure (1 atm), water boils at 100 degrees Celsius or 212 degrees Fahrenheit."
    ],
    [
        "Who painted the Mona Lisa?",
        "Leonardo da Vinci painted the Mona Lisa.",
        "The Mona Lisa is a portrait painting created by the Italian Renaissance artist Leonardo da Vinci between 1503 and 1506."
    ],
    [
        "What’s the fastest animal on land?",
        "The cheetah is the fastest land animal, capable of running up to 75 miles per hour.",
        "Cheetahs are known for their speed and are the fastest land animals, reaching speeds up to 75 mph in short bursts covering distances up to 500 meters."
    ],
   
]

columns = ["question", "answer", "context"]


eval_df = pd.DataFrame(data, columns=columns)
eval_df.head()

Unnamed: 0,question,answer,context
0,What is the chemical symbol for gold?,Gold chemical symbol is Au.,Gold is a chemical element with the symbol Au ...
1,What is the capital of Japan?,The capital of Japan is Tokyo.,Tokyo is the capital city of Japan and one of ...
2,Tell me a joke.,Why don't programmers like nature? Too many bugs!,Programmers often use the term 'bug' to descri...
3,When does water boil?,Water's boiling point is 100 degrees Celsius.,"At standard atmospheric pressure (1 atm), wate..."
4,Who painted the Mona Lisa?,Leonardo da Vinci painted the Mona Lisa.,The Mona Lisa is a portrait painting created b...


In [None]:
import os
OPENAI_API_KEY="YOUR_OPENAI_API_KEY"  # Replace with your actual OpenAI API key
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

### Creating an Evaluation Dataset with Descriptors

This section converts the dummy `pandas.DataFrame` into an `Evidently` `Dataset` and attaches multiple evaluation **descriptors** to analyze various aspects of the LLM responses.


#### Description of Key Components

- **`Dataset.from_pandas(...)`**: Converts the DataFrame into a format understood by Evidently.
- **`data_definition=DataDefinition()`**: Placeholder that can be extended to define semantic roles of columns (e.g., prediction, reference).
- **`descriptors=[...]`**: List of metrics to apply to each row in the dataset.

#### Attached Descriptors

| Descriptor | Purpose |
|-----------|---------|
| `NegativityLLMEval` | Detects negative tone or wording in the response. |
| `PIILLMEval` | Identifies personally identifiable information (PII). |
| `DeclineLLMEval` | Checks if the model declined to answer a question. |
| `BiasLLMEval` | Flags biased or unfair language in the answer. |
| `ToxicityLLMEval` | Measures toxic or harmful language. |
| `ContextQualityLLMEval` | Assesses the quality of the provided context relative to the question. |
| `ContextRelevance` | Evaluates how well the answer aligns with the context. |
| `Sentiment` | Provides sentiment polarity (positive/neutral/negative). |
| `TextLength` | Reports the length of the response. |
| `DeclineLLMEval` (again with alias `Denials`) | Duplicate use for additional reporting under a different label. |


> `include_score=True` ensures numeric scoring for quantitative evaluation.
> `alias` helps assign custom names to descriptor outputs for clarity in results.



In [13]:
eval_dataset = Dataset.from_pandas(
    eval_df,
    data_definition=DataDefinition(),
    descriptors=[
        NegativityLLMEval("answer",include_score=True),
        PIILLMEval("answer",include_score=True),
        DeclineLLMEval("answer", include_score=True),
        BiasLLMEval("answer", include_score=True),
        ToxicityLLMEval("answer" , include_score=True),
        ContextQualityLLMEval("context", question="question",include_score= True), 
        ContextRelevance("answer", "context",alias="ContextRelevance"), 
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        DeclineLLMEval("answer", alias="Denials",include_score=True),]) 



In [9]:
eval_dataset.as_dataframe().head()


Unnamed: 0,question,answer,context,Negativity,Negativity score,Negativity reasoning,PII,PII score,PII reasoning,Decline,...,Toxicity reasoning,ContextQuality,ContextQuality score,ContextQuality reasoning,ContextRelevance,Sentiment,Length,Denials,Denials score,Denials reasoning
0,What is the chemical symbol for gold?,Gold chemical symbol is Au.,Gold is a chemical element with the symbol Au ...,UNKNOWN,0.5,The text provides factual information about th...,OK,0.0,The text provides information about the chemic...,OK,...,The text is factual and does not contain any h...,VALID,1.0,The text provides the chemical symbol for gold...,0.886829,0.0,27,OK,0.0,The text provides factual information about th...
1,What is the capital of Japan?,The capital of Japan is Tokyo.,Tokyo is the capital city of Japan and one of ...,POSITIVE,0.0,The text simply states a factual piece of info...,OK,0.0,The text provides information about the capita...,OK,...,The text simply states a factual piece of info...,VALID,1.0,The text clearly states that Tokyo is the capi...,0.928,0.0,30,OK,0.0,The statement 'The capital of Japan is Tokyo' ...
2,Tell me a joke.,Why don't programmers like nature? Too many bugs!,Programmers often use the term 'bug' to descri...,POSITIVE,0.1,The text presents a light-hearted joke rather ...,OK,0.0,The provided text is a joke about programmers ...,OK,...,The text is a light-hearted joke about program...,INVALID,0.0,"The text does not provide a joke, which is req...",0.774565,-0.3404,49,OK,0.0,The text is a light-hearted joke about program...
3,When does water boil?,Water's boiling point is 100 degrees Celsius.,"At standard atmospheric pressure (1 atm), wate...",UNKNOWN,0.5,The statement is factual and neither expresses...,OK,0.0,The text discusses a physical property of wate...,OK,...,The text presents a factual statement about wa...,VALID,1.0,The text provides clear and sufficient informa...,0.892115,0.0,45,OK,0.0,The text provides factual information about th...
4,Who painted the Mona Lisa?,Leonardo da Vinci painted the Mona Lisa.,The Mona Lisa is a portrait painting created b...,POSITIVE,0.0,The text presents a factual statement about Le...,OK,0.0,The text does not contain any identifiable per...,OK,...,The text simply states a factual statement abo...,VALID,1.0,The text clearly states that the Mona Lisa was...,0.87329,0.0,40,OK,0.0,The text states a fact about Leonardo da Vinci...


### Extracting Metric Results 
This code evaluates **only the first row** of the `eval_dataset` for demonstration purposes and extracts all **numeric descriptor scores** into a dictionary called `metric_results`.


In [10]:
metric_results = {}
row = eval_dataset.as_dataframe().iloc[0]
for col, val in row.items():
    if isinstance(val, (int, float)):
        clean_col = col.replace(' score', '')
        metric_results[clean_col] = float(val)
print(metric_results)


{'Negativity': 0.5, 'PII': 0.0, 'Decline': 0.0, 'Bias': 0.0, 'Toxicity': 0.0, 'ContextQuality': 1.0, 'ContextRelevance': 0.8868294358253479, 'Sentiment': 0.0, 'Denials': 0.0}


## How to get your subscription key for the TRACE Metrics API

To access the TRACE Metrics API, you need a **subscription key**. Follow these steps:

1. **Sign in to CognitiveView**  
   Go to [app.cognitiveview.com](https://app.cognitiveview.com) and log in with your account credentials.

2. **Navigate to System Settings**  
   Once logged in, find the **System Settings** option in the main menu.

3. **Generate and view your subscription key**  
   - Inside **System Settings**, look for the section labeled **API Access** or **Subscription Key**.
   - If a key has already been generated, you can copy it directly.
   - If not, click the **Generate Key** button to create a new key.

4. **Copy and save your subscription key**  
   Keep this key secure. You’ll need to include it in your API requests for authentication.

---

## Posting Evaluation Metrics to TRACE Metrics API

This script sends your evaluation metric results (for example, from DeepEval) to the TRACE Metrics API using an authenticated HTTP POST request.

### Authentication

- Use an **Authorization token** (`AUTH_TOKEN`) in the request header.
- Include an **X-User-Id** header to identify the user making the request.

---

### Endpoint

| Item         | Value                                                      |
|--------------|-----------------------------------------------------------|
| **Base URL** | `https://api.cognitiveview.com`                            |
| **API Path** | `/cv/v1/metrics`                                          |
| **Full URL** | `https://api.cognitiveview.com/cv/v1/metrics`             |

---

### Payload Structure

#### `metric_metadata`

Contains information about what you’re evaluating:
- `application_name`: Name of the evaluated application.
- `version`: Version of the application or model.
- `provider`: The metric system or platform (for example, `deepeval`).
- `use_case`: The business or functional domain (for example, `customer_support`).


In [None]:
def post_metrics_to_TRACE_Metric_API(metric_results, auth_token, user_id):
  """
  Posts Evidently metric results to the Trace Metric API.

  Args:
    metric_results (dict): Dictionary of computed metric scores.
    auth_token (str): Authorization token for the API.
    user_id (str): User ID for the API (default is "C473421_T181751").

  Returns:
    dict: Response JSON from the API.
  """
  import requests

  BASE_URL = "https://api.cognitiveview.com"
  url = f"{BASE_URL}/cv/v1/metrics"

  headers = {
    "Authorization": auth_token,
    "Content-Type": "application/json",
    "X-User-Id": user_id,
  }

  payload = {
    "metric_metadata": {
    "application_name": "chat-application",
    "version": "1.0.0",
    "url": "https://api.example.com/chat",
    "provider": "Evidently",
    "use_case": "transportation"
    },
    "metric_data": {
    "Evidently": metric_results
    } 
  }

  response = requests.post(url, headers=headers, json=payload)
  print(f"Status Code: {response.status_code}")
  try:
    print("Response JSON:", response.json())
    return response.json()
  except Exception:
    print("Response Text:", response.text)
    return None

# Example usage:
AUTH_TOKEN = "Your-Authorization-Token-Here"  # Replace with your actual token
user_id = "user_id"  # Replace with your actual user ID
post_metrics_to_TRACE_Metric_API(metric_results,AUTH_TOKEN,user_id)

