In [1]:
import os
from dotenv import load_dotenv
load_dotenv(encoding='utf-8')

True

# Overview

This Jupyter notebook evaluates the performance of of the Cohere deployment within the Cohere toolkit using synthetic data. The notebook follows a series of steps to prepare the testing data, evaluate the RAG system's performance, and save the results.

**Step 1: Load Synthetic Data**

The notebook loads synthetic question-ground truth data generated by another notebook (`1-ragas-synthetic-test-data-generation.ipynb`).

**Step 2: Feed Data to System and Extract Responses**

The notebook feeds the synthetic questions into the RAG system and extracts the system's generated responses using the `2-chat-history-extraction.ipynb` notebook.

**Step 3: Prepare Testing Data**

The notebook prepares the testing data by:

1. Loading the chat history extraction data from a CSV file (`test_dataset_hr_cohere_deployment_test.csv`) using a custom function `load_dataframe_with_list_column`.
2. Loading the ground truth data from a JSON file (`test_dataset_hr.json`).
3. Merging the two dataframes by the question content.
4. Converting the dataframe to the RAGAS (RAG Augmented Search) dataset format using the `Dataset` class from the `datasets` library.

**Step 4: Evaluate RAG System's Performance**

The notebook evaluates the RAG system's performance using the `evaluate` function from the `ragas` library, which takes the prepared dataset and a list of metrics as input. The metrics used in this evaluation are:

1. Answer relevancy
2. Faithfulness
3. Context recall
4. Context precision

**Step 5: Save Evaluated Results**

The notebook saves the evaluated results to a CSV file (`eval_result_dataset_hr_cohere_deployment.csv`) using the `to_pandas` method.

**Additional Features**

The notebook also includes optional code for tracing runs with LangSmith, which requires signing up for an API key.

**Custom Functions**

The notebook defines several custom functions for working with dataframes containing list columns, including:

1. `serialize_list`: serializes a list to a JSON string
2. `deserialize_list`: deserializes a JSON string back into a list
3. `save_dataframe_with_list_column`: saves a dataframe with a list column to a CSV file
4. `load_dataframe_with_list_column`: loads a dataframe from a CSV file, restoring the list structure.


# Prepare testing data

**When saving the test data in the notebook 2-chat-history-extraction.ipynb, we serialized the contexts column. To load that file to a dataframe, we need a function to de-serialize it.**

In [37]:
import json
import pandas as pd

def serialize_list(value):
    """Serializes a list to a JSON string."""
    return json.dumps(value)

def deserialize_list(value):
    """Deserializes a JSON string back into a list."""
    return json.loads(value)

def save_dataframe_with_list_column(df, filename):
    """Saves a DataFrame with a list column to a CSV file, preserving the list structure.

    Args:
        df: The DataFrame to save.
        filename: The name of the output CSV file.
    """

    # Apply the serialization function to the list column
    df['contexts'] = df['contexts'].apply(serialize_list)

    # Save the DataFrame to CSV
    df.to_csv(filename, index=False)

def load_dataframe_with_list_column(filename):
    """Loads a DataFrame from a CSV file, restoring the list structure.

    Args:
        filename: The name of the input CSV file.

    Returns:
        The loaded DataFrame.
    """

    # Load the DataFrame
    df = pd.read_csv(filename)

    # Apply the deserialization function to the list column
    df['contexts'] = df['contexts'].apply(deserialize_list)

    return df

## Load the test data from the chat history extraction process

In [81]:
import pandas as pd
from from_root import from_root
file_name = "test_dataset_hr_cohere_deployment_test.csv"
df_question_answer_contexts = load_dataframe_with_list_column(os.path.join(from_root(), "data-test/test-dataset/", file_name))

## Adding ground truth

In [82]:
# Import the ground truth from the test question set 
df_ground_truth = pd.read_json(os.path.join(from_root(), "data-test/test-dataset/", "test_dataset_hr.json"))

In [83]:
# Merge the dataframe with the ground_truth by the question content
data_to_test = pd.merge(df_question_answer_contexts[['question', 'answer', 'contexts']], df_ground_truth, on='question', how='left')

## Convert to RAGAS data format

In [85]:
# Convert testing data to RAGAS Dataset format
from datasets import Dataset

question = list(data_to_test['question'])
answer = list(data_to_test['answer'])
contexts = list(data_to_test['contexts'])
ground_truth = list(data_to_test['ground_truth'])

data = {
    'question': question,
    'answer': answer,
    'contexts': contexts,
    'ground_truth': ground_truth
}

dataset = Dataset.from_dict(data)

# Evaluation

In [11]:
# Optional, uncomment to trace runs with LangSmith. Sign up here: https://smith.langchain.com.
from langsmith import Client
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"] = "Cohere_RAG_Eval"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
client = Client()

In [86]:
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
result = evaluate(
    dataset,
    metrics=[
        answer_relevancy,
        faithfulness,
        context_recall,
        context_precision,
    ],
)

result

Evaluating:   0%|          | 0/28 [00:00<?, ?it/s]

{'answer_relevancy': 0.5457, 'faithfulness': 0.7483, 'context_recall': 0.3766, 'context_precision': 0.5714}

In [87]:
df = result.to_pandas()
df

Unnamed: 0,question,answer,contexts,ground_truth,answer_relevancy,faithfulness,context_recall,context_precision
0,What is the purpose of adding a header image t...,"I'm sorry, I could not find any information ab...","[Description#F4F5F7In a sentence or two, descr...",Adding a header image to a Confluence space en...,0.0,1.0,0.0,0.0
1,How do employee engagement and disengagement d...,Employee engagement refers to the emotional co...,[Employee Engagement?Employee engagement refer...,Employee engagement and disengagement differ i...,0.871751,0.714286,0.636364,1.0
2,What services does the Employee Assistance Pro...,The Employee Assistance Program (EAP) provides...,[take appropriate action to ensure a safe and ...,Employees can contact the Employee Assistance ...,0.98487,1.0,1.0,1.0
3,What training programs are offered in data sci...,Tech Innovators Inc. offers training programs ...,"[training programs in cloud computing, data sc...","Courses covering financial analysis, budgeting...",0.989385,1.0,0.0,0.0
4,What resources should be added for new hires i...,"I'm sorry, I could not find any information ab...","[Offers are extended promptly, and candidates ...",Add resources for new hires in the onboarding ...,0.0,0.666667,0.0,1.0
5,What is Tech Innovators Inc.'s approach to wor...,Tech Innovators Inc. has a zero-tolerance poli...,[is the process for handling workplace harassm...,Tech Innovators Inc. has a zero-tolerance poli...,0.974083,0.857143,1.0,1.0
6,What is the significance of identifying growth...,"I'm sorry, I could not find any information ab...",[No contexts],Identifying growth areas in self-assessment is...,0.0,0.0,0.0,0.0


In [88]:
# Save the result data
file_name = "eval_result_dataset_hr_cohere_deployment.csv"
result.to_pandas().to_csv(os.path.join(from_root(), "data-test/test-dataset/", file_name), index=False)