# Overview

This Jupyter notebook provides a framework to evaluate the performance of the Cohere toolkit RAG system using the RAGAS (Relevance, Answerability, Grammar, Accuracy, and Specificity) evaluation framework. The evaluation is based on a test dataset of questions and answers collected from a chat history extraction process.

**Key Steps in the Notebook**

1. **Loading Environment Variables and Dependencies**: The notebook loads environment variables from a `.env` file and installs the required dependencies, including `python-dotenv` and `langsmith`.
2. **Loading Test Data**: The notebook loads the test dataset from a CSV file, which contains questions, answers, and contexts.
3. **Data Preprocessing**: The notebook preprocesses the data by converting the contexts to a list format using the `serialize_list` function and saving the preprocessed data to a new CSV file.
4. **Converting to RAGAS Format**: The notebook converts the preprocessed data to the RAGAS format using the `Dataset` class from the `datasets` library.
5. **Setting up Langsmith Client**: The notebook sets up a Langsmith client using the `Client` class from the `langsmith` library, configuring the Langchain endpoint, project, and API key.
6. **Running RAGAS Evaluation**: The notebook runs the RAGAS evaluation using the `evaluate` function from the `ragas` library, specifying the metrics to evaluate, including answer relevancy and faithfulness.
7. **Saving Evaluation Results**: The notebook saves the evaluation results to a new CSV file using the `to_csv` method.

**Evaluation Metrics**

The notebook evaluates the following two metrics:

1. **Answer Relevancy**: Measures how well the system's answers match the intent of the question.
2. **Faithfulness**: Measures how well the system's answers align with the context provided.

**Recommendations for Future Improvements**

1. **Increase Sample Size**: The notebook only evaluates the first 10 system responses. Increasing the sample size can provide a more comprehensive evaluation of the system's performance.
2. **Add More Evaluation Metrics**: The notebook only evaluates two metrics. Adding more metrics, such as context recall and precision, can provide a more detailed evaluation of the system's performance.
3. **Refine Evaluation Metrics**: The notebook uses predefined evaluation metrics. Refining these metrics to better align with the specific use case or application can provide more accurate evaluation results.


In [2]:
# Load the .env file
import os
from dotenv import load_dotenv
load_dotenv(encoding='utf-8')
from from_root import from_root

# Evaluation

In [1]:
import json
import pandas as pd

def serialize_list(value):
    """Serializes a list to a JSON string."""
    return json.dumps(value)

def deserialize_list(value):
    """Deserializes a JSON string back into a list."""
    return json.loads(value)

def save_dataframe_with_list_column(df, filename):
    """Saves a DataFrame with a list column to a CSV file, preserving the list structure.

    Args:
        df: The DataFrame to save.
        filename: The name of the output CSV file.
    """

    # Apply the serialization function to the list column
    df['contexts'] = df['contexts'].apply(serialize_list)

    # Save the DataFrame to CSV
    df.to_csv(filename, index=False)

def load_dataframe_with_list_column(filename):
    """Loads a DataFrame from a CSV file, restoring the list structure.

    Args:
        filename: The name of the input CSV file.

    Returns:
        The loaded DataFrame.
    """

    # Load the DataFrame
    df = pd.read_csv(filename)

    # Apply the deserialization function to the list column
    df['contexts'] = df['contexts'].apply(deserialize_list)

    return df

## Load the test data from the chat history extraction process

In [3]:
import pandas as pd
from from_root import from_root
file_name = "test_dataset_it_openai_deployment_post_prod.csv"
df_question_answer_contexts = load_dataframe_with_list_column(os.path.join(from_root(), "data-test/test-dataset/", file_name))

In [9]:
data_to_test = df_question_answer_contexts[['question', 'answer', 'contexts']]

## Convert to RAGAS format

**Let evaluate the first 10 system's responses.**

In [12]:
from datasets import Dataset
question = list(data_to_test['question'])
answer = list(data_to_test['answer'])
contexts = list(data_to_test['contexts'])

data_samples = {
    'question': question,
    'answer': answer,
    'contexts': contexts,
}

dataset = Dataset.from_dict(data_samples)

In [79]:
from langsmith import Client
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"] = os.getenv('LANGCHAIN_PROJECT')
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
client = Client()

## Run RAGAS evaluation

In [14]:
from ragas import evaluate
# from ragas.integrations.langsmith import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
)

In [15]:
result = evaluate(
    dataset,
    metrics=[
        answer_relevancy,
        faithfulness,
    ],
)

Evaluating:   0%|          | 0/14 [00:00<?, ?it/s]

In [16]:
file_name = "eval_result_post_prod_test_dataset_it_openai_deployment.csv"
json_file_path = os.path.join(from_root(), "data-test/eval-result/", file_name)
result.to_pandas().to_csv(json_file_path, index=False)