# Overview

This Jupyter Notebook evaluates the performance of the OpenAi deployment within the Cohere toolkit on a test dataset for HR-related questions. The notebook loads the test data, merges it with ground truth information, converts it to the RAGAS dataset format, and evaluates the model using various metrics.

**Initialization and Data Loading**

1. The notebook starts by loading environment variables from a `.env` file and importing necessary libraries.
2. It defines a function to serialize and deserialize lists to and from JSON strings, which is used to store and load the test data.
3. The notebook loads the test dataset from a CSV file using the `load_dataframe_with_list_column` function.

**Data Preprocessing**

1. The notebook imports the ground truth information from a JSON file.
2. It merges the test dataset with the ground truth information on the question content.
3. The resulting dataset is converted to the RAGAS dataset format using the `Dataset.from_dict` function.

**Evaluation**

1. The notebook evaluates the RAGAS model on the test dataset using various metrics, including:
	* Answer Relevancy: measures the relevance of the generated answer to the question.
	* Faithfulness: measures the correctness of the generated answer with respect to the ground truth.
	* Context Recall: measures the proportion of relevant contexts recalled by the model.
	* Context Precision: measures the proportion of relevant contexts among all recalled contexts.
2. The evaluation results are stored in a `result` object and converted to a Pandas DataFrame.

**Result Saving**

1. The evaluation results are saved to a CSV file in the `data-test/eval-result` directory.

**Langsmith Tracing (Optional)**

1. The notebook has an optional section for tracing runs with Langsmith. This section can be uncommented to enable tracing.

**Result**

The final result is a Pandas DataFrame containing the evaluation metrics for the RAGAS model on the test dataset. The DataFrame is saved to a CSV file for further analysis.


# Prepare testing data

In [4]:
import os
from dotenv import load_dotenv
load_dotenv(encoding='utf-8')

True

**When saving the test data in the notebook 2-chat-history-extraction.ipynb, we serialized the contexts column. To load that file to a dataframe, we need a function to de-serialize it.**

In [2]:
import json
import pandas as pd

def serialize_list(value):
    """Serializes a list to a JSON string."""
    return json.dumps(value)

def deserialize_list(value):
    """Deserializes a JSON string back into a list."""
    return json.loads(value)

def save_dataframe_with_list_column(df, filename):
    """Saves a DataFrame with a list column to a CSV file, preserving the list structure.

    Args:
        df: The DataFrame to save.
        filename: The name of the output CSV file.
    """

    # Apply the serialization function to the list column
    df['contexts'] = df['contexts'].apply(serialize_list)

    # Save the DataFrame to CSV
    df.to_csv(filename, index=False)

def load_dataframe_with_list_column(filename):
    """Loads a DataFrame from a CSV file, restoring the list structure.

    Args:
        filename: The name of the input CSV file.

    Returns:
        The loaded DataFrame.
    """

    # Load the DataFrame
    df = pd.read_csv(filename)

    # Apply the deserialization function to the list column
    df['contexts'] = df['contexts'].apply(deserialize_list)

    return df

## Load the test data from the chat history extraction process

In [5]:
import pandas as pd
from from_root import from_root
file_name = "test_dataset_hr_openai_deployment_test.csv"
df_question_answer_contexts = load_dataframe_with_list_column(os.path.join(from_root(), "data-test/test-dataset/", file_name))

## Adding ground truth

In [6]:
# Import the ground truth from the test question set 
df_ground_truth = pd.read_json(os.path.join(from_root(), "data-test/test-dataset/", "test_dataset_hr.json"))

In [7]:
# Merge the dataframe with the ground_truth by the question content
data_to_test = pd.merge(df_question_answer_contexts[['question', 'answer', 'contexts']], df_ground_truth, on='question', how='left')

## Convert to RAGAS data format

In [9]:
# Convert testing data to RAGAS Dataset format
from datasets import Dataset

question = list(data_to_test['question'])
answer = list(data_to_test['answer'])
contexts = list(data_to_test['contexts'])
ground_truth = list(data_to_test['ground_truth'])

data = {
    'question': question,
    'answer': answer,
    'contexts': contexts,
    'ground_truth': ground_truth
}

dataset = Dataset.from_dict(data)

# Evaluation

In [16]:
# # Optional, uncomment to trace runs with LangSmith. Sign up here: https://smith.langchain.com.
# from langsmith import Client
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
# os.environ["LANGCHAIN_PROJECT"] = "Cohere_RAG_Eval"
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
# client = Client()

In [10]:
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
result = evaluate(
    dataset,
    metrics=[
        answer_relevancy,
        faithfulness,
        context_recall,
        context_precision,
    ],
)

result

Evaluating:   0%|          | 0/28 [00:00<?, ?it/s]

{'answer_relevancy': 0.8137, 'faithfulness': 0.6284, 'context_recall': 0.4286, 'context_precision': 0.4601}

In [13]:
df = result.to_pandas()
df

Unnamed: 0,question,answer,contexts,ground_truth,answer_relevancy,faithfulness,context_recall,context_precision
0,What is the purpose of adding a header image t...,[openai]: Adding a header image to a Confluenc...,"[Description#F4F5F7In a sentence or two, descr...",Adding a header image to a Confluence space en...,0.950839,0.071429,0.0,0.0
1,How do employee engagement and disengagement d...,[openai]: Employee engagement and disengagemen...,[and disengagement are two sides of the same c...,Employee engagement and disengagement differ i...,0.906978,0.727273,1.0,0.8875
2,What is the significance of identifying growth...,[openai]: Identifying growth areas in self-ass...,[assessments are used to ensure a fair compari...,Identifying growth areas in self-assessment is...,0.988894,1.0,0.0,0.0
3,What services does the Employee Assistance Pro...,[openai]: The Employee Assistance Program (EAP...,[and reach out to HR with any questions or con...,Employees can contact the Employee Assistance ...,0.945168,0.6,1.0,0.583333
4,What training programs are offered in data sci...,[openai]: Tech Innovators Inc. offers a range ...,[experience with cloud platforms and tools.Tra...,"Courses covering financial analysis, budgeting...",0.0,1.0,0.0,0.0
5,What resources should be added for new hires i...,[openai]: To enhance the onboarding process fo...,"[Offers are extended promptly, and candidates ...",Add resources for new hires in the onboarding ...,0.930121,0.0,0.0,0.75
6,What is Tech Innovators Inc.'s approach to wor...,[openai]: Tech Innovators Inc. has a zero-tole...,[is the process for handling workplace harassm...,Tech Innovators Inc. has a zero-tolerance poli...,0.974083,1.0,1.0,1.0


In [15]:
# Save the result data
file_name = "eval_result_dataset_hr_openai_deployment.csv"
result.to_pandas().to_csv(os.path.join(from_root(), "data-test/eval-result/", file_name), index=False)