### Evaluation: LLM as a Judge

- LLM as a judge refers to using large language models (LLMs) to evaluate chat assistants by judging their responses to open-ended questions, aiming to approximate human preferences in a scalable and explainable way

- This approach offers scalability by reducing the need for human involvement in evaluations, enabling faster iterations and benchmarks, and explainability by providing not only scores but also reasons behind those scores

- However, LLM judges face challenges such as biases, including a tendency to favor answers generated by themselves (self-enhancement bias), and limited ability in grading math and reasoning questions

- To address these limitations, methods like chain-of-thought and reference-guided judging have been proposed, aiming to improve the accuracy of LLM judges in evaluating complex questions

### Prompt design for LLM #1 to generate anaswers

In [None]:
qna_system_message = """<YOUR Q&A SYSTEM PROMPT HERE>"""

print(qna_system_message)

In [None]:
qna_user_message_template = """
###Context
Here are some documents that are relevant to the question mentioned below.\n
```{context}```

###Question
```{question}```

###Answer"""
print(qna_user_message_template)

### Prompt design for LLM #2 to evaluate answers generated by LLM #1

In [None]:
rater_system_message_v1 = """<YOUR RATING SYSTEM PROMPT HERE>"""

print(rater_system_message_v1)

Notice how we are providing specific instructions on how the rating should be done based on two parameters:
 - faithfulness to the context, that is, if the context is used correctly to create the response
 - relevance of response, that is, if the response is relevant to the query posed by the user

The user message for the rater is a collection of placeholders for the query, context and the response.

In [None]:
rater_user_message_template = """
###Query
{query}

###Context
{context}

###Response
{response}"""

print(rater_user_message_template)

Now, we need to gather a collection of gold queries to evaluate the performance of the Language Learning Model (LLM). It's important to note that since we will be utilizing another LLM to rate the responses generated by a focal LLM, there is no requirement for human annotated data. The gold queries should be sourced from the stakeholders who will ultimately utilize the retrieval system.

In [None]:
gold_queries = [
"what is the consolidated total income?",
"What is the revenue from the mining services?",
"what are the details mentioned for green hydrogen ecosystem in the earning call?",
"How many commercial blocks does the company own now and where are they located?",
"What is the attrition in this quater?"
]

Notice how the gold queries are a mix of both subjective and factual questions. We can now run the evaluation step on the prompts using the gold queries. The workflow of code that executes this step is presented in the figure below.

### Loading libraries

In [None]:
# Import the os module to interact with the operating system environment variables
import os

# Import the json module for handling JSON data
import json

# Import ChatOpenAI class from langchain_openai package for interacting with OpenAI chat models
from langchain_openai import ChatOpenAI

# Import OpenAIEmbeddings class from langchain_openai package for working with embeddings generated by OpenAI models
from langchain_openai import OpenAIEmbeddings 

# Import PineconeVectorStore class from langchain_pinecone package for managing vector storage in Pinecone
from langchain_pinecone import PineconeVectorStore 

# Import ChatPromptTemplate class from langchain_core.prompts for creating chat prompt templates
from langchain_core.prompts import ChatPromptTemplate

# Import StrOutputParser class from langchain_core.output_parsers for parsing string outputs
from langchain_core.output_parsers import StrOutputParser

# Import RunnablePassthrough class from langchain_core.runnables for running tasks sequentially
from langchain_core.runnables import RunnablePassthrough

# Import RunnableParallel class from langchain_core.runnables for running tasks in parallel
from langchain_core.runnables import RunnableParallel

### Setting environment variables

In [None]:
# Set the LANGCHAIN_TRACING_V2 environment variable to enable tracing (e.g., logging)
os.environ["LANGCHAIN_TRACING_V2"] = "true"

# Set the LANGCHAIN_API_KEY environment variable with your LangChain API key for authentication
os.environ["LANGCHAIN_API_KEY"] = "<YOUR API KEY HERE>"

# Set the LANGCHAIN_PROJECT environment variable to specify the project name or identifier
os.environ["LANGCHAIN_PROJECT"] = "<YOUR PROJECT NAME HERE>"

# Set the OPENAI_API_KEY environment variable with your OpenAI API key for accessing OpenAI services
os.environ["OPENAI_API_KEY"] = "<YOUR API KEY HERE>"

# Set the PINECONE_API_KEY environment variable with your Pinecone API key for accessing Pinecone services
os.environ["PINECONE_API_KEY"] = "<YOUR API KEY HERE>"

### Creating RAG pipeline

In [None]:
# Define the index name for the Pinecone vector store
INDEX_NAME = '<YOUR CODE HERE>'

# Initialize the OpenAIEmbeddings object with the specified model for generating text embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Initialize the ChatOpenAI object with GPT-3.5 Turbo model and set the temperature parameter
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create a PineconeVectorStore instance using the specified index name and embeddings
docsearch = PineconeVectorStore(index_name=INDEX_NAME, embedding=embeddings)

# Convert the PineconeVectorStore instance into a retriever object with custom search parameters
retriever = <YOUR CODE HERE>

# Create a ChatPromptTemplate from predefined message templates for Q&A interactions
chat_template = ChatPromptTemplate.from_messages(
    [
        ("system", qna_system_message),
        ("human", qna_user_message_template),
    ]
)

In [None]:
# Define a function to format document content by joining page contents with newlines
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Construct a runnable chain that formats documents, applies a chat template, processes through a language model,
# and parses the output as a string
rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))  # Format document content
    | chat_template
    | llm
    | StrOutputParser()
)

# Create a parallel runnable chain that retrieves documents based on search criteria and formats them,
# then combines the formatted documents with a question to process through the language model and parse the output
rag_chain_with_source = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

### Generating responses

In [None]:
# Initialize lists to hold gold standard queries and corresponding contexts for predictions
predictions_gold_queries, context_for_predictions = [], []

# Iterate over each query in the gold standard queries
for query in gold_queries:
    try:
        # Invoke the runnable chain with the current query to get a response
        response = rag_chain_with_source.invoke(query)
        
        # Append the joined page contents of the retrieved documents as the context for predictions
        context_for_predictions.append("\n\n".join(doc.page_content for doc in response.get("context")))
        
        # Append the answer from the response to the list of gold standard queries
        predictions_gold_queries.append(
            response.get("answer")
        )
        
    except Exception as e:
        # Print the exception if an error occurs during processing
        print(e)
        # Continue with the next iteration to avoid stopping the loop due to an exception
        continue

Let us observe the predictions to the gold queries.

In [None]:
predictions_gold_queries

### Creating evaluation pipeline

With these responses in place, we can now present all the three components - query, context and response - to the rating LLM and collect the ratings.

In [None]:
# Initialize an empty list to store rating information
ratings = []

# Iterate over each query, context, and prediction simultaneously using zip
for query, context, prediction in zip(gold_queries, context_for_predictions, predictions_gold_queries):
    # Create a ChatPromptTemplate for rating purposes with predefined system and user messages
    prompt_for_rating = ChatPromptTemplate.from_messages(
        [
            ("system", rater_system_message_v1),
            ("human", rater_user_message_template),
        ]
    )
    
    # Create a pipeline starting with the rating prompt and ending with the language model
    pipe = prompt_for_rating | llm
    
    try:
        # Invoke the pipeline with the query, context, and prediction to get a response
        response = pipe.invoke({"query": query, "context": context, "response": prediction})
        
        # Convert the JSON string response content to a Python dictionary
        response_json = json.loads(response.content)
        
        # Append the rating information including query, prediction, rating score, and rationale to the ratings list
        ratings.append({
            'query': query,
            'answer': prediction,
            'rating': response_json['Rating'],
            'rationale': response_json['Rationale']
        })
        
    except Exception as e:
        # Print the exception if an error occurs during processing
        print(e)
        # Continue with the next iteration to avoid stopping the loop due to an exception
        continue

In [None]:
ratings

Since the ratings are collected as a list of dictionaries, we can inspect these ratings by converting the to a DataFrame.

In [None]:
import pandas as pd
pd.DataFrame(ratings)

We can compute the mean rating for the gold queries like so:

In [None]:
pd.DataFrame(ratings).rating.mean()

## Debugging for bias & rating inaccuracies

To debug these ratings for bias or inaccuracies, let us first pick prompts on which the rater LLM gave a bad feedback. We look at the context, query, response and the rationale given by the rating LLM to decide if we need to make amends to the prompt used for the task (i.e., the `qna_system_message`).

For example, look at the following query:

In [None]:
user_query = "What is the attrition in this quater?"

This is an irrelevant query (bordering on adversarial) that should have been rated high by the rater since the response is in line with the system message.

In this case, the rater system message needs to be amended so that such adversarial queries are not rated badly. Look at the revised `qna_system_message` below and that corrects for this mistake.


In [None]:
rater_system_message_v1 = """<YOUR RATING SYSTEM PROMPT V1 HERE>"""

`Note that the AI system was asked to respond with "I don't know" if the answer to the query was not found in the context.` we add this line to update the `rater_system_message_v1` to make the Rater model aware that the prediction model was asked to output "I don't know".

In [None]:
rater_system_message_v2 = """<YOUR RATING SYSTEM PROMPT V2 HERE>"""

In [None]:
rater_user_message_template = """
###Query
{query}

###Context
{context}

###Response
{response}
"""

In [None]:
# Define a ChatPromptTemplate for rating purposes using predefined system and user messages
prompt_for_rating = ChatPromptTemplate.from_messages(
    [
        ("system", rater_system_message_v2),  # System message for setting up the rating scenario
        ("human", rater_user_message_template),  # User message template for inputting a rating
    ]
)

# Create a pipeline starting with the rating prompt and ending with the language model
pipe = prompt_for_rating | llm  # The pipe will take the prompt and pass its output to the language model for processing

In [None]:
# Retrieving relevant document chunks for the given user query
relevant_document_chunks = retriever.get_relevant_documents(user_query)

# Extracting page content from each relevant document chunk
context_list = [d.page_content for d in relevant_document_chunks]

# Joining the extracted page contents into a single string to form the context for the query
context_for_query = ". ".join(context_list)

# Generating a rating response with the updated rater prompt, passing the user query, the constructed context, 
# and a placeholder response
response = pipe.invoke({"query": user_query, "context": context_for_query, "response": "I don't know"})

# Converting the JSON string response content to a Python dictionary
response_json = json.loads(response.content)

In [None]:
# printing the rating and the rationale
response_json['Rating'], response_json['Rationale']

We can see that the rating LLM now correctly reasons that the question was out of context and rates the response highly.

Let us look at another query that was rated badly.

In [None]:
user_query = "what is the consolidated total income?"

In this case, the answer "I don't know" points to the absence of relevant documents in the context. Let us look at the context extracted from the database for this query.

In [None]:
# Retrieve relevant document chunks based on the user's query
relevant_document_chunks = retriever.get_relevant_documents(user_query)

# Extract page content from each relevant document chunk
context_list = [d.page_content for d in relevant_document_chunks]

# Concatenate the extracted page contents into a single string, separated by periods and spaces, to serve as the context for the query
context_for_query = ". ".join(context_list)

In [None]:
for document in relevant_document_chunks:
    print(document.page_content.replace('\t', ' '))
    print("---")

In [None]:
# Invoke the pipeline with the user query, the constructed context, and the first answer from the ratings list
response = pipe.invoke({"query": user_query, 
                        "context": context_for_query, 
                        "response": ratings[0]['answer']})

# Load the JSON string response content into a Python dictionary
response_json = json.loads(response.content)

In [None]:
ratings[0]['answer']

In [None]:
response_json['Rating'], response_json['Rationale']

Let us now consider a query that had a high rating:

In [None]:
user_query = "what are the details mentioned for green hydrogen ecosystem in the earning call?"

In [None]:
ratings[2]['answer']

In [None]:
relevant_document_chunks = retriever.get_relevant_documents(user_query)

for document in relevant_document_chunks:
    print(document.page_content.replace('\t', ' '))
    print("---")

The answer to this query is perfect; the context was appropriately retrieved and the LLM picked the correct answer from the chunks that were retrieved.

In debugging a RAG pipeline, we begin first with the rating model. It is important to align the rating model with human input so we are not flagging false negatives or false positives. We do this by assembling a sample of low and high ratings, observing ratings and amending the rating system prompt as required.

Once the rater is aligned, we should look at the retrieved context to understand if the vector database index is not ideal. At the end, we look at the task prompts presented to the LLM to see if there are gaps in its formulation.

### Using Langsmith

#### Creating a Dataset

To begin with testing the RAG chain we first need to create a sample dataset with question and answer pair.

In [None]:
# Import the Client class from the langsmith library to create datasets and examples
from langsmith import Client

# Define a list of example inputs consisting of questions and their corresponding answers
example_inputs = [
    ('what is the consolidated total income?', 
     'The Consolidated income was at Rs.25,810 crores.'),
    
    ('What is the revenue from the mining services?', 
     'The revenue from mining services stood at Rs.608 crores.'),
    
    ('what are the details mentioned for green hydrogen ecosystem in the earning call?', 
     """The progress on the green hydrogen ecosystem includes the following updates:\n- Agreements for technologies related to 
     electrolyzers are in place.\n- Development and construction work on an integrated facility for electrolyzer manufacturing 
     is expected to start towards the end of the quarter or early in the third quarter.\n- Land for solar and wind plants has 
     been identified, and site evaluation and work are ongoing.\n- Updates are expected over the next 6 to 9 months on various 
     aspects such as green methanol, green ammonia, green fertilizers, and other products in the ancillary and product 
     system."""),
    
    ('How many commercial blocks does the company own now and where are they located?', """The Company now has 7 commercial 
    blocks located in the state of Chhattisgarh, Maharashtra, Madhya Pradesh, Odisha, and Jharkhand."""),
    
    ('What is the attrition in this quater?\t', "There are no references to attrition mentioned.")
]

# Instantiate the Client object to interact with the LangSmith platform
client = Client()

# Specify the name of the dataset to be created
dataset_name = "qa_adani_enterprise_fy24_q1"

# Create a new dataset on the LangSmith platform with a description
dataset = client.create_dataset(
    dataset_name=dataset_name, description="Questions and answers related to Adani Enterprise FY24 Q1."
)

# Loop through the example inputs and add each as an example in the newly created dataset
for input_prompt, output_answer in example_inputs:
    client.create_example(
        inputs={"question": input_prompt},  # Add the question as input
        outputs={"answer": output_answer},  # Add the answer as output
        metadata={"source": "Self Made"},  # Include metadata indicating the source of the example
        dataset_id=dataset.id,  # Reference the ID of the dataset being populated
    )

Creating a predict function that will be used by the evaluate funtion to get the output

In [None]:
# Define a function named `predict` that takes a dictionary of inputs as argument and returns a dictionary
def predict(inputs: dict) -> dict:
    # Invoke the runnable chain with the question from the inputs dictionary to get a response
    response = rag_chain_with_source.invoke(inputs["question"])
    
    # Return a dictionary containing the output, specifically the answer from the response
    return {"output": response.get("answer")}

## Defining the Evaluators and Running the Experiment

- The `qa` evaluator instructs an llm to directly grade a response as "correct" or "incorrect" based on the reference answer.
- The `context_qa` evaluator instructs the LLM chain to use reference "context" (provided throught the example outputs) in determining correctness. This is useful if you have a larger corpus of grounding docs but don't have ground truth answers to a query.
- The `cot_qa` evaluator is similar to the "context_qa" evaluator, except it instructs the LLMChain to use chain of thought "reasoning" before determining a final verdict. This tends to lead to responses that better correlate with human labels, for a slightly higher token and runtime cost.


In [None]:
# Import the Client class from the langsmith library for interacting with the LangSmith platform
from langsmith import Client

# Import the LangChainStringEvaluator and evaluate functions from langsmith.evaluation for evaluating predictions
from langsmith.evaluation import LangChainStringEvaluator, evaluate

# Initialize a LangChainStringEvaluator for QA evaluations with a specific configuration
qa_evaluator = LangChainStringEvaluator("qa", config={
    "llm": ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
})

# Instantiate the Client object to interact with the LangSmith platform
client = Client()

# Evaluate the predict function against the specified dataset using the initialized evaluator
evaluate(
    predict,  # Function to evaluate
    data=dataset_name,  # Name of the dataset to evaluate against
    evaluators=[qa_evaluator],  # List of evaluators to use for the evaluation
    metadata={"revision_id": "v3"}  # Metadata for the evaluation, specifying a revision ID
)