<a href="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/langchain_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating RAG pipelines with LangChain

This notebook demonstrates how to evaluate a RAG pipeline using LangChain's [QA Evaluator](https://docs.smith.langchain.com/evaluation/evaluator-implementations#correctness-qa-evaluation). This evaluator helps measure the correctness of a response given some context, which makes it ideally suited for evaluating a RAG pipeline. At the end of this notebook, you will have a measurable QA model using RAG.

In this tutorial, you will use an Astra DB vector store, an OpenAI embedding model, an OpenAI LLM, LangChain, and LangSmith.

## Prerequisites

You will need a vector-enabled Astra database and an OpenAI Account.

* Create an [Astra vector database](https://docs.datastax.com/en/astra-serverless/docs/getting-started/create-db-choices.html).
* Create an [OpenAI account](https://openai.com/)
* Within your database, create an [Astra DB Access Token](https://docs.datastax.com/en/astra-serverless/docs/manage/org/manage-tokens.html) with Database Administrator permissions.
* Get your Astra DB Endpoint: 
  * `https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com`
* A [LangSmith account](https://docs.smith.langchain.com/)

See the [Prerequisites](https://docs.datastax.com/en/ragstack/docs/prerequisites.html) page for more details.

## Setup
`ragstack-ai` includes all the packages you need to build a RAG pipeline. 

In [3]:
! pip install -q ragstack-ai

In [2]:
import os
from getpass import getpass

# Enter your settings for Astra DB and OpenAI:
os.environ["ASTRA_DB_API_ENDPOINT"] = input("Enter your Astra DB API Endpoint: ")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("Enter your Astra DB Token: ")
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

## Create RAG Pipeline

### Embedding Model and Vector Store

In [5]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_astradb import AstraDBVectorStore
import os

# Configure your embedding model and vector store
embedding = OpenAIEmbeddings()
vstore = AstraDBVectorStore(
    collection_name="lc",
    embedding=embedding,
    token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
    api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
)
print("Astra vector store configured")

Astra vector store configured


In [6]:
# Retrieve the text of a short story that will be indexed in the vector store
! curl https://raw.githubusercontent.com/CassioML/cassio-website/main/docs/frameworks/langchain/texts/amontillado.txt --output amontillado.txt
SAMPLEDATA = ["amontillado.txt"]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13022  100 13022    0     0  29120      0 --:--:-- --:--:-- --:--:-- 29395


In [None]:
# Alternatively, provide your own file. However, you will want to update your queries to match the content of your file.

# Upload sample file (Note: this cell assumes you are on Google Colab)
# Local Jupyter notebooks can provide the path to their files directly by uncommenting and running just the next line).
# SAMPLEDATA = ["<path_to_file>"]

from google.colab import files

print("Please upload your own sample file:")
uploaded = files.upload()
if uploaded:
    SAMPLEDATA = uploaded
else:
    raise ValueError("Cannot proceed without Sample Data. Please re-run the cell.")

print(f"Please make sure to change your queries to match the contents of your file!")

In [7]:
import os
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader

# Loop through each file and load it into our vector store
documents = []
for filename in SAMPLEDATA:
    path = os.path.join(os.getcwd(), filename)

    # Supported file types are pdf and txt
    if filename.endswith(".pdf"):
        loader = PyPDFLoader(path)
        new_docs = loader.load_and_split()
        print(f"Processed pdf file: {filename}")
    elif filename.endswith(".txt"):
        loader = TextLoader(path)
        new_docs = loader.load_and_split()
        print(f"Processed txt file: {filename}")
    else:
        print(f"Unsupported file type: {filename}")

    if len(new_docs) > 0:
        documents.extend(new_docs)

# empty the list of file names in case this cell is run multiple times
SAMPLEDATA = []

print(f"\nProcessing done.")

Processed txt file: amontillado.txt

Processing done.


In [9]:
# Create embeddings by inserting your documents into the vector store.
inserted_ids = vstore.add_documents(documents)
print(f"\nInserted {len(inserted_ids)} documents.")


Inserted 4 documents.


In [None]:
# Checks your Collection to verify the Documents are embedded.
print(vstore.astra_db.collection("lc").find())

### Basic Retrieval

Retrieve context from your vector database, and pass it to the model with a prompt.

In [11]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retriever = vstore.as_retriever(search_kwargs={"k": 3})

prompt_template = """
Answer the question based only on the supplied context. If you don't know the answer, say you don't know the answer.
Context: {context}
Question: {question}
Your answer:
"""
prompt = ChatPromptTemplate.from_template(prompt_template)
model = ChatOpenAI()

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

chain.invoke(
    "In the given context, what motivates the narrator, Montresor, to seek revenge against Fortunato?"
)

'The narrator, Montresor, seeks revenge against Fortunato because Fortunato insulted him.'

## Evaluation

LangChain offers several [built-in evaluators](https://docs.smith.langchain.com/evaluation/evaluator-implementations) that you can use to test the efficacy of your RAG pipeline. Since you've created a RAG pipeline, the QA Evaluator is a good fit. 

Remember that LLMs are probablistic -- responses will not be the exact same for each invocation. Evaluation results will differ between invocations, and they may be imperfect. Using the metrics as part of a larger holistic testing strategy for your RAG application is recommended.

### Setup

LangSmith is required to run evaluation using the built-in LangChain tools. 

In [12]:
import os
from getpass import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

langsmith_api_key = "LANGCHAIN_API_KEY"
if langsmith_api_key not in os.environ:
    os.environ[langsmith_api_key] = getpass(f"Enter {langsmith_api_key}: ")

In [13]:
os.environ["LANGCHAIN_PROJECT"] = input(
    "Project: "
)  # if not specified, defaults to "default"

In [14]:
eval_questions = [
    "What motivates the narrator, Montresor, to seek revenge against Fortunato?",
    "What are the major themes in this story?",
    "What is the significance of the story taking place during the carnival season?",
    "What literary techniques does Poe use to create suspense and tension in the story?",
]

eval_answers = [
    "Montresor is insulted by Lenora and seeks revenge.",  # Incorrect Answer
    "The major themes are happiness and trustworthiness.",  # Incorrect Answer
    "The carnival season is a time of celebration and merriment, which contrasts with the sinister events of the story.",
    "Poe uses foreshadowing, irony, and symbolism to create suspense and tension.",
]

examples = zip(eval_questions, eval_answers)

In [15]:
# Create your dataset in LangSmith
from langsmith import Client
from langsmith.utils import LangSmithError

client = Client()
dataset_name = "test_eval_dataset"

try:
    # Check if dataset exists
    dataset = client.read_dataset(dataset_name=dataset_name)
    print("using existing dataset: ", dataset.name)
except LangSmithError:
    # If not, create a new one with the eval questions
    dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="sample evaluation dataset",
    )
    for question, answer in examples:
        client.create_example(
            inputs={"input": question},
            outputs={"answer": answer},
            dataset_id=dataset.id,
        )

    print("Created a new dataset: ", dataset.name)

using existing dataset:  test_eval_dataset


In [16]:
from langchain.chains import RetrievalQA

# Since chains and agents can be stateful (they can have memory),
# create a constructor to pass in to the run_on_dataset method.
# This is so any state in the chain is not reused when evaluating individual examples.
def create_qa_chain(llm, vstore, return_context=True):
    qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=vstore.as_retriever(),
        return_source_documents=return_context,
    )
    return qa_chain

### Run Evaluation

Now that you have a dataset in LangSmith, you can run evaluation over it using LangChain's built-in evaluators.

In [17]:
from langsmith import Client
from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    # LangChain offers several QA Evaluator types
    evaluators=[
        "qa", # grades a response as correct or incorrect based on reference answer
        "context_qa", # uses reference context to to determine correctness 
        # "cot_qa", # similar to context_qa, but uses chain-of-thought
    ],
    prediction_key="result",
)

client = Client()
run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=create_qa_chain(llm=model, vstore=vstore),
    client=client,
    evaluation=evaluation_config,
    verbose=False,
)

View the evaluation results for project 'puzzled-money-58' at:
https://smith.langchain.com/o/c848b3f1-1464-5ef1-bd87-f03d2ce8baa7/datasets/1755e594-2ad8-42ab-9995-d8eac00b07c2/compare?selectedSessions=71b02351-d931-4539-b0e0-be2250d300a8

View all tests for Dataset test_eval_dataset at:
https://smith.langchain.com/o/c848b3f1-1464-5ef1-bd87-f03d2ce8baa7/datasets/1755e594-2ad8-42ab-9995-d8eac00b07c2
[------------------------------------------------->] 4/4

{'project_name': 'puzzled-money-58',
 'results': {'cb519748-8c64-4dad-a258-56cb5e6b08c5': {'input': {'input': 'What literary techniques does Poe use to create suspense and tension in the story?'},
   'feedback': [EvaluationResult(key='correctness', score=1, value='CORRECT', comment='CORRECT', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('0e777c93-14d0-4e7f-9e53-4680c371eafd'))}, source_run_id=None, target_run_id=None),
    EvaluationResult(key='Contextual Accuracy', score=1, value='CORRECT', comment='CORRECT', correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('0ed63220-bcd4-45ac-8975-42510018bfcb'))}, source_run_id=None, target_run_id=None)],
   'execution_time': 36.841969,
   'run_id': 'd38cc7d0-c6d4-46f0-926f-257efb000a0f',
   'output': {'query': 'What literary techniques does Poe use to create suspense and tension in the story?',
    'result': 'Poe uses several literary techniques to create suspense and tension in the story. \n\n1. Foreshadowing: Poe hin

Open the link to view your evaluation results. Note that the first two queries should have "incorrect" results, as the dataset purposely contained incorrect answers for those. 

For more details, see [LangSmith Testing and Evaluation](https://docs.smith.langchain.com/category/testing--evaluation).

### Cleanup

In [None]:
# WARNING: This will delete the collection and all documents in the collection
# vstore.delete_collection()

## What's Next

Now that you've set up a RAG pipeline and run evalation over it, you can try:
* More advanced queries over your dataset
* Using your internal documentation and evaluating responses over those
* Implementing more advanced RAG techniques to compare evaluation scores
* Evaluating with external evaluation tools 