# Evaluating Langchain QA Chains

LangChain is a framework for developing applications powered by language models. It can also be used to create RAG systems (or QA systems as they are reffered to in langchain). If you want to know more about creating RAG systems with langchain you can check the [docs](https://python.langchain.com/docs/use_cases/question_answering/).

With this integration you can easily evaluate your QA chains with the metrics offered in ragas

In [1]:
%load_ext autoreload
%autoreload 2

First lets load the dataset. We are going to build a generic QA system over the [NYC wikipedia page](https://en.wikipedia.org/wiki/New_York_City). Load the dataset and create the `VectorstoreIndex` and the `RetrievalQA` from it.

In [2]:
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

loader = TextLoader("./nyc_wikipedia/nyc_text.txt")
index = VectorstoreIndexCreator().from_loaders([loader])


llm = ChatOpenAI()
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True
)

Testing it out

In [3]:

question = "How did New York City get its name?"
result = qa_chain({"query": question})
result["result"]

'New York City was named in 1664 in honor of the Duke of York, who later became King James II of England. It was named after him because King Charles II of England appointed the Duke as proprietor of the former territory of New Netherland, including the city of New Amsterdam, which was seized from Dutch control by England.'

Now in order to evaluate the qa system we generated a few relevant questions.

In [5]:
eval_questions = [
    'What is the population of New York City as of 2020?',
    'Which borough of New York City has the highest population?',
    'What is the economic significance of New York City?',
    'How did New York City get its name?',
    'What is the significance of the Statue of Liberty in New York City?'
]

queries = [{"query": q} for q in eval_questions]

### Introducing `RagasEvaluatorChain`

Now comes the fun part. In order to evaluate the QA chains build with langchain, ragas provides you with a `RagasEvaluatorChain`. The `RagasEvaluatorChain` takes in any `Metric` in ragas and make a evaluation chain out of it.

The evaluator chain has the following APIs

- `__call__()`
- `evaluate()`
- `evaluate_run()`

In [6]:
from ragas.langchain.evalchain import RagasEvaluatorChain
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy

# create evaluation chains
faithfulness_chain = RagasEvaluatorChain(metric=faithfulness)
answer_rel_chain = RagasEvaluatorChain(metric=answer_relevancy)
context_rel_chain = RagasEvaluatorChain(metric=context_relevancy)

`__call__()`

...

In [7]:
eval_result = faithfulness_chain(result)
eval_result['faithfulness_score']

0.6666666666666667

`evaluate()`

...

In [9]:
# run the queries as a batch for efficiency
predictions = qa_chain.batch(queries)

# evaluate
print("evaluating...")
r = faithfulness_chain.evaluate(queries, predictions)
r

evaluating...


100%|████████████████████████████████████████████████████████████| 1/1 [00:28<00:00, 28.98s/it]


[{'faithfulness_score': 1.0},
 {'faithfulness_score': 0.0},
 {'faithfulness_score': 1.0},
 {'faithfulness_score': 1.0},
 {'faithfulness_score': 1.0}]

## Evaluate with langsmith

In [10]:
# dataset creation

from langsmith import Client
from langsmith.utils import LangSmithError

client = Client()
dataset_name = "NYC test"

try:
    # check if dataset exists
    dataset = client.read_dataset(dataset_name=dataset_name)
    print("using existing dataset: ", dataset.name) 
except LangSmithError:
    # if not create a new one with the generated query examples
    dataset = client.create_dataset(
        dataset_name=dataset_name, description="NYC test dataset"
    )
    for q in eval_questions:
        client.create_example(
            inputs={"query": q},
            dataset_id = dataset.id,
        )
    
    print("Created a new dataset: ", dataset.name)

using existing dataset:  NYC test


![](./assets/langsmith-dataset.png)

In [11]:
def create_qa_chain(return_context=True):
    qa_chain = RetrievalQA.from_chain_type(
        llm,
        retriever=index.vectorstore.as_retriever(),
        return_source_documents=return_context
    )
    return qa_chain

In [12]:
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    custom_evaluators = [
        faithfulness_chain,
        answer_rel_chain,
        context_rel_chain
    ],
    prediction_key="result"
)

result = run_on_dataset(
    client,
    dataset_name,
    create_qa_chain,
    evaluation=evaluation_config,
    input_mapper=lambda x: x
)

View the evaluation results for project '2023-08-18-23-24-58-RetrievalQA' at:
https://smith.langchain.com/projects/p/838cb050-9f13-408e-8fd8-82cb43dd1e03?eval=true


![](./assets/langsmith-evaluation.png)

Evaluation Result

Now if you want to dive more into the reasons for the scores and how to improve them

![](./assets/langsmith-feedback.png)


![](./assets/langsmith-ragas-chain-trace.png)

In [None]:
"./assets/langsmith-dataset.png"
"./assets/langsmith-evaluation.png"
"./assets/langsmith-feedback.png"
"./assets/langsmith-ragas-chain-trace.png"