# Build a Baseline with Langchain

Having prepared a test dataset, we are now equipped to conduct experiments and iteratively enhance our RAG pipelines.

## Introducing Metrics Driven Development (MDD)

A key challenge is developing a systematic approach to measure and refine our RAG pipeline. To address this, we propose `Metrics Driven Development` (MDD), inspired by the popular Test Driven Development methodology. MDD advocates for employing various metrics to assess different facets of the LLM application and conducting targeted experiments for improvement based on specific use cases.

Metrics Driven Development (MDD) offers a structured framework to tackle the complexities involved in optimizing RAG applications. Below is a mind map created by an engineer at AWS, illustrating potential strategies for enhancing a RAG pipeline.

![](https://media.licdn.com/dms/image/D4D22AQEgjWxKXokOPA/feedshare-shrink_800/0/1708498751086?e=1711584000&v=beta&t=xaT95vKS8m4qTybofpKqQfXOGoFs8lQXBuOk2Fr45AE)

The [original Miro mind map](https://miro.com/app/board/uXjVNvklNmc=/) is accessible for further exploration. MDD serves as a guiding light through the complexity of enhancing RAG applications.

Let's proceed to see MDD in action.

In [2]:
# only for reloading code that changes locally
%load_ext autoreload
%autoreload 2

## Building a baseline

if you remember, in the [last notebook](./dataset.ipynb) we outline the steps and addressed a few

1. Load the data as documents. ✅
2. Generate the test set from these documents. ✅
3. Upload and verify the test set with Langsmith. ✅
4. Formulate experiments to improve you RAG pipeline. ⏳
5. Choose the right metrics to evaluate the experiment ⏳
6. Analyze the results using the Langsmith dashboard. ⏳

So lets continue where we left off.

## 4. Formulate experiments to improve your RAG pipeline

Now for the baseline first thing we want to know is how effective is vanila GPT-3.5 compared to RAG based model. RAG should be superiour because there is specific information about companies but what exactly is the difference?

> is RAG better than just using an LLM for our case?

In order to compare that need to create 2 chains 
1. Just LLM - gpt-3.5
2. LLM + Retriver 

### Building the RAG

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

To build the RAG lets load the data, chunk it and add it to a vector store for retrieval. If you want more info on how to build RAG systems with langchain, check the [docs](https://python.langchain.com/docs/modules/data_connection/)

In [3]:
# load the documents
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader("./data/")
documents = loader.load()

# add filename as metadata
for document in documents:
    document.metadata['file_name'] = document.metadata['source']

# how many docs do we have?
docs = documents
len(docs)

26

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# create the vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

In [14]:
# get one example question for the dataset for testing
from langsmith import Client

client = Client()
examples = list(client.list_examples(dataset_name="basecamp"))

q = examples[0].inputs
q

{'question': 'What does the 37signals Employee Handbook provide for new hires?'}

In [11]:
from operator import itemgetter

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain import hub

In [12]:
# Retrieve and generate using the relevant snippets from the docs
vectorstore_retriever = vectorstore.as_retriever()
# load a RAG prompt from Langchain HUB
prompt = hub.pull("rlm/rag-prompt")
# our llm of choice
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
    
def ragas_output_parser(docs):
    return [doc.page_content for doc in docs]

Now lets string together all the components together and make the RAG pipeline

In [13]:
from langchain_core.runnables import RunnableParallel

generator = (
    prompt
    | llm
    | StrOutputParser()
)

retriever = RunnableParallel({
    "context": vectorstore_retriever | format_docs, 
    "question": RunnablePassthrough(),
})

filter_langsmith_dataset = RunnableLambda(lambda x: x["question"] if isinstance(x, dict) else x)

rag_chain = RunnableParallel({
    "question": filter_langsmith_dataset,
    "answer": filter_langsmith_dataset | retriever | generator,
    "contexts": filter_langsmith_dataset | vectorstore_retriever | ragas_output_parser,
})

In [16]:
# check with the example question to see if everything is working
get_answer = RunnableLambda(lambda x: x["answer"])
resp = (rag_chain | get_answer).invoke(q)
resp


'The 37signals Employee Handbook provides new hires with information on life insurance, retirement plans, company culture, and benefits. It also includes sections on what the company stands for, how they work, and employee perks. New employees are encouraged to question things and take advantage of their new perspective.'

Voilà! We have our RAG working with Langchain. Go on and try a few questions yourself from the `examples` we generated.

Now, let's build the LLM.

### Just the LLM

Setting this up is much easier as you could imagine, all you need are the prompts.

In [18]:
from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

Question: {question}

Helpful Answer:"""
llm_prompt = PromptTemplate.from_template(template)

just_llm = (
    {"question": RunnablePassthrough()}
    | llm_prompt
    | llm
    | StrOutputParser()
    | RunnableParallel({
        "answer": RunnablePassthrough(),
        "contexts": RunnableLambda(lambda _: [""]),
    })
)

In [19]:
resp = (just_llm | get_answer).invoke(q)
resp

"The 37signals Employee Handbook provides information on the company's values, culture, and expectations for new hires. It also outlines policies and procedures to help employees navigate their roles within the organization. Thanks for asking!"

Try out a few `examples` from this chain also, see if you can spot any differences in performance by eyeballing the results.

## 5. Choose the right metrics to evaluate the experiment

Ragas provides you with a different metrics that you can use to measure the different components of your RAG pipeline. You can see the entire list in the [docs](https://docs.ragas.io/en/latest/concepts/metrics/index.html).

For this experiment we are going to choose [Answer Correctness](https://docs.ragas.io/en/latest/concepts/metrics/answer_correctness.html). `Answer Correctness` is an end-to-end metric that measures the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Do check out the docs to learn more about how it works internally.

To make evaluation of Langchain chains on Langsmith easier, Ragas provides you with 2 utils 
1. `EvaluatorChain`: which is a langchain chain that take a Ragas metric and creates a `Chain` which outputs the score.
2. `evaluate()`: this is a util function for Langsmith that takes a dataset_name, chain and metrics to run the evaluations.

Lets take a look at both of them.

### `EvaluatorChain`

Lets create one for `Answer Correctness` and evaluate both of the baselines we created

In [None]:
from ragas.integrations.langchain import EvaluatorChain

# the metric we will be using
from ragas.metrics import answer_correctness

### `evaluate()` Langsmith Dataset

this utility function take the Langsmith dataset_name, RAG chain, the Ragas metrics you choose and runs the evaluations for you.

In [22]:
from ragas.integrations.langsmith import evaluate

Lets evaluate the `rag_chain` first.

In [24]:
dataset_name = "basecamp"
# evaluate just llms
run = evaluate(
    dataset_name=dataset_name, 
    llm_or_chain_factory=rag_chain, 
    experiment_name="rag_chain_1",
    metrics=[answer_correctness],
    verbose=True
)

View the evaluation results for project 'rag_chain_1' at:
https://smith.langchain.com/o/9bfbddc5-b88e-41e5-92df-2a62f0c64b4b/datasets/e9dc7bc8-9d47-4efd-8f4c-678a18a7aef5/compare?selectedSessions=e1c7b451-6331-476d-b360-7f488326df50

View all tests for Dataset basecamp at:
https://smith.langchain.com/o/9bfbddc5-b88e-41e5-92df-2a62f0c64b4b/datasets/e9dc7bc8-9d47-4efd-8f4c-678a18a7aef5
[>                                                 ] 0/50

  warn_deprecated(


[------------------------------------------------->] 50/50

Unnamed: 0,feedback.answer_correctness,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,5b518dc8-243a-4098-a6d1-7aecf35ac98a
freq,,,,1
mean,0.533543,,1.664292,
std,0.135102,,0.371598,
min,0.180216,,1.051612,
25%,0.48566,,1.341244,
50%,0.535136,,1.652268,
75%,0.616268,,1.96562,


Now lets evaluate the RAG pipeline

In [26]:
# evaluate rag_chain
run = evaluate(
    dataset_name=dataset_name,
    llm_or_chain_factory=just_llm, 
    experiment_name="just_llm_1",
    metrics=[answer_correctness], 
    verbose=True
)

View the evaluation results for project 'just_llm_1' at:
https://smith.langchain.com/o/9bfbddc5-b88e-41e5-92df-2a62f0c64b4b/datasets/e9dc7bc8-9d47-4efd-8f4c-678a18a7aef5/compare?selectedSessions=91e9d8cd-313b-4a29-892a-0d5f8c6a28e8

View all tests for Dataset basecamp at:
https://smith.langchain.com/o/9bfbddc5-b88e-41e5-92df-2a62f0c64b4b/datasets/e9dc7bc8-9d47-4efd-8f4c-678a18a7aef5
[------------------------------------------------->] 50/50

Unnamed: 0,feedback.answer_correctness,error,execution_time,run_id
count,50.0,0.0,50.0,50
unique,,0.0,,50
top,,,,e36cae5a-d19a-4917-9bb2-1bce78167127
freq,,,,1
mean,0.465834,,0.936587,
std,0.177573,,0.206522,
min,0.179124,,0.41259,
25%,0.270415,,0.80366,
50%,0.500165,,0.925462,
75%,0.597824,,1.060094,


Now you can check you langsmith dataset dashboard to view and analyise the results.