In [1]:
%load_ext autoreload
%autoreload 2

# The Problem

Figure out a Metrics-driven approach to make sense of this
![](https://media.licdn.com/dms/image/D4D22AQEgjWxKXokOPA/feedshare-shrink_800/0/1708498751086?e=1711584000&v=beta&t=xaT95vKS8m4qTybofpKqQfXOGoFs8lQXBuOk2Fr45AE)

Access to the [original miro mindmap](https://miro.com/app/board/uXjVNvklNmc=/)

## Our Solution: `Metrics Driven Development with Ragas`

![](https://docs.ragas.io/en/latest/_static/imgs/component-wise-metrics.png)

In [3]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

## VectorStore

In [1]:
# load the documents
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader("./data/")
documents = loader.load()

# add filename as metadata
for document in documents:
    document.metadata['file_name'] = document.metadata['source']

# how many docs do we have?
docs = documents
len(docs)

26

In [4]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# create the vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

In [10]:
import pandas as pd

eval_df = pd.read_csv("./eval_dataset.csv").dropna(ignore_index=True)
eval_df.head()

Unnamed: 0,input_question,output_ground_truth
0,What are the expectations for behavior that co...,"Using courteous language, being respectful and..."
1,How does the concept of a presentable scope of...,The concept of a presentable scope of work hel...
2,What is the importance of levelheadedness in t...,Levelheadedness is important in the company's ...
3,What is the process for formal performance rev...,You'll meet with your manager for formal perfo...
4,What are the qualifications and responsibiliti...,The qualifications and responsibilities of a S...


Now you have a nice way to view them and select the ones you want.

## Baselines

now lets build 2 baselines and compare them with metrics available through Ragas. The first metric will be `AnswerCorrectness`.

The question I'm curious about is
> Is RAG actually better than just LLM's for this data distribution?

well - lets compare shall we! We'll take 2 examples
1. Vanilla RAG from Langchain
2. GPT 3.5 and my humble prompts

first create both

### RAG

In [5]:
from operator import itemgetter

from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

In [6]:
# Retrieve and generate using the relevant snippets of the blog.
vectorstore = FAISS.from_documents(documents, embedding=OpenAIEmbeddings())
vectorstore_retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
    
def ragas_output_parser(docs):
    return [doc.page_content for doc in docs]

In [7]:
from langchain_core.runnables import RunnableParallel

generator = (
    prompt
    | llm
    | StrOutputParser()
)

retriever = RunnableParallel({
    "context": vectorstore_retriever | format_docs, 
    "question": RunnablePassthrough(),
})

filter_langsmith_dataset = RunnableLambda(lambda x: x["question"] if isinstance(x, dict) else x)

rag_chain = RunnableParallel({
    "question": filter_langsmith_dataset,
    "answer": filter_langsmith_dataset | retriever | generator,
    "contexts": filter_langsmith_dataset | vectorstore_retriever | ragas_output_parser,
})

In [19]:
q = eval_df.input_question[0]
print("Q: ", q)

Q:  What are the expectations for behavior that contributes to a healthy and friendly work environment according to the 37signals Code of Conduct?


In [21]:
get_answer = RunnableLambda(lambda x: x["answer"])
resp = (rag_chain | get_answer).invoke(q)
resp


'The expectations for behavior that contributes to a healthy and friendly work environment according to the 37signals Code of Conduct include using courteous language, being respectful and empathetic, accepting constructive criticism, and assuming good intentions. Unacceptable behavior includes the use of sexualized or violent language, making unwelcome sexual advances, and any form of discrimination or harassment. Employees are expected to report any violations of the Code of Conduct to their manager or leadership for review and investigation.'

### LLM

In [22]:
from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

Question: {question}

Helpful Answer:"""
llm_prompt = PromptTemplate.from_template(template)

just_llm = (
    {"question": RunnablePassthrough()}
    | llm_prompt
    | llm
    | StrOutputParser()
    | RunnableParallel({
        "answer": RunnablePassthrough(),
        "contexts": RunnableLambda(lambda _: [""]),
    })
)

In [23]:
q = eval_df.input_question[0]
print("Q: ", q)



Q:  What are the expectations for behavior that contributes to a healthy and friendly work environment according to the 37signals Code of Conduct?


In [25]:
resp = (just_llm | get_answer).invoke(q)
resp



'The 37signals Code of Conduct expects employees to be respectful, open-minded, and collaborative in order to contribute to a healthy and friendly work environment. It also emphasizes the importance of communication, empathy, and a positive attitude towards colleagues. Thanks for asking!'

# Evaluate: just_llm vs rag_chain

Let evaluate and compare with just_llm and rag_chain. We have some utility functions to help us with this.

First is `EvaluatorChain` which is a wrapper around a langsmith evaluator that can be used to evaluate a langchain runnable.

Second is `evaluate` which run the evaluations for you.

In [26]:
from ragas.integrations.langsmith import evaluate

from ragas.metrics import answer_correctness

In [30]:
dataset_name = "basecamp"
# evaluate just llms
run = evaluate(
    dataset_name=dataset_name, 
    llm_or_chain_factory=just_llm, 
    experiment_name="just_llm",
    metrics=[answer_correctness],
    verbose=True
)

View the evaluation results for project 'just_llm' at:
https://smith.langchain.com/o/9bfbddc5-b88e-41e5-92df-2a62f0c64b4b/datasets/8f267706-24b2-47fb-84ee-3ea3cfc5a0c0/compare?selectedSessions=ec853c9b-906e-43b0-8b61-676f4348fdea

View all tests for Dataset basecamp at:
https://smith.langchain.com/o/9bfbddc5-b88e-41e5-92df-2a62f0c64b4b/datasets/8f267706-24b2-47fb-84ee-3ea3cfc5a0c0
[------------------------------------------------->] 7/7

Unnamed: 0,feedback.answer_correctness,error,execution_time,run_id
count,7.0,0.0,7.0,7
unique,,0.0,,7
top,,,,960bed4f-21e3-43b1-9498-dbbb98bb6781
freq,,,,1
mean,0.499785,,1.238861,
std,0.132672,,0.242346,
min,0.218822,,0.958083,
25%,0.49614,,0.996293,
50%,0.539284,,1.353536,
75%,0.567302,,1.447945,


In [31]:
# evaluate rag_chain
run = evaluate(
    dataset_name=dataset_name,
    llm_or_chain_factory=rag_chain, 
    experiment_name="rag_chain",
    metrics=[answer_correctness], 
    verbose=True
)

View the evaluation results for project 'rag_chain' at:
https://smith.langchain.com/o/9bfbddc5-b88e-41e5-92df-2a62f0c64b4b/datasets/8f267706-24b2-47fb-84ee-3ea3cfc5a0c0/compare?selectedSessions=077a232f-e162-4635-82a4-4db28328fa66

View all tests for Dataset basecamp at:
https://smith.langchain.com/o/9bfbddc5-b88e-41e5-92df-2a62f0c64b4b/datasets/8f267706-24b2-47fb-84ee-3ea3cfc5a0c0
[------------------------------------------------->] 7/7

Unnamed: 0,feedback.answer_correctness,error,execution_time,run_id
count,7.0,0.0,7.0,7
unique,,0.0,,7
top,,,,7d1019e7-7e9f-4081-b7bb-93c7a02188fb
freq,,,,1
mean,0.623937,,10.036328,
std,0.093188,,18.716885,
min,0.493509,,2.230393,
25%,0.557318,,2.622856,
50%,0.669444,,3.157579,
75%,0.688257,,3.577293,


Now you can check you langsmith dataset dashboard to view and analyise the results.