<a href="https://colab.research.google.com/github/bekingcn/colab-archive/blob/main/Synthetic_Data_Generation_RAGAS_%26_LangSmith.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Synthetic Data Generation Using RAGAS - RAG Evaluation with LangSmith

In the following notebook we'll explore a use-case for RAGAS' synthetic testset generation workflow, inspired by the [Evol Instruct](https://arxiv.org/abs/2304.12244) paper.

We will use this pipeline to:

1. Generate synthetic Question/Ground Truth Pairs
2. Load them into a LangSmith Dataset
3. Evaluate our RAG chain against the synthetic test data
4. Make changes to our pipeline
5. Evaluate the modified pipeline

SDG is a critical piece of the puzzle, especially for early iteration! Without it, it would not be nearly as easy to get high quality early signal for our application's performance.

Let's dive in!

## Dependencies and API Keys

We'll need to install a number of API keys and dependencies, since we'll be leveraging a number of great technologies for this pipeline!

1. OpenAI's endpoints to handle the Synthetic Data Generation
2. TogetherAI's Endpoints for our RAG pipeline and LangSmith evaluation
3. QDrant as our vectorstore
4. LangSmith for our evaluation coordinator!

Let's install and provide all the required information below!

In [None]:
!pip install -qU langsmith langchain-together langchain-core langchain-community langchain-openai langchain-qdrant

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/384.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/384.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/384.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m204.8/384.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m378.9/384.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.0/384.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install -qU pymupdf ragas

We'll need to provide our LangSmith API key, and set tracing to "true".

In [None]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

LangChain API Key:··········


We'll also want to set a project name to make things easier for ourselves.

In [None]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

You can follow the process outlined at step 1 [here](https://docs.together.ai/docs/quickstart#1-register-for-an-account) to obtain an API key.

> NOTE: This notebook can be executed with the free \$5 given to new Together AI accounts. This notebook will consume ~$0.50 credits total. Details about pricing are available [here](https://www.together.ai/pricing).

In [None]:
os.environ["TOGETHER_API_KEY"] = getpass.getpass("Together API Key:")

Together API Key:··········


OpenAI's API Key!

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


## Loading Source Documents

In order to create a synthetic dataset, we must first load our source documents!

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

documents = PyMuPDFLoader(file_path="https://s2.q4cdn.com/470004039/files/doc_earnings/2024/q3/filing/_10-Q-Q3-2024-As-Filed.pdf").load()

Creating our Synthetic Dataset is as simple as running the following cell.

You'll notice that we're declaring a `distributions` below - this will impact what *kinds* of questions are created - more information is available [here](https://docs.ragas.io/en/latest/concepts/testset_generation.html#in-depth-evolution).

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-4o-mini")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

Let's generate!

> NOTE: This cell will take some time, and also make a lot of calls to OpenAI's endpoints! You may run into rate-limits during this cell!

In [None]:
testset = generator.generate_with_langchain_docs(documents, 20, distributions)
testset.to_pandas()

embedding nodes:   0%|          | 0/64 [00:00<?, ?it/s]



Generating:   0%|          | 0/20 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the purpose of the commercial paper pr...,[Note 6 – Income Taxes\nEuropean Commission St...,The purpose of the commercial paper program fo...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
1,What instruments does the Company use as cash ...,"[ of reasons, including accounting considerati...","The Company uses forwards, options, and other ...",simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
2,What impact do macroeconomic conditions have o...,[Item 2. \nManagement’s Discussion and Analysi...,"Macroeconomic conditions, including inflation,...",simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
3,What are the responsibilities of the certifyin...,"[Exhibit 31.2\nCERTIFICATION\nI, Luca Maestri,...",The certifying officer's responsibilities rega...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
4,What does the term 'fair value' refer to in th...,"[September 30, 2023\nAdjusted\nCost\nUnrealize...",The term 'fair value' in the context of financ...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
5,What factors can materially and adversely affe...,"[ ended March 30, 2024 (the “second quarter 20...",The context mentions that the Company's busine...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
6,What was the total number of shares purchased ...,[Item 2. \nUnregistered Sales of Equity Securi...,The total number of shares purchased through o...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
7,What are the key components of the Company's c...,"[Selling, General and Administrative\nSelling,...",The key components of the Company's capital re...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
8,What factors contributed to the increase in se...,[Products and Services Performance\nThe follow...,The increase in services net sales during the ...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True
9,What is the significance of the Sarbanes-Oxley...,[Exhibit 32.1\nCERTIFICATIONS OF CHIEF EXECUTI...,The Sarbanes-Oxley Act of 2002 is significant ...,simple,[{'source': 'https://s2.q4cdn.com/470004039/fi...,True


## LangSmith Dataset

Now we can move on to creating a dataset for LangSmith!

First, we'll need to create a dataset on LangSmith using the `Client`!

We'll name our Dataset to make it easy to work with later.

In [None]:
from langsmith import Client

client = Client()

dataset_name = "Apple 10-Q Filing Questions - v1"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Questions about Apple's 10-Q Filing"
)

We'll iterate through the RAGAS created dataframe - and add each example to our created dataset!

> NOTE: We need to conform the outputs to the expected format - which in this case is: `question` and `answer`.

In [None]:
for test in testset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": test[1]["question"]
      },
      outputs={
          "answer": test[1]["ground_truth"]
      },
      metadata={
          "context": test[0]
      },
      dataset_id=dataset.id
  )

## Basic RAG Chain

Time for some RAG!

We'll use the Apple 10-Q filing as our data source today!


In [None]:
rag_documents = PyMuPDFLoader(file_path="https://s2.q4cdn.com/470004039/files/doc_earnings/2024/q3/filing/_10-Q-Q3-2024-As-Filed.pdf").load()

To keep things simple, we'll just use LangChain's recursive character text splitter!


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

We'll create our vectorstore using a TogetherAI [embedding model](https://docs.together.ai/docs/embedding-models)!

We'll specifically use:

- `togethercomputer/m2-bert-80M-8k-retrieval` - this embedding model is 80M parameters, with 768 as the embedding dimension.

In [None]:
from langchain_together.embeddings import TogetherEmbeddings

embeddings = TogetherEmbeddings(model="togethercomputer/m2-bert-80M-8k-retrieval")

As usual, we will power our RAG application with Qdrant!

In [None]:
from langchain_community.vectorstores import Qdrant

vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Apple 10-Q"
)

In [None]:
retriever = vectorstore.as_retriever()

To get the "A" in RAG, we'll provide a prompt.

In [None]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

For our LLM, we will be using TogetherAI's endpoints as well!

We're going to be using Meta Llama 3.1 70B Instruct Turbo - a powerful model which should get us powerful results!

In [None]:
from langchain_together import ChatTogether

llm = ChatTogether(model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo")

Finally, we can set-up our RAG LCEL chain!

In [None]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [None]:
rag_chain.invoke({"question" : "Does Apple seem to be in good financial health?"})

"Yes, based on the context, Apple's financial health appears to be good. The provided documents show that the company's net sales have increased or remained relatively flat in various categories, such as Mac, iPad, and Services, compared to the same periods in 2023. Additionally, the total net sales have increased by 5% in the third quarter and 1% in the first nine months of 2024 compared to the same periods in 2023. These numbers suggest a stable and growing financial performance for Apple."

## LangSmith Evaluation Set-up

We'll use TogetherAI's Llama 3.1 405B Instruct Turbo as our evaluation LLM for our base Evaluators.

In [None]:
eval_llm = ChatTogether(model="meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo")

We'll be using a number of evaluators - from LangSmith provided evaluators, to a few custom evaluators!

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})
context_qa_evaluator = LangChainStringEvaluator("context_qa", config={"llm" : eval_llm})
cot_qa_evaluator = LangChainStringEvaluator("cot_qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        }
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["output"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

dope_or_nope_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "dopeness": "Is this submission dope, lit, or cool?",
        }
    }
)

## LangSmith Evaluation

In [None]:
evaluate(
    rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        context_qa_evaluator,
        cot_qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "default_chain"},
)

## Dope-ifying Our Application

We'll be making a few changes to our RAG chain to increase its performance on our SDG evaluation test dataset!

- Include a "dope" prompt augmentation
- Use larger chunks
- Improve the retriever model to: `BAAI/bge-large-en-v1.5`

Let's see how this changes our evaluation!

In [None]:
DOPE_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the questions in a dope way, be cool!

Context: {context}
Question: {question}
"""

dope_rag_prompt = ChatPromptTemplate.from_template(DOPE_RAG_PROMPT)

In [None]:
rag_documents = PyMuPDFLoader(file_path="https://s2.q4cdn.com/470004039/files/doc_earnings/2024/q3/filing/_10-Q-Q3-2024-As-Filed.pdf").load()

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(rag_documents)

In [None]:
from langchain_together.embeddings import TogetherEmbeddings

embeddings = TogetherEmbeddings(model="BAAI/bge-large-en-v1.5")

In [None]:
vectorstore = Qdrant.from_documents(
    documents=rag_documents,
    embedding=embeddings,
    location=":memory:",
    collection_name="Apple 10-Q (Augmented)"
)

In [None]:
retriever = vectorstore.as_retriever()

In [None]:
dope_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | dope_rag_prompt | llm | StrOutputParser()
)

In [None]:
dope_rag_chain.invoke({"question" : "Does Apple seem to be in good financial health?"})

"Yaaas, Apple's financials are lookin' straight fire! They're rockin' a total of $331,612 in assets, with a significant chunk of that comin' from marketable securities and cash. Their net sales are also on point, with a total of $296,105 for the nine months ended June 29, 2024. And let's not forget about that gross margin, baby - it's a whoppin' $136,804! They're also keepin' their operating expenses in check, with a total of $43,179 for the nine months ended June 29, 2024. All in all, Apple's financials are lookin' healthy and strong, homie!"

In [None]:
evaluate(
    dope_rag_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        context_qa_evaluator,
        cot_qa_evaluator,
        labeled_helpfulness_evaluator,
        dope_or_nope_evaluator
    ],
    metadata={"revision_id": "dope_chain"},
)