**Step 0: Imports, constants, and API Keys!**

In [1]:
!pip install -q langchain==0.2.16 langchain_core==0.2.38 langchain_community==0.2.16 pymupdf openai 
!pip install -q langchain_openai==0.1.23 langchain-qdrant qdrant_client asyncio ragas==0.1.14 pandas
!pip install -q langsmith

In [8]:
# RAG constants
CHUNK_SIZE = 1500
OVERLAP = 150
BASELINE_EMBEDDING_MODEL = "text-embedding-3-small"
BASELINE_CHAT_MODEL = "gpt-4o-mini-2024-07-18"

# RAGAS constants
RAGAS_CHUNK_SIZE = 750
RAGAS_OVERLAP = 75
GENERATOR_LLM = "gpt-4o-mini-2024-07-18"
CRITIC_LLM = "gpt-4o-2024-08-06"
N_EVAL_QUESTIONS = 30 # IRL, we'd want more, and maybe a test and validation set. But set it low to accommodate low rate limits.
TEST_DATASET_FILE = f"test_dataset_{N_EVAL_QUESTIONS}.csv"

# Dataset
PDFS = [
    "https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
    "https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf"
]

In [3]:
import os
import openai
from getpass import getpass

# collect OpenAI key
openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

**Step 1: Download and chunk the data**

We are going to use the following docs as our knowledge base:
1. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People (PDF)
2. National Institute of Standards and Technology (NIST) Artificial Intelligent Risk Management Framework 

Let's start with a simple fixed chunking strategy as a baseline, and later evaluate parent-doc retrieval if we have time

In [4]:
import importlib
import vanilla_rag

importlib.reload(vanilla_rag)
for pdf in PDFS:
    chunks = await vanilla_rag.load_and_chunk_pdf(pdf,CHUNK_SIZE,OVERLAP)


Loading https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf...
Chunking...
Loading https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf...
Chunking...


**Step 2: Basic RAG Pipeline**

In [5]:
importlib.reload(vanilla_rag)
rag_chain = await vanilla_rag.vanilla_openai_rag_chain(texts=chunks, 
                                            openai_key=openai.api_key, 
                                            embedding_model=BASELINE_EMBEDDING_MODEL,
                                            chat_model=BASELINE_CHAT_MODEL)

created qdrant client
created embeddings
populated vector db
created chain


In [19]:
from pprint import pprint
response = await rag_chain.ainvoke({"input":"What are some key risks associated with modern LLMs?"})
pprint(response)

{'context': [Document(metadata={'_id': '0c56626e4e1248b8aa06650938c75b4a', '_collection_name': 'default'}, page_content='with greater ease and scale than other technologies. LLMs have been reported to generate dangerous or \nviolent recommendations, and some models have generated actionable instructions for dangerous or \n \n \n9 Confabulations of falsehoods are most commonly a problem for text-based outputs; for audio, image, or video \ncontent, creative generation of non-factual content can be a desired behavior.  \n10 For example, legal confabulations have been shown to be pervasive in current state-of-the-art LLMs. See also, \ne.g.,'),
             Document(metadata={'_id': '75e5f9bf15df4824a59305d8ed7d9e11', '_collection_name': 'default'}, page_content='development, production, or use of CBRN weapons or other dangerous materials or agents. While \nrelevant biological and chemical threat knowledge and information is often publicly accessible, LLMs \ncould facilitate its analysis or

**Step 3: Generate synthetic data**

In [6]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context, conditional
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model=GENERATOR_LLM)
critic_llm = ChatOpenAI(model=CRITIC_LLM)
embeddings = OpenAIEmbeddings()

# Initialize data generator and set up distributions
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.3,
    reasoning: 0.1,
    conditional: 0.1
}

In [7]:
# re-chunk the data using a different size, then generate the synthetic test set
importlib.reload(vanilla_rag)
for pdf in PDFS:
    ragas_chunks = await vanilla_rag.load_and_chunk_pdf(pdf,RAGAS_CHUNK_SIZE,RAGAS_OVERLAP)

testset = generator.generate_with_langchain_docs(ragas_chunks, N_EVAL_QUESTIONS, distributions, with_debugging_logs=True)

Loading https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf...
Chunking...
Loading https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf...
Chunking...


embedding nodes:   0%|          | 0/520 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/30 [00:00<?, ?it/s]

[ragas.testset.filters.DEBUG] context scoring: {'clarity': 1, 'depth': 1, 'structure': 1, 'relevance': 1, 'score': 1.0}
[ragas.testset.evolutions.INFO] retrying evolution: 0 times
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 1, 'depth': 2, 'structure': 1, 'relevance': 2, 'score': 1.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['AI life cycle', 'Harmful Bias', 'Fact-checking techniques', 'GAI systems', 'Information Integrity']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 1, 'depth': 2, 'structure': 1, 'relevance': 2, 'score': 1.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Organizational risk tolerance', 'GAI system outputs', 'Safety and validity review', 'Information integrity', 'Security anomalies']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 2, 'depth': 3, 'structure': 2, 'relevance': 3, 'score': 2.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Sensitive information', 'Adversarial attack

In [9]:
import pandas as pd
# Generating the test data costs money, time, and compute, so make sure to save it for later re-use
test_df = testset.to_pandas().to_csv(TEST_DATASET_FILE,index=False)


**Step 4: Evaluate baseline RAG system**

In [28]:
# Load the dataset and run the RAG pipeline
import pandas as pd
from tqdm.asyncio import tqdm_asyncio

test_df = pd.read_csv(TEST_DATASET_FILE)

test_questions = test_df["question"].to_list()
test_gt = test_df["ground_truth"].to_list()

answers = []
contexts = []

for question in tqdm_asyncio(test_questions,desc="Processing Questions"):
  response = await rag_chain.ainvoke({"input" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Processing Questions:  13%|█▎        | 4/30 [00:20<02:15,  5.23s/it]

In [33]:
from datasets import Dataset

# Put in huggingface dataset format
response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_gt
})
response_dataset.save_to_disk(f"baseline_response_dataset_{N_EVAL_QUESTIONS}")


Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

In [34]:
# Use ragas to evaluate
from datasets import load_from_disk
response_dataset = load_from_disk("baseline_response_dataset")

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)

metrics = [ faithfulness, answer_relevancy, context_precision, context_recall ]
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/400 [00:00<?, ?it/s]

Exception raised in Job[377]: APIConnectionError(Connection error.)
Exception raised in Job[383]: APIConnectionError(Connection error.)
Exception raised in Job[397]: APIConnectionError(Connection error.)
Exception raised in Job[399]: APIConnectionError(Connection error.)
Exception raised in Job[332]: APIConnectionError(Connection error.)
Exception raised in Job[328]: APIConnectionError(Connection error.)
Exception raised in Job[390]: APIConnectionError(Connection error.)
Exception raised in Job[324]: APIConnectionError(Connection error.)
Exception raised in Job[326]: APIConnectionError(Connection error.)
Exception raised in Job[380]: APIConnectionError(Connection error.)
Exception raised in Job[317]: APIConnectionError(Connection error.)
Exception raised in Job[319]: APIConnectionError(Connection error.)
Exception raised in Job[330]: APIConnectionError(Connection error.)
Exception raised in Job[334]: APIConnectionError(Connection error.)
Exception raised in Job[381]: APIConnectionError

KeyboardInterrupt: 

In [32]:
# Check out the results
print(results)

{'faithfulness': nan, 'answer_relevancy': nan, 'context_precision': nan, 'context_recall': nan}
