### Task 1: Dealing with the Data
You identify the following important documents that, if used for context, you believe will help people understand what’s happening now:
Your boss, the SVP of Technology, green-lighted this project to drive the adoption of AI throughout the enterprise.  It will be a nice showpiece for the upcoming conference and the big AI initiative announcement the CEO is planning.

https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf

https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf


Your boss, the SVP of Technology, green-lighted this project to drive the adoption of AI throughout the enterprise.  It will be a nice showpiece for the upcoming conference and the big AI initiative announcement the CEO is planning.

In [7]:
%pip install -qU langgraph langchain langchain_openai langchain_experimental

Note: you may need to restart the kernel to use updated packages.


In [4]:
%pip install -qU --disable-pip-version-check qdrant-client pymupdf tiktoken

Note: you may need to restart the kernel to use updated packages.


### RAGAS DEPENDENCIES

In [9]:
%pip install -qU langsmith langchain-qdrant ragas
%pip install langchain-community>=0.3.0,<0.4.0
%pip install langchain-core>=0.3.0,<0.4.0

Note: you may need to restart the kernel to use updated packages.
/bin/bash: 0.4.0: No such file or directory
Note: you may need to restart the kernel to use updated packages.
/bin/bash: 0.4.0: No such file or directory
Note: you may need to restart the kernel to use updated packages.


In [5]:
import os
import getpass
from dotenv import load_dotenv
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [13]:
from langchain.document_loaders import PyMuPDFLoader

docB = PyMuPDFLoader("https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf").load()
docN = PyMuPDFLoader("https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf").load()

documents = docB + docN
print(f"Loaded {len(documents)} documents")
print(f"Loaded {documents[:1]} documents")

Loaded 137 documents
Loaded [Document(metadata={'source': 'https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf', 'file_path': 'https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf', 'page': 0, 'total_pages': 73, 'format': 'PDF 1.6', 'title': 'Blueprint for an AI Bill of Rights', 'author': '', 'subject': '', 'keywords': '', 'creator': 'Adobe Illustrator 26.3 (Macintosh)', 'producer': 'iLovePDF', 'creationDate': "D:20220920133035-04'00'", 'modDate': "D:20221003104118-04'00'", 'trapped': ''}, page_content=' \n \n \n \n \n \n \n \n \n \nBLUEPRINT FOR AN \nAI BILL OF \nRIGHTS \nMAKING AUTOMATED \nSYSTEMS WORK FOR \nTHE AMERICAN PEOPLE \nOCTOBER 2022 \n')] documents


In [14]:
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-4o-mini").encode(
        text,
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 0,
    length_function = tiktoken_len,
)

split_chunks = text_splitter.split_documents(documents)

print(f"Split {len(split_chunks)} chunks")

Split 363 chunks


In [15]:
from langchain_openai import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

In [16]:
from langchain_community.vectorstores import Qdrant

qdrant_vectorstore = Qdrant.from_documents(
    split_chunks,
    embedding_model,
    location=":memory:",
    collection_name="extending_context_window_llama_3",
)



In [18]:
qdrant_retriever = qdrant_vectorstore.as_retriever()

In [19]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """
CONTEXT:
{context}

QUERY:
{question}

You are a helpful assistant. Use the available context to answer the question only. If the question is not in the context then, say 'I don't know brah!'.
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [20]:
from langchain_openai import ChatOpenAI

openai_chat_model = ChatOpenAI(model="gpt-4o-mini")

In [21]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser

#Context is both the question and the output of the qdrant retriever 
#The question is passed through the qdrant retriever to get the context
#Question is passed through the RAG prompt and then the openai chat model
rag_chain = (
    {"context": itemgetter("question") | qdrant_retriever, "question": itemgetter("question")}
    | rag_prompt | openai_chat_model | StrOutputParser()
)

In [22]:
rag_chain.invoke({"question" : "tell about CBRN Information or Capabilities?"})

'CBRN Information or Capabilities refer to information and capabilities related to Chemical, Biological, Radiological, and Nuclear threats. The context highlights the need to periodically evaluate whether models may misuse CBRN information or capabilities, as well as the importance of governance and oversight concerning dangerous, violent, or hateful content associated with these capabilities. Additionally, it emphasizes establishing policies and procedures for risk measurement related to CBRN information within structured frameworks.'

### TASK 3: RAGAS FRAMEWORK - GENERATION SYNTHETIC DATA


In [23]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

In [24]:
testset = generator.generate_with_langchain_docs(documents, 5, distributions, with_debugging_logs=True)

embedding nodes:   0%|          | 0/284 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/5 [00:00<?, ?it/s]

[ragas.testset.filters.DEBUG] context scoring: {'clarity': 2, 'depth': 3, 'structure': 2, 'relevance': 3, 'score': 2.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Human subjects', 'Content provenance data', 'Data privacy', 'AI system performance', 'Pre-deployment testing']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 1, 'depth': 1, 'structure': 2, 'relevance': 2, 'score': 1.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Technical companion', 'AI Bill of Rights', 'Algorithmic discrimination protections', 'Data privacy', 'Human alternatives']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 2, 'depth': 3, 'structure': 2, 'relevance': 3, 'score': 2.5}
[ragas.testset.evolutions.DEBUG] keyphrases in merged node: ['Information sharing', 'Feedback mechanisms', 'Negative impact', 'GAI systems', 'AI risks']
[ragas.testset.filters.DEBUG] context scoring: {'clarity': 1, 'depth': 1, 'structure': 2, 'relevance': 2, 'score': 1.5}
[ragas.te

In [25]:
testset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,How do GAI value chains involve third-party co...,[ \n12 \nCSAM. Even when trained on “clean” da...,GAI value chains involve many third-party comp...,simple,[{'source': 'https://nvlpubs.nist.gov/nistpubs...,True
1,How can organizations verify information shari...,[ \n20 \nGV-4.3-003 \nVerify information shari...,Organizations can verify information sharing a...,simple,[{'source': 'https://nvlpubs.nist.gov/nistpubs...,True
2,"How to measure AI risks in GAI systems, includ...",[ \n28 \nMAP 5.2: Practices and personnel for ...,The answer to given question is not present in...,multi_context,[{'source': 'https://nvlpubs.nist.gov/nistpubs...,True
3,What does the Executive Order on Advancing Rac...,[ \n \n \n \nENDNOTES\n1.The Executive Order O...,The Executive Order on Advancing Racial Equity...,multi_context,[{'source': 'https://www.whitehouse.gov/wp-con...,True
4,What are the key components of testing automat...,[ \n \n \n \n \n \n \nSAFE AND EFFECTIVE \nSYS...,Systems should undergo extensive testing befor...,simple,[{'source': 'https://www.whitehouse.gov/wp-con...,True
