**Step 1: Download and chunk the data**

We are going to use the following docs as our knowledge base:
1. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People (PDF)
2. National Institute of Standards and Technology (NIST) Artificial Intelligent Risk Management Framework 

Let's start with a simple fixed chunking strategy as a baseline, and later evaluate parent-doc retrieval if we have time

In [1]:
!pip install -qU -r requirements.txt 
!pip install -qU langchain-openai langchain-qdrant



In [2]:
# define constants
CHUNK_SIZE = 1500
OVERLAP = 150

PDFS = [
    "https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf",
    "https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf"
]

In [3]:
import os
import openai
from getpass import getpass

# collect OpenAI key
openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

In [29]:
import importlib
import chunk_and_load

importlib.reload(chunk_and_load)


<module 'chunk_and_load' from '/Users/Angela/Desktop/ai_makerspace/code/ai-risk-bot/chunk_and_load.py'>

In [30]:
# Load and chunk our pdfs
chunks = []
for pdf in PDFS:
    chunks.extend(chunk_and_load.load_and_chunk_pdf(pdf,CHUNK_SIZE,OVERLAP))

print(f"Loaded {len(chunks)} chunks")

Loading https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf...
Chunking...
Loading https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf...
Chunking...
Loaded 338 chunks


In [31]:
print(chunks[100])

page_content='correcting data. Entities should conduct regular, independent audits and take prompt corrective measures to 
maintain accurate, timely, and complete data. 
Limit access to sensitive data and derived data. Sensitive data and derived data should not be sold, 
shared, or made public as part of data brokerage or other agreements. Sensitive data includes data that can be 
used to infer sensitive information; even systems that are not directly marketed as sensitive domain technologies 
are expected to keep sensitive data private. Access to such data should be limited based on necessity and based 
on a principle of local control, such that those individuals closest to the data subject have more access while 
those who are less proximate do not (e.g., a teacher has access to their students’ daily progress data while a 
superintendent does not). 
Reporting. In addition to the reporting on data privacy (as listed above for non-sensitive data), entities devel-
oping technologies rel

In [32]:
import vanilla_rag
importlib.reload(vanilla_rag)

rag_chain = await vanilla_rag.vanilla_rag(chunks, openai.api_key, "AI-Risk")

created qdrant client
populated vector db
created chain


In [22]:
response = await rag_chain.ainvoke({"input":"What are some key risks associated with modern LLMs?"})
print(response)

{'response': AIMessage(content='Some key risks associated with modern LLMs include:\n\n1. **Dangerous or Violent Recommendations**: LLMs have been reported to generate content that incites violence or provides dangerous recommendations.\n\n2. **Confabulations of Falsehoods**: LLMs can produce incorrect information or logical steps that mislead users, potentially leading to harmful decisions, especially in critical areas like healthcare.\n\n3. **Deceptive Outputs**: LLMs may falsely assert human-like traits, deceiving users into believing they are interacting with a human.\n\n4. **Facilitation of Dangerous Knowledge**: LLMs could assist individuals without formal training in analyzing or synthesizing information related to chemical and biological threats.\n\n5. **Generation of Hateful Content**: LLMs can produce content that glorifies violence or promotes radicalization.\n\n6. **Trust Issues**: The potential for users to be misled by confabulated logic or citations can undermine trust i