# Midterm Notebook

## Dependencies

In [1]:
!pip install -qU langchain langchain-core langchain-community langchain-openai

In [2]:
!pip install -qU qdrant-client

In [3]:
!pip install -qU tiktoken pymupdf

## Environment Variables

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Chunking

We'll use the `RecursiveCharacterTextSplitter` to create our toy example.

It will split based on the following rules:

- Each chunk has a maximum size of 100 tokens
- It will try and split first on the `\n\n` character, then on the `\n`, then on the `<SPACE>` character, and finally it will split on individual tokens.

Let's implement it and see the results!

In [10]:
from langchain.document_loaders import PyMuPDFLoader

doc1 = PyMuPDFLoader("Blueprint-for-an-AI-Bill-of-Rights.pdf").load()
doc2 = PyMuPDFLoader("NIST.AI.600-1.pdf").load()

In [11]:
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-4o").encode(
        text,
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 0,
    length_function = tiktoken_len,
)

In [12]:
split_chunks1 = text_splitter.split_documents(doc1)
len(split_chunks1)

196

In [13]:
split_chunks2 = text_splitter.split_documents(doc2)
len(split_chunks2)

167

## Qdrant Vector Store for Embeddings

In [15]:
from langchain_community.vectorstores import Qdrant
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

qdrant_vectorstore = Qdrant.from_documents(
    documents=split_chunks1 + split_chunks2,
    embedding=embedding_model,
    location=":memory:"
)

In [16]:
qdrant_retriever = qdrant_vectorstore.as_retriever()

In [18]:
query = "What kind of protections should AI systems provide?"
query_vector = embedding_model.embed_query(query)
print(f"Vector of Size: {len(query_vector)}")

Vector of Size: 1536


### RAG Chain with LCEL

### General purpose RAG base prompt

In [19]:
from langchain.prompts import ChatPromptTemplate

base_rag_prompt_template = """\
Use the provided context to answer the provided user question. Only use the provided context to answer the question. If you do not know the answer, response with "I don't know"

Context:
{context}

Question:
{question}
"""

base_rag_prompt = ChatPromptTemplate.from_template(base_rag_prompt_template)

### Base LLM

In [20]:
from langchain_openai.chat_models import ChatOpenAI

base_llm = ChatOpenAI(model="gpt-4o-mini", tags=["base_llm"])

### Simple RAG Chain Definition

In [21]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | qdrant_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": base_rag_prompt | base_llm, "context": itemgetter("context")}
)

In [22]:
print(retrieval_augmented_qa_chain.get_graph().draw_ascii())

          +---------------------------------+      
          | Parallel<context,question>Input |      
          +---------------------------------+      
                    **            **               
                  **                **             
                **                    **           
         +--------+                     **         
         | Lambda |                      *         
         +--------+                      *         
              *                          *         
              *                          *         
              *                          *         
  +----------------------+          +--------+     
  | VectorStoreRetriever |          | Lambda |     
  +----------------------+          +--------+     
                    **            **               
                      **        **                 
                        **    **                   
          +----------------------------------+     
          | 

### Sample Queries

In [23]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What kind of protections should AI systems provide?"})
response["response"]

In [25]:
response = retrieval_augmented_qa_chain.invoke({"question" : "Should people know when an AI system is being used?"})
response["response"].content

### Dig into the Context

In [28]:
for context in response["context"]:
  print("Context:")
  print(context)
  print("----")

Context:
page_content='should be kept up-to-date and people impacted by the system should be notified of significant use case or key 
functionality changes. You should know how and why an outcome impacting you was determined by an 
automated system, including when the automated system is not the sole input determining the outcome. 
Automated systems should provide explanations that are technically valid, meaningful and useful to you and to 
any operators or others who need to understand the system, and calibrated to the level of risk based on the 
context. Reporting that includes summary information about these automated systems in plain language and 
assessments of the clarity and quality of the notice and explanations should be made public whenever possible. 
6' metadata={'source': 'Blueprint-for-an-AI-Bill-of-Rights.pdf', 'file_path': 'Blueprint-for-an-AI-Bill-of-Rights.pdf', 'page': 5, 'total_pages': 73, 'format': 'PDF 1.6', 'title': 'Blueprint for an AI Bill of Rights', 'author': 