# Introduction
This notebook aims to build a searchable vecstore of the documents [Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People](https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf) and [National Institute of Standards and Technology (NIST) Artificial Intelligent Risk Management Framework](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf) in order to help allay the concerns of people who are anxious about the state of AI. 

We'll start with a set of imports to get ready for indexing documents

In [1]:
import tiktoken
import os
from qdrant_client import QdrantClient
from langchain_qdrant import QdrantVectorStore
from qdrant_client.http.models import Distance, VectorParams
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader, UnstructuredHTMLLoader
from langchain_community.vectorstores import Qdrant
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings

In [2]:
from dotenv import load_dotenv; _ = load_dotenv()

In [3]:
pdf_loader = PyMuPDFLoader('Blueprint-for-an-AI-Bill-of-Rights.pdf')
docs_bill_of_rights = pdf_loader.load()

In [4]:
pdf_loader = PyMuPDFLoader('NIST.AI.600-1.pdf')
docs_nist = pdf_loader.load()

In [51]:
documents = docs_bill_of_rights + docs_nist

In [52]:
len(documents)

137

## Text splitting
Next we take these documents and split them up into chunks for easy retrieval from a vectorstore. Without knowing much about these documents, a `RecursiveCharacterTextSplitter` seems like the most obvious choice. Even the Langchain website recommends this strategy if the data is mostly unstructured (which these PDF documents are) and there's no additional structure we know about it. We'll make use of `text-embedding-3-small` as the default choice of openai embeddings for maximal performance. This is from the family of the highest performance embedding models from OpenAI. Using the `small` embeddings trades off performance and cost.

Since we're using `text-embedding-3-small` model, we'll also define a length function that accounts for tokens from use of this model while splitting up the text into chunks. 

For the future, a chunking strategy we could test out the experimental `SemanticChunker` which further combines sentences if they are semantically similar. But this is more of a risky choice so we'll stick with the default for now and might make use of MDD to determine if `SemanticChunker` is better later. We cam additionally also try out `text-embedding-3-large` model with this strategy.

In [7]:
def tiktoken_len(text, model='text-embedding-3-small'):
    embedding = tiktoken.encoding_for_model(model)
    query = embedding.encode(text)
    return len(query)


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    length_function=tiktoken_len
)

In [54]:
split_documents = text_splitter.split_documents(documents)

In [55]:
len(split_documents)

279

## Build Vectorstore from Embeddings
Next we take these split documents and build out a vectorstore using `Qdrant`, a fairly high performant and flexible vectorstore. We'll continue to use `text-embedding-3-small` as the embedding function to store documents. 

In [31]:
from dotenv import load_dotenv; load_dotenv()

True

In [32]:
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')

In [None]:
client = QdrantClient(
    url=os.environ.get('QDRANT_DB'),
    api_key=os.environ.get('QDRANT_API_KEY'),
)

client.create_collection(
    collection_name="ai_ethics_te3_small",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

In [44]:
vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_ethics_te3_small",
    embedding=embeddings,
)

In [17]:
def add_documents(store, documents):
    for i in range(0, len(documents), 10):
        batch = documents[i:i+10]
        vector_store.add_documents(
            documents=batch,
        )

In [None]:
add_documents(vector_store, split_documents)

## Also add a store for text-embeddings-3-large
Based on evaluation done in the notebook `Test Data and RAGAS Evaluation.ipynb`, it appears that a split and indexing strategy based on `text-embedding-3-large` model performs slightly better on some key metrics compared to `text-embedding-3-small` model. As such, we'll also create a vectorstore based on this embedding.

It didn't appear that SemanticChunking made much of a difference, so we ignore this part.

In [57]:
from functools import partial
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    length_function=partial(tiktoken_len, model='text-embedding-3-large')
)

In [58]:
split_documents = text_splitter.split_documents(documents)

In [59]:
client.create_collection(
    collection_name="ai_ethics_te3_large",
    vectors_config=VectorParams(size=3072, distance=Distance.COSINE),
)

True

In [60]:
embeddings = OpenAIEmbeddings(model='text-embedding-3-large')
vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_ethics_te3_large",
    embedding=embeddings,
)

In [61]:
add_documents(vector_store, split_documents)

# Updating Vectorstore with Policy Updates

There have been documents which have since updated the state of the Government's political stance on AI Systems. This part of the notebook aims to udpate the vectorstore with the executive order on [Safe, Secure and Trustworthy AI](https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/) as well as the [270 day update](https://www.whitehouse.gov/briefing-room/statements-releases/2024/07/26/fact-sheet-biden-harris-administration-announces-new-ai-actions-and-receives-additional-major-voluntary-commitment-on-ai/) on the same Executive Order. 

Any new policy documents can be similarly ingested into our vectorstore(s)

In [2]:
!pip install -qU unstructured

In [4]:
import tiktoken
import os
from qdrant_client import QdrantClient
from langchain_qdrant import QdrantVectorStore
from qdrant_client.http.models import Distance, VectorParams
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.vectorstores import Qdrant
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings

In [5]:
from dotenv import load_dotenv; load_dotenv()

True

In [5]:
eo_link = 'https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence'
eo_update_link = 'https://www.whitehouse.gov/briefing-room/statements-releases/2024/07/26/fact-sheet-biden-harris-administration-announces-new-ai-actions-and-receives-additional-major-voluntary-commitment-on-ai'

In [6]:
import requests
def download_html(link, path=None):
    if not path:
        path = link.split('/')[-1]
        if not path.endswith('.html'):
            path += '.html'
    
    with open(path, 'wb') as f:
        iter = requests.get(link, stream=True)
        for r in iter.iter_content(chunk_size=1024):
            f.write(r)
    return path

## Update Vectorstore of `text-embedding-3-small` embeddings
First update the vectorstore of `text-embedding-3-small` model.

In [16]:
eo = download_html(eo_link)
eo_documents = UnstructuredHTMLLoader(eo).load()

In [18]:
eo_update = download_html(eo_update_link)
eo_update_documents = UnstructuredHTMLLoader(eo_update).load()

In [19]:
all_eo_documents = eo_documents + eo_update_documents

In [20]:
from functools import partial
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    length_function=partial(tiktoken_len, model='text-embedding-3-small')
)
eo_split_documents = text_splitter.split_documents(all_eo_documents)

In [22]:
embedding = OpenAIEmbeddings(model='text-embedding-3-small')
client = QdrantClient(
    url=os.environ.get('QDRANT_DB'),
    api_key=os.environ.get('QDRANT_API_KEY'),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_ethics_te3_small",
    embedding=embedding,
)

  embedding = OpenAIEmbeddings(model='text-embedding-3-small')


In [24]:
add_documents(vector_store, eo_split_documents)

## Update Vectorstore of `text-embedding-3-large` embeddings
Also update the vectorstore indexed using `text-embedding-3-large` model.

In [25]:
from functools import partial
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    length_function=partial(tiktoken_len, model='text-embedding-3-large')
)
eo_split_documents = text_splitter.split_documents(all_eo_documents)

In [26]:
embedding = OpenAIEmbeddings(model='text-embedding-3-large')
client = QdrantClient(
    url=os.environ.get('QDRANT_DB'),
    api_key=os.environ.get('QDRANT_API_KEY'),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_ethics_te3_large",
    embedding=embedding,
)

In [27]:
add_documents(vector_store, eo_split_documents)

# Add a vectorstore for `nomic-embed-text-v1` finetuned model
The notebook `Fine_Tuning_nomic_embed_text_v1_on_AI_Ethics_Docs.ipynb` further finetunes a [nomic-ai/nomic-embed-text-v1(https://huggingface.co/nomic-ai/nomic-embed-text-v1) model. This model outperforms default models on Answer correctness. Thus we finally create a vectorstore indexed with our finetuned embedding model and with all the documents above.

In [8]:
pdf_loader = PyMuPDFLoader('Blueprint-for-an-AI-Bill-of-Rights.pdf')
docs_bill_of_rights = pdf_loader.load()
pdf_loader = PyMuPDFLoader('NIST.AI.600-1.pdf')
docs_nist = pdf_loader.load()

In [7]:
eo_documents = UnstructuredHTMLLoader('executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence.html').load()
eo_update_documents = UnstructuredHTMLLoader('fact-sheet-biden-harris-administration-announces-new-ai-actions-and-receives-additional-major-voluntary-commitment-on-ai.html').load()

In [8]:
documents = docs_bill_of_rights + docs_nist + eo_documents + eo_update_documents

In [9]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer
embedding = HuggingFaceEmbeddings(model_name="deman539/nomic-embed-text-v1", model_kwargs={'trust_remote_code': True})
tokenizer = AutoTokenizer.from_pretrained("deman539/nomic-embed-text-v1")

def nomic_len_function(text):
  inputs = tokenizer(text)
  return len(inputs.input_ids)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    length_function=nomic_len_function
)

  from .autonotebook import tqdm as notebook_tqdm
  embedding = HuggingFaceEmbeddings(model_name="deman539/nomic-embed-text-v1", model_kwargs={'trust_remote_code': True})
<All keys matched successfully>


In [14]:
split_documents = text_splitter.split_documents(documents)

In [19]:
client = QdrantClient(
    url=os.environ.get('QDRANT_DB'),
    api_key=os.environ.get('QDRANT_API_KEY'),
)

client.create_collection(
    collection_name="ai_ethics_nomicv1_finetuned",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

True

In [20]:
vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_ethics_nomicv1_finetuned",
    embedding=embedding,
)

In [21]:
add_documents(vector_store, split_documents)