# Introduction
This notebook aims to build a searchable vecstore of the documents [Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People](https://www.whitehouse.gov/wp-content/uploads/2022/10/Blueprint-for-an-AI-Bill-of-Rights.pdf) and [National Institute of Standards and Technology (NIST) Artificial Intelligent Risk Management Framework](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf) in order to help allay the concerns of people who are anxious about the state of AI. 

We'll start with a set of imports to get ready for indexing documents

In [48]:
import tiktoken
import os
from qdrant_client import QdrantClient
from langchain_qdrant import QdrantVectorStore
from qdrant_client.http.models import Distance, VectorParams
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import Qdrant
from langchain_openai.llms import OpenAI
from langchain_openai.chat_models import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings

In [49]:
pdf_loader = PyMuPDFLoader('Blueprint-for-an-AI-Bill-of-Rights.pdf')
docs_bill_of_rights = pdf_loader.load()

In [50]:
pdf_loader = PyMuPDFLoader('NIST.AI.600-1.pdf')
docs_nist = pdf_loader.load()

In [51]:
documents = docs_bill_of_rights + docs_nist

In [52]:
len(documents)

137

## Text splitting
Next we take these documents and split them up into chunks for easy retrieval from a vectorstore. Without knowing much about these documents, a `RecursiveCharacterTextSplitter` seems like the most obvious choice. Even the Langchain website recommends this strategy if the data is mostly unstructured (which these PDF documents are) and there's no additional structure we know about it. We'll make use of `text-embedding-3-small` as the default choice of openai embeddings for maximal performance. This is from the family of the highest performance embedding models from OpenAI. Using the `small` embeddings trades off performance and cost.

Since we're using `text-embedding-3-small` model, we'll also define a length function that accounts for tokens from use of this model while splitting up the text into chunks. 

For the future, a chunking strategy we could test out the experimental `SemanticChunker` which further combines sentences if they are semantically similar. But this is more of a risky choice so we'll stick with the default for now and might make use of MDD to determine if `SemanticChunker` is better later.

In [53]:
def tiktoken_len(text, model='text-embedding-3-small'):
    embedding = tiktoken.encoding_for_model(model)
    query = embedding.encode(text)
    return len(query)


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    length_function=tiktoken_len
)

In [54]:
split_documents = text_splitter.split_documents(documents)

In [55]:
len(split_documents)

279

## Build Vectorstore from Embeddings
Next we take these split documents and build out a vectorstore using `Qdrant`, a fairly high performant and flexible vectorstore. We'll continue to use `text-embedding-3-large` as the embedding function to store documents. 

In [31]:
from dotenv import load_dotenv; load_dotenv()

True

In [32]:
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')

In [None]:
client = QdrantClient(
    url=os.environ.get('QDRANT_DB'),
    api_key=os.environ.get('QDRANT_API_KEY'),
)

client.create_collection(
    collection_name="ai_ethics_te3_small",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

In [44]:
vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_ethics_te3_small",
    embedding=embeddings,
)

In [47]:
def add_documents(store, documents):
    for i in range(0, len(documents), 10):
        batch = documents[i:i+10]
        vector_store.add_documents(
            documents=batch,
        )

In [None]:
add_documents(vector_store, split_documents)

### Also add a store for text-embeddings-3-large

In [57]:
from functools import partial
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    length_function=partial(tiktoken_len, model='text-embedding-3-large')
)

In [58]:
split_documents = text_splitter.split_documents(documents)

In [59]:
client.create_collection(
    collection_name="ai_ethics_te3_large",
    vectors_config=VectorParams(size=3072, distance=Distance.COSINE),
)

True

In [60]:
embeddings = OpenAIEmbeddings(model='text-embedding-3-large')
vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_ethics_te3_large",
    embedding=embeddings,
)

In [61]:
add_documents(vector_store, split_documents)