# VectorDB Augmented with Leapfrog ai


In [1]:
# Helper function for printing docs

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

## Using a vanilla vector store retriever
Let's start by initializing a simple vector store retriever and storing the 2023 State of the Union speech (in chunks). We can see that given an example question our retriever returns one or two relevant docs and a few irrelevant docs. And even the relevant docs have a lot of irrelevant information in them.

In [8]:


import openai
# Put anything you want in `API key`
openai.api_key = 'Free the models'

# Point to leapfrogai
openai.api_base = "https://leapfrogai.tom.bigbang.dev"
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key="foobar",document_model_name="text-embedding-ada-002")

print(embeddings)

response = openai.Completion.create(
  model="llama-7B-4b",
  prompt="The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly.\n\nHuman: Hello, who are you?\nAI: I am an AI created by OpenAI. How can I help you today?\nHuman: I'd like to cancel my subscription.\nAI:",
  temperature=0.9,
  max_tokens=150,
  top_p=1,
  frequency_penalty=0.0,
  presence_penalty=0.6,
  stop=[" Human:", " AI:"]
)

client=<class 'openai.api_resources.embedding.Embedding'> model='text-embedding-ada-002' document_model_name='text-embedding-ada-002' query_model_name='text-embedding-ada-002' embedding_ctx_length=8191 openai_api_key='foobar' openai_organization=None allowed_special=set() disallowed_special='all' chunk_size=1000 max_retries=6


In [4]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
from langchain.vectorstores import FAISS

documents = TextLoader('state_of_the_union.txt').load()

text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
contents = [d.page_content for d in texts]
metadatas = [d.metadata for d in texts]


# embeds = [ openai.Embedding.create(input=content,model="text-embedding-ada-002") for content in contents ]

print(len(metadatas))


183


In [5]:
from langchain.vectorstores import Chroma

persist_directory = 'db'

vectordb = Chroma(embedding_function=embeddings)
_ = vectordb.add_texts(texts=contents, metadatas=metadatas)
# texts = [doc.page_content for doc in documents]
#         metadatas = [doc.metadata for doc in documents]
#         return cls.from_texts(
#             texts=texts,
#             embedding=embedding,
#             metadatas=metadatas,
#             ids=ids,
#             collection_name=collection_name,
#             persist_directory=persist_directory,
#             client_settings=client_settings,
#         )

# vectordb = Chroma.from_documents(documents=documents, embedding=embeddings,persist_directory=persist_directory)
# vectordb = Chroma.from_texts(documents=documents, texts=texts, embedding=embeddings, persist_directory=persist_directory)
   

Using embedded DuckDB without persistence: data will be transient


In [6]:
# add some nist documents
import os
folder_path = "../../wget"  # replace with the path to your folder
from langchain.document_loaders import PyPDFLoader

pdf_files = []
pdf_text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=200)

for root, dirs, files in os.walk(folder_path):
    for file_name in files:
        if file_name.endswith('.pdf'):
            pdf_files.append(os.path.join(root, file_name))


for file_name in pdf_files:
    try:
        print(file_name)
        loader = PyPDFLoader(file_path=file_name)
        pages = loader.load_and_split()
        splits = pdf_text_splitter.split_documents(pages)
        contents = [d.page_content for d in splits]
        metadatas = [d.metadata for d in splits]
        _ = vectordb.add_texts(texts=contents, metadatas=metadatas)
    except:
        print("An error")
    finally:
        print("Done!")

../../wget/istio.io/latest/talks/istio_talk_gluecon_2017.pdf
Done!
../../wget/istio.io/latest/blog/2023/ada-logics-security-assessment/Istio audit report - ADA Logics - 2023-01-30 - v1.0.pdf
Done!
../../wget/istio.io/latest/blog/2021/ncc-security-assessment/NCC_Group_Google_GOIST2005_Report_2020-08-06_v1.1.pdf
Done!
../../wget/www.fedramp.gov/assets/resources/training/200-D-FedRAMP-Training-Continuous-Monitoring-ConMon-Overview.pdf
Done!
../../wget/www.fedramp.gov/assets/resources/training/200-A-FedRAMP-Training-FedRAMP-System-Security-Plan-SSP-Required-Documents.pdf
Done!
../../wget/www.fedramp.gov/assets/resources/training/200-C-FedRAMP-Training-Security-Assessment-Report-SAR.pdf
Done!
../../wget/www.fedramp.gov/assets/resources/training/200-B-FedRAMP-Training-Security-Assessment-Plan-SAP.pdf
Done!
../../wget/www.fedramp.gov/assets/resources/training/201-B-FedRAMP-Training-How-to-Write-a-Control.pdf
Done!
../../wget/www.fedramp.gov/assets/resources/templates/FedRAMP-Agency-Authorizat

Created a chunk of size 841, which is longer than the specified 400


Done!
../../wget/csrc.nist.gov/CSRC/media/Publications/nistir/8144/draft/documents/nistir8144_draft.pdf
Done!
../../wget/csrc.nist.gov/CSRC/media/Publications/nistir/7275/rev-4/final/documents/nistir-7275r4_updated-march-2012_clean.pdf
Done!
../../wget/csrc.nist.gov/CSRC/media/Publications/nistir/7275/rev-4/final/documents/nistir-7275r4_updated-march-2012_markup.pdf
Done!
../../wget/csrc.nist.gov/CSRC/media/Publications/nistir/8214b/draft/documents/NISTIR-8214B-ipd-public-feedback.pdf
Done!
../../wget/csrc.nist.gov/CSRC/media/Publications/nistir/8085/draft/documents/nistir_8085_draft.pdf
Done!
../../wget/csrc.nist.gov/CSRC/media/Publications/nistir/8193/draft/documents/nistir8193-draft.pdf
Done!
../../wget/csrc.nist.gov/CSRC/media/Publications/nistir/8105/final/documents/nistir-8105-public-comments-mar2016.pdf
Done!
../../wget/csrc.nist.gov/CSRC/media/Publications/nistir/7977/final/documents/nistir_7977_second_draft.pdf
Done!
../../wget/csrc.nist.gov/CSRC/media/Publications/nistir/7977

In [7]:

query = "What did the president say about Ketanji Brown Jackson"
results = vectordb.similarity_search_with_score(query=query)
for r in results:
    print(f"Score: { r[1] }")
    print("Contents:")
    print(f"{r[0].page_content}")
    print("------------------")
    




Score: 1.6240897178649902
Contents:
Resolution : Used the suggested sentence to replace the commented sentence. (Note KI 
is changed  to KIN in the final published SP 800 -108 Rev.1)  
3) When  presenting the pseudocode for  the various KDFs, the description of what 
is output  currently has this simple form:   Output : KO. 
What "should" be there (to accommodate the cases where an error occurs) is something like 
this:   
Output : KO (or an error indicator).  
“Omitting the parenthetical remark presents a problem in following the pseudocode  in the cases 
where an error is encountered, because no  KO value is  actually computed.”  
Resolution : Correc ted. (Note KO is changed  to KOUT in the final published SP 800 -108 
Rev.1)  
Robert Moskowitz  
To the SP800 -108r1 group and particularly to Lily Chen,  
 
Thank you for providing an update to 108, particularly including current standards with the Keccak 
function:   KMAC from SP800 -185.  
 This seems to be the next step to an update

In [71]:
# Example from here: https://github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/f1ee996782d107a9a5e4123199a57e008151c2de/utils/makechain.ts#L5
prompt = "You are a helpful AI assistant. Use the following pieces of context to answer the question at the end.\n"
prompt += "If you don't know the answer, just say you don't know. DO NOT try to make up an answer.\n"
prompt += "If the question is not related to the context, politely respond that you are tuned to only answer questions that are related to the context.\n"
prompt +="\n{ context }\n"
prompt +="Question: { question }\n"
prompt += "Helpful answer:"


In [74]:
def buildPrompt(context, query):
    prompt = f"You are a helpful AI assistant. Use the following pieces of context to answer the question at the end."
    prompt += f"If you don't know the answer, just say you don't know. DO NOT try to make up an answer."
    prompt += f"If the question is not related to the context, politely respond that you are tuned to only answer questions that are related to the context."
    prompt += f"\n { context } "
    prompt += f"Question: { query }"
    prompt += f"Helpful answer in markdown:"
    return prompt

In [83]:
def ask(question, k=1):
    related_documents = vectordb.similarity_search_with_score(query=question, k=k)
    # prompt = f"###Instruction:\n"
    # prompt += f"{ query }"
    # prompt = f"###Instruction:\nYou are an assistant that is helpful, creative, clever, and very friendly. The AI Assistant has access to the following data: { related_documents[0][0].page_content}]\n\n### Input:\n{ question }\n### Response:\n"
    context = ""
    for r in related_documents:
        context += r[0].page_content + "\n"
    prompt = buildPrompt(context, query=question)
    response = openai.Completion.create(
        model="llama-7B-4b",
        prompt=prompt,
        temperature=0.9,
        max_tokens=150,
        top_p=1,
        frequency_penalty=0.0,
        presence_penalty=0.6,
        stop=[" Human:", " AI:"],
        timeout=300
        )
    # print(response)
    answer = response.choices[0].text
    # answer = answer.replace(prompt,"",1)[1:]
    answer = answer[len(prompt):]
    # for r in related_documents:
    #     print(r[0].metadata)
    print(answer)

def askMany(question):
    ask(question=question,k=5)

In [89]:
ask("What did the president say about Ketanji Brown Jackson?")

: 

The president said that Ketanji Brown Jackson is one of our nation's top legal minds and will continue Justice Breyer's legacy of excellence.


In [88]:
ask("What is the biggest crisis the United States is facing?")

: 

The biggest crisis the United States is facing is the rising cost of living.


In [87]:
ask("Who is the president?")

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 



In [90]:
ask("What is a FedRAMP boundary?")

:
A FedRAMP boundary is the boundary between a cloud service provider (CSP) and a federal agency. It defines the roles and responsibilities of the CSP and the federal agency in the FedRAMP process.


In [84]:
askMany("What is a FedRAMP boundary?")

 A FedRAMP boundary is the boundary of a FedRAMP-certified system. It is the boundary of the system that is protected by the FedRAMP certification. It is the boundary of the system that is protected by the FedRAMP certification. It is the boundary of the system that is protected by the FedRAMP certification. It is the boundary of the system that is protected by the FedRAMP certification. It is the boundary of the system that is protected by the FedRAMP certification. It is the boundary of the system that is protected by the FedRAMP certification. It is the boundary of the system that is protected by the FedRAMP certification. It is the boundary of the


In [86]:
ask("Can you create a for loop in bash?")

:
# Create a for loop in bash
for i in 1 2 3 4 5 6 7 8 9 10
echo $i
# Output: 1 2 3 4 5 6 7 8 9 10


In [85]:
askMany("How can istio help create a FedRAMP boundary?")

an help create a FedRAMP boundary by providing a secure and scalable microservices architecture. Istio's security features include authentication, authorization, encryption, and access control. Additionally, Istio's observability and telemetry features can be used to monitor the health and performance of the microservices.
