# <font color=red>LangChain:  Vector DBs</font>
- https://docs.langchain.com/docs

## What Does LangChain Provide?
+ Models
  + embedding
  + LLM (e.g. OpenAI)
+ Prompts
  + prompt templates
  + few-shot
  + example-selectors
  + output parsers
+ Chains (a multi-step workflow composed of <em>links</em>)</br>
  + Links (one of: prompt, model, another chain)
<span style="font-family:'Comic Sans MS', cursive, sans-serif;"><font color=orange>
+ Vector Database Access
  + Document Loaders
  + Text Splitting 
</font></span>
+ Memories (to facilitate chatbots or other 'iterative' sorts of apps)
+ Agents (loop over Thought, Act, Observe)
  + Tools
    + math
    + web search
    + custom (user-defined)

<span style="font-family:'Comic Sans MS', cursive, sans-serif;"><font color=orange>
## Vector Database Access
</font></span>
There are a lot of vector DBs available via LangChain.</br>
Perhaps the most well-known commercial one is Pinecone.  It requires setup at their site.</br>
We will be using a couple of free ones here:  FAISS and Chroma
<span style="font-family:'Comic Sans MS', cursive, sans-serif;"><font color=orange>
### Document Loaders and Text Splitting
</font></span>
These examples are somewhat longer because they not only demo using vector DBs, but also use chains to demo</br>
the VDBs being used to retrieve useful info. 
<font color=green>These examples read a set of *.txt and *.pdf files from sub-directories named txt and pdf,</br>
which are supplied with this notebook.

##### Demo loading txt files (no pdfs)
This demo uses the <font color=green>Chroma</font> vector DB which is quite popular.</br> 
We also save (persist) the DB to disk.</br>
We use a RetrievalQA chain to prove we can use the DB to retrieve document content and information.

In [None]:
!pip install chromadb   ## you may have to do this if not already installed

In [None]:
from langchain.vectorstores import Chroma   ## use chroma vector DB
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('./', glob="./txt/*.txt", loader_cls=TextLoader)
# loader = TextLoader('./one_file.txt')

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts, embedding=embeddings, persist_directory="chroma_db")

## first, prove we can obtain the relevant docs
retriever = vectordb.as_retriever()
docs = retriever.get_relevant_documents( "What did Abraham Lincoln say our fathers had brought forth on this continent?" )
print("RELEVANT DOCS")
print(docs)

# k docs to return, default 4
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
print("SEARCH TYPE",retriever.search_type)
print("SEARCH KWARGS",retriever.search_kwargs)

llm = ChatOpenAI(temperature=0.0, model_name='gpt-4')

qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)

# get the sources from the response
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

# full example
query = "What did Abraham Lincoln say our fathers had brought forth on this continent?"
llm_response = qa_chain(query)
print("LLM_RESPONSE")
print(llm_response)
print("PROCESSED OUTPUT")
process_llm_response(llm_response)

#### Here is a simpler example with Chroma but using more high-level operations from LangChain

In [None]:
import sys, os

from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator

to_summarize = "./txt/cuban.txt"

loaders = []
for fn in os.listdir("./txt"):
    filename = "./txt/" + fn
    loader = TextLoader(filename)
    loaders.append(loader)
index = VectorstoreIndexCreator().from_loaders(loaders)

print("INDEX_VECTORSTORE",index.vectorstore)
print("AS_RETRIEVER",index.vectorstore.as_retriever())
print()

query = "What did Lincoln say that our fathers had brought forth on this continent?"
result = index.query(query)
print(result,"\n")

query = "What did Lincoln say that our fathers had brought forth on this continent?"
result = index.query_with_sources(query)
# print(result,"\n")
print(result["answer"])
print(result["sources"],"\n")

query = "What happened on December 7, 1941?"  # Dec. is abbreviated in the doc
result = index.query_with_sources(query)
# print(result,"\n")
print(result["answer"])
print(result["sources"],"\n")

result = index.query("Summarize the general content of this document.",
                     retriever_kwargs={"search_kwargs": {"filter": {"source": to_summarize}}})
print(result,"\n")

##### Demo loading pdf files (not txt)
This demo uses the <font color=green>FAISS</font> vector DB.</br> 
We use a regular LLMChain chain to prove we can use the DB to retrieve document content and information.

In [None]:
import sys, os, openai, textwrap
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)


pdf_filenames = ["./pdf/"+f for f in os.listdir('./pdf') if f.endswith(".pdf")]

embeddings = OpenAIEmbeddings()
all_pages = []
for pdf_filename in pdf_filenames:
    loader = PyPDFLoader(pdf_filename)
    pages = loader.load_and_split()
    all_pages.extend(pages)
faiss_index = FAISS.from_documents(all_pages, embeddings)

query = "What is a generative agent?"  ## NOTE: voyager_gpt4 pdf paper answers this

# gpt-4 can handle up to 8192 tokens.  Set chunksize to 1000 and k to 8.
docs = faiss_index.similarity_search(query, k=8)  # k=8 is default
docs_page_content = " ".join([d.page_content for d in docs])

chat = ChatOpenAI(model_name="gpt-4", temperature=0.1)

system_msg_template = """
    You are a helpful assistant that that can answer questions about content
    obtained from pdf documents: {docs}
    Only use the factual information from the documents to answer the question.
    If you don't have enough information to answer the question, simply say "I don't know".
    Your answer should be concise but provide sufficient detail to fully answer.
"""
system_msg_prompt = SystemMessagePromptTemplate.from_template(system_msg_template)

###### NOTE: the human_template determines whether you ask which document is relevant to the
######       question, or if you ask for the actual answer to the question
human_template = "Which document provides the best answer to this question: {question}"
human_template = "Answer the following question: {question}"
human_msg_prompt = HumanMessagePromptTemplate.from_template(human_template)
chat_prompt = ChatPromptTemplate.from_messages( [system_msg_prompt,human_msg_prompt])

chain = LLMChain(llm=chat,prompt=chat_prompt)
response = chain.run(question=query, docs=docs_page_content,return_source_documents=True)
response = response.replace("\n", "")
print(f"\nanswer:\n    {textwrap.fill(response, width=70)}")

print("\nsource page info:")
for doc in docs:
    print(f"    {doc.metadata}")

### Contextual Compression With Documents Example 
From the LangChain documentation:</br>
The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.

In [None]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain import LLMChain
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

def pretty_print_docs(docs):
    for (i,doc) in enumerate(docs):
        print("-"*70)
        print(f"Document {i+1}:\n")
        print(doc.page_content)


## first, get the relevant docs in the "usual" way

loader = PyPDFLoader("./pdf/voyager_minecraft_gpt4.pdf")
pages = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000,chunk_overlap=0)
texts = text_splitter.split_documents(pages)
embeddings = OpenAIEmbeddings()
retriever = FAISS.from_documents(texts,embeddings).as_retriever()

docs = retriever.get_relevant_documents("What are generative agents?")
pretty_print_docs(docs)

## then, get compressed relevant docs using compression retriever
##   then go ahead and use the llm to answer the question

template = """
You are a useful assistant.
Please answer the question within the given context.
Context: {context}
Question: {question}
"""

prompt = PromptTemplate.from_template(template)

MODEL_NAME = "gpt-3.5-turbo"  # "gpt-4"
llm = ChatOpenAI(model_name=MODEL_NAME,temperature=0.0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                       base_retriever=retriever)

question = "What are generative agents?"

compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

llm_chain = LLMChain(prompt=prompt, llm=llm)
result = llm_chain.predict(question=question, context=compressed_docs)
print("-" * 70)
print(f"Question: {question} \nAnswer: {result}")

### Pinecone example code 
This code <font color=red>probably will not run for you</font>.  It depends on having a Pinecone account
setup and usable via their API.</br>
If you do have an account, then you should be able to modify the code to work. 

In [None]:
#### this program does not query openai it merely gets embeddings from openai, and 
#### places them into a pinecone vector DB along with the passage for each embedding

import os, re, time

content = ""
with open("txt/gettysburg.txt") as f:
    content = ""
    lines = f.readlines()
    for line in lines:
        content += line.strip() + " "

# --------

import openai
import pinecone  # pip install pinecone-client
import pinecone.info
import numpy as np

EMBED_MODEL = "text-embedding-ada-002"

# openai.organization = os.getenv("OPENAI_ORG")
with open("openaiorg.txt") as f:
    openai.organization = f.read().strip()

# pinecone_api_key = os.getenv("PINECONE_API_KEY")
with open("pineconekey.txt") as f:
    pinecone_api_key = f.read().strip()
pinecone.init(api_key=pinecone_api_key, environment='us-west1-gcp')
version_info = pinecone.info.version()
server_version = ".".join(version_info.server.split(".")[:2])
client_version = ".".join(version_info.client.split(".")[:2])
print(server_version,client_version)
## assert client_version == server_version, "Please upgrade pinecone-client."

passages = []
num_words_per_chunk = 100
words = content.split()
for i in range(0, len(words), num_words_per_chunk):
    chunk = " ".join(words[i:i+num_words_per_chunk])
    passages.append(chunk)

batch_size = 32
embeddings_all = []
embeds_as_arrays = []  # need list of arrays to create index
print("NUM_PASSAGES",len(passages),"APPROX_NUM_BATCHES",len(passages)//batch_size)
for i in range(0, len(passages), batch_size):
    batch = passages[i : i+batch_size]
    res = openai.Embedding.create(input=batch, engine=EMBED_MODEL)
    embeds = [record['embedding'] for record in res['data']]
    embeddings_all.extend(embeds)
print(len(passages),len(embeddings_all))

dim = len(embeddings_all[0])
index_name = "gettysburg"
if index_name in pinecone.list_indexes():
    print("DEL INDEX")
    pinecone.delete_index(index_name)
    print("DEL DONE")
    # pass
ctime = time.time()
print("CREATE INDEX")
pinecone.create_index(name=index_name, dimension=dim, metric="cosine")
print("CREATE DONE",time.time()-ctime)
index = pinecone.Index(index_name=index_name)
vecIDs = [ str(i) for i in range(len(embeddings_all)) ]   # ids should be str
meta = [{'text': passage} for passage in passages]
rc = index.upsert(vectors=zip(vecIDs, embeddings_all, meta))
print(rc)
print( index.describe_index_stats() )
len_embeds = len(embeddings_all[0])

stime = time.time()

query = "Which speech began with 'Four score and seven years ago'?"
res = openai.Embedding.create (
    input=[query], engine=EMBED_MODEL
)
q_embed = res['data'][0]['embedding']
###### q_embed = np.array(q_embed).reshape( (1,len(q_embed)) )
###### print("QEMBED",q_embed.shape)

rc = index.query(
        vector=q_embed,
        top_k=1,  # just going for 1 in this tiny demo
        include_metadata=True,
        include_values=True)
for x in rc["matches"]:
    print(x["id"])
    print(x["metadata"])

### Ensemble RAG example
This code demos an ensemble of vectordb search with keyword search.  The keyword search is based on BM25 which is</br>
used widely, e.g. by ElasticSearch.

In [None]:
import sys, os, time

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain

embedding = OpenAIEmbeddings()

loader = DirectoryLoader("./", glob="./txt/*.txt", loader_cls=TextLoader)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
texts = [ doc.page_content for doc in documents ]

bm25_retriever = BM25Retriever.from_texts(texts)
bm25_retriever.k = 2

result_bm25 = bm25_retriever.get_relevant_documents("soviet missiles")
print("RESBM25",len(result_bm25))
for doc in result_bm25:
    print(len(doc.page_content))
print("-" * 50)

faiss_vectorstore = FAISS.from_texts(texts, embedding)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

result_faiss = faiss_retriever.get_relevant_documents("soviet missiles")
print("RESFAISS",len(result_faiss))
for doc in result_faiss:
    print(len(doc.page_content))
print("-" * 50)

ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, faiss_retriever],
                                       weights=[0.5, 0.5])

ensemble_docs = ensemble_retriever.get_relevant_documents("soviet missiles")
print("RESENSEM",len(ensemble_docs))
for doc in ensemble_docs:
    print(len(doc.page_content))
print("-" * 50)

#### IMPORTANT NOTE ######## ****************
####   If the total size of the page_content in the ensemble_docs is too large,
####   you will get an error back from OpenAI because you have exceeded the token
####   limit.  So, you may have to reduce it before using in the query.
query = "what was the significance of 'soviet missiles'?"
llm = ChatOpenAI(model_name="gpt-4", temperature=0.7, max_tokens = 128,)
chain = load_qa_chain(llm,chain_type="stuff") # stuff all in at once
response = chain.run(input_documents=ensemble_docs,question=query)
print(response)