# LLMs asking Questions on Private Data

My notes on using LLMs to query private data. In particular using [LangChain](https://python.langchain.com/en/latest/index.html) to create embeddings, persist in a vector stores, and use with LLM.


## LLMs and constraints

LLMs are trained on large amounts of unstructured data and are great at general text generation. There are a few limitations of using off-the-shelf pre-trained LLMs:
* They’re usually trained offline, making the model agnostic to the latest information
* They make predictions by only looking up information stored in its parameters, leading to inferior interpretability.
* They’re mostly trained on general domain corpora, making them less effective on domain-specific tasks. 
* There are scenarios when you want models to generate text based on specific data rather than generic data.


## LangChain

[LangChain](https://python.langchain.com/en/latest/) is an opensource framework designed to simplify the creation of applications using LLMs. It includes a standard interface for interacting with LLMs. It allows chaining together different components to create more advanced use cases around LLMs. 


Key Modules:
* **Prompt templates**
* **LLMs** OpenAI, Huggingface. Supported [integrations](https://langchain.com/integrations.html)
* **Agents** allow interactions tools like web search, external APIs, ...
* **Memory** Short-term memory (chat history)/long-term memory (vector stores) 



# Demo


~/projects/openai/langchain
```
streamlit run app.py
```


In [None]:
! streamlit run app.py

-----
## Environment set up



In [None]:
!pip install langchain --upgrade
!pip install openai --upgrade
!pip install pypdf --upgrade
!pip install chromadb --upgrade
!pip install pinecone-client --upgrade
!pip install ipywidgets --upgrade
!pip install tiktoken --upgrade


In [None]:
import os

# Check to see if there is an environment variable with your API keys, if not, use what you put below
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY', 'YourAPIKey')
PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY', 'YourAPIKey')
PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV', 'us-west4-gcp') # You may need to switch with your env



---------
# Data Ingestion


<img src='./data/data_ingestion.png' width='800'>



--> Point to data source and load multiple documents (Pdf/Word/HTML/Chat...). [Document Loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html)

--> **Chunk** into smaller parts. [Text Splitters](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html)
  * Optimize for the smallest size without losing context
  * Consider adding some meaningful global metadata in all the chunks giving global context to all your embedded chunks
  * Use ```chunk_overlap``` to maintain some local context
  
--> Create **embedding** vectors for each chunk using LLM embedding. [Text Embedding Models](https://python.langchain.com/en/latest/modules/models/text_embedding.html)
  * An embedding is a vector (list) of floating point numbers
        
--> Raw data --> Embedding Model --> Vector Embedding --> Store embedding + meta data in 
        [Vectorstores](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html)

  * **Vector stores**:
    * [Pinecone](https://docs.pinecone.io/docs/overview): Managed vector store. Pinecone vector search index (Dimension: 1536, Metric: [Cosine, DotProduct, Euclidean])
    * [Chroma](https://docs.trychroma.com/): Open source locally managed vector store.
    
--> **Semantic search** to retrieve relevant informaiton by measuring the distance between two vectors ie. measures their similarity
  


            

In [None]:
# Setting some variable used global for the following cells

persist_chroma_directory = '.chroma_db'
pdf_folder = './data/pdf'

os.listdir(pdf_folder)



In [None]:
from langchain.document_loaders import DirectoryLoader, \
                                        PyPDFLoader, \
                                        UnstructuredPDFLoader, \
                                        TextLoader

loader = DirectoryLoader(pdf_folder, glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()

# If using PyPDFLoader each document in documents is 1 page of a pdf. 
print(f'{len(documents)} pages loaded')


In [None]:
documents[:1]

In [None]:
print(documents[0].page_content)

## Create document embeddings



In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings


# use OpenAI embedding
embedding = OpenAIEmbeddings()
persist_chroma_directory = '.chroma_db'


In [None]:
chroma_store = Chroma.from_documents(documents=chunks, embedding=embedding, persist_directory=persist_chroma_directory)

# Persist the database --> Need to call persist() when using Jupyter
chroma_store.persist()
chroma_store = None


## or Load previously stored embedding

In [None]:
# Now we can load the persisted database from disk, and use it as normal. 
chroma_store = Chroma(embedding_function=embedding, persist_directory=persist_chroma_directory)


----
# Query LLM


<img src='./data/RAG.jpg' width='800'>



In [None]:
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA

# Can load the persisted database from disk, and use it as normal. 
if chroma_store == None:
    chroma_store = Chroma(embedding_function=embedding, persist_directory=persist_chroma_directory)

# Create the chain
#   Use ```retreiver``` in future calls to load previously generated embeddings from Chroma
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=chroma_store.as_retriever(search_kwargs={'k':3}), verbose=True)


query = "Provide details of my hotel entitlement if my flight is cancelled?"
#query = "How much liquid can I bring on a flight?"
#query = "how long is my ticket valid for?"
result = qa.run(query)

print(result)
