# LLMs asking Questions on Private Data

My notes on using LLMs to query private data. In particular using [LangChain](https://python.langchain.com/en/latest/index.html) to create embeddings, persist in a vector stores, and use with LLM.


## LLMs and constraints

LLMs are trained on large amounts of unstructured data and are great at general text generation. There are a few limitations of using off-the-shelf pre-trained LLMs:
* They’re usually trained offline, making the model agnostic to the latest information
* They make predictions by only looking up information stored in its parameters, leading to inferior interpretability.
* They’re mostly trained on general domain corpora, making them less effective on domain-specific tasks. 
* There are scenarios when you want models to generate text based on specific data rather than generic data.






## Retrieval Augmented Generation (RAG)


<img src='./data/RAG2.png' width='1000'>

RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. 

The idea of [Retrieval Augmented Generation (RAG)](https://huggingface.co/docs/transformers/model_doc/rag) workflow is simple. Instead of asking a question directly, the process first uses the user question to perform a search to retrieve relevant documents from the internal dataset and then provides these documents together with the question to LLM. With the additional context the LLM can answer as though it has been trained with the internal dataset.



## LangChain

[LangChain](https://python.langchain.com/en/latest/) is an opensource framework designed to simplify the creation of applications using LLMs. It includes a standard interface for interacting with LLMs. It allows chaining together different components to create more advanced use cases around LLMs. 


Key Modules:
* **LLMs** OpenAI, Huggingface. Supported [integrations](https://langchain.com/integrations.html)
* **Prompt templates**
* **Agents** allow interactions tools like web search, external APIs, ...
* **Memory** Short-term memory (chat history)/long-term memory (vector stores) 

# Demo


~/projects/llm

```
streamlit run app.py
```


In [None]:
! streamlit run app.py

-----
## Environment set up



In [None]:
!pip install langchain --upgrade
!pip install openai --upgrade
!pip install pypdf --upgrade
!pip install chromadb --upgrade
!pip install pinecone-client --upgrade
!pip install ipywidgets --upgrade
!pip install tiktoken --upgrade
!pip install langflow --upgrade


In [None]:
import os

# Check to see if there is an environment variable with your API keys, if not, use what you put below
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY', 'YourAPIKey')
PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY', 'YourAPIKey')
PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV', 'us-west4-gcp') # You may need to switch with your env



---------
# Data Ingestion


<img src='./data/data_ingestion.png' width='800'>



--> Point to data source and load multiple documents (PDF/Word/HTML/Chat...). [Document Loaders](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html)

--> **Chunk** into smaller parts. [Text Splitters](https://python.langchain.com/en/latest/modules/indexes/text_splitters.html)
  * Optimize for the smallest size without losing context
  * Consider adding some meaningful global metadata in all the chunks giving global context to all your embedded chunks
  * Use ```chunk_overlap``` to maintain some local context
  
--> Create **embedding** vectors for each chunk using LLM embedding. [Text Embedding Models](https://python.langchain.com/en/latest/modules/models/text_embedding.html)
  * An embedding is a vector (list) of floating point numbers
        
--> Raw data --> Embedding Model --> Vector Embedding --> Store embedding + meta data in 
        [Vector stores](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html)

  * **Vector stores**:
    * [Pinecone](https://docs.pinecone.io/docs/overview): Managed vector store. Pinecone vector search index (Dimension: 1536)
    * [Chroma](https://docs.trychroma.com/): Open source locally managed vector store.
    
--> **Semantic search** to retrieve relevant information by measuring the similarity between two vectors. Typically **Cosine, Dot Product, Euclidean** metric

            

-----
## Load documents and chunk



In [None]:
# Setting some variable used global for the following cells

persist_chroma_directory = '.chroma_db'
pdf_folder = './data/pdf'

os.listdir(pdf_folder)



In [None]:
from langchain.document_loaders import DirectoryLoader, \
                                        PyPDFLoader, \
                                        UnstructuredPDFLoader, \
                                        TextLoader

loader = DirectoryLoader(pdf_folder, glob='**/*.pdf', loader_cls=PyPDFLoader)
documents = loader.load()

# If using PyPDFLoader each document in documents is 1 page of a pdf. 
print(f'{len(documents)} pages loaded')


In [None]:
documents[0]

In [None]:
print(documents[0].page_content)

----
## Split in to smaller chunks

* Split the text up into small, semantically meaningful chunks.
* Most LLMs are constrained by the number of tokens that you can pass in so passing in an entire document or several document pages + prompt may exceed LLM token limit



In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Chunk loaded documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

print(f'{len(chunks)} chunks created')

In [None]:
chunks[:4]


In [None]:
print(chunks[0].page_content)

----
## Chroma: Create document embeddings

[Chroma](https://docs.trychroma.com/): Open source locally managed vector store.



In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings


persist_chroma_directory = '.chroma_db'

# use OpenAI embedding
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, \
                             model='text-embedding-ada-002')



In [None]:
chroma_store = Chroma.from_documents(documents=chunks, embedding=embedding, persist_directory=persist_chroma_directory)

# Persist the database --> Need to call persist() when using Jupyter
chroma_store.persist()
chroma_store = None



### or Load previously stored embedding

In [None]:
# Now we can load the persisted database from disk, and use it as normal. 
chroma_store = Chroma(embedding_function=embedding, persist_directory=persist_chroma_directory)


### Similarity search in Chroma



In [None]:
# Use the following to access the local Chroma store created by langchain instead of accessing Docker container

import chromadb
from chromadb.config import Settings

#persist_chroma_directory = '.chroma_db'

chroma_client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",
                                    persist_directory=persist_chroma_directory
                                    ))


In [None]:
chroma_client.list_collections()


In [None]:
from chromadb.utils import embedding_functions

openai_embedding = embedding_functions.OpenAIEmbeddingFunction(
                api_key = OPENAI_API_KEY,
                model_name = "text-embedding-ada-002"
                )

collection = chroma_client.get_collection(name="langchain", embedding_function=openai_embedding)

collection.count()


In [None]:
query = "Can I get hotel accomodation if my flight is cancelled?"

chunks = collection.query(query_texts=[query], n_results=3)


In [None]:
chunks

-----
## Pinecone: Create document embeddings

[Pinecone](https://app.pinecone.io/) Vectorstore is a fully managed vector database.



Pinecone [CRUD](https://towardsdatascience.com/crud-with-pinecone-ee6b6f8b54e8) operations



In [None]:
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone


index_name = "langchaintest" 

pinecone.init(
    api_key=PINECONE_API_KEY, 
    environment=PINECONE_API_ENV
)

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=1536    # OpenAI used 1536 dimension size
    )

embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, \
                             model='text-embedding-ada-002')



In [None]:
# Create embedding and store in Pinecone

pinecone_store = Pinecone.from_documents(documents=chunks, embedding=embedding, index_name=index_name)


### or Load previously stored embedding



In [None]:
# Alternatively using previously created Pinecone index

embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

pinecone_store = Pinecone.from_existing_index(embedding=embedding, index_name=index_name)

### Similarity search in Pinecone



In [None]:
# Semantic search in Pinecone

query = "Can I get hotel accomodation if my flight is cancelled?"
chunks = pinecone_store.similarity_search(query, k=3)


In [None]:
chunks

----
# Query LLM


<img src='./data/RAG.jpg' width='800'>



In [None]:
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain

# Can load the persisted database from disk, and use it as normal. 
if chroma_store == None:
    chroma_store = Chroma(embedding_function=embedding, persist_directory=persist_chroma_directory)

# Create the chain. Use ```retreiver``` to load previously generated embeddings from Chroma
qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=chroma_store.as_retriever(search_kwargs={'k':3}), verbose=True)


query = "Provide details of compensation if my flight is cancelled? Output the results in bullet points"
#query = "How much liquid can I bring on a flight?"
#query = "how long is my ticket valid for?"

result = qa_with_sources(query)

print(result['answer'])


In [None]:
print(f'Question: {result["question"]}\n')

print(f'Answer: {result["answer"]}')

for index, doc in enumerate(result['sources'].split(',')):
    print(f'Source {index}: {doc}')


# BONUS: Low code LLM app builders



* [FlowiseAI](https://flowiseai.com)

* [LangFlow](https://github.com/logspace-ai/langflow) is a low code GUI builder for LLM application. Built on top of LangChain. Can be installed locally or find a hosted version on [Hugging Face](https://huggingface.co/spaces/Logspace/LangFlow)

    ```
    pip install langflow

    python -m langflow

    ```









In [None]:
!python3 -m langflow

In [None]:
!python3 mylangflow.py