# LangChain tests - Vector DB - ChromaDB

## This notebook

This notebook collects Python examples using LangChain and advanced ChromaDB features

This notebook has been tested in June 2023 on AWS SageMaker using DataScience 3.0 image.

Test environment:
> - AWS SageMaker Studio's notebook 
>> - Kernel image Data Science 3.0
>> - t3.medium 2CPU - 4GB
>> - Python 3.9.15
>> - Linux default 4.14.304-226.531.amzn2.x86_64

More informatioon about Langchain and dedicated examples in other bookmarks of the same folder.

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">NOTEBOOK SETUP</div>



**Instructions**

All setups are at the top of the notebook so that you can run all this section initialize the notebook.

Before running the setup you may need to create the following resources
- request an OpenAI API keys. OpenAI APIs are not free.

Confer to the setup sections for instruction on how to create those resources.

---
## API keys and environment

Langchain will get the API keys from environment variables or function parameters.

**Instructions**

- Never show the keys in shared notebooks, whether it part of the code or a log. A simple way to avoid key leakage, is to use environement variables.  You set the environment variable in the terminal or some local configuration. If so you do not have to set the key here.

- If it is easier for you to set the key here by assigning the value, do not forget to empty the string right after you run this block. The environment will be kept in memory as long as the kernel runs.

- Be careful when printing the keys. Ensure that you remove the outputs. 

- Before sharing check that the keys are not printed out by some features of the libraries. Avoid to print libraries' objects. They often hold the API keys as a property and may disclose the key value.


I Store API keys and configuration information in AWS Secrets Manager. The code below retrieves the secret holding the keys. The secret is a JSON string consisting in key/value pairs. It will be used later to set various environnement variables.

When using Notebooks and SageMaker do not forget to give permissions to read this secret to SageMaker execution role.

In [26]:
!apt-get update && apt-get install -y jq 1>/dev/null

Hit:1 http://deb.debian.org/debian bullseye InRelease
Hit:2 http://deb.debian.org/debian bullseye-updates InRelease
Hit:3 http://security.debian.org/debian-security bullseye-security InRelease
Reading package lists... Done


In [27]:
%%bash --out secrets 
# using AWS's Secret Manager to store keys
# garb the keys and store it into a Pytthon variable
export RESPONSE=$(aws secretsmanager get-secret-value --secret-id 'salvia/labbench/tests' )
export SECRETS=$( echo $RESPONSE | jq '.SecretString | fromjson')

echo $SECRETS

---
## pip upgrade

In [28]:
!pip install --upgrade pip  1>/dev/null

[0m

---
## LangChain Setup

**Resources**
> - [LangChain GetStarted](https://python.langchain.com/docs/get_started/quickstart)

In [29]:
!pip install langchain==0.0.230 1>/dev/null

[0m

---
## OpenAI Setup

**Resources**
> - [OpenAI tutorial on API keys](https://platform.openai.com/docs/quickstart)
> - [OpenAI package on Pypi](https://pypi.org/project/openai/)

In [30]:
import os

os.environ["OPENAI_API_KEY"] = eval(secrets)["OPENAI_API_KEY"]


In [31]:
!pip install openai==0.27.8 1>/dev/null

[0m

---
## Setup ChromaDB as a vector database

Install GCC C++ compiler

In [32]:
!apt-get update && apt-get install -y build-essential 1>/dev/null

Hit:1 http://security.debian.org/debian-security bullseye-security InRelease
Hit:2 http://deb.debian.org/debian bullseye InRelease
Hit:3 http://deb.debian.org/debian bullseye-updates InRelease
Reading package lists... Done


force the vzrsion in order to workaround the issue with Pydanic version

In [33]:
pip install chromadb==0.3.26 1>/dev/null

[0mNote: you may need to restart the kernel to use updated packages.


## Setup additional text managelment tools

When working with embeddings additonal packages are required.

- tiktoken, as a encoder and tokenizer

**Resources**
> - [Tiktoken package on Pypi](https://pypi.org/project/tiktoken/)

 

In [34]:
!pip install tiktoken==0.4.0 1>/dev/null

[0m

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">
    INITAILAIZATION
</div>


---
## Download example documents once

Download a book from the Internet Archive

In [41]:
!mkdir -p work/chromadb/data

In [42]:
!curl  https://ia800307.us.archive.org/17/items/aliceinwonderlan00carriala/aliceinwonderlan00carriala_djvu.txt \
 -o work/chromadb/data/alice_in_wonderland.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 29923  100 29923    0     0  38208      0 --:--:-- --:--:-- --:--:-- 38167


In [43]:
!ls -l work/chromadb/data

total 32
-rw-r--r-- 1 root root 29923 Jul 16 19:37 alice_in_wonderland.txt


---
## Common initializations

In [44]:
notebook_folder = "work/chromadb"
documents_folder = f"{notebook_folder}/data"

print(documents_folder)

work/chromadb/data


---
# Document loaders

Loaders are an easy ways to import documents from other sources 
and make it available for use in your language models. There are lot of loadre type.

**Resources**
> - Document Loaders: https://python.langchain.com/docs/modules/data_connection/document_loaders
> - List of loaders: https://github.com/hwchase17/langchain/tree/master/langchain/document_loaders

In [45]:
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader

# Note, the default model is already 'text-davinci-003' 
# temperature 0 means no randomness
llm = OpenAI(temperature=0, model_name='text-davinci-003')

# This is the source document.    
document_path = f"{documents_folder}/alice_in_wonderland.txt"
 
# Setup a text loader
loader = TextLoader(document_path)
alice_documents = loader.load()

text = alice_documents[0].page_content

# check the length
length = len(text)
print(f"{length=}")

# check the number of tokens
num_tokens = llm.get_num_tokens(text)
print(f"{num_tokens=}")

length=29258
num_tokens=20373


In [46]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Get your splitter ready
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
 
# Split your docs into texts
texts = text_splitter.split_documents(alice_documents)

# check texts
num_texts = len(texts)
print(f"{num_texts=}")


num_texts=33


---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">
BASIC CONCEPTS
</div>

---
## Question Ansxer Levaring similarity threshold

**Resources**
> - https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/chroma

In [47]:
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings


# Get embedding engine ready
embeddings = OpenAIEmbeddings()
 
# Embedd your texts andd store them in the vector database
# dtabase is in memory. it might be savecd to a file and loader later on.
chroma_db = Chroma.from_documents(texts, embeddings)

# Init a retriever for this db
# lookup for trelevqnt parts
retriever = chroma_db.as_retriever(search_type="similarity_score_threshold", 
                                    search_kwargs={"k":5,
                                           "score_threshold": 0.7
                                          })

In [48]:
# ra query
query = "who is the White Rabbit?"

# retrieve and count indexed documents relevant for the query
docs = retriever.get_relevant_documents(query)
print(f"\nFound {len(docs)} relevant documen(s)")


# create a chain to answer questions 
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(), 
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True)

response = qa({"query": query})
print(response['result'])


Found 5 relevant documen(s)
 The White Rabbit is a character from Lewis Carroll's Alice's Adventures in Wonderland. He is a magical rabbit who Alice follows down a rabbit hole into a fantastical world. He is late for an important date and is in a rush throughout the story.


In [None]:
# ra query
query = "who is the Mad Hatter?"

# retrieve and count indexed documents relevant for the query
docs = retriever.get_relevant_documents(query)
print(f"\nFound {len(docs)} relevant documen(s)")


# create a chain to answer questions 
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(), 
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True)

response = qa({"query": query})
print(response['result'])


Found 3 relevant documen(s)


Note that Mad Hatter has less citations than the White Rabbit;

---
<div style="background-color:green;color:black;text-align:center;padding:1rem;font-size:1.5rem;">
ADVANCED CONCEPTS
</div>

---
## Chroma can use metadata

<div class="alert alert-block alert-warning"> 
    TODO
</div>

Load meyta data from the metadata file and lookup metadata


---
## Chroma can persist the database
<div class="alert alert-block alert-warning"> 
    TODO 
</div>

```python
db2.persist()
docs = db2.similarity_search(query)

# load from disk
db3 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
docs = db3.similarity_search(query)
print(docs[0].page_content)
```

---
<div class="alert alert-block alert-warning"> TODO</div>

vector store backed retriever 

https://python.langchain.com/docs/modules/data_connection/retrievers/how_to/vectorstore

---

In [None]:
print('DONE')