### Simple RAG:

In this tutorial, we will focus on the following topics:
- What is a vector database and how to work with it.
- What is a chunk and how to split text into chunks.
- What is embedding and how to embed chunks and upsert them into a vector database.
- How to query our vector database collection and retrieve data from it using LLM.

---
tools:
* [anthropic](https://github.com/anthropics/anthropic-sdk-python)
* [chromadb](https://github.com/chroma-core/chroma)
* [sentence_transformers](https://sbert.net/)

---
Lets import our api_key to project

In [None]:
import os
from dotenv import load_dotenv

# load .env
load_dotenv(dotenv_path='../.env')

# get API key
api_key = os.getenv('ANTHROPIC_API_KEY')

print("api_key -> ", api_key)

In [10]:
from anthropic import Anthropic

client = Anthropic(api_key=api_key)

---

#### **Create collection:**
Chroma lets you manage collections of embeddings, using the collection primitive.

Chroma collections are created with a name and an optional embedding function. If you supply an embedding function, you must supply it every time you get the collection.

---
#### **Embedding function:**

- all-MiniLM-L6-v2
- context length < 256 tokens
- embedding dimension 384

Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video. There are many options for creating embeddings, whether locally using an installed library, or by calling an API.

Chroma provides lightweight wrappers around popular embedding providers, making it easy to use them in your apps. You can set an embedding function when you create a Chroma collection, which will be used automatically, or you can call them directly yourself.

---
### 🍜 New ingredients!

- import chromadb - vector database read [documentation](https://docs.trychroma.com/reference/py-client)


- from chromadb.utils import embedding_functions - [read more](https://docs.trychroma.com/guides/embeddings#default:-all-minilm-l6-v2)

- [Voyage embedding](https://docs.anthropic.com/en/docs/build-with-claude/embeddings)

- [sentence_transformers](https://sbert.net/)


- [Gemini text Embedding](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings#generative-ai-get-text-embedding-python_vertex_ai_sdk)

In [None]:
import chromadb
import pprint
from chromadb.utils import embedding_functions

# declare default embedding function [all-MiniLM-L6-v2]
default_embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

collection_name = "my_first_collection"

chroma_client = chromadb.PersistentClient(path="./chromadb/")
# declare ChromaDB collection
collection = chroma_client.get_or_create_collection(
    name=collection_name,
    embedding_function=default_embedding_function
    )

result = collection.get()

print(f"Collection {collection_name} created successfully")
pprint.pprint(result)

---
### Load txt from dir:
- For now we process just text files.

But in real world, we might parse and process documents. 

In other words - prepare them for working with our pipeline



In [None]:
import pprint

def load_txt_from_dir(dir_path):
    documents = []
    for filename in os.listdir(dir_path):
        if filename.endswith(".txt"):
            with open(os.path.join(dir_path, filename), "r") as file:
                documents.append({"text": file.read()})
    return documents

#--------------------------------------------#

directory_path = "../files/txt"

# load documents from directory
txt_files = load_txt_from_dir(directory_path)

print(f" {len(txt_files)} files loaded")
pprint.pprint(txt_files)

---
### Split text into chunks:

[Advanced Chunking Techniques for LLM](https://www.rungalileo.io/blog/mastering-rag-advanced-chunking-techniques-for-llm-applications)


[Unstructured](https://unstructured.io/)

In [None]:
import pprint

def split_text(
    text, 
    chunk_size=256, 
    chunk_overlap=20
    ):
    chunks = []
    start = 0
    text_length = len(text)
    while start < text_length:
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - chunk_overlap
    return chunks

# split text into chunks
chunked_txt = []

for file_id, txt_file in enumerate(txt_files):
    chunks = split_text(txt_file["text"])
    for chunk_id, chunk in enumerate(chunks):
        chunked_txt.append(
            {
                'id': f"{file_id}-{chunk_id}", 
                'text': chunk,
            }
        )

print(f"Split in to {len(chunked_txt)} chunks\n\n")

pprint.pprint(chunked_txt[13])

---
### Upsert data to ChromaDB:

If Chroma is passed a list of documents, it will automatically tokenize and embed them with the collection's embedding function (the default will be used if none was supplied at collection creation). Chroma will also store the documents themselves. If the documents are too large to embed using the chosen embedding function, an exception will be raised.

Each document must have a unique associated id. Trying to .add the same ID twice will result in only the initial value being stored. An optional list of metadata dictionaries can be supplied for each document, to store additional information and enable filtering.

Alternatively, you can supply a list of document-associated embeddings directly, and Chroma will store the associated documents without embedding them itself.


In [None]:
# upsert documents with embeddings to collection ChromaDB
for chunk in chunked_txt :
    collection.upsert(
            ids=chunk['id'],
            documents=chunk['text'],
    )


print(f"Collection {collection_name} has {collection.count()} documents")

---
### Query collection:

You can also query by a set of **query_texts**.
 
Chroma will first embed each **query_text** with the collection's embedding function, and then perform the query with the generated embedding.


In [8]:
import pprint

# function to query collection
def query_collection(question, n_results=5):
    results = collection.query(
        query_texts=question,
        n_results=n_results,
        # include=['embeddings', 'documents', 'distances']
    )
    # pprint.pprint(results)
    
    # extract relevant chunks
    relevant_chunks = [txt for sublist in results["documents"] for txt in sublist]
    # pprint.pprint(relevant_chunks)
    # for idx, txt in enumerate(results["documents"]):
    #     txt_id = results["ids"][0][idx]
    #     distance = results["distances"][0][idx]
    #     print("Chunks found:")
    #     print(f"document id: {txt_id}")
    #     print(f"text found:  {txt}")
    #     print(f"distance:    {distance}\n\n")

    return relevant_chunks


# function for generate response with openai
def api_response(query, relevant_chunks):
    
    context = "\n\n".join(relevant_chunks)
    
    user_prompt = (f"""
            You have been tasked with helping us to answer the following query: 
            <query>
            {query}
            </query>
            You have access to the following documents which are meant to provide context as you answer the query:
            <documents>
            {context}
            </documents>
            Please remain faithful to the underlying context, and only deviate from it if you are 100% sure that you know the answer already. 
            Answer the question now, and avoid providing preamble such as 'Here is the answer', etc
            """
            )
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=2048,
        messages=[{
                "role": "user", 
                "content": user_prompt
                }],
        temperature=0

    )
    return response.content[0].text


---
# Finally:

In [None]:
# query collection

question = "What lego product we have?"
relevant_chunks = query_collection(question)
answer = api_response(question, relevant_chunks)
print("\n------------------------------------\n")
print("answer\n", answer)

---

## well done!

---

#### Now let's cleanup db

---
### list collections 
here we check do we have any list_collections

In [None]:
list_collections = chroma_client.list_collections()

print(list_collections)

---
### delete collection
here we delete specific collection

In [None]:
chroma_client.delete_collection(collection_name)

list_collections = chroma_client.list_collections()

print(list_collections)

---
made with <3 by 
[dima dem](https://github.com/dimadem/) |42London