# RAG Index & Store

### Indexing

- Load: First we need to load our data. 

- Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.

- Store: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

- Retrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.

- Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data

In [1]:
# !pip install chromadb 
# !pip install uuid 

In [2]:
import chromadb
import uuid
import hashlib
import os
from dotenv import load_dotenv
import chromadb.utils.embedding_functions as embedding_functions
from rag_utils import DocLoader
from tqdm import tqdm

In [3]:
class StoreIndex:
    """
    StoreIndex is a class for managing a collection of documents with optional embedding functions.
    
    Attributes:
        db_path (str): Path to the database.
        collection_name (str): Name of the collection.
        extra_exembedding (bool): Flag to use extra embedding functions.
        hash_text_uuid (bool): Flag to hash text to UUID.
        enable_logging (bool): Flag to show progress of adding documents.
        client (chromadb.PersistentClient): Persistent client for the database.
        collection (chromadb.Collection): Collection of documents.
    """
    
    def __init__(self, db_path, collection_name, extra_exembedding=True, hash_text_uuid=True,enable_logging=True):
        """
        Initialize the StoreIndex with the given parameters.
        
        Args:
            db_path (str): Path to the database.
            collection_name (str): Name of the collection.
            extra_exembedding (bool): Flag to use extra embedding functions.
            hash_text_uuid (bool): Flag to hash text to UUID.
        """
        load_dotenv()
        self.db_path = db_path
        self.collection_name = collection_name
        self.extra_exembedding = extra_exembedding
        self.hash_text_uuid = hash_text_uuid
        self.enable_logging = enable_logging
        if not os.path.exists(self.db_path):
            print(f"\nCreating database at {self.db_path}.")
        self.client = chromadb.PersistentClient(path=self.db_path)
        self.collection = self._create_collection()
        if self.enable_logging:
            print(f"\nConnected to collection {self.collection_name} in database {self.db_path} for indexing.")
        
    def _create_collection(self):
        """
        Create or get the collection with optional embedding functions.
        
        Returns:
            chromadb.Collection: The created or retrieved collection.
        """
        if self.extra_exembedding:
            
            api_key = os.environ.get("OPENAI_API_KEY")
            base_url = os.environ.get("OPENAI_BASE_URL")
            embedding_model = os.environ.get("OPENAI_EMBEDDING_NAME", "text-embedding-3-small")
            
            if not api_key:
                raise ValueError("API key must be set in environment variables.")
            
            if base_url:
                openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                    api_key=api_key,
                    api_base=base_url,
                    model_name=embedding_model
                )
            else:
                openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                    api_key=api_key,
                    model_name=embedding_model
                )
            if not self.exists(self.db_path, self.collection_name):
                print("\nCreating collection {} in the database {}.".format(self.collection_name, self.db_path))
            return self.client.get_or_create_collection(name=self.collection_name, embedding_function=openai_ef)
        return self.client.get_or_create_collection(name=self.collection_name)

    def add(self, documents, metadatas=None):
        """
        Add documents and their corresponding metadata to the collection.
        
        Args:
            documents (Union[str, List[str]]): Documents to be added. Can be a single document or a list of documents.
            metadatas (Union[Mapping[str, Union[str, int, float, bool]], List[Mapping[str, Union[str, int, float, bool]]]]): 
                Metadata for the documents. Can be a single mapping or a list of mappings.
            enable_logging (bool): Flag to show progress of adding documents.
        
        Raises:
            ValueError: If the lengths of metadatas and documents do not match.
        """
        documents = documents if isinstance(documents, list) else [documents]
        
        if self.hash_text_uuid:
            ids = [self._generate_sha256_hash_from_text(doc) for doc in documents]
        else:
            ids = [f"{uuid.uuid4()}" for _ in range(len(documents))]
        
        ids = list(set(ids))
        
        existing_ids = set(self.collection.get(ids=ids)["ids"])
        
        if len(existing_ids) == len(ids):
            if self.enable_logging:
                print("All documents already exist in the collection.")
            return
        if self.enable_logging:
            print(f"Totale {len(documents)} | Adding {len(documents) - len(existing_ids)} new documents to the collection.")
        
        if metadatas:
            metadatas = metadatas if isinstance(metadatas, list) else [metadatas]    
            if len(metadatas) != len(documents):
                raise ValueError("metadatas and documents should have the same length")
            filtered_documents, filtered_metadatas, filtered_ids = [], [], []
            for doc, meta, id in zip(documents, metadatas, ids):
                if id not in existing_ids:
                    filtered_documents.append(doc)
                    filtered_metadatas.append(meta)
                    filtered_ids.append(id)
            self.collection.add(documents=filtered_documents, ids=filtered_ids, metadatas=filtered_metadatas)
        else:
            filtered_documents, filtered_ids = [], []
            for doc, id in zip(documents, ids):
                if id not in existing_ids:
                    filtered_documents.append(doc)
                    filtered_ids.append(id)
            self.collection.add(documents=filtered_documents, ids=filtered_ids)
        
    def index_from_doc_loader(self, doc_loader: DocLoader):
        """
        Index documents from a DocLoader object.
        
        Args:
            doc_loader (DocLoader): DocLoader object to load documents from.
        """        
        docs = list(doc_loader.docs)  # Convert generator to list
        for file, documents, metadatas in tqdm(docs, desc="Indexing documents"):
            if self.enable_logging:
                print(f"\nIndexing documents from {file}.")
            self.add(documents=documents, metadatas=metadatas)
        if self.enable_logging:
            print("\nIndexing completed, total documents indexed: {}.".format(len(docs)))
            
        
    def peek(self, n=10):
        """
        Get the first n documents in the collection.
        
        Args:
            n (int): Number of documents to return.
        
        Returns:
            List[str]: The first n documents in the collection.
        """
        return self.collection.peek(n)
    
    def clear(self):
        """
        Clear the collection and create a new one.
        """
        print(f"Deleting collection {self.collection_name} in the database {self.db_path}.")
        self.client.delete_collection(name=self.collection_name)
        self.collection = self._create_collection()

    def delete(self):
        """
        Delete the collection.
        """
        print(f"Deleting collection {self.collection_name} in the database {self.db_path}.")
        self.client.delete_collection(name=self.collection_name)

    @staticmethod
    def exists(db_path: str, collection_name: str) -> bool:
        """
        Check if the db and collection exists.
        
        Args:
            db_path (str): Path to the database.
            collection_name (str): Name of the collection.
        
        Returns:
            bool: True if the collection exists, False otherwise.
        """
        
        if not os.path.exists(db_path):
            return False
        else:
            client = chromadb.PersistentClient(path=db_path)
            try:
                client.get_collection(name=collection_name)
                return True
            except:
                return False

    @staticmethod
    def _generate_sha256_hash_from_text(text: str) -> str:
        """
        Generate a SHA-256 hash from the given text.
        
        Args:
            text (str): The text to hash.
        
        Returns:
            str: The SHA-256 hash of the text.
        """
        return hashlib.sha256(text.encode('utf-8')).hexdigest()


In [8]:
si = StoreIndex(db_path='db',collection_name='test',enable_logging=False)
# si.client.list_collections()

In [13]:
si.add(documents=['hello','world'],metadatas=[{'tag':'2'},{'tag':'3'}])
si.add(documents=['你好啊','世界','这是一条关于数学的笔记','微积分是研究变化的数学分支'])
si.add(documents=['测试1','测试2'],metadatas=[None,{'tag':'2'}])

In [8]:
si.collection.query(
    query_texts=["世界"],
    n_results=3,
)

{'ids': [['33650a369521ec29f2e26c43d25967535bcb26436755f536735d1ef6e84a1ec5',
   '486ea46224d1bb4fb680f34f7c9ad96a8f24ec88be73ea8e5a6c65260e9cb8a7',
   '2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824']],
 'distances': [[0.0, 0.7137184825340257, 1.4609405712453656]],
 'metadatas': [[None, {'tag': '3'}, {'tag': '2'}]],
 'embeddings': None,
 'documents': [['世界', 'world', 'hello']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

#### Load Doc and do Indexing

In [15]:
dl = DocLoader(path='doc',chunk_size=100,chunk_overlap=50)
si = StoreIndex(db_path='db',collection_name='play',enable_logging=False)
si.index_from_doc_loader(dl)


Loading files from doc

Processing PDFs
Skipping doc/example_doc.pdf as it has already been converted to markdown
Skipping doc/subdir/subdir_example_doc.pdf as it has already been converted to markdown


Indexing documents: 4it [00:00, 498.08it/s]


In [10]:
res = si.collection.query(
    query_texts=["how to do Meditate"],
    n_results=2,
)['documents']

for r in res[0]:
    print(r)
    print('-'*80)

### Basic Steps to Meditate  
1. **Choose a Comfortable Position**: Sit or lie down in a comfortable position. Keep your back straight to promote alertness.  
2. **Close Your Eyes**: Gently close your eyes or lower your gaze to minimize distractions.  
3. **Focus on Your Breath**: Take a few deep breaths, inhaling through your nose and exhaling through your mouth. Pay attention to the sensation of your breath entering and leaving your body.  
4. **Acknowledge Thoughts**: As thoughts arise, acknowledge them without judgment and gently return your focus to your breath.
--------------------------------------------------------------------------------
## A Simple Meditation Practice: 10-Minute Guided Session  
1. **Preparation (1 minute)**: Sit comfortably, close your eyes, and take a few deep breaths to center yourself.  
2. **Breath Awareness (5 minutes)**: Focus on your breath. Inhale deeply, hold for a moment, and exhale slowly. If your mind wanders, gently bring it back to your breath.

In [11]:
res = si.collection.query(
    query_texts=["what is Meditate"],
    n_results=2,
)['documents']

for r in res[0]:
    print(r)
    print('-'*80)

## Understanding Meditation  
At its core, meditation is a practice that involves focusing the mind and eliminating distractions to achieve a state of heightened awareness and tranquility. While it may seem simple, meditation encompasses a wide range of techniques and approaches, each designed to foster mindfulness and self-awareness.
--------------------------------------------------------------------------------
### Basic Steps to Meditate  
1. **Choose a Comfortable Position**: Sit or lie down in a comfortable position. Keep your back straight to promote alertness.  
2. **Close Your Eyes**: Gently close your eyes or lower your gaze to minimize distractions.  
3. **Focus on Your Breath**: Take a few deep breaths, inhaling through your nose and exhaling through your mouth. Pay attention to the sensation of your breath entering and leaving your body.  
4. **Acknowledge Thoughts**: As thoughts arise, acknowledge them without judgment and gently return your focus to your breath.
--------