## Recursive Abstractive Processing for Tree Organized Retrieval 
RAPTOR introduces a novel approach to retrieval-augmented language models by constructing a recursive tree structure from documents. RAPTOR takes an innovative method to retrieval-augmented language models by creating a recursive tree structure from texts. This enables more efficient and context-aware information retrieval from huge texts, solving major constraints in traditional language models.

The RAPTOR study proposes an innovative approach for indexing and retrieval of documents.
* The leaves are a collection of starter documents.
* Leaves are embedded and crowded.
* Clusters are then combined into higher-level (more abstract) consolidations of information from related documents.
* This is done recursively, resulting in a "tree" of raw documents (leaves) that lead to more abstract summaries.


This tree structure is critical to the RAPTOR function because it captures both high-level and detailed aspects of text, which is especially beneficial for complex theme questions and multi-step reasoning in questioning and answering activities.

Documents are segmented into shorter texts known as chunks, which are then embedded using an embedding model. A clustering method is then used to group these embeddings together. After clusters are formed, the text linked with each cluster is summarized using an LLM.

The summaries are created as nodes in a tree, with higher-level nodes delivering more abstract summaries.


Here's the paper link:[RAPTOR](https://arxiv.org/html/2401.18059v1)

## Install the Libraries

In [1]:
# !pip install langchain torch sentence_transformers pypdf umap-learn langchain-community langchain-cohere tiktoken langchain-huggingface langchain-groq

In [2]:
# %pip install --upgrade --quiet  langchain_milvus
# !pip uninstall -y grpcio pymilvus
# !pip install grpcio==1.60.1 pymilvus

In [1]:
import os
import uuid
import umap
import base64
import tiktoken
import re
import numpy as np
import pandas as pd
from sklearn.mixture import GaussianMixture
from typing import Dict, List, Optional, Tuple

In [2]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain_groq import ChatGroq

## Define the LLM Model:
* Here we use the Groq API to access the open-source LLaMA3 model.
* The Groq API, combined with the powerful capabilities of Llama 3, offers an innovative approach to building and deploying machine learning models.
* Groq, known for its high-performance AI accelerators, provides an efficient and scalable platform for running complex AI workloads.
* Llama 3, a state-of-the-art language model, leverages these capabilities to deliver robust natural language processing (NLP) solutions.

In [3]:
import os
os.environ['GROQ_API_KEY'] = 'YOUR_GROQ_API_KEY'

model = ChatGroq(model_name="Llama3-8b-8192")

### Load the data
Here, we used legal books as input data for our analysis. Each PDF file is over 300 pages. The PDF links for these books are provided below.:
* [Family Law](https://lawfaculty.du.ac.in/userfiles/downloads/LLBCM/Ist%20Term_Family%20Law-%20I_LB105_2023.pdf)
* [Administrative Law](https://lawfaculty.du.ac.in/userfiles/downloads/LLBCM/IVth%20Term_Administrative%20Law_LB%20402_2023.pdf)
* [Labour Law](https://www.icsi.edu/media/webmodules/Labour_Laws&_Practice.pdf)

In [4]:
# # load PDF files from a directory
loader = PyPDFDirectoryLoader("/content/drive/MyDrive/rag_with_raptor/books")
docs = loader.load()



In [5]:
# Extract text content from the documents
texts = [doc.page_content for doc in docs]

In [6]:
## Display the content of the texts
texts[:10]

['     \n    LL.B. I Term    LB – 105: Family Law – I   Cases Selected and Edited by Usha Tandon Kiran Gupta Vandana Manju Relan P.B. Pankaja Pinki Sharma Neha Aneja     FACULTY OF LAW UNIVERSITY OF DELHI, DELHI-110007 January, 2023  (For private use only in the course of instruction) \n',
 '   Semester- First Course Name- Family Law-I Course Code- LB-105 Core course: 5 credits; Classes – 64 (4 Classes/week + Tutorial) Course Objectives: 1. To create awareness and educate the students about rights and duties of members of family towards each other, with special reference to spousal relationship. 2. To give overview to the students and enhance their understanding on the current laws on marriage, divorce, maintenance, adoption and guardianship. 3. To give practical exposure to students by field visit of Family Courts, Mediation and Conciliation Centres etc.  Course Learning Outcomes:  1. Students will be able to practice in Law Courts as a specialized Matrimonial Lawyer. 2. Students will

## Text Cleaning Steps:
* Convert Text to Lowercase
* Remove Punctuation and Special Characters
* Tokenize Text into Words
* Remove Stopwords
* Lemmatize Words

In [8]:
import re
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk

# Make sure to download the required NLTK data files
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Remove special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize text
    words = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Apply lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    # Join words back to a single string
    processed_text = ' '.join(words)

    return processed_text


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [9]:
preprocess_text=[preprocess_text(doc) for doc in texts]

In [10]:
preprocess_text[:10]

['llb term lb family law case selected edited usha tandon kiran gupta vandana manju relan pb pankaja pinki sharma neha aneja faculty law university delhi delhi january private use course instruction',
 'semester first course name family lawi course code lb core course credit class classesweek tutorial course objective create awareness educate student right duty member family towards special reference spousal relationship give overview student enhance understanding current law marriage divorce maintenance adoption guardianship give practical exposure student field visit family court mediation conciliation centre etc course learning outcome student able practice law court specialized matrimonial lawyer student able join research house especially issue relating woman child domestic international level unit marriage hindu law concept marriage general nature hindu marriage applicability legislation section hma condition validity marriage section hma solemnisation marriage special reference 

## Create reference document chunks
Typically for RAG, large texts are broken down into smaller chunks at ingest time. Given a user query, only the most relevant chunks are retrieved, to pass on as context to the LLM. So as a next step, we will chunk up our reference texts before embedding and ingesting them into.


We use the from_tiktoken_encoder method of the RecursiveCharacterTextSplitter class in LangChain. This way, the texts are split by character and recursively merged into tokens by the tokenizer as long as the chunk size (in terms of number of tokens) is less than the specified chunk size (chunk_size). Some overlap between chunks has been shown to improve retrieval, so we set an overlap of 30 characters in the chunk_overlap parameter.

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split text by tokens using the tiktoken tokenizer
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", keep_separator=False, chunk_size=400,   chunk_overlap=30
)

def split_texts(texts):
    chunked_texts = []
    for text in texts:
        chunks = text_splitter.create_documents([text])
        chunked_texts.extend([chunk.page_content for chunk in chunks])
    return chunked_texts

In [12]:
# Split the context field into chunks
docs_texts = split_texts(preprocess_text)

In [13]:
docs_texts[:5]

['llb term lb family law case selected edited usha tandon kiran gupta vandana manju relan pb pankaja pinki sharma neha aneja faculty law university delhi delhi january private use course instruction',
 'semester first course name family lawi course code lb core course credit class classesweek tutorial course objective create awareness educate student right duty member family towards special reference spousal relationship give overview student enhance understanding current law marriage divorce maintenance adoption guardianship give practical exposure student field visit family court mediation conciliation centre etc course learning outcome student able practice law court specialized matrimonial lawyer student able join research house especially issue relating woman child domestic international level unit marriage hindu law concept marriage general nature hindu marriage applicability legislation section hma condition validity marriage section hma solemnisation marriage special reference 

## For embedding models, I use SBERT for embeddings

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.

Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.

In [14]:
embd = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-MiniLM-L3-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [15]:
from langchain.prompts import PromptTemplate
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.chains import RetrievalQA

## Define every RAPTOR phase.

1. **Global Clustering with UMAP**: Reduces the dimensionality of the input embeddings globally using UMAP (Uniform Manifold Approximation and Projection).Returns a numpy array of the embeddings reduced to the specified dimensionality.
2. **Local Clustering with UMAP**: Performs local dimensionality reduction on the embeddings using UMAP after global clustering. Returns a numpy array of the embeddings reduced to the specified dimensionality.
3. **Determine Optimal Number of Clusters**: Determines the optimal number of clusters using the Bayesian Information Criterion (BIC) with a Gaussian Mixture Model. Returns an integer representing the optimal number of clusters found.
4. **Gaussian Mixture Model Clustering**: Clusters embeddings using a Gaussian Mixture Model (GMM) based on a probability threshold. Returns a tuple containing the cluster labels and the number of clusters determined.
5. **Perform Clustering**: Performs clustering by first reducing dimensionality globally, clustering with GMM, and then performing local clustering within each global cluster. Returns a list of numpy arrays, where each array contains the cluster IDs for each embedding.
6. **Generate Embeddings for Texts**: Generates embeddings for a list of text documents. Returns a numpy array of embeddings for the given text documents.
7. **Embed and Cluster Texts**: Embeds a list of texts and clusters them, returning a DataFrame with texts, their embeddings, and cluster labels. Returns a DataFrame containing the original texts, their embeddings, and the assigned cluster labels.
8. **Format Texts for Summarization**: Formats the text documents in a DataFrame into a single string. Returns a single string where all text documents are joined by a specific delimiter.
9. **Embed, Cluster, and Summarize Texts**: Embeds, clusters, and summarizes a list of texts, returning two DataFrames: one with clusters and one with summaries. Returns a tuple containing two DataFrames: one with clusters and one with summaries.
10. **Recursive Embed, Cluster, and Summarize Texts**: Recursively embeds, clusters, and summarizes texts up to a specified level or until the number of unique clusters becomes 1. Returns a dictionary where keys are the recursion levels and values are tuples containing the clusters DataFrame and summaries DataFrame at that level.

In [16]:
from typing import Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
import umap
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from sklearn.mixture import GaussianMixture

RANDOM_SEED = 224  # Fixed seed for reproducibility

### --- Code from citations referenced above (added comments and docstrings) --- ###


def global_cluster_embeddings(
    embeddings: np.ndarray,
    dim: int,
    n_neighbors: Optional[int] = None,
    metric: str = "cosine",
) -> np.ndarray:
    """
    Perform global dimensionality reduction on the embeddings using UMAP.

    Parameters:
    - embeddings: The input embeddings as a numpy array.
    - dim: The target dimensionality for the reduced space.
    - n_neighbors: Optional; the number of neighbors to consider for each point.
                   If not provided, it defaults to the square root of the number of embeddings.
    - metric: The distance metric to use for UMAP.

    Returns:
    - A numpy array of the embeddings reduced to the specified dimensionality.
    """
    if n_neighbors is None:
        n_neighbors = int((len(embeddings) - 1) ** 0.5)
    return umap.UMAP(
        n_neighbors=n_neighbors, n_components=dim, metric=metric
    ).fit_transform(embeddings)


def local_cluster_embeddings(
    embeddings: np.ndarray, dim: int, num_neighbors: int = 10, metric: str = "cosine"
) -> np.ndarray:
    """
    Perform local dimensionality reduction on the embeddings using UMAP, typically after global clustering.

    Parameters:
    - embeddings: The input embeddings as a numpy array.
    - dim: The target dimensionality for the reduced space.
    - num_neighbors: The number of neighbors to consider for each point.
    - metric: The distance metric to use for UMAP.

    Returns:
    - A numpy array of the embeddings reduced to the specified dimensionality.
    """
    return umap.UMAP(
        n_neighbors=num_neighbors, n_components=dim, metric=metric
    ).fit_transform(embeddings)


def get_optimal_clusters(
    embeddings: np.ndarray, max_clusters: int = 50, random_state: int = RANDOM_SEED
) -> int:
    """
    Determine the optimal number of clusters using the Bayesian Information Criterion (BIC) with a Gaussian Mixture Model.

    Parameters:
    - embeddings: The input embeddings as a numpy array.
    - max_clusters: The maximum number of clusters to consider.
    - random_state: Seed for reproducibility.

    Returns:
    - An integer representing the optimal number of clusters found.
    """
    max_clusters = min(max_clusters, len(embeddings))
    n_clusters = np.arange(1, max_clusters)
    bics = []
    for n in n_clusters:
        gm = GaussianMixture(n_components=n, random_state=random_state)
        gm.fit(embeddings)
        bics.append(gm.bic(embeddings))
    return n_clusters[np.argmin(bics)]


def GMM_cluster(embeddings: np.ndarray, threshold: float, random_state: int = 0):
    """
    Cluster embeddings using a Gaussian Mixture Model (GMM) based on a probability threshold.

    Parameters:
    - embeddings: The input embeddings as a numpy array.
    - threshold: The probability threshold for assigning an embedding to a cluster.
    - random_state: Seed for reproducibility.

    Returns:
    - A tuple containing the cluster labels and the number of clusters determined.
    """
    n_clusters = get_optimal_clusters(embeddings)
    gm = GaussianMixture(n_components=n_clusters, random_state=random_state)
    gm.fit(embeddings)
    probs = gm.predict_proba(embeddings)
    labels = [np.where(prob > threshold)[0] for prob in probs]
    return labels, n_clusters


def perform_clustering(
    embeddings: np.ndarray,
    dim: int,
    threshold: float,
) -> List[np.ndarray]:
    """
    Perform clustering on the embeddings by first reducing their dimensionality globally, then clustering
    using a Gaussian Mixture Model, and finally performing local clustering within each global cluster.

    Parameters:
    - embeddings: The input embeddings as a numpy array.
    - dim: The target dimensionality for UMAP reduction.
    - threshold: The probability threshold for assigning an embedding to a cluster in GMM.

    Returns:
    - A list of numpy arrays, where each array contains the cluster IDs for each embedding.
    """
    if len(embeddings) <= dim + 1:
        # Avoid clustering when there's insufficient data
        return [np.array([0]) for _ in range(len(embeddings))]

    # Global dimensionality reduction
    reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)
    # Global clustering
    global_clusters, n_global_clusters = GMM_cluster(
        reduced_embeddings_global, threshold
    )

    all_local_clusters = [np.array([]) for _ in range(len(embeddings))]
    total_clusters = 0

    # Iterate through each global cluster to perform local clustering
    for i in range(n_global_clusters):
        # Extract embeddings belonging to the current global cluster
        global_cluster_embeddings_ = embeddings[
            np.array([i in gc for gc in global_clusters])
        ]

        if len(global_cluster_embeddings_) == 0:
            continue
        if len(global_cluster_embeddings_) <= dim + 1:
            # Handle small clusters with direct assignment
            local_clusters = [np.array([0]) for _ in global_cluster_embeddings_]
            n_local_clusters = 1
        else:
            # Local dimensionality reduction and clustering
            reduced_embeddings_local = local_cluster_embeddings(
                global_cluster_embeddings_, dim
            )
            local_clusters, n_local_clusters = GMM_cluster(
                reduced_embeddings_local, threshold
            )

        # Assign local cluster IDs, adjusting for total clusters already processed
        for j in range(n_local_clusters):
            local_cluster_embeddings_ = global_cluster_embeddings_[
                np.array([j in lc for lc in local_clusters])
            ]
            indices = np.where(
                (embeddings == local_cluster_embeddings_[:, None]).all(-1)
            )[1]
            for idx in indices:
                all_local_clusters[idx] = np.append(
                    all_local_clusters[idx], j + total_clusters
                )

        total_clusters += n_local_clusters

    return all_local_clusters


### --- Our code below --- ###


def embed(texts):
    """
    Generate embeddings for a list of text documents.

    This function assumes the existence of an `embd` object with a method `embed_documents`
    that takes a list of texts and returns their embeddings.

    Parameters:
    - texts: List[str], a list of text documents to be embedded.

    Returns:
    - numpy.ndarray: An array of embeddings for the given text documents.
    """
    text_embeddings = embd.embed_documents(texts)
    text_embeddings_np = np.array(text_embeddings)
    return text_embeddings_np


def embed_cluster_texts(texts):
    """
    Embeds a list of texts and clusters them, returning a DataFrame with texts, their embeddings, and cluster labels.

    This function combines embedding generation and clustering into a single step. It assumes the existence
    of a previously defined `perform_clustering` function that performs clustering on the embeddings.

    Parameters:
    - texts: List[str], a list of text documents to be processed.

    Returns:
    - pandas.DataFrame: A DataFrame containing the original texts, their embeddings, and the assigned cluster labels.
    """
    text_embeddings_np = embed(texts)  # Generate embeddings
    cluster_labels = perform_clustering(
        text_embeddings_np, 10, 0.1
    )  # Perform clustering on the embeddings
    df = pd.DataFrame()  # Initialize a DataFrame to store the results
    df["text"] = texts  # Store original texts
    df["embd"] = list(text_embeddings_np)  # Store embeddings as a list in the DataFrame
    df["cluster"] = cluster_labels  # Store cluster labels
    return df


def fmt_txt(df: pd.DataFrame) -> str:
    """
    Formats the text documents in a DataFrame into a single string.

    Parameters:
    - df: DataFrame containing the 'text' column with text documents to format.

    Returns:
    - A single string where all text documents are joined by a specific delimiter.
    """
    unique_txt = df["text"].tolist()
    return "--- --- \n --- --- ".join(unique_txt)


def embed_cluster_summarize_texts(
    texts: List[str], level: int
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Embeds, clusters, and summarizes a list of texts. This function first generates embeddings for the texts,
    clusters them based on similarity, expands the cluster assignments for easier processing, and then summarizes
    the content within each cluster.

    Parameters:
    - texts: A list of text documents to be processed.
    - level: An integer parameter that could define the depth or detail of processing.

    Returns:
    - Tuple containing two DataFrames:
      1. The first DataFrame (`df_clusters`) includes the original texts, their embeddings, and cluster assignments.
      2. The second DataFrame (`df_summary`) contains summaries for each cluster, the specified level of detail,
         and the cluster identifiers.
    """

    # Embed and cluster the texts, resulting in a DataFrame with 'text', 'embd', and 'cluster' columns
    df_clusters = embed_cluster_texts(texts)

    # Prepare to expand the DataFrame for easier manipulation of clusters
    expanded_list = []

    # Expand DataFrame entries to document-cluster pairings for straightforward processing
    for index, row in df_clusters.iterrows():
        for cluster in row["cluster"]:
            expanded_list.append(
                {"text": row["text"], "embd": row["embd"], "cluster": cluster}
            )

    # Create a new DataFrame from the expanded list
    expanded_df = pd.DataFrame(expanded_list)

    # Retrieve unique cluster identifiers for processing
    all_clusters = expanded_df["cluster"].unique()

    print(f"--Generated {len(all_clusters)} clusters--")

    # Summarization
    template = """Please summarize the paragraph without changing the context.
     If the solution is not available in the text,
     explain that you are not sure. Do not make up any information...

    Documentation:
    {context}
    """
    prompt = ChatPromptTemplate.from_template(template)
    chain = prompt | model | StrOutputParser()

    # Format text within each cluster for summarization
    summaries = []
    for i in all_clusters:
        df_cluster = expanded_df[expanded_df["cluster"] == i]
        formatted_txt = fmt_txt(df_cluster)
        summaries.append(chain.invoke({"context": formatted_txt}))

    # Create a DataFrame to store summaries with their corresponding cluster and level
    df_summary = pd.DataFrame(
        {
            "summaries": summaries,
            "level": [level] * len(summaries),
            "cluster": list(all_clusters),
        }
    )

    return df_clusters, df_summary


def recursive_embed_cluster_summarize(
    texts: List[str], level: int = 1, n_levels: int = 3
) -> Dict[int, Tuple[pd.DataFrame, pd.DataFrame]]:
    """
    Recursively embeds, clusters, and summarizes texts up to a specified level or until
    the number of unique clusters becomes 1, storing the results at each level.

    Parameters:
    - texts: List[str], texts to be processed.
    - level: int, current recursion level (starts at 1).
    - n_levels: int, maximum depth of recursion.

    Returns:
    - Dict[int, Tuple[pd.DataFrame, pd.DataFrame]], a dictionary where keys are the recursion
      levels and values are tuples containing the clusters DataFrame and summaries DataFrame at that level.
    """
    results = {}  # Dictionary to store results at each level

    # Perform embedding, clustering, and summarization for the current level
    df_clusters, df_summary = embed_cluster_summarize_texts(texts, level)

    # Store the results of the current level
    results[level] = (df_clusters, df_summary)

    # Determine if further recursion is possible and meaningful
    unique_clusters = df_summary["cluster"].nunique()
    if level < n_levels and unique_clusters > 1:
        # Use summaries as the input texts for the next level of recursion
        new_texts = df_summary["summaries"].tolist()
        next_level_results = recursive_embed_cluster_summarize(
            new_texts, level + 1, n_levels
        )

        # Merge the results from the next level into the current results dictionary
        results.update(next_level_results)

    return results

## Build Tree

In [17]:
leaf_texts = docs_texts
results = recursive_embed_cluster_summarize(leaf_texts, level=1, n_levels=7)

--Generated 316 clusters--
--Generated 61 clusters--
--Generated 11 clusters--
--Generated 1 clusters--


## Generate final summaries

1. **Tree Traversal Retrieval:** Tree traversal starts at the root level of the tree and retrieves the top k documents of a node based on the cosine similarity of the vector embedding. So, at each level it retrieves top k documents from the child node.
2. **Collapsed Tree Retrieval:** Collapsed Tree retrieval is a much simpler method. It collapses all the trees into a single layer and retrieves nodes until a threshold number of tokens is reached based on the cosine similarity of the query vector.

In [18]:
all_texts= leaf_texts.copy()

# Iterate through the results to extract summaries from each level and add them to all_texts
for level in sorted(results.keys()):
    # Extract summaries from the current level's DataFrame
    summaries = results[level][1]["summaries"].tolist()
    # Extend all_texts with the summaries from the current level
    all_texts.extend(summaries)

In [33]:
import ast


# Write list to a file
with open('summary.txt', 'w') as file:
    file.write(str(all_texts))


## Load the texts into vectorstore
* To store the vectors, we use the Milvus database.[Milvus](https://milvus.io/docs/)
* Milvus is a strong vector database designed specifically for processing and querying large amounts of vector data.
* It stands out for its exceptional performance and scalability, making it ideal for machine learning, deep learning, similarity search jobs, and recommendation systems.

In [19]:
from langchain_milvus.vectorstores import Milvus
URI = "/content/drive/MyDrive/rag_with_raptor/database/milvus_rag.db"

vector_db = Milvus.from_texts(
    texts= all_texts,
    embedding=embd,
    connection_args={"uri": URI},
    # metadatas= chunks.metadatas
)

In [23]:
# Perform a similarity search
query = "Explain Article 14 of the Indian Constitution?"
docs = vector_db.similarity_search(query)
for doc in docs:
  print(doc.page_content)

The paragraph appears to be a collection of notes and points related to the Indian Constitution, industrial law, and labor rights. It discusses various topics such as the fundamental rights of citizens, the concept of socioeconomic justice, the importance of labor laws, and the role of the International Labor Organization (ILO) in promoting workers' rights.

The text mentions several articles and sections of the Indian Constitution, including Articles 19, 21, 23, and 24, as well as the Directive Principles of State Policy in Part IV of the Constitution. It also references various labor laws and regulations, such as the Factories Act, the Employee's State Insurance Act, and the Minimum Wage Act.

The paragraph also touches on the importance of social justice, the need for fair wages, and the right to collective bargaining and strike. It quotes from several court judgments and references international labor standards, including the ILO's Declaration on Fundamental Principles and Rights a

In [28]:
%pip install --upgrade --quiet pymilvus[model]

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## Retrieval Techniques:

* Milvus Hybrid Search retriever Milvus Hybrid Search retriever combines the advantages of dense and sparse vector searches.
* BM25 Retriever (BM stands for best matching) is a ranking mechanism used by search engines to determine the relevance of pages to a particular search query.
* Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the following paper:DPR

### Milvus Hybrid Search retriever, which combines the strengths of both dense and sparse vector search

In [29]:
from pymilvus import (
    Collection,
    CollectionSchema,
    DataType,
    FieldSchema,
    WeightedRanker,
    connections,
)

In [30]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_milvus.retrievers import MilvusCollectionHybridSearchRetriever
from langchain_milvus.utils.sparse import BM25SparseEmbedding
# from langchain_openai import ChatOpenAI, OpenAIEmbeddings

In [31]:
CONNECTION_URI = "/content/drive/MyDrive/rag_with_raptor/database/milvus_hybrid_search.db"

In [32]:
# dense_embedding_func = OpenAIEmbeddings()
dense_dim = len(embd.embed_query(all_texts[1]))
dense_dim

384

```Note that the output of sparse embedding is a set of sparse vectors, which represents the index and weight of the keywords of the input text.```

In [34]:
sparse_embedding_func = BM25SparseEmbedding(corpus=all_texts)
sparse_embedding_func.embed_query(all_texts[1])

{2: 6.510754,
 3: 8.982522,
 4: 1.5737364,
 5: 0.06933469,
 27: 13.536732,
 29: 5.9989367,
 30: 2.0308025,
 31: 2.338879,
 32: 7.6103578,
 33: 2.4285266,
 34: 5.409163,
 35: 4.55588,
 36: 2.6417124,
 37: 7.6103578,
 38: 7.0992017,
 39: 2.0968235,
 40: 3.1927638,
 41: 4.352872,
 42: 3.274152,
 43: 18.542564,
 44: 1.2016146,
 45: 2.1381311,
 46: 2.2027373,
 47: 3.6011338,
 48: 7.4210925,
 49: 2.9137743,
 50: 7.6103578,
 51: 6.3681736,
 52: 3.804253,
 53: 4.55588,
 54: 4.524294,
 55: 3.5036724,
 56: 4.7303886,
 57: 15.917902,
 58: 2.376485,
 59: 2.4153028,
 60: 2.6740603,
 61: 4.3015594,
 62: 5.494893,
 63: 5.7618856,
 64: 3.8767748,
 65: 4.809526,
 66: 0.41468054,
 67: 5.7618856,
 68: 3.9814482,
 69: 4.406886,
 70: 2.5999458,
 71: 2.086702,
 72: 4.55588,
 73: 7.151829,
 74: 3.0236313,
 75: 4.6569633,
 76: 3.8281357,
 77: 4.5884743,
 78: 2.7769804,
 79: 5.207497,
 80: 1.7268566,
 81: 0.9344973,
 82: 2.3024037,
 83: 2.306403,
 84: 3.9632545,
 85: 3.1089575,
 86: 3.4037583,
 87: 3.4581056,


## Create Milvus Collection and load data

In [35]:
connections.connect(uri=CONNECTION_URI)

### Define field names and their data types

In [36]:
pk_field = "doc_id"
dense_field = "dense_vector"
sparse_field = "sparse_vector"
text_field = "text"
fields = [
    FieldSchema(
        name=pk_field,
        dtype=DataType.VARCHAR,
        is_primary=True,
        auto_id=True,
        max_length=100,
    ),
    FieldSchema(name=dense_field, dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
    FieldSchema(name=sparse_field, dtype=DataType.SPARSE_FLOAT_VECTOR),
    FieldSchema(name=text_field, dtype=DataType.VARCHAR, max_length=65_535),
]

### Create a collection with the defined schema

In [37]:
schema = CollectionSchema(fields=fields, enable_dynamic_field=False)
collection = Collection(
    name="BriefSummaryofLaws", schema=schema, consistency_level="Strong"
)

### Define index for dense and sparse vectors

In [38]:
dense_index = {"index_type": "FLAT", "metric_type": "IP"}
collection.create_index("dense_vector", dense_index)
sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
collection.create_index("sparse_vector", sparse_index)
collection.flush()

### Insert entities into the collection and load the collection

In [39]:
entities = []
for text in all_texts:
    entity = {
        dense_field: embd.embed_documents([text])[0],
        sparse_field: sparse_embedding_func.embed_documents([text])[0],
        text_field: text,
    }
    entities.append(entity)
collection.insert(entities)
collection.load()

## Build RAG chain with Retriever

### Create the Retriever
#### Define search parameters for sparse and dense fields, and create a retriever

In [40]:
sparse_search_params = {"metric_type": "IP"}
dense_search_params = {"metric_type": "IP", "params": {}}
retriever = MilvusCollectionHybridSearchRetriever(
    collection=collection,
    rerank=WeightedRanker(0.5, 0.5),
    anns_fields=[dense_field, sparse_field],
    field_embeddings=[embd, sparse_embedding_func],
    field_search_params=[dense_search_params, sparse_search_params],
    top_k=3,
    text_field=text_field,
)

In the input parameters of this Retriever, we use a dense embedding and a sparse embedding to perform hybrid search on the two fields of this Collection, and use WeightedRanker for reranking. Finally, 3 top-K Documents will be returned.

In [41]:
retriever.invoke("Explain Labour Law?")

[Document(metadata={'doc_id': '451298459402371835'}, page_content='lesson n constitution labour law'),
 Document(metadata={'doc_id': '451298459402372918'}, page_content='lesson n industrial labour law audit'),
 Document(metadata={'doc_id': '451298459402371881'}, page_content='lesson n international labour organisation'),
 Document(metadata={'doc_id': '451298459402373793'}, page_content="The text appears to be a study material for a professional programme on labour law, focusing on the importance of labour audits and compliance with labour legislation. It explains that a labour audit is a process to ensure sound corporate governance and detect non-compliance with various labour laws applicable to an organization. The text highlights the benefits of labour audits, including boosting morale among workers, increasing productivity, and promoting good corporate governance. It also mentions the importance of compulsory labour audits to ensure compliance with past defaults and reduce the risk 

## Doing reranking with CohereRerank

Now let's wrap our base retriever with a `ContextualCompressionRetriever`. We'll add an `CohereRerank`, uses the Cohere rerank endpoint to rerank the returned results. Do note that it is mandatory to specify the model name in CohereRerank!

In [43]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

In [44]:
os.environ['COHERE_API_KEY'] = 'Your_COHERE_API_KEY'

In [45]:
compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

In [46]:
from langchain.chains import RetrievalQA

chain = RetrievalQA.from_chain_type(
    llm=model, retriever=compression_retriever
)

In [47]:
chain.invoke("Explain Matrimonial Remedies under Hindu Law?")

{'query': 'Explain Matrimonial Remedies under Hindu Law?',
 'result': 'Based on the provided context, here is an explanation of Matrimonial Remedies under Hindu Law:\n\nUnder Hindu Law, a wife may choose to live separately from her husband due to various reasons, including employment, financial difficulties, or personal reasons. In such cases, the wife may seek various matrimonial remedies to resolve the issue.\n\nSome of the matrimonial remedies available to a wife under Hindu Law include:\n\n1. Judicial Separation: This is a remedy where the court grants a decree of separation, allowing the wife to live separately from her husband without dissolving the marriage.\n2. Divorce: This is a remedy where the court dissolves the marriage, allowing the wife to remarry.\n3. Maintenance: This is a remedy where the court grants the wife a regular payment of money to maintain her livelihood, especially in cases where the husband has failed to provide for her.\n\nIn order to avail these remedies,

In [48]:
chain.invoke("Explain Article 14 of the Indian Constitution.?")

{'query': 'Explain Article 14 of the Indian Constitution.?',
 'result': 'Article 14 of the Indian Constitution is the "Equality before the Law" clause. It states:\n\n"Equality before the law is a constitutional principle that no person shall be denied the equal protection of the laws nor shall there be any discrimination by the State on the ground of religion, race, caste, sex, place of birth or any of them."\n\nIn simpler terms, Article 14 ensures that every individual is treated equally by the law and that there is no discrimination on the basis of certain grounds specified in the Constitution. This means that the State cannot discriminate against any person or group of people in the exercise of its powers or functions, such as in the allocation of resources, provision of services, or enforcement of laws.\n\nArticle 14 has been interpreted by the Supreme Court of India to mean that all persons, regardless of their background or characteristics, have an equal claim to the protection o

In [49]:
chain.invoke("Explain law of wages")

{'query': 'Explain law of wages',
 'result': 'Based on the provided text, I will explain the concept of "minimum rate wage" and "payment of wages" under Indian labor laws.\n\nIn India, the concept of minimum rate wage is defined under the Minimum Wages Act, 1948. The Act provides for the fixing of minimum rates of wages in certain employments and the payment of wages to employees at such rates. The minimum rate wage is the lowest rate that an employer is required to pay to an employee for their work.\n\nAccording to the Act, the minimum rate wage is fixed by the state governments, taking into account the standard of living, cost of living, and other factors relevant to the workers in that state. The minimum rate wage is applicable to all employees employed in scheduled employment, which includes industries such as manufacturing, mining, and construction.\n\nThe payment of wages to employees is governed by the Payment of Wages Act, 1936. This Act provides for the payment of wages to emp