# RAG Cookbook

In this notebook, we show the useage of CAMEL `RetrievalModule` in both customized way and default atuto way. We will also show how to combine `RetrievalModule` with `RolePlaying` by using `FunctionCalling`.

## Load Data

Let's first load the CAMEL paper from https://arxiv.org/pdf/2303.17760.pdf. This will be our local example data.

In [14]:
!mkdir -p 'local_data/'
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2303.17760.pdf" -O "local_data/camel paper.pdf"

--2024-01-06 20:35:34--  https://arxiv.org/pdf/2303.17760.pdf
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.3.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7536497 (7.2M) [application/pdf]
Saving to: ‘local_data/camel paper2.pdf’


2024-01-06 20:39:59 (28.9 KB/s) - ‘local_data/camel paper2.pdf’ saved [7536497/7536497]



## 1. Customized RAG
In this section we will set our customized RAG pipeline.


Set embedding model, we will use `OpenAIEmbedding` as the embedding model, so we need to set the `OPENAI_API_KEY` in below.

In [2]:
import os

os.environ["OPENAI_API_KEY"] ="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"

Import and set the embedding instance:

In [3]:
from camel.embeddings import OpenAIEmbedding

embedding_model_instance = OpenAIEmbedding()

Import and set the vector storage instance:

In [4]:
from camel.storages.vectordb_storages import QdrantStorage

vector_storage_instance=QdrantStorage(
            vector_dim=embedding_model_instance.get_output_dim(), collection="my first collection",
            create_collection=True, # set this as ture to create the collection for first running
            path="storage_customized_run")

Import and set the retrieval instance:

In [5]:
from camel.functions.retrieval_function import RetrievalModule

retrieval_instance = RetrievalModule(embedding_model=embedding_model_instance)

We use integrated `Unstructured Moudle` to splite the content into small chunks, the content will be splited automacitlly with its `chunk_by_title` function, the max character for each chunk is 500 characters, which is a suitable length for `OpenAIEmbedding`. All the text in the chunks will be embed and stored to the vector storgae instance, it will take some time, please wait..

In [6]:
retrieval_instance.embed_and_store_chunks(content_input_path="local_data/camel paper.pdf",vector_storage=vector_storage_instance)

  from .autonotebook import tqdm as notebook_tqdm


Now we can retrieve information from the vector storage by giving a query. By default it will give you back the text content from top 1 chunk with highest Cosine similarity, and the similarity should be higher than 0.75 to ensure the retrieved text content is relevant to the query. You can also change the `top_k` value and `similarity_threshold` value with your needs.

The returned string list includes:
- similarity score
- content path
- metadata
- text

In [7]:
retrieved_info=retrieval_instance.query_and_compile_results(query="What is CAMEL?",vector_storage=vector_storage_instance,top_k=2)

print(retrieved_info)

{'similarity score': '0.8369745249499829', 'content path': 'local_data/camel paper.pdf', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2024-01-06T17:16:57', 'page_number': 1}, 'text': 'CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society\n\nhttps://www.camel-ai.org\n\nGuohao Li⇤ Hasan Abed Al Kader Hammoud*\n\nHani Itani*\n\nDmitrii Khizbullin\n\nBernard Ghanem\n\nKing Abdullah University of Science and Technology (KAUST)\n\nAbstract'}
{'similarity score': '0.8077271901798347', 'content path': 'local_data/camel paper.pdf', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2024-01-06T17:16:57', 'page_number': 1}, 'text': 'rk, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing our library to support research on communicative agents and beyond. The GitHub repository of this project is made publicly available

Let's try an irrelevant query:

In [8]:
retrieved_info_irrevelant=retrieval_instance.query_and_compile_results(query="Compared with dumpling and rice, which should I take for dinner?",vector_storage=vector_storage_instance,top_k=2)

print(retrieved_info_irrevelant)

No suitable information retrieved from local_data/camel paper.pdf with similarity_throthold = 0.75.


## 2. Default RAG
In this section we will run the auto RAG pipeline with default settings. It uses `OpenAIEmbedding` as default embedding model and `Qdrant` as default vector storage.

What you need to do is:
- Set content input paths, which can be local paths or remote urls
- Set local storage path or remote url and api key for Qdrant
- Give a query

The default RAG pipeline would create collections for given content input paths, the collection name will be set automaticlly based on the content input path name, if the collection exists, it will do the retrieval directly.

In [12]:
retrieval_instance_default = RetrievalModule()
retrieved_info = retrieval_instance_default.run_default_retrieval(content_input_paths=[
        "local_data/camel paper.pdf", # example local path
        "https://www.camel-ai.org/", # example remote url
    ], vector_storage_local_path="storage_default_run",
    query="What is CAMEL-AI")

print(retrieved_info)

Original Query:
{What is CAMEL-AI}
Retrieved Context:

{'similarity score': '0.8322398642543604', 'content path': 'local_data/camel paper.pdf', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2024-01-06T17:16:57', 'page_number': 1}, 'text': 'CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society\n\nhttps://www.camel-ai.org\n\nGuohao Li⇤ Hasan Abed Al Kader Hammoud*\n\nHani Itani*\n\nDmitrii Khizbullin\n\nBernard Ghanem\n\nKing Abdullah University of Science and Technology (KAUST)\n\nAbstract'}
{'similarity score': '0.8380730666554639', 'content path': 'https://www.camel-ai.org/', 'metadata': {'emphasized_text_contents': ['Mission', 'CAMEL-AI.org', 'is an open-source community dedicated to the study of autonomous and communicative agents. We believe that studying these agents on a large scale offers valuable insights into their behaviors, capabilities, and potential risks. To facilitate research in this field, we prov

## 3. Roly-playing with RAG
In this section we will show how to combing the default `RetrievalModule` with `RolePlaying` by applying `OpenAI Function Calling`.


First, we need to set a retrieval function with well-written docstring for LLM to understand what this function is used for, the main code is the same with the default RAG section.

In [10]:
from typing import List
from camel.functions import OpenAIFunction
from camel.functions.retrieval_function import RetrievalModule

def local_retrieval(query: str) -> str:
    r"""Performs a default local retrieval for information. Given a query,
    this function will retrieve the information from the local vector storage,
    and return the retrieved information back. It is useful for information
    retrieval.

    Args:
        query (string): Question you want to be answered.

    Returns:
        str: Aggregated information retrieved in response to the query.

    Example:
        local_retrieval(query = "what is camel?")
    """
    retrieval_instance = RetrievalModule()
    retrieved_info = retrieval_instance.run_default_retrieval( content_input_paths=[
            "local_data/camel paper.pdf",
            "https://www.camel-ai.org/",
        ],
        vector_storage_local_path="storage_default_run",
        query=query)
    return retrieved_info

# add the function to OpenAIFunction list
RETRIEVAL_FUNCS: List[OpenAIFunction] = [
    OpenAIFunction(func)
    for func in [
        local_retrieval,
        ]
]

Run the role-playing with defined retrieval function:

In [11]:
from camel.societies import role_playing_with_function
from camel.functions import MATH_FUNCS # import another function from camel

role_playing_with_function(
    task_prompt=("What is Role-playing Framework, how many assistant roles did camel use, what's the value plus 100?"),
    function_list=[*RETRIEVAL_FUNCS, *MATH_FUNCS]
    )

[32mAI Assistant sys message:
BaseMessage(role_name='Searcher', role_type=<RoleType.ASSISTANT: 'assistant'>, meta_dict={'task': "What is Role-playing Framework, how many assistant roles did camel use, what's the value plus 100?", 'assistant_role': 'Searcher', 'user_role': 'Professor'}, content="===== RULES OF ASSISTANT =====\nNever forget you are a Searcher and I am a Professor. Never flip roles! Never instruct me!\nWe share a common interest in collaborating to successfully complete a task.\nYou must help me to complete the task.\nHere is the task: What is Role-playing Framework, how many assistant roles did camel use, what's the value plus 100?. Never forget our task!\nI must instruct you based on your expertise and my needs to complete the task.\n\nI must give you one instruction at a time.\nYou must write a specific solution that appropriately solves the requested instruction and explain your solutions.\nYou must decline my instruction honestly if you cannot perform the instructio