<a href="https://colab.research.google.com/github/aritrasen87/smolagents/blob/main/2_hf_smolagents_agentic_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installations

In [1]:
! pip -q install smolagents
! pip -q install litellm
! pip install langchain langchain-community rank_bm25 --upgrade -q
! pip install sentence-transformers -q
! pip install datasets -q
! pip install chromadb -q


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.1/68.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.4/321.4 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m111.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.7/81.7 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m89.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Credentials

In [2]:
import os
from google.colab import userdata
os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_KEY')

### Testing smolagents

In [3]:
from smolagents import CodeAgent, HfApiModel

model = HfApiModel()

agent = CodeAgent(tools=[],model=model)

agent.run('What is 24*365?')

* 'fields' has been removed


8760

### Vanilla RAG has limitations, most importantly these two:

1. It performs only one retrieval step: if the results are bad, the generation in turn will be bad.

2.  The user query will often be a question and the document containing the true answer will be in affirmative voice, so its similarity score will be downgraded compared to other source documents in the interrogative form, leading to a risk of missing the relevant information.

This Agent will:
- ✅ Formulate the query itself
- ✅ Critique to re-retrieve if needed.

### Indexing data into Chroma

In [4]:
import datasets
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.docstore.document import Document
#from langchain_community.retrievers import BM25Retriever

knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")
knowledge_base = knowledge_base.filter(lambda row: row["source"].startswith("huggingface/transformers"))

source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
    for doc in knowledge_base
]

### Creating Chunks using RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],
)
new_docs = text_splitter.split_documents(documents=source_docs)

###  BGE Embddings

from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cuda"}
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)
### Populate Vector DB

db = Chroma.from_documents(new_docs, embeddings)

README.md:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

huggingface_doc.csv:   0%|          | 0.00/22.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2647 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2647 [00:00<?, ? examples/s]

  embeddings = HuggingFaceBgeEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
retriever = db.as_retriever(search_kwargs={"k": 4})
retriever.invoke('forward pass in transformer')

[Document(metadata={'source': 'transformers', 'start_index': 7510}, page_content='4.  [ ] Created script that successfully runs forward pass using\n    original repository and checkpoint\n\n5.  [ ] Successfully opened a PR and added the model skeleton to Transformers\n\n6.  [ ] Successfully converted original checkpoint to Transformers\n    checkpoint\n\n7.  [ ] Successfully ran forward pass in Transformers that gives\n    identical output to original checkpoint\n\n8.  [ ] Finished model tests in Transformers\n\n9.  [ ] Successfully added Tokenizer in Transformers'),
 Document(metadata={'source': 'transformers', 'start_index': 7863}, page_content='4.  [ ] Created script that successfully runs forward pass using\n    original repository and checkpoint\n\n5.  [ ] Successfully opened a PR and added the model skeleton to Transformers\n\n6.  [ ] Successfully converted original checkpoint to Transformers\n    checkpoint\n\n7.  [ ] Successfully ran forward pass in Transformers that gives\n   

### Creation of Retriever Tool

In [12]:
from smolagents import Tool

class RetrieverTool(Tool):
    name = "retriever"
    description = "Uses semantic search to retrieve the parts of transformers documentation that could be most relevant to answer your query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.retriever = db.as_retriever(search_kwargs={"k": 4})

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = self.retriever.invoke(
            query,
        )
        return "\nRetrieved documents:\n" + "".join(
            [
                f"\n\n===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

retriever_tool = RetrieverTool()

### Agent Initialization

In [13]:
from smolagents import HfApiModel, CodeAgent

agent = CodeAgent(
    tools=[retriever_tool], model=HfApiModel(), max_steps=4, verbose=True
)

In [14]:
agent_output = agent.run("For a transformers model training, which is slower, the forward or the backward pass?")

print("Final output:")
print(agent_output)

Final output:
ChatCompletionOutputMessage(role='assistant', content='In the context of training transformer models, the backward pass is generally slower than the forward pass. This is due to several reasons:\n\n1. **Complexity of Operations**: The backward pass involves computing gradients for each parameter in the model, which requires additional computations such as the chain rule in backpropagation. This results in more complex operations compared to the forward pass, which is primarily focused on computing the output.\n\n2. **Memory Usage**: The backward pass requires storing intermediate activations from the forward pass to compute gradients correctly. This additional memory usage can slow down the backward pass, especially for large models and long sequences.\n\n3. **Parallelization**: The forward pass can often be more easily parallelized across multiple GPUs or TPUs, whereas the backward pass can be more challenging to parallelize efficiently due to the dependencies between co

In [15]:
print(agent_output.content)

In the context of training transformer models, the backward pass is generally slower than the forward pass. This is due to several reasons:

1. **Complexity of Operations**: The backward pass involves computing gradients for each parameter in the model, which requires additional computations such as the chain rule in backpropagation. This results in more complex operations compared to the forward pass, which is primarily focused on computing the output.

2. **Memory Usage**: The backward pass requires storing intermediate activations from the forward pass to compute gradients correctly. This additional memory usage can slow down the backward pass, especially for large models and long sequences.

3. **Parallelization**: The forward pass can often be more easily parallelized across multiple GPUs or TPUs, whereas the backward pass can be more challenging to parallelize efficiently due to the dependencies between computations.

4. **Gradient Accumulation**: In practice, the backward pass i

In [16]:
agent.run("For a transformers model training, What is the role of scaled dot product?")

ChatCompletionOutputMessage(role='assistant', content="The scaled dot product is a key component of the self-attention mechanism in transformer models, which plays a crucial role in how the model processes and understands the input data, particularly in natural language processing tasks.\n\nIn the context of transformers, the self-attention mechanism allows the model to weigh the importance of different words in a sentence relative to each other. This is done by computing a score for each pair of words, which indicates how relevant one word is to another. The scaled dot product is a method used to compute these scores efficiently.\n\nHere's a step-by-step breakdown of how the scaled dot product works in the self-attention mechanism:\n\n1. **Query, Key, and Value Vectors**: Each word in the input sequence is transformed into three vectors: a query vector, a key vector, and a value vector. These vectors are typically produced by applying linear transformations (using weight matrices) to 