# 💬 Navigate through docs using a chatbot with Argilla and LlamaIndex

In this tutorial, you'll learn about the integration between Argilla and LlamaIndex, by creating a multi-document agent over the Argilla documentation.

This tutorial includes the following steps:
[TODO]

You can use the `Open in Colab` button at the top of this page, which allows you to run the notebook directly on Google Colab. Don't forget to change the runtime type to GPU for faster model training and inference.

## Introduction

Let's start by installing the required dependencies

### Downloads

In [2]:
# !pip install llama-index-agent-openai
# !pip install llama-index-readers-file
# !pip install llama-index-postprocessor-cohere-rerank
# !pip install llama-index-llms-openai
# !pip install llama-index-embeddings-openai
!pip install unstructured
# !pip install partition_image
!pip install argilla-llama-index
!pip install llama-index-callbacks-argilla
!pip install argilla



### Imports

In [3]:
import os
import nest_asyncio
from pathlib import Path
from tqdm.notebook import tqdm
import pickle

from argilla_llama_index import ArgillaCallbackHandler

from llama_index.core import Document, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.readers.file import UnstructuredReader
from llama_index.core import set_global_handler

from llama_index.agent.openai import OpenAIAgent
from llama_index.core import (
    load_index_from_storage,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.node_parser import SentenceSplitter



## Download Data

In [3]:
domain = "docs.argilla.io"
docs_url = "https://docs.argilla.io/en/latest/"
!wget -e robots=off --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains {domain} --no-parent {docs_url}

Both --no-clobber and --convert-links were specified, only --convert-links will be used.
--2024-02-14 12:43:19--  https://docs.argilla.io/en/latest/
Resolving docs.argilla.io (docs.argilla.io)... 104.17.33.82, 104.17.32.82
Connecting to docs.argilla.io (docs.argilla.io)|104.17.33.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘docs.argilla.io/en/latest/index.html’

docs.argilla.io/en/     [ <=>                ]  73,07K  --.-KB/s    in 0,001s  

2024-02-14 12:43:19 (62,9 MB/s) - ‘docs.argilla.io/en/latest/index.html’ saved [74827]

--2024-02-14 12:43:19--  https://docs.argilla.io/en/latest/genindex.html
Reusing existing connection to docs.argilla.io:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘docs.argilla.io/en/latest/genindex.html’

docs.argilla.io/en/     [ <=>                ] 147,87K  --.-KB/s    in 0,02s   

2024-02-14 12:43:19 (5,82 MB/s) - ‘docs.argilla.io/en/lates

In [4]:
reader = UnstructuredReader()

all_files_gen = Path("./docs.argilla.io/").rglob("*")
all_files = [f.resolve() for f in all_files_gen]

all_html_files = [f for f in all_files if f.suffix.lower() == ".html"]

print(f"Number of documents downloaded: {len(all_html_files)}")

Number of documents downloaded: 169


[nltk_data] Downloading package punkt to /Users/ignacio/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ignacio/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [5]:
docs = []

for idx, f in enumerate(all_html_files):

    # The first three elements of the docs are indexes, so we can skip them.
    if idx < 3:
        continue
    
    # We also need to handle exceptions when images are found among the html files.
    try:
        loaded_docs = reader.load_data(file=f, split_documents=True)
    except NameError:
        print(f"Error loading {f}, was a image. Skipping.")
        continue
    
    loaded_doc = Document(
        text="\n\n".join([d.get_content() for d in loaded_docs]),
        metadata={"path": str(f)},
    )
    docs.append(loaded_doc)

Error loading /Users/ignacio/Documents/recognai/argilla-llama-index/docs/tutorials/docs.argilla.io/en/latest/_images/token_length_plot.png.html, was a image. Skipping.


In [6]:
# Print each field and its content from the an element of the list of docs.
for elem in docs[10].__dict__:
    print(f"{elem}: {docs[10].__dict__[elem]}")


id_: d83c68fc-ed91-4dca-89d6-befae4b04b46
embedding: None
metadata: {'path': '/Users/ignacio/Documents/recognai/argilla-llama-index/docs/tutorials/docs.argilla.io/en/latest/practical_guides/practical_guides.html'}
excluded_embed_metadata_keys: []
excluded_llm_metadata_keys: []
relationships: {}
text: Hide navigation sidebar

Hide table of contents sidebar

Hide search

Toggle site navigation sidebar

Toggle search

Hide search

Toggle Light / Dark / Auto color theme

Join

Getting Started

What is Argilla?

🚀 QuickstartToggle navigation of 🚀 Quickstart
Installation
Workflow Feedback Dataset
Workflow of Other Datasets

🎼 Cheatsheet

🔧 InstallationToggle navigation of 🔧 Installation
Python
Docker
Docker Quickstart
Docker-compose
Cloud Providers and Kubernetes
Hugging Face Spaces
Google Colab

⚙️ ConfigurationToggle navigation of ⚙️ Configuration
Elasticsearch
Server configuration
User Management
Workspace and Dataset Management
Database Migrations
Image Support

Conceptual Guides

Argill

## Load the LLM and the Argilla handler

In [7]:
# If you don't have your key in an environment variable, you can fill this constant with your API KEY.
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

nest_asyncio.apply()

In [8]:
OPENAI_API_KEY

'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'

In [2]:
set_global_handler("argilla", dataset_name="query_model")

In [None]:
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = Settings.llm

## Building Multi-Document Agents

In [None]:

async def build_agent_per_doc(nodes, file_base):
    print(file_base)

    vi_out_path = f"./data/argilla_docs/{file_base}"
    summary_out_path = f"./data/argilla_docs/{file_base}_summary.pkl"
    if not os.path.exists(vi_out_path):
        Path("./data/argilla_docs/").mkdir(parents=True, exist_ok=True)
        # build vector index
        vector_index = VectorStoreIndex(nodes)
        vector_index.storage_context.persist(persist_dir=vi_out_path)
    else:
        vector_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=vi_out_path),
        )

    # build summary index
    summary_index = SummaryIndex(nodes)

    # define query engines
    vector_query_engine = vector_index.as_query_engine(llm=llm)
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize", llm=llm
    )

    # extract a summary
    if not os.path.exists(summary_out_path):
        Path(summary_out_path).parent.mkdir(parents=True, exist_ok=True)
        summary = str(
            await summary_query_engine.aquery(
                "Extract a concise 1-2 line summary of this document"
            )
        )
        pickle.dump(summary, open(summary_out_path, "wb"))
    else:
        summary = pickle.load(open(summary_out_path, "rb"))

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name=f"vector_tool_{file_base}",
                description=f"Useful for questions related to specific facts",
            ),
        ),
        QueryEngineTool(
            query_engine=summary_query_engine,
            metadata=ToolMetadata(
                name=f"summary_tool_{file_base}",
                description=f"Useful for summarization questions",
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-4")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=True,
        system_prompt=f"""\
You are a specialized agent designed to answer queries about the `{file_base}.html` part of the Argilla docs.
You must ALWAYS use at least one of the tools provided when answering a question; do NOT rely on prior knowledge.\
""",
    )

    return agent, summary


async def build_agents(docs):
    node_parser = SentenceSplitter()

    # Build agents dictionary
    agents_dict = {}
    extra_info_dict = {}

    # # this is for the baseline
    # all_nodes = []

    for idx, doc in enumerate(tqdm(docs)):
        nodes = node_parser.get_nodes_from_documents([doc])
        # all_nodes.extend(nodes)

        # ID will be base + parent
        file_path = Path(doc.metadata["path"])
        file_base = str(file_path.parent.stem) + "_" + str(file_path.stem)
        agent, summary = await build_agent_per_doc(nodes, file_base)

        agents_dict[file_base] = agent
        extra_info_dict[file_base] = {"summary": summary, "nodes": nodes}

    return agents_dict, extra_info_dict

In [None]:
agents_dict, extra_info_dict = await build_agents(docs)

### Building a Document Agent for each document

### Buildint a top-level agent

## Running example queries

## Visualizing in Argilla