# Multidoc Autoretrieval Pack

This is the LlamaPack version of our structured hierarchical retrieval guide in the [core repo](https://docs.llamaindex.ai/en/stable/examples/query_engine/multi_doc_auto_retrieval/multi_doc_auto_retrieval.html).

## Setup and Download Data

In this section, we'll load in LlamaIndex Github issues.

In [1]:
import nest_asyncio

nest_asyncio.apply()

In [2]:
import os

os.environ[
    "GITHUB_TOKEN"
] = "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"

In [3]:
import os

from llama_hub.github_repo_issues import (
    GitHubRepositoryIssuesReader,
    GitHubIssuesClient,
)

github_client = GitHubIssuesClient()
loader = GitHubRepositoryIssuesReader(
    github_client,
    owner="run-llama",
    repo="llama_index",
    verbose=True,
)

orig_docs = loader.load_data()

limit = 100
# limit = 10

docs = []
for idx, doc in enumerate(orig_docs):
    doc.metadata["index_id"] = doc.id_
    if idx >= limit:
        break
    docs.append(doc)

Found 100 issues in the repo page 1
Resulted in 100 documents
Found 100 issues in the repo page 2
Resulted in 200 documents
Found 100 issues in the repo page 3
Resulted in 300 documents
Found 9 issues in the repo page 4
Resulted in 309 documents
No more issues found, stopping


In [4]:
from copy import deepcopy
import asyncio
from tqdm.asyncio import tqdm_asyncio
from llama_index import SummaryIndex, Document, ServiceContext
from llama_index.llms import OpenAI
from llama_index.async_utils import run_jobs


async def aprocess_doc(doc, include_summary: bool = True):
    """Process doc."""
    print(f"Processing {doc.id_}")
    metadata = doc.metadata

    date_tokens = metadata["created_at"].split("T")[0].split("-")
    year = int(date_tokens[0])
    month = int(date_tokens[1])
    day = int(date_tokens[2])

    assignee = "" if "assignee" not in doc.metadata else doc.metadata["assignee"]
    size = ""
    if len(doc.metadata["labels"]) > 0:
        size_arr = [l for l in doc.metadata["labels"] if "size:" in l]
        size = size_arr[0].split(":")[1] if len(size_arr) > 0 else ""
    new_metadata = {
        "state": metadata["state"],
        "year": year,
        "month": month,
        "day": day,
        "assignee": assignee,
        "size": size,
        "index_id": doc.id_,
    }

    # now extract out summary
    summary_index = SummaryIndex.from_documents([doc])
    query_str = "Give a one-sentence concise summary of this issue."
    query_engine = summary_index.as_query_engine(
        service_context=ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo"))
    )
    summary_txt = str(query_engine.query(query_str))

    new_doc = Document(text=summary_txt, metadata=new_metadata)
    return new_doc


async def aprocess_docs(docs):
    """Process metadata on docs."""

    new_docs = []
    tasks = []
    for doc in docs:
        task = aprocess_doc(doc)
        tasks.append(task)

    new_docs = await run_jobs(tasks, show_progress=True, workers=5)

    # new_docs = await tqdm_asyncio.gather(*tasks)

    return new_docs

In [5]:
new_docs = await aprocess_docs(docs)

  0%|                                                                           | 0/100 [00:00<?, ?it/s]

Processing 9244
Processing 9417
Processing 9618
Processing 9491
Processing 9408
Processing 9611
Processing 9627
Processing 9372
Processing 9623
Processing 9415
Processing 9620
Processing 9414
Processing 9097
Processing 9525
Processing 9339
Processing 9427
Processing 9398
Processing 9613
Processing 9353
Processing 9612
Processing 8832
Processing 9348
Processing 9609
Processing 9604
Processing 7457
Processing 9426
Processing 9383
Processing 9664
Processing 9425
Processing 9419
Processing 9405
Processing 9684
Processing 9373
Processing 9546
Processing 9565
Processing 9488
Processing 9560
Processing 9269
Processing 8802
Processing 9510
Processing 9343
Processing 9523
Processing 9416
Processing 9421
Processing 9522
Processing 9653
Processing 9520
Processing 9435
Processing 9571
Processing 9358
Processing 9385
Processing 9685
Processing 9380
Processing 9352
Processing 9477
Processing 9626
Processing 9368
Processing 8893
Processing 9543
Processing 9638
Processing 9312
Processing 8551
Processi

100%|█████████████████████████████████████████████████████████████████| 100/100 [01:48<00:00,  1.09s/it]


In [6]:
new_docs[5].metadata

{'state': 'open',
 'year': 2023,
 'month': 12,
 'day': 21,
 'assignee': '',
 'size': 'L',
 'index_id': '9658'}

## Setup Weaviate Indices

In [7]:
from llama_index.vector_stores import WeaviateVectorStore
from llama_index.storage import StorageContext
from llama_index import VectorStoreIndex

In [8]:
import weaviate

# cloud
auth_config = weaviate.AuthApiKey(api_key="RR3SptbaO2l5Xqb2GbEZtUKXOVRcrDEYhAHw")
client = weaviate.Client(
    "https://jerry-cluster-gk9v5ken.weaviate.network",
    auth_client_secret=auth_config,
)

doc_metadata_index_name = "LlamaIndex_auto"
doc_chunks_index_name = "LlamaIndex_AutoDoc"

In [9]:
# optional: delete schema
client.schema.delete_class(doc_metadata_index_name)
client.schema.delete_class(doc_chunks_index_name)

### Setup Metadata Schema

This is required for autoretrieval; we put this in the prompt.

In [10]:
from llama_index.vector_stores.types import MetadataInfo, VectorStoreInfo


vector_store_info = VectorStoreInfo(
    content_info="Github Issues",
    metadata_info=[
        MetadataInfo(
            name="state",
            description="Whether the issue is `open` or `closed`",
            type="string",
        ),
        MetadataInfo(
            name="year",
            description="The year issue was created",
            type="integer",
        ),
        MetadataInfo(
            name="month",
            description="The month issue was created",
            type="integer",
        ),
        MetadataInfo(
            name="day",
            description="The day issue was created",
            type="integer",
        ),
        MetadataInfo(
            name="assignee",
            description="The assignee of the ticket",
            type="string",
        ),
        MetadataInfo(
            name="size",
            description="How big the issue is (XS, S, M, L, XL, XXL)",
            type="string",
        ),
    ],
)

## Download LlamaPack

In [11]:
# from llama_index.llama_pack import download_llama_pack

# MultiDocAutoRetrieverPack = download_llama_pack(
#     "MultiDocAutoRetrieverPack",
#     "./multidoc_autoretriever_pack",
#     llama_hub_url="https://raw.githubusercontent.com/run-llama/llama-hub/jerry/add_multi_doc_autoretrieval_pack/llama_hub"
# )

from llama_hub.llama_packs.multidoc_autoretrieval.base import MultiDocAutoRetrieverPack

In [12]:
pack = MultiDocAutoRetrieverPack(
    client,
    doc_metadata_index_name,
    doc_chunks_index_name,
    new_docs,
    docs,
    vector_store_info,
    auto_retriever_kwargs={
        "verbose": True,
        "similarity_top_k": 2,
        "empty_query_top_k": 10,
    },
    verbose=True,
)

Indexed metadata nodes.
Indexed source document nodes.
Setup autoretriever over metadata.
Setup per-document retriever.
Setup recursive retriever.


## Run LlamaPack

Now let's try the LlamaPack on some queries! 

In [13]:
response = pack.run("Tell me about some issues on 12/11")
print(str(response))

[1;3;34mRetrieving with query id None: Tell me about some issues on 12/11
[0mUsing query str: issues
Using filters: [('month', '==', 12), ('day', '==', 11)]
[1;3;38;5;200mRetrieved node with id, entering: 9425
[0m[1;3;34mRetrieving with query id 9425: Tell me about some issues on 12/11
[0m[1;3;38;5;200mRetrieving text node: [Feature Request]: Make llama-index compartible with models finetuned and hosted on modal.com
### Feature Description

Modal.com is a cloud computing service that allows you to finetune and host models on their workers. They provide inference points for any models finetuned on their platform.

### Reason

I have not tried implementing the feature. I just read about the capabilities on modal.com and thought it would be a good integration feature for llama-index to allow for more configuration.

### Value of Feature

An integration feature to allow users who host their models on modal to use llama-index for their RAG and prompt engineering pipelines.
[0m[1;3;

In [14]:
response = pack.run("Tell me about some open issues related to agents")
print(str(response))

[1;3;34mRetrieving with query id None: Tell me about some open issues related to agents
[0mUsing query str: agents
Using filters: [('state', '==', 'open')]
[1;3;38;5;200mRetrieved node with id, entering: 9472
[0m[1;3;34mRetrieving with query id 9472: Tell me about some open issues related to agents
[0m[1;3;38;5;200mRetrieving text node: [Feature Request]: Add stop words to ReAct agent
### Feature Description

The ReAct agent does not use any stop words and the current API does not allow these to be passed to the LLM API.
When using the ReAct agent chat abstraction the LLM often will generate an entire conversation before this output is collected by llama-index and then trimmed to the first `Thought:`, `Action:` set.

This is very, very slow for some models.

A better approach would be to use any available stop word setting in the APIs llama-index calls, or to instead use a streaming approach and implement stop words when possible this way.

Additionally stop words should be plum

### Retriever-only

We can also get the retriever module and just run that.

In [16]:
retriever = pack.get_modules()["recursive_retriever"]
nodes = retriever.retrieve("Tell me about some open issues related to agents")
print(f"Number of source nodes: {len(nodes)}")
nodes[0].node.metadata

[1;3;34mRetrieving with query id None: Tell me about some open issues related to agents
[0mUsing query str: agents
Using filters: [('state', '==', 'open')]
[1;3;38;5;200mRetrieved node with id, entering: 9653
[0m[1;3;34mRetrieving with query id 9653: Tell me about some open issues related to agents
[0m[1;3;38;5;200mRetrieving text node: [Feature Request]: Add multi-filter single key solution
### Feature Description

This is following up on what the bot suggested in this ticket: [https://github.com/run-llama/llama_index/issues/9627](https://github.com/run-llama/llama_index/issues/9627).

I need functionality for multi-filter, single-key with chroma in particular. Something like this, for example:
key="Month", value=["September", "October"] with an OR filter condition and an IN operator
As far as I understand, this is not currently supported.

I have the following working solution and am hoping this or something similar could be merged into the repo:

1) vector_stores/types.py

``

{'state': 'open',
 'created_at': '2023-12-21T15:55:48Z',
 'url': 'https://api.github.com/repos/run-llama/llama_index/issues/9653',
 'source': 'https://github.com/run-llama/llama_index/issues/9653',
 'labels': ['enhancement', 'triage'],
 'index_id': '9653'}