# Indexing API Example

This notebook demonstrates how to use the Indexing API of the Dial RAG application.

Dial RAG can automatically run indexing for a new documents it sees in RAG or retrieval requests. However, you can also manually index documents using the Indexing API. It may be useful if you know the document in advance and do not want to wait for the automatic indexing during the first request, or if you want to attribute the costs of the indexing to a separate Dial API-Key. Also, the indexing API allows you to manually manage the path to the index storage, which can be useful if you have some custom pipeline which involves copying/moving the documents after indexing.


This notebook provides a step-by-step example which covers uploading files to Dial, calling Dial RAG Indexing API to index the files, describes the results of the indexing, and demonstrates how to use the custom paths to the index files in the Dial storage.

## Setup

Install the dependencies we are going to use in this example:

In [None]:
!pip install -q openai==1.68.2
!pip install -q requests==2.32.3
!pip install -q python-dotenv==1.0.1

Import the packages we are going to use:

In [1]:
import json
import os

import requests
from dotenv import load_dotenv
from openai import AzureOpenAI

Load `DIAL_URL` and `DIAL_API_KEY` from .env file to access the Dial RAG API. We need `DIAL_URL` to connect to the Dial RAG API and `DIAL_API_KEY` to authenticate our requests.

**Note**: we use pre-defined API Key written in `.env` file in this example. For a production applications it is recommended to use per-request key issued by Dial Core for a single request to your application. For more details, see [Per-request key](https://docs.dialx.ai/platform/core/per-request-keys).

In [2]:
load_dotenv(override=True)

dial_url = os.environ["DIAL_URL"]
api_key = os.environ["DIAL_API_KEY"]

Helper function to upload files to Dial. See [Uploading file to the DIAL file storage](https://docs.dialx.ai/tutorials/developers/apps-development/multimodality/dial-cookbook/examples/how_to_call_image_to_text_applications#uploading-file-to-the-dial-file-storage) for more details.

In [3]:
bucket = requests.get(
    f"{dial_url}/v1/bucket",
    headers={"Api-Key": api_key},
    timeout=30,
).json()["bucket"]


def upload_file_to_dial(bucket: str, filename: str, content_type: str):
    with open(filename, "rb") as file:
        metadata = requests.put(
            f"{dial_url}/v1/files/{bucket}/data/{os.path.basename(filename)}",
            headers={"Api-Key": api_key},
            files={"file": (os.path.basename(filename), file, content_type)},
            timeout=30,
        ).json()
    return {"type": metadata["contentType"], "url": metadata["url"]}


print(f"Bucket: {bucket}")

Bucket: 6iTkeGUs2CvUehhYLmMYXB


## Indexing

For this example we will use two files from the test data:
- alps_wiki.pdf - pdf file with wikipedia article about the Alps. This file will be successfully indexed by the Dial RAG and will be used to demonstrate how to use it in Dial RAG requests.
- test_file.csv - the unsupported CSV file with some test data. This CSV file has different number of columns in each row and cannot be parsed as a single table, which is currently not supported by the Dial RAG. This file will help to demonstrate error handling.

The files should be uploaded to the Dial before we can use them in the Dial RAG indexing API. Let's upload the files to the Dial:

In [4]:
pdf_file_attachment = upload_file_to_dial(
    bucket, "../tests/data/alps_wiki.pdf", content_type="application/pdf"
)
unsupported_csv_file_attachment = upload_file_to_dial(
    bucket, "../tests/data/test_file.csv", content_type="text/csv"
)

print(f"PDF file uploaded: {pdf_file_attachment}")
print(f"Unsupported CSV file uploaded: {unsupported_csv_file_attachment}")

PDF file uploaded: {'type': 'application/pdf', 'url': 'files/6iTkeGUs2CvUehhYLmMYXB/data/alps_wiki.pdf'}
Unsupported CSV file uploaded: {'type': 'text/csv', 'url': 'files/6iTkeGUs2CvUehhYLmMYXB/data/test_file.csv'}


We can use AzureOpenAI client to make a request to the Dial RAG indexing API.
For more ways to call the Dial RAG, you can follow the examples for the ways to call Dial applications [How to call image-to-text applications](https://docs.dialx.ai/tutorials/developers/apps-development/multimodality/dial-cookbook/examples/how_to_call_image_to_text_applications) and adapt the code accordingly.

To make an indexing request, you need to specify `request.type` as `indexing` in the `custom_fields.configuration`.

In [5]:
# Initialize the Azure OpenAI client
client = AzureOpenAI(
    azure_endpoint=dial_url,
    api_key=api_key,
    api_version="2024-10-21",
)


# Send a indexing request
response = client.chat.completions.create(
    model="dial-rag",
    messages=[
        {
            "role": "user",
            "custom_content": {
                "attachments": [
                    pdf_file_attachment,
                    unsupported_csv_file_attachment,
                ],
            },
        }
    ],
    extra_body={
        "custom_fields": {
            "configuration": {
                "request": {
                    "type": "indexing",
                }
            }
        }
    },
    stream=False,  # We can set stream=True, but we will need to assemble the response JSON from streamed chunks
)

The response will contain the attachments with `type` equal to `application/x.aidial-rag-index.v0` pointing to the the resulting indexing files. The `reference_url` field in the attachment will contain the URL to the original document file.

In [6]:
response_attachments = response.choices[0].message.custom_content["attachments"]

pdf_index_attachment = next(
    (
        attachment
        for attachment in response_attachments
        if attachment["type"] == "application/x.aidial-rag-index.v0"
    ),
    None,
)
pdf_index_attachment

{'type': 'application/x.aidial-rag-index.v0',
 'url': 'files/DaEaz3wZq4g26J5J7BPmGC/dial-rag-index/cbfce81c/02c84941/80408552/957d0c77/dbad09f5/85ab2c5c/56b3acbc/5fd03e99/index.bin',
 'reference_url': 'files/6iTkeGUs2CvUehhYLmMYXB/data/alps_wiki.pdf'}

The response also contains the indexing result in the `application/x.aidial-rag.indexing-response+json` attachment which could be used to get an error message in case of indexing failure for some of the documents.

In [7]:
indexing_result_attachment = next(
    (
        attachment
        for attachment in response_attachments
        if attachment["type"]
        == "application/x.aidial-rag.indexing-response+json"
    ),
    None,
)
print(json.loads(indexing_result_attachment["data"]))

{'indexing_result': {'files/6iTkeGUs2CvUehhYLmMYXB/data/test_file.csv': {'errors': [{'message': 'Unable to load document content. Try another document format.'}]}}}


## Custom path for indexing files

You can use the same indexing attachment format to specify a custom path for the indexing files. This is useful if your application need to manage the indexing files by itself. For example, it you want to pre-index some document and you need to copy the index file and the document to some other location (for example, during the publishing of your application).

To do this, the attachment should have `type` equal to `application/x.aidial-rag-index.v0`, `reference_url` pointing to the url original document file, and `url` pointing to the desired relative url for the indexing files.

**Note**: the Dial RAG should have write access to the index file path specified in the request. For this example we use `appdata/dial-rag` in our bucket which will be available for Dial RAG with write access. See [Files sharing](https://docs.dialx.ai/platform/core/per-request-keys#files-sharing) for more details on how to share an access to the files in the Dial.

In [8]:
pdf_index_with_custom_path = {
    "type": "application/x.aidial-rag-index.v0",
    "reference_url": pdf_file_attachment["url"],
    "url": f"files/{bucket}/appdata/dial-rag/custom_index.bin",
}

response = client.chat.completions.create(
    model="dial-rag",
    messages=[
        {
            "role": "user",
            "custom_content": {
                "attachments": [
                    pdf_file_attachment,
                    pdf_index_with_custom_path,
                ],
            },
        }
    ],
    extra_body={
        "custom_fields": {
            "configuration": {
                "request": {
                    "type": "indexing",
                }
            }
        }
    },
    stream=False,
)

In [9]:
response_attachments = response.choices[0].message.custom_content["attachments"]
response_attachments

[{'type': 'application/x.aidial-rag-index.v0',
  'url': 'files/6iTkeGUs2CvUehhYLmMYXB/appdata/dial-rag/custom_index.bin',
  'reference_url': 'files/6iTkeGUs2CvUehhYLmMYXB/data/alps_wiki.pdf'},
 {'type': 'application/x.aidial-rag.indexing-response+json',
  'title': 'Indexing results',
  'data': '{"indexing_result":{}}'}]

## Using custom index path in requests

You can use the same indexing attachment format to specify a custom path for the indexing files for the the rag requests.

In [10]:
rag_response = client.chat.completions.create(
    model="dial-rag",
    messages=[
        {
            "role": "user",
            "content": "What is the highest mountain in the Alps?",
            "custom_content": {
                "attachments": [
                    pdf_file_attachment,
                    pdf_index_with_custom_path,
                ],
            },
        }
    ],
    stream=False,
)

print(rag_response.choices[0].message.content)

The highest mountain in the Alps is Mont Blanc, which spans the French-Italian border and has an elevation of 4,810 meters (15,781 feet) [1].


You can use the same indexing attachment format for a retrieval requests as well:

In [11]:
retrieval_response = client.chat.completions.create(
    model="dial-rag",
    messages=[
        {
            "role": "user",
            "content": "What is the highest mountain in the Alps?",
            "custom_content": {
                "attachments": [
                    pdf_file_attachment,
                    pdf_index_with_custom_path,
                ],
            },
        }
    ],
    extra_body={
        "custom_fields": {
            "configuration": {
                "request": {
                    "type": "retrieval",
                }
            }
        },
    },
    stream=False,  # We can set stream=True, but we will need to assemble the response JSON from streamed chunks
)

retrieval_attachments = retrieval_response.choices[0].message.custom_content[
    "attachments"
]
print(json.loads(retrieval_attachments[0]["data"])["chunks"][0])

{'attachment_url': 'files/6iTkeGUs2CvUehhYLmMYXB/data/alps_wiki.pdf', 'source': {'url': 'files/6iTkeGUs2CvUehhYLmMYXB/data/alps_wiki.pdf#page=1', 'display_name': 'data/alps_wiki.pdf'}, 'text': 'cantly from the current revision.\n\nTemplate:Lang-it The Template:IPA-it; Template:Lang-fr Template:IPA- fr; Template:IPA-de; Template:Lang-sl Template:IPA-sl) are the highest and most extensive mountain range system that lies entirely in Europe,[1] stretching approximately 1,200 kilometres (750 mi) across eight Alpine Italy, countries: Austria, France, Germany, Liechtenstein, and Slovenia, Switzerland.[2] The Caucasus Mountains are higher, and the Urals longer, but both lie partly in Asia. The mountains were formed over tens of millions of years as the African and Eurasian tectonic plates collided. Extreme shortening caused by in marine sedimentary rocks rising by thrusting and folding into high mountain peaks such as Mont Blanc and the Matterhorn. Mont Blanc spans the French– Italian border, 