# Configuring Chunking Settings For Inference Endpoints

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/document-chunking/configuring-chunking-settings-for-inference-endpoints.ipynb)


Learn how to configure [chunking settings](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-apis.html#infer-chunking-config) for [Inference API](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-apis.html) endpoints.

# 🧰 Requirements

For this example, you will need:

- An Elastic deployment:
   - We'll be using [Elastic serverless](https://www.elastic.co/docs/current/serverless) for this example (available with a [free trial](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook))

- Elasticsearch 8.16 or above.

# Create Elastic Cloud deployment or serverless project

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

# Install packages and connect with Elasticsearch Client

To get started, we'll need to connect to our Elastic deployment using the Python client (version 8.12.0 or above).
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to `pip` install the following packages:

- `elasticsearch`

In [6]:
!pip install elasticsearch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Next, we need to import the modules we need. 🔐 NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.

In [7]:
from elasticsearch import Elasticsearch
from getpass import getpass

Now we can instantiate the Python Elasticsearch client.

First we prompt the user for their password and Cloud ID.
Then we create a `client` object that instantiates an instance of the `Elasticsearch` class.

In [8]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# Create the client instance
client = Elasticsearch(
    # For local development
    #hosts=["http://localhost:9200"],
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
    request_timeout=120,
    max_retries=10,
    retry_on_timeout=True,
)

### Test the Client
Before you continue, confirm that the client has connected with this test.

In [9]:
print(client.info())

{'name': 'runTask-0', 'cluster_name': 'runTask', 'cluster_uuid': 'P0mSKeG7Qxe0PQszKQUhOA', 'version': {'number': '9.1.0-SNAPSHOT', 'build_flavor': 'default', 'build_type': 'tar', 'build_hash': 'c4dcdee9f8e42987e9d09a667e6b5ebcecc00fa9', 'build_date': '2025-03-03T19:12:08.271285Z', 'build_snapshot': True, 'lucene_version': '10.1.0', 'minimum_wire_compatibility_version': '8.19.0', 'minimum_index_compatibility_version': '8.0.0'}, 'tagline': 'You Know, for Search'}


Refer to [the documentation](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect to a self-managed deployment.

Read [this page](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect using API keys.

<a name="create-the-inference-endpoint"></a>
## Create the inference endpoint object

Let's create the inference endpoint by using the [Create Inference API](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-inference-api.html#put-inference-api-desc).

In this example, you'll be creating an inference endpoint for the [ELSER integration](https://www.elastic.co/guide/en/elasticsearch/reference/current/infer-service-elser.html) which will deploy Elastic's [ELSER model](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) within your cluster. Chunking settings are configurable for any inference endpoint with an embedding task type. A full list of available integrations can be found in the [Create Inference API](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-inference-api.html#put-inference-api-desc) documentation.

To configure chunking settings, the request body must contain a `chunking_settings` map with a `strategy` value along with any required values for the selected chunking strategy. For this example, you'll be configuring chunking settings for a `sentence` strategy with a maximum chunk size of 25 words and 1 sentence overlap between chunks. For more information on available chunking strategies and their configurable values, see the [chunking strategies documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-apis.html#_chunking_strategies).

In [10]:
client.inference.put(
	task_type="sparse_embedding",
	inference_id="my_elser_endpoint",
	body={
        "service": "elasticsearch",
    		"service_settings": {
	    "num_allocations": 1,
			"num_threads": 1,
			"model_id": ".elser_model_2"
  		},
		"chunking_settings": {
			"strategy": "sentence",
			"max_chunk_size": 25,
			"sentence_overlap": 1
		}
	}
)

ObjectApiResponse({'inference_id': 'my_elser_endpoint', 'task_type': 'sparse_embedding', 'service': 'elasticsearch', 'service_settings': {'num_allocations': 1, 'num_threads': 1, 'model_id': '.elser_model_2'}, 'chunking_settings': {'strategy': 'sentence', 'max_chunk_size': 25, 'sentence_overlap': 1}})

<a name="create-the-index"></a>
## Create the index

To see the chunking settings you've configured in action, you'll need to ingest a document into a semantic text field of an index. Let's create an index with a semantic text field linked to the inference endpoint created in the previous step.

In [11]:
client.indices.create(
index="my_index",
mappings={
        "properties": {
                "infer_field": {
                        "type": "semantic_text",
                        "inference_id": "my_elser_endpoint"
                }
        }
})

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'my_index'})

<a name="ingest-a-document"></a>
## Ingest a document

Now let's ingest a document into the index created in the previous step.

In [12]:
client.index(index="my_index", document={
	"infer_field": "This is some sample document data. The data is being used to demonstrate the configurable chunking settings feature. The configured chunking settings will determine how this text is broken down into chunks to help increase inference accuracy."
})

ObjectApiResponse({'_index': 'my_index', '_id': '_Fh3XZUBHKE836hZgHKV', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1})

<a name="view-the-chunks"></a>
## View the chunks

The generated chunks and their corresponding inference results can be seen stored in the document in the index under the key `chunks` within the `_inference_fields` metafield. The chunks are stored as a list of character offset values. Let's see the chunks generated when ingesting the documenting in the previous step.

In [13]:
client.search(index="my_index", body = {
    	'size' : 100,
    	'query': {
        	'match_all' : {}
    	},
       'fields': [ '_inference_fields' ]
})

ObjectApiResponse({'took': 32, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 1, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'my_index', '_id': '_Fh3XZUBHKE836hZgHKV', '_score': 1.0, '_source': {'infer_field': 'This is some sample document data. The data is being used to demonstrate the configurable chunking settings feature. The configured chunking settings will determine how this text is broken down into chunks to help increase inference accuracy.', '_inference_fields': {'infer_field': {'inference': {'inference_id': 'my_elser_endpoint', 'model_settings': {'task_type': 'sparse_embedding'}, 'chunks': {'infer_field': [{'start_offset': 0, 'end_offset': 117, 'embeddings': {'##able': 0.73828125, '##e': 0.011505127, '##fi': 1.0898438, '##gur': 1.2460938, '##ing': 1.1835938, '##u': 0.015289307, 'above': 0.28320312, 'algorithm': 0.5683594, 'apache': 0.5839844, 'api': 0.038208008, 'application': 0.041137695, 'ar

<a name="conclusion"></a>
## Conclusion

You've now learned how to configure chunking settings for an inference endpoint! For more infomration about configurable chunking, see the [configuring chunking](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-apis.html#infer-chunking-config) documentation.