From 1573b58e79352150bcf443c86c84b73176dd1e90 Mon Sep 17 00:00:00 2001 From: kosabogi Date: Thu, 22 May 2025 10:40:33 +0200 Subject: [PATCH 1/9] Adds information about the importance of adaptive allocations --- .../elastic-inference/inference-api.md | 23 +++++++++++++------ 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md index 20a0b35f49..3fae57f80e 100644 --- a/explore-analyze/elastic-inference/inference-api.md +++ b/explore-analyze/elastic-inference/inference-api.md @@ -9,15 +9,18 @@ products: - id: kibana --- -# Integrate with third-party services +# Inference integrations -{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints to integrate with machine learning models provide by popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more. +{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints that integrate with services such as Elasticsearch (for built-in NLP models like [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser) and [E5](/explore-analyze/machine-learning/nlp/ml-nlp-e5)), as well as popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more. -Learn how to integrate with specific services in the subpages of this section. +You can create a new inference endpoint: + +- using the [Create an inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-put-1) +- through the [Inference endpoints UI](#add-inference-endpoints). ## Inference endpoints UI [inference-endpoints] -You can also manage inference endpoints using the UI. +You can manage inference endpoints using the UI. The **Inference endpoints** page provides an interface for managing inference endpoints. @@ -33,7 +36,7 @@ Available actions: * Copy the inference endpoint ID * Delete endpoints -## Add new inference endpoint +## Add new inference endpoint [add-inference-endpoints] To add a new interference endpoint using the UI: @@ -42,18 +45,24 @@ To add a new interference endpoint using the UI: 1. Provide the required configuration details. 1. Select **Save** to create the endpoint. +If your inference endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to reduce resource usage and save costs. + + ## Adaptive allocations [adaptive-allocations] Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load. +This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services like Alibaba Cloud, Cohere, or OpenAI. When adaptive allocations are enabled: * The number of allocations scales up automatically when the load increases. * Allocations scale down to a minimum of 0 when the load decreases, saving resources. -For more information about adaptive allocations and resources, refer to the trained model autoscaling documentation. +::::{warning} +If you don't use adaptive allocations, the deployment will always use a fixed amount of resources, which can lead to unnecessary usage and higher costs. +:::: -% TO DO: Add a link to trained model autoscaling when the page is available.% +For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling) documentation. ## Default {{infer}} endpoints [default-enpoints] From b142d842fae14d3f15863b8f5a5eb2517ff1871e Mon Sep 17 00:00:00 2001 From: kosabogi Date: Thu, 22 May 2025 11:04:22 +0200 Subject: [PATCH 2/9] Fixes links --- explore-analyze/elastic-inference/inference-api.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md index 3fae57f80e..0e73aadc60 100644 --- a/explore-analyze/elastic-inference/inference-api.md +++ b/explore-analyze/elastic-inference/inference-api.md @@ -11,7 +11,7 @@ products: # Inference integrations -{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints that integrate with services such as Elasticsearch (for built-in NLP models like [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser) and [E5](/explore-analyze/machine-learning/nlp/ml-nlp-e5)), as well as popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more. +{{es}} provides a machine learning [inference API](https://www.elastic.co/docs/api/doc/elasticsearch/v8/operation/operation-inference-get-1) to create and manage inference endpoints that integrate with services such as Elasticsearch (for built-in NLP models like [ELSER](/explore-analyze/machine-learning/nlp/ml-nlp-elser.md) and [E5](/explore-analyze/machine-learning/nlp/ml-nlp-e5.md)), as well as popular third-party services like Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, Hugging Face, and more. You can create a new inference endpoint: @@ -62,7 +62,7 @@ When adaptive allocations are enabled: If you don't use adaptive allocations, the deployment will always use a fixed amount of resources, which can lead to unnecessary usage and higher costs. :::: -For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling) documentation. +For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation. ## Default {{infer}} endpoints [default-enpoints] From df9f37c2f266a351e5165630a61376d499b7d05d Mon Sep 17 00:00:00 2001 From: kosabogi <105062005+kosabogi@users.noreply.github.com> Date: Thu, 22 May 2025 13:51:34 +0200 Subject: [PATCH 3/9] Update explore-analyze/elastic-inference/inference-api.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: István Zoltán Szabó --- explore-analyze/elastic-inference/inference-api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md index 0e73aadc60..0c56e96ea5 100644 --- a/explore-analyze/elastic-inference/inference-api.md +++ b/explore-analyze/elastic-inference/inference-api.md @@ -59,7 +59,7 @@ When adaptive allocations are enabled: * Allocations scale down to a minimum of 0 when the load decreases, saving resources. ::::{warning} -If you don't use adaptive allocations, the deployment will always use a fixed amount of resources, which can lead to unnecessary usage and higher costs. +If you don't use adaptive allocations, the deployment will always consume a fixed amount of resources, regardless of actual usage. This can lead to inefficient resource utilization and higher costs. :::: For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation. From 3acb326d5e108527ebdf4f1071c9f7de4387db0f Mon Sep 17 00:00:00 2001 From: kosabogi Date: Thu, 22 May 2025 14:04:20 +0200 Subject: [PATCH 4/9] Applies suggestions --- explore-analyze/elastic-inference/inference-api.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md index 0c56e96ea5..dc7497c8f6 100644 --- a/explore-analyze/elastic-inference/inference-api.md +++ b/explore-analyze/elastic-inference/inference-api.md @@ -45,13 +45,12 @@ To add a new interference endpoint using the UI: 1. Provide the required configuration details. 1. Select **Save** to create the endpoint. -If your inference endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to reduce resource usage and save costs. - +If your inference endpoint uses a model deployed in Elastic’s infrastructure, such as ELSER, E5, or a model uploaded through Eland, you can configure [adaptive allocations](#adaptive-allocations) to dynamically adjust resource usage based on the current demand. ## Adaptive allocations [adaptive-allocations] Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load. -This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services like Alibaba Cloud, Cohere, or OpenAI. +This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services like Alibaba Cloud, Cohere, or OpenAI, because those models are hosted externally and not deployed within your Elasticsearch cluster. When adaptive allocations are enabled: From b5b0340dc3c7adb62e4225c5319b06cf1f040017 Mon Sep 17 00:00:00 2001 From: kosabogi Date: Mon, 26 May 2025 11:06:39 +0200 Subject: [PATCH 5/9] Additional information --- .../elastic-inference/inference-api.md | 27 +++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md index dc7497c8f6..7a2a47c8de 100644 --- a/explore-analyze/elastic-inference/inference-api.md +++ b/explore-analyze/elastic-inference/inference-api.md @@ -57,6 +57,33 @@ When adaptive allocations are enabled: * The number of allocations scales up automatically when the load increases. * Allocations scale down to a minimum of 0 when the load decreases, saving resources. +### Scaling behavior across configurations [adaptive-allocations-behavior] + +Depending on how adaptive allocations are configured, the actual behavior may vary. + +::::{tab-set} + +:::{tab-item} Configure from the UI +If adaptive resources are [enabled from the UI](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-in-kibana-adaptive-resources), whether the model can scale down to zero allocations depends on the following factors: + - The selected usage level (low, medium, high) + - Whether the model is optimized for [search](/deploy-manage/autoscaling/trained-model-autoscaling.md#search-optimized) or [ingest](/deploy-manage/autoscaling/trained-model-autoscaling.md#ingest-optimized) + - The platform type (for example, Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless) + +In some configurations, the model may not scale down to zero, even if the load is low. + +If adaptive resources are disabled from the UI, the deployment always maintains at least 1 or 2 allocations, depending on the usage level and optimization setting. + +::: + +:::{tab-item} Configure through API +When adaptive allocations are [enabled via API](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-through-apis-adaptive-allocations), the system can scale down to 0 allocations if load is low, unless the `min_number_of_allocations` greater than 0 is explicitly set. + +If adaptive allocations are disabled from the API, the number of model allocations is fixed and set explicitly using the `num_allocations` parameter. + +::: + +:::: + ::::{warning} If you don't use adaptive allocations, the deployment will always consume a fixed amount of resources, regardless of actual usage. This can lead to inefficient resource utilization and higher costs. :::: From d0fdead343fa7a7820e96788b26f5352cb5f21a2 Mon Sep 17 00:00:00 2001 From: kosabogi Date: Mon, 26 May 2025 13:47:47 +0200 Subject: [PATCH 6/9] Applying suggestions --- .../elastic-inference/inference-api.md | 74 +++++++++++++++---- 1 file changed, 61 insertions(+), 13 deletions(-) diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md index 7a2a47c8de..b6eeb86d2e 100644 --- a/explore-analyze/elastic-inference/inference-api.md +++ b/explore-analyze/elastic-inference/inference-api.md @@ -57,33 +57,81 @@ When adaptive allocations are enabled: * The number of allocations scales up automatically when the load increases. * Allocations scale down to a minimum of 0 when the load decreases, saving resources. -### Scaling behavior across configurations [adaptive-allocations-behavior] +### Allocation scaling behavior -Depending on how adaptive allocations are configured, the actual behavior may vary. +The behavior of allocations depends on several factors: + +- Platform (Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless) +- Usage level (low, medium, or high) +- Optimization type (ingest or search) + +The tables below apply when adaptive resource settings are [configured through the UI](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-in-kibana-adaptive-resources). + +#### Adaptive resources enabled ::::{tab-set} -:::{tab-item} Configure from the UI -If adaptive resources are [enabled from the UI](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-in-kibana-adaptive-resources), whether the model can scale down to zero allocations depends on the following factors: - - The selected usage level (low, medium, high) - - Whether the model is optimized for [search](/deploy-manage/autoscaling/trained-model-autoscaling.md#search-optimized) or [ingest](/deploy-manage/autoscaling/trained-model-autoscaling.md#ingest-optimized) - - The platform type (for example, Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless) +:::{tab-item} ECH, ECE +| Usage level | Optimization | Allocations | +|-------------|--------------|-------------------------------| +| Low | Ingest | 0 to 2 if available, dynamically | +| Medium | Ingest | 1 to 32 dynamically | +| High | Ingest | 1 to limit set in the Cloud console*, dynamically | +| Low | Search | 1 | +| Medium | Search | 1 to 2 (if threads=16), dynamically | +| High | Search | 1 to limit set in the Cloud console*, dynamically | -In some configurations, the model may not scale down to zero, even if the load is low. +\* The Cloud console doesn’t directly set an allocations limit; it only sets a vCPU limit. This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. -If adaptive resources are disabled from the UI, the deployment always maintains at least 1 or 2 allocations, depending on the usage level and optimization setting. +::: + +:::{tab-item} Serverless +| Usage level | Optimization | Allocations | +|-------------|--------------|-------------------------------| +| Low | Ingest | 0 to 2 dynamically | +| Medium | Ingest | 1 to 32 dynamically | +| High | Ingest | 1 to 512 for Search
1 to 128 for Security and Observability | +| Low | Search | 0 to 1 dynamically | +| Medium | Search | 1 to 2 (if threads=16), dynamically | +| High | Search | 1 to 32 (if threads=16), dynamically
1 to 128 for Security and Observability | ::: -:::{tab-item} Configure through API -When adaptive allocations are [enabled via API](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-through-apis-adaptive-allocations), the system can scale down to 0 allocations if load is low, unless the `min_number_of_allocations` greater than 0 is explicitly set. +:::: + +#### Adaptive resources disabled + +::::{tab-set} + +:::{tab-item} ECH, ECE +| Usage level | Optimization | Allocations | +|-------------|--------------|-------------------------------| +| Low | Ingest | 2 if available, otherwise 1, statically | +| Medium | Ingest | The smaller of 32 or the limit set in the Cloud console*, statically | +| High | Ingest | Maximum available set in the Cloud console*, statically | +| Low | Search | 1 if available, statically | +| Medium | Search | 2 (if threads=16) statically | +| High | Search | Maximum available set in the Cloud console*, statically | + +\* The Cloud console doesn’t directly set an allocations limit; it only sets a vCPU limit. This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. -If adaptive allocations are disabled from the API, the number of model allocations is fixed and set explicitly using the `num_allocations` parameter. - +::: + +:::{tab-item} Serverless +| Usage level | Optimization | Allocations | +|-------------|--------------|-------------------------------| +| Low | Ingest | Exactly 32 | +| Medium | Ingest | 1 to 32 dynamically | +| High | Ingest | 512 for Search
No static allocations for Security and Observability | +| Low | Search | 1 statically | +| Medium | Search | 2 statically (if threads=16) | +| High | Search | 32 statically (if threads=16) for Search
No static allocations for Security and Observability | ::: :::: +You can also configure adaptive allocations via the API using parameters like `num_allocations`, `min_number_of_allocations`, and `threads_per_allocation`. Refer to [Enable autoscaling through APIs](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-through-apis-adaptive-allocations) for details. + ::::{warning} If you don't use adaptive allocations, the deployment will always consume a fixed amount of resources, regardless of actual usage. This can lead to inefficient resource utilization and higher costs. :::: From 05a9a12bfaf402c05284825c19ac9fa2eb34e066 Mon Sep 17 00:00:00 2001 From: kosabogi <105062005+kosabogi@users.noreply.github.com> Date: Tue, 27 May 2025 07:52:05 +0200 Subject: [PATCH 7/9] Update explore-analyze/elastic-inference/inference-api.md Co-authored-by: Arianna Laudazzi <46651782+alaudazzi@users.noreply.github.com> --- explore-analyze/elastic-inference/inference-api.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md index b6eeb86d2e..d6e7aaf7e4 100644 --- a/explore-analyze/elastic-inference/inference-api.md +++ b/explore-analyze/elastic-inference/inference-api.md @@ -20,8 +20,6 @@ You can create a new inference endpoint: ## Inference endpoints UI [inference-endpoints] -You can manage inference endpoints using the UI. - The **Inference endpoints** page provides an interface for managing inference endpoints. :::{image} /explore-analyze/images/kibana-inference-endpoints-ui.png From 4112605765c24cd559103f275edee7c83a3ff62e Mon Sep 17 00:00:00 2001 From: kosabogi Date: Tue, 3 Jun 2025 12:32:06 +0200 Subject: [PATCH 8/9] Modifications based on feedback --- .../elastic-inference/inference-api.md | 79 ++----------------- 1 file changed, 7 insertions(+), 72 deletions(-) diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md index d6e7aaf7e4..722d00a0a1 100644 --- a/explore-analyze/elastic-inference/inference-api.md +++ b/explore-analyze/elastic-inference/inference-api.md @@ -48,7 +48,7 @@ If your inference endpoint uses a model deployed in Elastic’s infrastructure, ## Adaptive allocations [adaptive-allocations] Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load. -This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services like Alibaba Cloud, Cohere, or OpenAI, because those models are hosted externally and not deployed within your Elasticsearch cluster. +This feature is only supported for models deployed in Elastic’s infrastructure, such as ELSER, E5, or models uploaded through Eland. It is not available for third-party services (for example, Alibaba Cloud, Cohere, or OpenAI), because those models are hosted externally and not deployed within your Elasticsearch cluster. When adaptive allocations are enabled: @@ -59,83 +59,18 @@ When adaptive allocations are enabled: The behavior of allocations depends on several factors: -- Platform (Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless) +- Deployment type (Elastic Cloud Hosted, Elastic Cloud Enterprise, or Serverless) - Usage level (low, medium, or high) -- Optimization type (ingest or search) +- Optimization type ([ingest](/deploy-manage/autoscaling/trained-model-autoscaling.md#ingest-optimized) or [search](/deploy-manage/autoscaling/trained-model-autoscaling.md#search-optimized)) -The tables below apply when adaptive resource settings are [configured through the UI](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-in-kibana-adaptive-resources). - -#### Adaptive resources enabled - -::::{tab-set} - -:::{tab-item} ECH, ECE -| Usage level | Optimization | Allocations | -|-------------|--------------|-------------------------------| -| Low | Ingest | 0 to 2 if available, dynamically | -| Medium | Ingest | 1 to 32 dynamically | -| High | Ingest | 1 to limit set in the Cloud console*, dynamically | -| Low | Search | 1 | -| Medium | Search | 1 to 2 (if threads=16), dynamically | -| High | Search | 1 to limit set in the Cloud console*, dynamically | - -\* The Cloud console doesn’t directly set an allocations limit; it only sets a vCPU limit. This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. - -::: - - -:::{tab-item} Serverless -| Usage level | Optimization | Allocations | -|-------------|--------------|-------------------------------| -| Low | Ingest | 0 to 2 dynamically | -| Medium | Ingest | 1 to 32 dynamically | -| High | Ingest | 1 to 512 for Search
1 to 128 for Security and Observability | -| Low | Search | 0 to 1 dynamically | -| Medium | Search | 1 to 2 (if threads=16), dynamically | -| High | Search | 1 to 32 (if threads=16), dynamically
1 to 128 for Security and Observability | -::: - -:::: - -#### Adaptive resources disabled - -::::{tab-set} - -:::{tab-item} ECH, ECE -| Usage level | Optimization | Allocations | -|-------------|--------------|-------------------------------| -| Low | Ingest | 2 if available, otherwise 1, statically | -| Medium | Ingest | The smaller of 32 or the limit set in the Cloud console*, statically | -| High | Ingest | Maximum available set in the Cloud console*, statically | -| Low | Search | 1 if available, statically | -| Medium | Search | 2 (if threads=16) statically | -| High | Search | Maximum available set in the Cloud console*, statically | - -\* The Cloud console doesn’t directly set an allocations limit; it only sets a vCPU limit. This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. - -::: - -:::{tab-item} Serverless -| Usage level | Optimization | Allocations | -|-------------|--------------|-------------------------------| -| Low | Ingest | Exactly 32 | -| Medium | Ingest | 1 to 32 dynamically | -| High | Ingest | 512 for Search
No static allocations for Security and Observability | -| Low | Search | 1 statically | -| Medium | Search | 2 statically (if threads=16) | -| High | Search | 32 statically (if threads=16) for Search
No static allocations for Security and Observability | -::: - -:::: +For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation. -You can also configure adaptive allocations via the API using parameters like `num_allocations`, `min_number_of_allocations`, and `threads_per_allocation`. Refer to [Enable autoscaling through APIs](/deploy-manage/autoscaling/trained-model-autoscaling.md#enabling-autoscaling-through-apis-adaptive-allocations) for details. +::::{note} +If you enable adaptive allocations and set the `min_number_of_allocations` to a value greater than `0`, you will be charged for the machine learning resources associated with your inference endpoint, even if no inference requests are sent. -::::{warning} -If you don't use adaptive allocations, the deployment will always consume a fixed amount of resources, regardless of actual usage. This can lead to inefficient resource utilization and higher costs. +However, enabling adaptive allocations with a `min_number_of_allocations` greater than `0` helps ensure that the model remains available at all times, without delays due to scaling. This configuration may lead to higher resource usage and associated costs. Consider your workload and availability requirements when choosing the appropriate settings. :::: -For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation. - ## Default {{infer}} endpoints [default-enpoints] Your {{es}} deployment contains preconfigured {{infer}} endpoints which makes them easier to use when defining `semantic_text` fields or using {{infer}} processors. The following list contains the default {{infer}} endpoints listed by `inference_id`: From d4edf6af18dd6426cf2fc0b3ef85963c8cdcc6c3 Mon Sep 17 00:00:00 2001 From: kosabogi Date: Wed, 4 Jun 2025 10:55:37 +0200 Subject: [PATCH 9/9] Applying suggestions --- explore-analyze/elastic-inference/inference-api.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/explore-analyze/elastic-inference/inference-api.md b/explore-analyze/elastic-inference/inference-api.md index 722d00a0a1..039aab445a 100644 --- a/explore-analyze/elastic-inference/inference-api.md +++ b/explore-analyze/elastic-inference/inference-api.md @@ -63,13 +63,13 @@ The behavior of allocations depends on several factors: - Usage level (low, medium, or high) - Optimization type ([ingest](/deploy-manage/autoscaling/trained-model-autoscaling.md#ingest-optimized) or [search](/deploy-manage/autoscaling/trained-model-autoscaling.md#search-optimized)) -For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation. +::::{important} +If you enable adaptive allocations and set the `min_number_of_allocations` to a value greater than `0`, you will be charged for the machine learning resources, even if no inference requests are sent. -::::{note} -If you enable adaptive allocations and set the `min_number_of_allocations` to a value greater than `0`, you will be charged for the machine learning resources associated with your inference endpoint, even if no inference requests are sent. +However, setting the `min_number_of_allocations` to a value greater than `0` keeps the model always available without scaling delays. Choose the configuration that best fits your workload and availability needs. +:::: -However, enabling adaptive allocations with a `min_number_of_allocations` greater than `0` helps ensure that the model remains available at all times, without delays due to scaling. This configuration may lead to higher resource usage and associated costs. Consider your workload and availability requirements when choosing the appropriate settings. -:::: +For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation. ## Default {{infer}} endpoints [default-enpoints]