Skip to content

Conversation

kosabogi
Copy link
Contributor

@kosabogi kosabogi commented May 22, 2025

📸 Preveiw

Description

This PR updates the Inference integration documentation to:

  • Clearly state that not enabling adaptive allocations can result in unnecessary resource usage and higher costs.
  • Expand the scope of the page to cover not only third-party service integrations, but also the Elasticsearch service.

Related issue: #1393

@kosabogi kosabogi requested a review from szabosteve May 22, 2025 08:53
@kosabogi kosabogi requested review from a team as code owners May 22, 2025 08:53
Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great! Left a couple of comments and suggestions.

kosabogi and others added 2 commits May 22, 2025 13:51
Co-authored-by: István Zoltán Szabó <szabosteve@gmail.com>
@kosabogi
Copy link
Contributor Author

It looks great! Left a couple of comments and suggestions.

Thank you! I applied your suggestions in my latest commit.

@kosabogi kosabogi requested a review from ppf2 May 22, 2025 14:01
@ppf2
Copy link
Contributor

ppf2 commented May 23, 2025

Thanks! I think there are a few different aspects to this we will want to cover (cc: @arisonl @shubhaat )

Adaptive resources is enabled (from the UI):

  • Depending on the usage level selected, whether it is configured to search/ingest optimized and the Platform Type (ECH/ECE vs. Serverless), it may or may not autoscale down to 0 allocations when the load is low.

Adaptive resources is disabled (from the UI):

  • Even at the low usage level, there will still be at least 1 or 2 allocations depending on search/ingest optimized.

Adaptive allocations enabled (from the API):

  • If enabled, model allocations can scale down to 0 when the load is low unless the user has explicitly specified a >0 min_number_of_allocations setting.

Adaptive allocations disabled (from the API):

  • User defines the num_allocations used by the model.

@leemthompo
Copy link
Contributor

Some things struck me here:

  • the separation between UI and API tabbed sections seems somewhat arbitrary since both are constrained by the same platform-specific infrastructure realities
  • the format forces readers to mentally cross-reference three variables (usage level, optimization type, platform) across multiple paragraphs
  • perhaps we could replace the entire tabbed prose section with a single table?
    • some of the prose is vague and requires guesswork— might be better defined explicitly

Please disregard if the linked page contains the full details and we're happy to have general overview here :)

@kosabogi
Copy link
Contributor Author

Some things struck me here:

  • the separation between UI and API tabbed sections seems somewhat arbitrary since both are constrained by the same platform-specific infrastructure realities

  • the format forces readers to mentally cross-reference three variables (usage level, optimization type, platform) across multiple paragraphs

  • perhaps we could replace the entire tabbed prose section with a single table?

    • some of the prose is vague and requires guesswork— might be better defined explicitly

Please disregard if the linked page contains the full details and we're happy to have general overview here :)

Thank you @ppf2 and @leemthompo for all of your suggestions!
I've updated the Adaptive allocations section by rewriting the content as a table to make it easier to scan and compare configurations across platform, usage level, and optimization type.
Let me know what you think!

Copy link
Contributor

@alaudazzi alaudazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a minor suggestion, otherwise LGTM.

Co-authored-by: Arianna Laudazzi <46651782+alaudazzi@users.noreply.github.com>
@leemthompo
Copy link
Contributor

Thanks @kosabogi might be nice to get a final 👀 from @ppf2 and @shubhaat before merging :)

@kosabogi
Copy link
Contributor Author

kosabogi commented Jun 3, 2025

Based on my conversation with Pius, I’ve updated my PR as follows:

  • Removed the tables to avoid duplicating the content from the Trained model autoscaling page. Instead, I added links to the relevant sections so users can easily access the autoscaling settings guidance.
  • Added a note to inform users about the implications of setting min_number_of_allocations greater than 0 when adaptive allocations are enabled.

@shubhaat @arisonl I’d love to hear your thoughts - is this a good way to communicate how adaptive allocations work at this point in the guide?

cc @ppf2


For more information about adaptive allocations and resources, refer to the [trained model autoscaling](/deploy-manage/autoscaling/trained-model-autoscaling.md) documentation.

::::{note}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're delegating to the trained model page, should this note still live here or would it make sense to move that too?

Copy link
Contributor Author

@kosabogi kosabogi Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you think it would make more sense to place it?
IMO it should live here because it links to the Trained model autoscaling page, where the adaptive allocations information is located. But maybe I'm misunderstanding your comment? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree it's good to have cost warning here. But I'd also expect this to be on the main page we're linking to, so I was just wondering if that duplication is OK.
  • I'm not sure about the current placement of the note.
    • Here's my thinking: The ::::{note} goes into details about pricing implications but we've already sent readers to another page with "For more information about adaptive allocations and resources, refer to the trained model autoscaling documentation." So the flow feels wrong. I'd argue for moving the cost note before the link. And also perhaps it should be important given it deals with real cost implications.
  • Nit: I think the wording could be more concise too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought you were referring to the sentence before the note - but it totally makes sense now.

My point is that we should include it here because the support ticket that raised this issue were about customers using inference services without realizing the cost implications.
Part of this is already mentioned on the trained model autoscaling page, specifically:

Note: If you set the minimum number of allocations to 1, you will be charged even if the system is not using those resources.

In this case, I think the duplication is fine, as it's important to warn users about potential costs (according to the issue, this information was missing from this page.)

You're absolutely right that placing the note before that sentence interrupts the flow - I’ll move it and rework the wording a bit.

Thanks so much for your suggestions!

@shubhaat
Copy link
Contributor

shubhaat commented Jun 3, 2025

I think these changes are a good first step, eventually I think we want to hide some of these settings on serverless completely and simplify the UI and choices for customers. Very few users on serverless see the Trained models UI, so my guess is very few read the docs as well.

Copy link
Contributor

@leemthompo leemthompo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@kosabogi kosabogi merged commit 47acfae into main Jun 6, 2025
6 checks passed
@kosabogi kosabogi deleted the inference-adaptive-allocations branch June 6, 2025 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants