-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inference API] Add Azure AI Studio Embeddings and Chat Completion Support #108472
[Inference API] Add Azure AI Studio Embeddings and Chat Completion Support #108472
Conversation
Pinging @elastic/ml-core (Team:ML) |
Pinging @elastic/ent-search-eng (Team:Enterprise Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, just posting the comments I had so far. I'll post another round shortly.
...erence/src/main/java/org/elasticsearch/xpack/inference/InferenceNamedWriteablesProvider.java
Show resolved
Hide resolved
...earch/xpack/inference/external/request/azureaistudio/AzureAiStudioChatCompletionRequest.java
Outdated
Show resolved
Hide resolved
...ticsearch/xpack/inference/external/request/azureaistudio/AzureAiStudioEmbeddingsRequest.java
Outdated
Show resolved
Hide resolved
...elasticsearch/xpack/inference/external/request/azureaistudio/AzureAiStudioRequestFields.java
Show resolved
Hide resolved
...n/java/org/elasticsearch/xpack/inference/external/response/ChatCompletionResponseEntity.java
Outdated
Show resolved
Hide resolved
...java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioEndpointType.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioProvider.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioProvider.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioModel.java
Show resolved
Hide resolved
...main/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioService.java
Outdated
Show resolved
Hide resolved
ActionListener<List<ChunkedInferenceServiceResults>> listener | ||
) { | ||
ActionListener<InferenceServiceResults> inferListener = listener.delegateFailureAndWrap( | ||
(delegate, response) -> delegate.onResponse(translateToChunkedResults(input, response)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might have been an accident that we didn't implement the word boundary chunker for the azure service 🤔 . An example of the chunker is here: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/cohere/CohereService.java#L227-L239
I think we only support chunking for text embedding. Doesn't look like we have logic in OpenAiService
to throw an exception or anything if we try to use for other model types though.
@davidkyle do we want chunking support for azure openai and azure studio?
If so, it's ok with me if you want to do those changes in a separate PR @markjhoy .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @jonathan-buttner I'll follow up with that change
...main/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioService.java
Outdated
Show resolved
Hide resolved
protected final AzureAiStudioEndpointType endpointType; | ||
protected final RateLimitSettings rateLimitSettings; | ||
|
||
protected static final RateLimitSettings DEFAULT_RATE_LIMIT_SETTINGS = new RateLimitSettings(1_440); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this rate is probably going to be different depending on the target used 🤔 I tried looking for some docs on rate limits for azure studio as a whole but didn't find much. Have you seen anything?
Maybe we should have the child classes (embedding and chat completion) pass in a default and those classes can guesstimate the default to use based on the provider?
Or I suppose we could set a fairly low limit (I think the lowest so far is like 240 requests per minute from azure openai chat completions that Tim worked on) and just document that the user should change this as needed.
If/once we have dynamic rate limiting I suppose this won't be an issue.
What do you all think @maxhniebergall @davidkyle ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would say that we should pick a low limit and make it clear that this is something users should change. As long as the error message is clear, they will understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen anything?
Unfortunately no - and I would assume it's provider specific as well... for the "realtime" deployments, I suspect there are no limits, as the VM is hosted by the user and it would be whatever the VM size can handle as well...
Or I suppose we could set a fairly low limit
I have a feeling this will probably be the best way to start. As long as we let the user know and to have them change it if they need as you mention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A low limit sounds good to me, and we can make it clear in the docs that it needs to be adjusted by the user 👍
...a/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioServiceSettings.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioServiceSettings.java
Outdated
Show resolved
Hide resolved
.../inference/services/azureaistudio/completion/AzureAiStudioChatCompletionServiceSettings.java
Outdated
Show resolved
Hide resolved
.../inference/services/azureaistudio/completion/AzureAiStudioChatCompletionServiceSettings.java
Outdated
Show resolved
Hide resolved
} | ||
dimensionsSetByUser = dims != null; | ||
} | ||
case PERSISTENT -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a heads up, we've run into issues with not correctly adding backwards compatible logic for parsing the persistent configuration so I think we're going to stop validating the configuration when parsing it from storage. I don't think you need to change anything at the moment but we'll probably be going through and remove the validation checks for the persistent code path.
...pack/inference/services/azureaistudio/embeddings/AzureAiStudioEmbeddingsServiceSettings.java
Outdated
Show resolved
Hide resolved
...inference/services/azureaistudio/embeddings/AzureAiStudioEmbeddingsServiceSettingsTests.java
Outdated
Show resolved
Hide resolved
...ference/external/response/huggingface/HuggingFaceAzureAndOpenAiErrorResponseEntityTests.java
Outdated
Show resolved
Hide resolved
...h/xpack/inference/external/response/openai/OpenAiAzureAndOpenAiErrorResponseEntityTests.java
Outdated
Show resolved
Hide resolved
...h/xpack/inference/external/response/cohere/CohereAzureAndOpenAiErrorResponseEntityTests.java
Outdated
Show resolved
Hide resolved
@elasticmachine test this please |
After a successful PUT, Mistral failed with a 405 status code on inference
|
✅ Completions with the databricks provider and The default value of |
That's interesting... About what would you say was the default from Databrick's side? And I'd opt to add this to the docs to tell the user they may need to increase the |
Huh - that is really odd. I just tried it myself without any issues... although like Databricks, the return tokens was really low: Create model:
Test infer:
Response:
@davidkyle - just to be sure - when you created (PUT) your model, did you include the |
I take that back... from more testing, with Meta and my Mistral tests -- the default number of tokens seems low, so we may want to add a default... do you have a suggestion for what is a good value? (and btw, ✅ Meta works for chat completions) |
I can confirm as well that ✅ Microsoft Phi works as expected... and again, I think we do need to set a default max num tokens... the default from this yielded 12 terms in the output... :( |
buildkite test this |
…tudio_integration_inference_redo
I will test again, likely a user error
++ |
From some tests - I think perhaps 64 is a decent number... thoughts? |
…tudio_integration_inference_redo
@davidkyle , @jonathan-buttner - FYI - I added in a default |
FYI - I still can't get a Snowflake deployment due to quota issues - and no idea how to get this working... I'm confident in my implementation as written by the input/output on the model card, and it seems to be the same as the others... I'm cool with (a) going forward with this, or (b) omitting Snowflake from this... let me know your thoughts. |
|
||
ValidationException validationException = new ValidationException(); | ||
|
||
Double temperature = extractOptionalDouble(map, TEMPERATURE_FIELD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was doing some testing and looks like we allow temperature
and topP
to be negative values but it results in
{
"completion": [
{
"result": "None"
}
]
}
Should we validate that they're in the correct range? I suppose that could be problematic if the allowable ranges changes in the future 🤔 I wonder if we're getting an error response back but not passing it along.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we validate that they're in the correct range?
Good question... I'd say yes - but there's no direct docs I can see for the valid ranges for any of these parameters... the only thing that comes close is in the AzureOpenAI .DLL / SDK documentation:
- Temperature : valid range of 0.0 to 2.0
- Top P aka Nuclear Sampling - no range specified, but I'd assume it's the same as temperature
- Max Tokens minimum of 0
I wonder if we're getting an error response back but not passing it along.
This I doubt as we're still getting a 200 response - however, I can certainly see if all the probabilities are negative it might only consider those > 0.0... but 🤷 I don't know for certain and will do a bit of testing manually...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we're getting an error response back but not passing it along.
This I doubt as we're still getting a 200 response - however, I can certainly see if all the probabilities are negative it might only consider those > 0.0... but 🤷 I don't know for certain and will do a bit of testing manually...
FYI - I ran a manual test - and yep - no error. Calling with temperature
of -2.0 directly to the .../score
endpoint yields a 200 response with the following:
{
"output": "None"
}
✅ Microsoft Phi: phi-3-mini-128k |
…tudio_integration_inference_redo
@jonathan-buttner - FYI - just pushed up a commit that constrains the top_p and temperature to the 0.0 to 2.0 range. The max new tokens already was set up to only receive positive integers. 👍 |
Regarding mistral, I tried again with and without
I hit the same quota problem for Snowflake |
.../plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/ServiceUtils.java
Outdated
Show resolved
Hide resolved
.../plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/ServiceUtils.java
Outdated
Show resolved
Hide resolved
.../plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/ServiceUtils.java
Outdated
Show resolved
Hide resolved
.../plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/ServiceUtils.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just did a ✅ using Meta Llama 7B - and all looks good there - @jonathan-buttner - did you get a chance to test yet? |
This PR adds support for Azure AI Studio integration into the Inference API. Currently this supports text_embedding and completion task types.
Prerequisites to Model Creation
You must have an Azure subscription with Azure AI Studio access
You must have a deployed model either Chat Completion or Embeddings
Model Creation:
Valid {tasktype} types are: [text_embedding, completion]
Required Service Settings:
Embeddings Service Settings
Embeddings Task Settings
(this is also overridable in the inference request)
Completion Service Settings
(no additional service settings)
Completion Task Settings
(these are all optional and can be overridden in the inference request)
Text Embedding Inference
Chat Completion Inference