[Inference API] Add Azure AI Studio Embeddings and Chat Completion Support #108472

markjhoy · 2024-05-09T16:32:14Z

This PR adds support for Azure AI Studio integration into the Inference API. Currently this supports text_embedding and completion task types.

Prerequisites to Model Creation

You must have an Azure subscription with Azure AI Studio access
You must have a deployed model either Chat Completion or Embeddings

Model Creation:

PUT _inference/{tasktype}/{model_id}
{
  "service": "azureaistudio",
  "service_settings": {
    "api_key": "{api_key}",
    "target": “{deployment_target}”,
    “provider”: “(model provider}”,
    “endpoint_type”: “(endpoint type)”
  }
}

Valid {tasktype} types are: [text_embedding, completion]

Required Service Settings:

api_key: The API key can be found on your Azure AI Studio deployment's overview page
target: The target URL can be found on your Azure AI Studio deployment's overview page
provider: Valid provider types are (case insensitive):
- openai - available for embeddings and completion
- mistral - available for completion only
- meta - available for completion only
- microsoft_phi - available for completion only
- cohere - available for embeddings and completion
- databricks - available for completion only
endpoint_type: Valid endpoint types are:
- token - a "pay as you go" endpoint (charged by token). Available for OpenAI, Meta and Cohere
- realtime - a realtime endpoint VM deployment (charged by the hour). Available for Mistral, Meta, Microsoft Phi, and Databricks

Embeddings Service Settings

dimensions: (optional) the number of dimensions the resulting output embeddings should have.

Embeddings Task Settings
(this is also overridable in the inference request)

user: (optional) a string that is a unique identifier representing your end-user. This helps Azure AI Studio in the case of abuse or issues for debugging.

Completion Service Settings
(no additional service settings)

Completion Task Settings
(these are all optional and can be overridden in the inference request)

temperature: What sampling temperature to use, between 0 and 2. Higher values mean the model takes more risks. Try 0.9 for more creative applications, and 0 (argmax sampling) for ones with a well-defined answer. Microsoft recommends altering this or top_p but not both.
top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. Microsoft recommends altering this or temperature but not both.
do_sample: request to perform the sampling or not
max_new_tokens: the maximum number of new tokens the chat completion inference should produce in the output

Text Embedding Inference

POST _inference/text_embedding/{model_id}
{
    "input": "The answer to the universe is"
}

Chat Completion Inference

POST _inference/completion/{model_id}
{
    "input": "The answer to the universe is"
}

elasticsearchmachine · 2024-05-09T16:32:37Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2024-05-09T16:32:37Z

Pinging @elastic/ent-search-eng (Team:Enterprise Search)

jonathan-buttner

Great work, just posting the comments I had so far. I'll post another round shortly.

...erence/src/main/java/org/elasticsearch/xpack/inference/InferenceNamedWriteablesProvider.java

...earch/xpack/inference/external/request/azureaistudio/AzureAiStudioChatCompletionRequest.java

...ticsearch/xpack/inference/external/request/azureaistudio/AzureAiStudioEmbeddingsRequest.java

...elasticsearch/xpack/inference/external/request/azureaistudio/AzureAiStudioRequestFields.java

...n/java/org/elasticsearch/xpack/inference/external/response/ChatCompletionResponseEntity.java

...java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioEndpointType.java

...ain/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioProvider.java

...c/main/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioModel.java

...main/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioService.java

jonathan-buttner · 2024-05-09T16:44:10Z

...main/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioService.java

+        ActionListener<List<ChunkedInferenceServiceResults>> listener
+    ) {
+        ActionListener<InferenceServiceResults> inferListener = listener.delegateFailureAndWrap(
+            (delegate, response) -> delegate.onResponse(translateToChunkedResults(input, response))


I think it might have been an accident that we didn't implement the word boundary chunker for the azure service 🤔 . An example of the chunker is here: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/cohere/CohereService.java#L227-L239

I think we only support chunking for text embedding. Doesn't look like we have logic in OpenAiService to throw an exception or anything if we try to use for other model types though.

@davidkyle do we want chunking support for azure openai and azure studio?

If so, it's ok with me if you want to do those changes in a separate PR @markjhoy .

Good catch @jonathan-buttner I'll follow up with that change

...main/java/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioService.java

jonathan-buttner · 2024-05-09T16:52:16Z

...a/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioServiceSettings.java

+    protected final AzureAiStudioEndpointType endpointType;
+    protected final RateLimitSettings rateLimitSettings;
+
+    protected static final RateLimitSettings DEFAULT_RATE_LIMIT_SETTINGS = new RateLimitSettings(1_440);


hmm this rate is probably going to be different depending on the target used 🤔 I tried looking for some docs on rate limits for azure studio as a whole but didn't find much. Have you seen anything?

Maybe we should have the child classes (embedding and chat completion) pass in a default and those classes can guesstimate the default to use based on the provider?

Or I suppose we could set a fairly low limit (I think the lowest so far is like 240 requests per minute from azure openai chat completions that Tim worked on) and just document that the user should change this as needed.

If/once we have dynamic rate limiting I suppose this won't be an issue.

What do you all think @maxhniebergall @davidkyle ?

Personally, I would say that we should pick a low limit and make it clear that this is something users should change. As long as the error message is clear, they will understand.

Have you seen anything?

Unfortunately no - and I would assume it's provider specific as well... for the "realtime" deployments, I suspect there are no limits, as the VM is hosted by the user and it would be whatever the VM size can handle as well...

Or I suppose we could set a fairly low limit

I have a feeling this will probably be the best way to start. As long as we let the user know and to have them change it if they need as you mention.

A low limit sounds good to me, and we can make it clear in the docs that it needs to be adjusted by the user 👍

...a/org/elasticsearch/xpack/inference/services/azureaistudio/AzureAiStudioServiceSettings.java

.../inference/services/azureaistudio/completion/AzureAiStudioChatCompletionServiceSettings.java

jonathan-buttner · 2024-05-09T20:35:56Z

...pack/inference/services/azureaistudio/embeddings/AzureAiStudioEmbeddingsServiceSettings.java

+                }
+                dimensionsSetByUser = dims != null;
+            }
+            case PERSISTENT -> {


Just a heads up, we've run into issues with not correctly adding backwards compatible logic for parsing the persistent configuration so I think we're going to stop validating the configuration when parsing it from storage. I don't think you need to change anything at the moment but we'll probably be going through and remove the validation checks for the persistent code path.

...pack/inference/services/azureaistudio/embeddings/AzureAiStudioEmbeddingsServiceSettings.java

...inference/services/azureaistudio/embeddings/AzureAiStudioEmbeddingsServiceSettingsTests.java

...ference/external/response/huggingface/HuggingFaceAzureAndOpenAiErrorResponseEntityTests.java

...h/xpack/inference/external/response/openai/OpenAiAzureAndOpenAiErrorResponseEntityTests.java

...h/xpack/inference/external/response/cohere/CohereAzureAndOpenAiErrorResponseEntityTests.java

…ce_redo

davidkyle · 2024-05-13T15:05:24Z

@elasticmachine test this please

davidkyle · 2024-05-14T10:20:07Z

After a successful PUT, Mistral failed with a 405 status code on inference

PUT _inference/completion/mist
{
  "service": "azureaistudio",
  "service_settings": {
    "api_key": "XXX",
    "target": "https://Mistral-small-XXX",
    "provider": "mistral",
    "endpoint_type": "realtime"
  }
}
# PUT ok


POST _inference/mist?error_trace
{
  "input": "Breakfast alternatives to muesli"
}
# POST failed
{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "Received an unsuccessful status code for request from inference entity id [mist] status [405]",
      }
    ]
  }
}

davidkyle · 2024-05-14T10:36:52Z

✅ Completions with the databricks provider and databricks-dbrx-instruct-3 model

The default value of max_new_tokens is quite low for this model, if you don't increase the setting you get a very short response. Perhaps Elasticsearch can set a higher default value or we can make it clearer somehow that the max_new_tokens setting value is coming into play.

markjhoy · 2024-05-14T12:14:47Z

The default value of max_new_tokens is quite low for this model, if you don't increase the setting you get a very short response. Perhaps Elasticsearch can set a higher default value or we can make it clearer somehow that the max_new_tokens setting value is coming into play.

That's interesting... About what would you say was the default from Databrick's side? And I'd opt to add this to the docs to tell the user they may need to increase the max_new_tokens rather than setting a default, as the others seem to return a decent amount of text (usually in the 100+ word range)... I'm not opposed to either way ultimately, but coding in a default might be future maintenance if Databrick's changes this in the future...

markjhoy · 2024-05-14T12:29:42Z

After a successful PUT, Mistral failed with a 405 status code on inference

Huh - that is really odd.

I just tried it myself without any issues... although like Databricks, the return tokens was really low:

Create model:

PUT _inference/completion/test_mistral_completion
{
  "service": "azureaistudio",
  "service_settings": {
    "api_key": "########",
    "target": "https://########.eastus2.inference.ml.azure.com/score",
    "provider": "mistral",
    "endpoint_type": "realtime"
  }
}

Test infer:

POST _inference/completion/test_mistral_completion
{
  "input": "The answer to the universe is"
}

Response:

{
    "completion": [
        {
            "result": " I'm an artificial intelligence and don't have the ability to know the"
        }
    ]
}

@davidkyle - just to be sure - when you created (PUT) your model, did you include the /score in the target at the end?

…ce_redo

markjhoy · 2024-05-14T13:10:14Z

And I'd opt to add this to the docs to tell the user they may need to increase the max_new_tokens rather than setting a default, as the others seem to return a decent amount of text (usually in the 100+ word range)

I take that back... from more testing, with Meta and my Mistral tests -- the default number of tokens seems low, so we may want to add a default... do you have a suggestion for what is a good value?

(and btw, ✅ Meta works for chat completions)

markjhoy · 2024-05-14T13:38:51Z

I can confirm as well that ✅ Microsoft Phi works as expected... and again, I think we do need to set a default max num tokens... the default from this yielded 12 terms in the output... :(

markjhoy · 2024-05-14T14:25:44Z

buildkite test this

…tudio_integration_inference_redo

davidkyle · 2024-05-14T16:22:25Z

@davidkyle - just to be sure - when you created (PUT) your model, did you include the /score in the target at the end?

I will test again, likely a user error

I think we do need to set a default max num tokens... the default from this yielded 12 terms in the output... :(

++
DataBricks returned a short sentence ~8 words. Once I upped the token count it returned a much longer response.

markjhoy · 2024-05-14T16:28:19Z

I think we do need to set a default max num tokens... the default from this yielded 12 terms in the output... :(

DataBricks returned a short sentence ~8 words. Once I upped the token count it returned a much longer response.

From some tests - I think perhaps 64 is a decent number... thoughts?

…tudio_integration_inference_redo

markjhoy · 2024-05-14T20:04:08Z

@davidkyle , @jonathan-buttner - FYI - I added in a default max_new_tokens of 64 if none is entered. This will be documented as well.

markjhoy · 2024-05-14T20:09:52Z

I can't for whatever reason get a Snowflake deployment working...

FYI - I still can't get a Snowflake deployment due to quota issues - and no idea how to get this working... I'm confident in my implementation as written by the input/output on the model card, and it seems to be the same as the others...

I'm cool with (a) going forward with this, or (b) omitting Snowflake from this... let me know your thoughts.

jonathan-buttner · 2024-05-14T20:51:39Z

...erence/services/azureaistudio/completion/AzureAiStudioChatCompletionRequestTaskSettings.java

+
+        ValidationException validationException = new ValidationException();
+
+        Double temperature = extractOptionalDouble(map, TEMPERATURE_FIELD);


I was doing some testing and looks like we allow temperature and topP to be negative values but it results in

{ "completion": [ { "result": "None" } ] }

Should we validate that they're in the correct range? I suppose that could be problematic if the allowable ranges changes in the future 🤔 I wonder if we're getting an error response back but not passing it along.

Should we validate that they're in the correct range?

Good question... I'd say yes - but there's no direct docs I can see for the valid ranges for any of these parameters... the only thing that comes close is in the AzureOpenAI .DLL / SDK documentation:

Temperature : valid range of 0.0 to 2.0

Top P aka Nuclear Sampling - no range specified, but I'd assume it's the same as temperature

Max Tokens minimum of 0

I wonder if we're getting an error response back but not passing it along.

This I doubt as we're still getting a 200 response - however, I can certainly see if all the probabilities are negative it might only consider those > 0.0... but 🤷 I don't know for certain and will do a bit of testing manually...

I wonder if we're getting an error response back but not passing it along.

This I doubt as we're still getting a 200 response - however, I can certainly see if all the probabilities are negative it might only consider those > 0.0... but 🤷 I don't know for certain and will do a bit of testing manually...

FYI - I ran a manual test - and yep - no error. Calling with temperature of -2.0 directly to the .../score endpoint yields a 200 response with the following:

{ "output": "None" }

jonathan-buttner · 2024-05-14T22:07:36Z

✅ Microsoft Phi: phi-3-mini-128k
I did notice that we allow a negative temperature/top_p which results in "Result: None". I'm not sure if that's the response we're getting when we pass along invalid values for those fields or if we're not parsing the error response or something.

…tudio_integration_inference_redo

markjhoy · 2024-05-14T23:15:43Z

@jonathan-buttner - FYI - just pushed up a commit that constrains the top_p and temperature to the 0.0 to 2.0 range. The max new tokens already was set up to only receive positive integers. 👍

davidkyle · 2024-05-15T10:21:59Z

@davidkyle - just to be sure - when you created (PUT) your model, did you include the /score in the target at the end?

Regarding mistral, I tried again with and without /score and saw the same error. If it worked for you it must be something in my configuration but I don't know what. It is hard too debug as there is little information in the logs.

FYI - I still can't get a Snowflake deployment due to quota issues - and no idea how to get this working... I'm confident in my implementation as written by the input/output on the model card, and it seems to be the same as the others...

I hit the same quota problem for Snowflake

.../plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/ServiceUtils.java

…ce_redo

davidkyle

LGTM

markjhoy · 2024-05-15T13:51:45Z

Just did a ✅ using Meta Llama 7B - and all looks good there - @jonathan-buttner - did you get a chance to test yet?

redo after messy merge commit

dfc9de1

markjhoy added >non-issue :ml Machine learning Team:ML Meta label for the ML team :EnterpriseSearch/Application Enterprise Search Team:Enterprise Search Meta label for Enterprise Search team v8.15.0 labels May 9, 2024

markjhoy requested review from jonathan-buttner and davidkyle May 9, 2024 16:32

markjhoy mentioned this pull request May 9, 2024

(obsolete - do not merge) [Inference API] Add Azure AI Studio Embeddings and Chat Completion Support #108310

Closed

jonathan-buttner reviewed May 9, 2024

View reviewed changes

cleanups; refactoring; and added a few tests

f0832eb

markjhoy requested a review from jonathan-buttner May 9, 2024 17:34

jonathan-buttner reviewed May 9, 2024

View reviewed changes

filter xContent ratelimit; reduce boilerplate code

0e35525

markjhoy requested review from jonathan-buttner and maxhniebergall May 9, 2024 22:28

markjhoy added 2 commits May 9, 2024 19:00

fix checkstyle issue

3c79a9a

... and spotlessApply

b8ca2e2

jonathan-buttner reviewed May 10, 2024

View reviewed changes

set lower rate limit 240; rename back files

26fd405

markjhoy requested a review from jonathan-buttner May 11, 2024 02:06

markjhoy added 4 commits May 11, 2024 16:13

Merge branch 'main' into markjhoy/azure_ai_studio_integration_inferen…

05f0ac5

…ce_redo

clean lint

35dbbad

Merge branch 'main' into markjhoy/azure_ai_studio_integration_inferen…

e984bb3

…ce_redo

fix code and tests after merge

ab9831b

davidkyle added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label May 13, 2024

Merge branch 'main' into markjhoy/azure_ai_studio_integration_inferen…

d7d8e36

…ce_redo

clean lint

b0c218f

Merge remote-tracking branch 'upstream/main' into markjhoy/azure_ai_s…

2bc99b7

…tudio_integration_inference_redo

markjhoy added 2 commits May 14, 2024 13:46

Merge remote-tracking branch 'upstream/main' into markjhoy/azure_ai_s…

b08b4d7

…tudio_integration_inference_redo

add default max_new_tokens of 64

768d1dc

jonathan-buttner reviewed May 14, 2024

View reviewed changes

markjhoy added 2 commits May 14, 2024 18:07

Merge remote-tracking branch 'upstream/main' into markjhoy/azure_ai_s…

f76c582

…tudio_integration_inference_redo

constrain top_p temperature to 0.0-2.0 range

f21b075

davidkyle reviewed May 15, 2024

View reviewed changes

markjhoy added 2 commits May 15, 2024 08:32

Merge branch 'main' into markjhoy/azure_ai_studio_integration_inferen…

46a99ac

…ce_redo

remove Snowflake provider; cleanups

4a0c9ea

davidkyle approved these changes May 15, 2024

View reviewed changes

jonathan-buttner approved these changes May 15, 2024

View reviewed changes

markjhoy merged commit e87047f into elastic:main May 15, 2024
16 checks passed

markjhoy mentioned this pull request May 16, 2024

Add Docs for Azure AI Studio Support for the Inference API #108737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference API] Add Azure AI Studio Embeddings and Chat Completion Support #108472

[Inference API] Add Azure AI Studio Embeddings and Chat Completion Support #108472

markjhoy commented May 9, 2024 •

edited

Loading

elasticsearchmachine commented May 9, 2024

elasticsearchmachine commented May 9, 2024

jonathan-buttner left a comment

jonathan-buttner May 9, 2024

davidkyle May 10, 2024

jonathan-buttner May 9, 2024

maxhniebergall May 9, 2024

markjhoy May 9, 2024 •

edited

Loading

jonathan-buttner May 10, 2024

jonathan-buttner May 9, 2024

davidkyle commented May 13, 2024

davidkyle commented May 14, 2024

davidkyle commented May 14, 2024

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024 •

edited

Loading

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024

davidkyle commented May 14, 2024

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024

jonathan-buttner May 14, 2024

markjhoy May 14, 2024

markjhoy May 14, 2024

jonathan-buttner commented May 14, 2024

markjhoy commented May 14, 2024

davidkyle commented May 15, 2024

davidkyle left a comment

markjhoy commented May 15, 2024


		ValidationException validationException = new ValidationException();

		Double temperature = extractOptionalDouble(map, TEMPERATURE_FIELD);

[Inference API] Add Azure AI Studio Embeddings and Chat Completion Support #108472

[Inference API] Add Azure AI Studio Embeddings and Chat Completion Support #108472

Conversation

markjhoy commented May 9, 2024 • edited Loading

Prerequisites to Model Creation

Model Creation:

Text Embedding Inference

Chat Completion Inference

elasticsearchmachine commented May 9, 2024

elasticsearchmachine commented May 9, 2024

jonathan-buttner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markjhoy May 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidkyle commented May 13, 2024

davidkyle commented May 14, 2024

davidkyle commented May 14, 2024

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024 • edited Loading

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024

davidkyle commented May 14, 2024

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024

markjhoy commented May 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonathan-buttner commented May 14, 2024

markjhoy commented May 14, 2024

davidkyle commented May 15, 2024

davidkyle left a comment

Choose a reason for hiding this comment

markjhoy commented May 15, 2024

markjhoy commented May 9, 2024 •

edited

Loading

markjhoy May 9, 2024 •

edited

Loading

markjhoy commented May 14, 2024 •

edited

Loading