# Google Vertex AI Chat completion with Elastic

This notebook shows how to use Elastic API to interact with Google Vertex AI models to perform Chat completion tasks.

You will need access to a Google Cloud project and enable the Vertex AI APIs, the GCP console will help you do that, follow the instructions. Please note that costs could derive from the use of Vertex AI.

For more info please refer to

https://cloud.google.com/vertex-ai

https://www.elastic.co/docs/api/doc/elasticsearch/operation/
operation-inference-put-googlevertexai

# Install dependencies

**Install Python dependencies**

We will use the `elasticsearch` python library to create the inference endpoint and the `requests` library to make HTTP Calls to the Elastic API.

In [None]:
!pip install elasticsearch requests

**Import Required Libraries**

Now import the necessary modules, including `requests` for making HTTP calls, `json` for manipulating JSON payloads, and `getpass` for secure input of username, password and API keys.

**In production you want to use a secure secret management to handle your sensitive data like usernames, paswords and API keys.**


In [None]:
from elasticsearch import Elasticsearch, helpers
from urllib.request import urlopen
from getpass import getpass
import json
import time
import requests
from base64 import b64encode

# Create Elastic client and Inference endpoint

**Instantiate the Elasticsearch Client**

This section sets up your Elasticsearch client. For demonstration purposes, we're using a local Elasticsearch instance with default credentials. Adjust these settings for your specific environment.

In [None]:
ELASTIC_USER = getpass("ELASTIC USER: ")
ELASTIC_PASSWORD = getpass("ELASTIC PASSWORD: ")
host = "" # use your own host

client = Elasticsearch(
    hosts=[f"http://{host}/"],
    basic_auth=(ELASTIC_USER, ELASTIC_PASSWORD),
)

Confirm the client connected by getting its metadata:


In [None]:
print(client.info())

**Create an Inference Endpoint using Gemini**

In this step we create the Inference endpoint to allow calling for Chat completion tasks.

For this you will need to get the Service account key file from GCP.


**Get the service account credentials**

You will need a SA (Service Account)  and its credentials so the Elasticsearch server can access the service.

Go to https://console.cloud.google.com/iam-admin/serviceaccounts

  1.   Click the button Create service account
  2.   Write a name that it’s suitable for you.
  3.   Click Create and continue
  4.   Grant the role Vertex AI User.
  5.   Click `Add another role` and then grant the role Service account token creator. This role is needed to allow the SA to generate the necessary access tokens
  6.   Click Done.

After creating the Service account you need to get the JSON key file:

Go to https://console.cloud.google.com/iam-admin/serviceaccounts and click on the SA just created.

Go to the keys tab and click Add key -> Create new key -> JSON -> Click on Create

If you get an error message Service account key creation is disabled your administrator needs to change the organization policy *iam.disableServiceAccountKeyCreation* or grant an exception.
The service account keys should be downloaded to your PC automatically.

Once you donwload the JSON file, open it with you favorite editor and copy its contents. Paste the contents when prompted on the step below.


---

**IMPORTANT**

Note that the use of this service account may have an impact in the GCP billing.

Service account keys can be vulnerable, remember to always:

**KEEP SA KEYS SAFE**

**ENFORCE LEAST PRIVILEGE**




In [None]:
GOOGLE_API_KEY = getpass("Enter Google Service account API key:  ")
inference_id = "chat_completion-notebook-test1" # set the inference ID for the endpoint
project_id = "" # use your GCP project
location = "" # set the location in which Vertex AI models live

model_id = "gemini-2.5-flash-preview-05-20" # choose the model, you could use any model from your Vertex AI.

**Generate a Chat Completion Inference**

Using the requests library, create a POST request to the Elastic API for Chat completion inference task.

In [None]:
client.inference.put(
    task_type="chat_completion",
    inference_id=inference_id,
    body={
        "service": "googlevertexai",
        "service_settings": {
            "service_account_json": GOOGLE_API_KEY,
            "model_id": model_id,
           "location": location,
           "project_id": project_id
        },
    },
)

# Call the Inference API for Chat Completion

In [None]:
api_key = b64encode(f"{ELASTIC_USER}:{ELASTIC_PASSWORD}".encode())

def extract_content(json_data) -> str:
    try:
      data = json.loads(json_data)
      if "choices" in data and len(data["choices"]) > 0:
          choice = data["choices"][0]
          if "delta" in choice and "content" in choice["delta"]:
              return choice["delta"]["content"]
    except:
      pass
    return ""

def extract_content_sse(chunk: bytearray):
    chunk_str :str = chunk.decode("utf-8")
    _, data = chunk_str.split("data: ")
    return extract_content(data)

In [None]:
url = f"http://{host}/_inference/chat_completion/{inference_id}/_stream"
headers = {
    "Authorization": f"Basic {api_key}",
    "Content-Type": "application/json",
}
data = {
    "model": "gemini-2.5-flash-preview-05-20",
    "messages": [{"role": "user", "content": "What is Elastic?"}],
}

post_response = requests.post(url, headers=headers, json=data, stream=True)

for chunk in post_response.iter_content(chunk_size=None):
    #extract_content_sse(chunk)
    print(extract_content_sse(chunk), end="")

**Call the Inference using Tools**

You can also include the usage of tools on chat completion inference tasks.

In [None]:
url = f"http://{host}/_inference/chat_completion/{inference_id}/_stream"
headers = {
    "Authorization": f"Basic {api_key}",
    "Content-Type": "application/json",
}
data = {
    "model": "gemini-2.5-flash-preview-05-20",
    "messages": [{"role": "user", "content": "What is the weather like in Boston today?"}],
    "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

post_response = requests.post(url, headers=headers, json=data, stream=True)

print("Post inference response:")
for chunk in post_response.iter_content(chunk_size=None):
    print(chunk.decode("utf-8"), end="")


**Calling the chat completion inference task with system messages**

System messages can be included on the messages payload to give the agent more context regarding the conversation.

In [None]:
url = f"http://{host}/_inference/chat_completion/{inference_id}/_stream"
headers = {
    "Authorization": f"Basic {api_key}",
    "Content-Type": "application/json",
}
data = {
    "model": "gemini-2.5-flash-preview-05-20",
    "messages": [
        {
            "role": "system",
            "content": "You are an AI travel assistant that can read images, call functions, and interpret structured data. Be helpful and accurate."
        },
        {
            "role": "user",
            "content": "When is the best time to visit Japan?"}
        ],
}

post_response = requests.post(url, headers=headers, json=data, stream=True)

print("Post inference response:")
for chunk in post_response.iter_content(chunk_size=None):
    print(chunk.decode("utf-8"), end="")