# Google Vertex AI Completion with Elastic

This notebook shows how to use Elastic API to interact with Google Vertex AI models to perform completion tasks.

You will need access to a Google Cloud project and enable the Vertex AI APIs, the GCP console will help you do that, follow the instructions. Please note that costs could derive from the use of Vertex AI.

For more info please refer to
https://cloud.google.com/vertex-ai

https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-googlevertexai


# Install dependencies


**Install Python dependencies**

We will use the `elasticsearch` python library to create the inference endpoint and the `requests` library to make HTTP Calls to the Elastic search API.

You may choose a different HTTP library.


In [None]:
!pip install elasticsearch requests

**Import Required Libraries**

Now import the necessary modules, including `requests` for making HTTP calls, `json` for manipulating JSON payloads, and `getpass` for secure input of username, password and API keys.

**In production you want to use a secure secret management to handle your sensitive data like usernames, paswords and API keys.**


In [None]:
from elasticsearch import Elasticsearch, helpers
from urllib.request import urlopen
from getpass import getpass
import json
import time
import requests
from base64 import b64encode

# Create Elastic client and Inference endpoint


**Instantiate the Elasticsearch Client**

This section sets up your Elasticsearch client. For demonstration purposes, we're using a local Elasticsearch instance with default credentials. Adjust these settings for your specific environment.


In [None]:
ELASTIC_USER = getpass("ELASTIC USER: ")
ELASTIC_PASSWORD = getpass("ELASTIC PASSWORD: ")
host = ""  # use your Elastic API host
client = Elasticsearch(
    hosts=[f"http://{host}/"],
    basic_auth=(ELASTIC_USER, ELASTIC_PASSWORD),
)

Confirm the Elsatic client connected by looking at its metadata:


In [None]:
print(client.info())

**Create an Inference Endpoint using Gemini**

In this step we create the Inference endpoint to allow calling for Completion tasks.

For this you will need to get the Service account key file from GCP.

**Get the service account credentials**

You will need a SA (Service Account) and its credentials so the Elasticsearch server can access the service.

Go to https://console.cloud.google.com/iam-admin/serviceaccounts

1.  Click the button Create service account
2.  Write a name that it’s suitable for you.
3.  Click Create and continue
4.  Grant the role Vertex AI User.
5.  Click `Add another role` and then grant the role Service account token creator. This role is needed to allow the SA to generate the necessary access tokens
6.  Click Done.

After creating the Service account you need to get the JSON key file:

Go to https://console.cloud.google.com/iam-admin/serviceaccounts and click on the SA just created.

Go to the keys tab and click Add key -> Create new key -> JSON -> Click on Create

If you get an error message Service account key creation is disabled your administrator needs to change the organization policy _iam.disableServiceAccountKeyCreation_ or grant an exception.
The service account keys should be downloaded to your PC automatically.

Once you donwload the JSON file, open it with you favorite editor and copy its contents. Paste the contents when prompted on the step below.

---

**IMPORTANT**

Note that the use of this service account may have an impact in the GCP billing.

Service account keys can be vulnerable, remember to always:

**KEEP SA KEYS SAFE**

**ENFORCE LEAST PRIVILEGE**


In [None]:
GOOGLE_API_KEY = getpass("Enter Google Service account API key:  ")
inference_id = "completion-notebook-test-1"  # set the inference ID for the endpoint
project_id = ""  # use your GCP project
location = "us-central1"  # set the region in which the model lives

model_id = "gemini-2.5-flash-preview-05-20"  # choose the model, you could use any model from your Vertex AI.
# Availabe models per zone are listed here: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/locations#google_model_endpoint_locations

**Generate a Completion Inference**

Using the requests library, create a POST request to the Elastic API for Completion inference task.


In [None]:
client.inference.put(
    task_type="completion",
    inference_id=inference_id,
    body={
        "service": "googlevertexai",
        "service_settings": {
            "service_account_json": GOOGLE_API_KEY,
            "model_id": model_id,
            "location": location,
            "project_id": project_id,
        },
    },
)

# Call the Inference API for Completion


In [None]:
api_key = b64encode(f"{ELASTIC_USER}:{ELASTIC_PASSWORD}".encode())


def extract_content(json_data) -> str:
    try:
        data = json.loads(json_data)
        if "choices" in data and len(data["choices"]) > 0:
            choice = data["choices"][0]
            if "delta" in choice and "content" in choice["delta"]:
                return choice["delta"]["content"]
    except:
        pass
    return ""


def extract_content_sse(chunk):
    """
    Extracts the 'delta' content from an SSE chunk with the specific
    {"completion":[{"delta":"..."}]} structure.
    Handles 'data: [DONE]' messages.
    """
    try:
        chunk_str = chunk.decode("utf-8")
        lines = chunk_str.split("\n")
        extracted_deltas = []

        for line in lines:
            line = line.strip()
            if not line:
                continue

            if line.startswith("data:"):
                json_data_str = line[len("data:") :].strip()
                if json_data_str == "[DONE]":
                    return ""
                try:
                    data_obj = json.loads(json_data_str)
                    if "completion" in data_obj and isinstance(
                        data_obj["completion"], list
                    ):
                        for item in data_obj["completion"]:
                            if "delta" in item:
                                extracted_deltas.append(item["delta"])
                except json.JSONDecodeError:
                    pass
            elif line.startswith("event: message"):
                pass

        return "".join(extracted_deltas)

    except Exception as e:
        return ""

In [None]:
url_completion = f"http://{host}/_inference/completion/{inference_id}"
headers = {"Authorization": f"Basic {api_key}", "content-type": "application/json"}
data_completion = {"input": "What is elastic?"}

try:
    response = requests.post(url_completion, headers=headers, json=data_completion)
    response.raise_for_status()

    print(f"Status Code: {response.status_code}")
    print("Response Body:")
    print(json.dumps(response.json(), indent=2))

except requests.exceptions.RequestException as e:
    print(f"Error during regular completion request: {e}")
    if hasattr(e, "response") and e.response is not None:
        print(f"Response content: {e.response.text}")

**Call the Inference using Streaming**

The API will stream the LLM response.


In [None]:
url_stream_completion = f"http://{host}/_inference/completion/{inference_id}/_stream"
headers_stream = {
    "Authorization": f"Basic {api_key}",
    "content-type": "application/json",
}
data_stream_completion = {"input": "What is Elastic? (use spongebob lore to explain)"}

try:

    post_response_stream = requests.post(
        url_stream_completion,
        headers=headers_stream,
        json=data_stream_completion,
        stream=True,
    )
    post_response_stream.raise_for_status()

    print(f"Status Code (Stream): {post_response_stream.status_code}")
    print("Streaming Response:")

    for chunk in post_response_stream.iter_content(chunk_size=None):
        print(extract_content_sse(chunk), end="")

    print("\n--- End of Stream ---")

except requests.exceptions.RequestException as e:
    print(f"Error during streaming completion request: {e}")
    if hasattr(e, "response") and e.response is not None:
        print(f"Response content: {e.response.text}")