# Databricks SDK

This page considers details on working with databricks SDK.

In [27]:
from databricks.sdk import WorkspaceClient

## Workspace client

The most popular way to communicate with databricks workspace is through a `databricks.sdk.WorkspaceClient`. To create it, you must set up Databricks authentification:

- Through setting `~/.databrickscfg` file. It may look like this:
- Through defining environment variables. The most popular are:
    - `DATABRICKS_HOST`: set your databricks host.
    - `DATABRICKS_TOKEN`: set your access token.

The default `.databrickscfg` file may look like this: 

```
[DEFAULT]
host = https:////dbc-<some unique for workspace>.cloud.databricks.com
token = <here is your token>
```

- The profile name `DEFAULT` is important. You can specify a different name, but this will be used by default.
- The `host` you can copy from the browser url line (just host, without path).
- The `token` you can get through databricks UI: settings->developer->Access tokens->Manage.

**Note.** If you have problems with authentication, check the environment variables. Some tools, such as the VSCode Databricks extension, may define some default values starting with `DATABRICKS_...`. Also, check the `~/.ipython/profile_default/startup` if there are some startup scripts that can invisibly change the behavior of the IPython.

---

If everything cofigured correctly, the following cell shold be runned without any issues:

In [None]:
w = WorkspaceClient()

## Spark session

You can get a databricks session that will have access to your databricks workspace by using `databricks.connect.DatabricksSession.builder.remote().getOrCreate` method.

- You cannnot create crate a `DatabricksSession` if you have a regular `pyspark` installed on your system. You must run this code from a different Python environment.

---

The following cell creates a Spark session that attched to the Databricks environment runned in the "serverless" mode.

In [1]:
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.remote(serverless=True).getOrCreate()

The following cell displays the list of the tables that are available in my Databricks workspace.

In [2]:
spark.sql("SHOW TABLES").show()

+--------+--------------------+-----------+
|database|           tableName|isTemporary|
+--------+--------------------+-----------+
| default|  telco_churn_bronze|      false|
| default|telco_churn_features|      false|
+--------+--------------------+-----------+



## Feature engineering

The `databricks.feature_engineering` module allows to manipulate feature storage in databricks.

The `databricks.feature_engineering.FeatureEngineeringClient` object provides methods:

| Method                                                                       | Description                                                                                                                           |
| ---------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `create_feature_table(...)`                                                  | Creates a feature table in Unity Catalog, defining its primary keys, schema, timestamp column, and metadata.                          |
| `create_training_set(...)`                                                   | Joins features (via `FeatureLookup` or `FeatureSpec`) to a DataFrame to form a training set with metadata.                            |
| `log_model(...)`                                                             | Logs an MLflow model together with feature metadata so the required features can be fetched automatically at inference.               |
| `score_batch(...)`                                                           | Runs batch inference: given a model URI and a DataFrame, automatically fetches missing features, joins them, and returns predictions. |
| `create_feature_spec(...)`                                                   | Defines a feature spec (collection of `FeatureLookup`/`FeatureFunction`) for use in training sets or feature serving.                 |
| `create_feature_serving_endpoint(...)`                                       | Creates an endpoint for real-time / online feature serving.                                                                           |
| `get_feature_serving_endpoint(...)` / `delete_feature_serving_endpoint(...)` | Manage (retrieve or delete) feature serving endpoints.                                                                                |
| `publish_table(...)`                                                         | Publishes an offline feature table to an online store for low-latency feature access.                                                 |
| `read_table(...)`                                                            | Reads the contents of a feature table into a Spark DataFrame.                                                                         |
| `write_table(...)`                                                           | Inserts or upserts data into a feature table; supports streaming DataFrames.                                                          |
| `set_feature_table_tag(...)` / `delete_feature_table_tag(...)`               | Manage tags (set or delete) on feature tables for governance and organization.                                                        |
| `drop_online_table(...)`                                                     | Removes a published feature table from an online store.                                                                               |

For more details and examples check the [Feature engineering](databricks_sdk/feature_engineering.ipynb) page.

## OpenAI client

The method `serving_endpoints.get_open_ai_client.get_open_ai_client` returns the  `openai.OpenAI` client, which you can use to requiest some served models.

---

The following cell creates the `open_ai_client` and shows that it is really open ai client.

In [None]:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()

open_ai_client = w.serving_endpoints.get_open_ai_client()
type(open_ai_client)

openai.OpenAI

The following cell illustrates the invocation of the embedding model.

In [None]:

embedding = open_ai_client.embeddings.create(
   model="databricks-gte-large-en",
   input="hello"
)
type(embedding)

openai.types.create_embedding_response.CreateEmbeddingResponse

The result is an `openai` embedding response object.

In [None]:
embedding.data[0].embedding[:20]

[-0.9521484375,
 -0.7998046875,
 -0.79931640625,
 -0.138427734375,
 -0.79150390625,
 -0.31787109375,
 -0.55810546875,
 0.392333984375,
 -0.36767578125,
 0.4013671875,
 -0.0791015625,
 -0.78515625,
 -0.4599609375,
 0.4189453125,
 0.418212890625,
 -0.36767578125,
 -0.587890625,
 -0.466796875,
 0.159423828125,
 -0.359130859375]

## Serving endpoint

With Databricks, you can launch an endpoint with a registered model. You can do this through the UI Databricks interface, but here we show the option of using the Python SDK.

---

The following cell registers the simple function that is logged as an ML model in MLFlow.

In [23]:
import mlflow

mlflow.set_tracking_uri("databricks")
mlflow.set_registry_uri("databricks-uc")

experiment_name = "/Users/fedor.kobak@innowise.com/serving_tests"
experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is None:
    experiment_id = mlflow.create_experiment(experiment_name)
else:
    experiment_id = experiment.experiment_id

mlflow.set_experiment(experiment_id=experiment_id)

@mlflow.pyfunc.utils.pyfunc
def model(model_input: list[float]) -> list[float]:
    return [x * 2 for x in model_input]

model_name = "workspace.knowledge.serving_example"

with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        name="model",
        python_model=model,
        pip_requirements=[],
        registered_model_name=model_name
    )

Registered model 'workspace.knowledge.serving_example' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

Created version '2' of model 'workspace.knowledge.serving_example'.


🏃 View run whimsical-smelt-614 at: https://dbc-6bc9e7c2-e867.cloud.databricks.com/ml/experiments/2555847948754149/runs/f2ecb60bae784c7f8fea0e9bf1c6c456
🧪 View experiment at: https://dbc-6bc9e7c2-e867.cloud.databricks.com/ml/experiments/2555847948754149


The following cell defines the endpoint configuration and endpoint name.

In [None]:
from databricks.sdk.service.serving import EndpointCoreConfigInput
config = EndpointCoreConfigInput.from_dict({
    "served_models": [
        {
            "model_name": model_name,
            "model_version": 1,
            "scale_to_zero_enabled": True,
            "workload_size": "Small"
        }
    ]
})

endpoint_name = "serving-example"

EndpointCoreConfigInput(auto_capture_config=None, name=None, served_entities=[], served_models=[ServedModelInput(scale_to_zero_enabled=True, model_name='workspace.knowledge.serving_example', model_version=1, environment_vars=None, instance_profile_arn=None, max_provisioned_concurrency=None, max_provisioned_throughput=None, min_provisioned_concurrency=None, min_provisioned_throughput=None, name=None, provisioned_model_units=None, workload_size='Small', workload_type=None)], traffic_config=None)

Use  `WorkspaceClient.serving_endpoitns.create_and_wait` method to create the endpoint, as shown in the following cell.

**Note.** This cell may take some time to be executed ~10 min.

In [None]:
w = WorkspaceClient()
w.serving_endpoints.create_and_wait(
    name=endpoint_name,
    config=config
)

ServingEndpointDetailed(ai_gateway=None, budget_policy_id=None, config=EndpointCoreConfigOutput(auto_capture_config=None, config_version=1, served_entities=[ServedEntityOutput(creation_timestamp=1759322561000, creator='fedor.kobak@innowise.com', entity_name='workspace.knowledge.serving_example', entity_version='1', environment_vars=None, external_model=None, foundation_model=None, instance_profile_arn=None, max_provisioned_concurrency=None, max_provisioned_throughput=None, min_provisioned_concurrency=None, min_provisioned_throughput=None, name='serving_example-1', provisioned_model_units=None, scale_to_zero_enabled=True, state=ServedModelState(deployment=<ServedModelStateDeployment.DEPLOYMENT_READY: 'DEPLOYMENT_READY'>, deployment_state_message=''), workload_size='Small', workload_type=<ServingModelWorkloadType.CPU: 'CPU'>)], served_models=[ServedModelOutput(creation_timestamp=1759322561000, creator='fedor.kobak@innowise.com', environment_vars=None, instance_profile_arn=None, max_provi

After that your endpoint is awailable in the internet. The following cell throws `curl` to it.

To use it you must create the environment variables `DATABRICKS_HOST` and `DATABRICKS_TOKEN`.

In [2]:
%%bash
curl -s\
  -u token:$DATABRICKS_TOKEN \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"inputs": [5.0, 10.0]}'\
  $DATABRICKS_HOST/serving-endpoints/serving-example/invocations

{"predictions": [10.0, 20.0]}

The outputs, just as was specified in "model", are twice inputs.