From 83167bf01f70bf58410be7010cca8cc88890b4d1 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Mon, 3 Feb 2025 15:30:38 -0800 Subject: [PATCH 1/2] Databricks Volumes and Delta Tables connectors: New how-to video links, more links to 3rd-party docs --- .../databricks-delta-table.mdx | 111 ++++++++++++++---- .../databricks-volumes.mdx | 94 ++++++++------- 2 files changed, 139 insertions(+), 66 deletions(-) diff --git a/snippets/general-shared-text/databricks-delta-table.mdx b/snippets/general-shared-text/databricks-delta-table.mdx index 1063dc94..b4c82e80 100644 --- a/snippets/general-shared-text/databricks-delta-table.mdx +++ b/snippets/general-shared-text/databricks-delta-table.mdx @@ -9,10 +9,35 @@ - A SQL warehouse for [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create), or [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html). + + The following video shows how to create a SQL warehouse, get its **Server Hostname** and **HTTP Path** values, and set permissions for someone other than the warehouse's owner to use it: + + + - An all-purpose cluster for [AWS](https://docs.databricks.com/compute/use-compute.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/use-compute), or [GCP](https://docs.gcp.databricks.com/compute/use-compute.html). + The following video shows how to create an all-purpose cluster, get its **Server Hostname** and **HTTP Path** values, and set permissions for someone other than the cluster's owner to use it: + + + - The SQL warehouse's or cluster's **Server Hostname** and **HTTP Path** values for [AWS](https://docs.databricks.com/integrations/compute-details.html), [Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details), or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html). @@ -25,7 +50,7 @@ for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html). - - A schema + - A schema (formerly known as a database) for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) @@ -34,7 +59,19 @@ for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html) - within that schema. + within that schema (formerly known as a database). + + The following video shows how to create a catalog, schema (formerly known as a database), and a table in Unity Catalog, and set privileges for someone other than their owner to use them: + + This table must contain the following column names and their data types: @@ -86,22 +123,36 @@ ); ``` + + In Databricks, a table's _schema_ is different than a _schema_ (formerly known as a database) in a catalog-schema object relationship in Unity Catalog. + + - Within Unity Catalog, a volume for [AWS](https://docs.databricks.com/volumes/utility-commands.html), [Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands), or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html) - within the same schema as the table. -- For Databricks personal access token authentication to the workspace, the - Databricks personal access token value for - [AWS](https://docs.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users), - [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat#azure-databricks-personal-access-tokens-for-workspace-users), or - [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users). - This token must be for the workspace user who - has the appropriate access permissions to the catalog, schema, table, volume, and cluster or SQL warehouse, + within the same schema (formerly known as a database) as the table. + + The following video shows how to create a catalog, schema (formerly known as a database), and a volume in Unity Catalog, and set privileges for someone other than their owner to use them: + + + If you already have a table that you want to use, then use the table's existing catalog and schema (formerly known as a database). Do not create a new catalog or schema (formerly known as a database)—just create the new volume. The volume and table must be within the same schema (formerly known as a database) and catalog. + + + + - For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace: - A Databricks managed service principal. - This service principal must have the appropriate access permissions to the catalog, schema, table, volume, and cluster or SQL warehouse. + This service principal must have the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse. - The service principal's **UUID** (or **Client ID** or **Application ID**) value. - The OAuth **Secret** value for the service principal. @@ -110,7 +161,7 @@ [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). - For Azure Databricks, this connector only supports Databricks managed service principals. + For Azure Databricks, this connector only supports Databricks managed service principals for authentication. Microsoft Entra ID managed service principals are not supported. @@ -126,6 +177,26 @@ allowfullscreen > +- For Databricks personal access token authentication to the workspace, the + Databricks personal access token value for + [AWS](https://docs.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users), + [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat#azure-databricks-personal-access-tokens-for-workspace-users), or + [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users). + This token must be for the workspace user who + has the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse, + + The following video shows how to create a Databricks personal access token: + + + - The Databricks workspace user or Databricks managed service principal must have the following _minimum_ set of permissions and privileges to write to an existing volume or table in Unity Catalog: @@ -140,7 +211,7 @@ - To access a Unity Catalog volume, the following privileges: - `USE CATALOG` on the volume's parent catalog in Unity Catalog. - - `USE SCHEMA` on the volume's parent schema in Unity Catalog. + - `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. - `READ VOLUME` and `WRITE VOLUME` on the volume. Learn how to check and set Unity Catalog privileges for @@ -148,22 +219,10 @@ [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). - The following videos shows how to grant a Databricks managed service principal privileges to a Unity Catalog volume: - - - - To access a Unity Catalog table, the following privileges: - `USE CATALOG` on the table's parent catalog in Unity Catalog. - - `USE SCHEMA` on the tables's parent schema in Unity Catalog. + - `USE SCHEMA` on the tables's parent schema (formerly known as a database) in Unity Catalog. - `MODIFY` and `SELECT` on the table. Learn how to check and set Unity Catalog privileges for diff --git a/snippets/general-shared-text/databricks-volumes.mdx b/snippets/general-shared-text/databricks-volumes.mdx index 4c3c6f5a..8ccabdde 100644 --- a/snippets/general-shared-text/databricks-volumes.mdx +++ b/snippets/general-shared-text/databricks-volumes.mdx @@ -1,19 +1,10 @@ - - -The preceding video shows how to use Databricks personal access tokens (PATs), which are supported only for [Unstructured Ingest](/ingestion/overview). - -To learn how to use Databricks managed service principals, which are supported by both the [Unstructured Platform](/platform/overview) and Unstructured Ingest, -see the additional video later on this page. - -- The Databricks workspace URL. Get the workspace URL for +- A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), + [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or + [GCP](https://docs.gcp.databricks.com/getting-started/index.html). +- A workspace within the Datbricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), + [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or + [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). +- The workspace's URL. Get the workspace URL for [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids), [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids), or [GCP](https://docs.gcp.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids). @@ -29,6 +20,13 @@ see the additional video later on this page. [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html). + For the [Unstructured Platform](/platform/overview), only Databricks OAuth machine-to-machine (M2M) authentication is supported for + [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), + [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), and + [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). + You will need the the **Client ID** (or **UUID** or **Application** ID) and OAuth **Secret** (client secret) values for the corresponding service principal. + Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported. + The following video shows how to create a Databricks managed service principal: - - For the [Unstructured Platform](/platform/overview), only Databricks OAuth machine-to-machine (M2M) authentication is supported for AWS, Azure, and GCP. - You will need the the **Client ID** (or **UUID** or **Application** ID) and OAuth **Secret** (client secret) values for the corresponding service principal. - Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported. For [Unstructured Ingest](/ingestion/overview), the following Databricks authentication types are supported: - - For Databricks personal access token authentication (AWS, Azure, and GCP): The personal access token's value. - - For username and password (basic) authentication (AWS only): The user's name and password values. - - For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal. - - For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional values. - - For Azure managed identities (MSI) authentication (Azure only): The client ID value for the corresponding managed identity. - - For Microsoft Entra ID service principal authentication (Azure only): The tenant ID, client ID, and client secret values for the corresponding service principal. - - For Azure CLI authentication (Azure only): No additional values. - - For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user. - - For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account's credentials file. - - For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account's email address. + - For Databricks personal access token authentication for + [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), + [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or + [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html): The personal access token's value. + + The following video shows how to create a Databricks personal access token: + + + + - For username and password (basic) authentication ([AWS](https://docs.databricks.com/archive/dev-tools/basic.html) only): The user's name and password values. + - For OAuth machine-to-machine (M2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), + [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), and + [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html)): The client ID and OAuth secret values for the corresponding service principal. + - For OAuth user-to-machine (U2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-u2m.html), + [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-u2m), and + [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-u2m.html)): No additional values. + - For Azure managed identities (formerly Managed Service Identities (MSI) authentication) ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-mi) only): The client ID value for the corresponding managed identity. + - For Microsoft Entra ID service principal authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-sp) only): The tenant ID, client ID, and client secret values for the corresponding service principal. + - For Azure CLI authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-cli) only): No additional values. + - For Microsoft Entra ID user authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/user-aad-token) only): The Entra ID token for the corresponding Entra ID user. + - For Google Cloud Platform credentials authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-creds.html) only): The local path to the corresponding Google Cloud service account's credentials file. + - For Google Cloud Platform ID authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-id.html) only): The Google Cloud service account's email address. - The name of the parent catalog in Unity Catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html) for the volume. -- The name of the parent schema in Unity Catalog for +- The name of the parent schema (formerly known as a database) in Unity Catalog for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) for the volume. @@ -73,22 +87,22 @@ see the additional video later on this page. existing volume in Unity Catalog: - `USE CATALOG` on the volume's parent catalog in Unity Catalog. - - `USE SCHEMA` on the volume's parent schema in Unity Catalog. + - `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. - `READ VOLUME` and `WRITE VOLUME` on the volume. - Learn how to check and set Unity Catalog privileges for - [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), - [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or - [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). + The following videos shows how to create and set privileges for a catalog, schema (formerly known as a database), and volume in Unity Catalog. - The following videos shows how to grant a Databricks managed service principal privileges to a Unity Catalog volume: - \ No newline at end of file + > + + Learn more about how to check and set Unity Catalog privileges for + [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), + [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or + [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). From 29109910013162fda7358475dc6c983b3c187ea8 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Tue, 4 Feb 2025 08:30:54 -0800 Subject: [PATCH 2/2] Tables and volumes can be in the same schema or in different ones --- ...atabricks-delta-table-api-placeholders.mdx | 10 +++++++-- .../databricks-delta-table-cli-api.mdx | 15 ++++++++----- .../databricks-delta-table-platform.mdx | 10 +++++++-- .../databricks-delta-table.mdx | 21 ++++++++----------- 4 files changed, 35 insertions(+), 21 deletions(-) diff --git a/snippets/general-shared-text/databricks-delta-table-api-placeholders.mdx b/snippets/general-shared-text/databricks-delta-table-api-placeholders.mdx index 06c1bfec..2c7c95da 100644 --- a/snippets/general-shared-text/databricks-delta-table-api-placeholders.mdx +++ b/snippets/general-shared-text/databricks-delta-table-api-placeholders.mdx @@ -4,9 +4,15 @@ - `` (_required_ for PAT authentication): For Databricks personal access token (PAT) authentication, the target Databricks user's PAT value. - `` and `` (_required_ for OAuth authentication): For Databricks OAuth machine-to-machine (M2M) authentication, the Databricks managed service principal's **UUID** (or **Client ID** or **Application ID**) and OAuth **Secret** (client secret) values. - `` (_required_): The name of the catalog in Unity Catalog for the target volume and table in the Databricks workspace. -- ``: The name of the database in Unity Catalog for the target volume and table. The default is `default` if not otherwise specified. +- ``: The name of the schema (formerly known as a database) in Unity Catalog for the target table. The default is `default` if not otherwise specified. + + If the target table and volume are in the same schema (formerly known as a database), then `` and `` will have the same values. + - `` (_required_): The name of the target table in Unity Catalog. -- ``: The name of the schema in Unity Catalog for the target volume and table. The default is `default` if not otherwise specified. +- ``: The name of the schema (formerly known as a database) in Unity Catalog for the target volume. The default is `default` if not otherwise specified. + + If the target volume and table are in the same schema (formerly known as a database), then `` and `` will have the same values. + - `` (_required_): The name of the target volume in Unity Catalog. - ``: Any target folder path inside of the volume to use instead of the volume's root. If not otherwise specified, processing occurs at the volume's root. diff --git a/snippets/general-shared-text/databricks-delta-table-cli-api.mdx b/snippets/general-shared-text/databricks-delta-table-cli-api.mdx index cd54ac71..ff71233e 100644 --- a/snippets/general-shared-text/databricks-delta-table-cli-api.mdx +++ b/snippets/general-shared-text/databricks-delta-table-cli-api.mdx @@ -16,8 +16,11 @@ The following environment variables: - `DATABRICKS_CLIENT_ID` - For Databricks managed service principal authenticaton, the service principal's **UUID** (or **Client ID** or **Application ID**) value, represented by `--client-id` (CLI) or `client_id` (Python). - `DATABRICKS_CLIENT_SECRET` - For Databricks managed service principal authenticaton, the service principal's OAuth **Secret** value, represented by `--client-secret` (CLI) or `client_secret` (Python). - `DATABRICKS_CATALOG` - The name of the catalog in Unity Catalog, represented by `--catalog` (CLI) or `catalog` (Python). -- `DATABRICKS_DATABASE` - The name of the schema (database) inside of the catalog, represented by `--database` (CLI) or `database` (Python). The default is `default` if not otherwise specified. -- `DATABRICKS_TABLE` - The name of the table inside of the schema (database), represented by `--table-name` (CLI) or `table_name` (Python). The default is `elements` if not otherwise specified. +- `DATABRICKS_DATABASE` - The name of the schema (formerly known as a database) inside of the catalog for the target table, represented by `--database` (CLI) or `database` (Python). The default is `default` if not otherwise specified. + + If you are also using a volume, and the target table and volume are in the same schema (formerly known as a database), then `DATABRICKS_DATABASE` and `DATABRICKS_SCHEMA` will have the same values. + +- `DATABRICKS_TABLE` - The name of the table inside of the schema (formerly known as a database), represented by `--table-name` (CLI) or `table_name` (Python). The default is `elements` if not otherwise specified. For the SQL-based implementation, add these environment variables: @@ -26,7 +29,9 @@ For the SQL-based implementation, add these environment variables: For the volume-based implementation, add these environment variables: -- `DATABRICKS_SCHEMA` - The name of the schema (database) inside of the catalog, represented by `--schema` (CLI) or `schema` (Python). This name of this database (schema) must be the same as - the value of the `DATABRICKS_DATABASE` environment variable and is required for compatiblity. The default is `default` if not otherwise specified. -- `DATABRICKS_VOLUME` - The name of the volume inside of the schema (database), represented by `--volume` (CLI) or `volume` (Python). +- `DATABRICKS_SCHEMA` - The name of the schema (formerly known as a database) inside of the catalog for the target volume, represented by `--schema` (CLI) or `schema` (Python). The default is `default` if not otherwise specified. + + If the target volume and table are in the same schema (formerly known as a database), then `DATABRICKS_SCHEMA` and `DATABRICKS_SCHEMA` will have the same values. + +- `DATABRICKS_VOLUME` - The name of the volume inside of the schema (formerly known as a database), represented by `--volume` (CLI) or `volume` (Python). - `DATABRICKS_VOLUME_PATH` - Optionally, a specific path inside of the volume that you want to start accessing from, starting from the volume's root, represented by `--volume-path` (CLI) or `volume_path` (Python). The default is to start accessing from the volume's root if not otherwise specified. diff --git a/snippets/general-shared-text/databricks-delta-table-platform.mdx b/snippets/general-shared-text/databricks-delta-table-platform.mdx index 3a99fa09..4805157a 100644 --- a/snippets/general-shared-text/databricks-delta-table-platform.mdx +++ b/snippets/general-shared-text/databricks-delta-table-platform.mdx @@ -6,8 +6,14 @@ Fill in the following fields: - **Token** (_required_ for PAT authentication): For Databricks personal access token (PAT) authentication, the target Databricks user's PAT value. - **UUID** and **OAuth Secret** (_required_ for OAuth authentication): For Databricks OAuth machine-to-machine (M2M) authentication, the Databricks managed service principal's **UUID** (or **Client ID** or **Application ID**) and OAuth **Secret** (client secret) values. - **Catalog** (_required_): The name of the catalog in Unity Catalog for the target volume and table in the Databricks workspace. -- **Database**: The name of the database in Unity Catalog for the target volume and table. The default is `default` if not otherwise specified. +- **Database**: The name of the schema (formerly known as a database) in Unity Catalog for the target table. The default is `default` if not otherwise specified. + + If the target table and volume are in the same schema (formerly known as a database), then **Database** and **Schema** will have the same names. + - **Table Name** (_required_): The name of the target table in Unity Catalog. -- **Schema**: The name of the schema in Unity Catalog for the target volume and table. The default is `default` if not otherwise specified. +- **Schema**: The name of the schema (formerly known as a database) in Unity Catalog for the target volume. The default is `default` if not otherwise specified. + + If the target volume and table are in the same schema (formerly known as a database), then **Schema** and **Database** will have the same names. + - **Volume** (_required_): The name of the target volume in Unity Catalog. - **Volume Path**: Any target folder path inside of the volume to use instead of the volume's root. If not otherwise specified, processing occurs at the volume's root. diff --git a/snippets/general-shared-text/databricks-delta-table.mdx b/snippets/general-shared-text/databricks-delta-table.mdx index b4c82e80..2a166f84 100644 --- a/snippets/general-shared-text/databricks-delta-table.mdx +++ b/snippets/general-shared-text/databricks-delta-table.mdx @@ -10,7 +10,7 @@ [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create), or [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html). - The following video shows how to create a SQL warehouse, get its **Server Hostname** and **HTTP Path** values, and set permissions for someone other than the warehouse's owner to use it: + The following video shows how to create a SQL warehouse if you do not already have one available, get its **Server Hostname** and **HTTP Path** values, and set permissions for someone other than the warehouse's owner to use it: