diff --git a/mint.json b/mint.json index c3911717..ea052af2 100644 --- a/mint.json +++ b/mint.json @@ -449,6 +449,7 @@ "pages": [ "platform/sources/overview", "platform/sources/azure-blob-storage", + "platform/sources/databricks-volumes", "platform/sources/google-cloud", "platform/sources/s3", "platform/sources/sharepoint" @@ -460,6 +461,7 @@ "platform/destinations/overview", "platform/destinations/astradb", "platform/destinations/azure-cognitive-search", + "platform/destinations/databricks-volumes", "platform/destinations/delta-table", "platform/destinations/google-cloud", "platform/destinations/milvus", diff --git a/platform/destinations/databricks.mdx b/platform/destinations/databricks-volumes.mdx similarity index 57% rename from platform/destinations/databricks.mdx rename to platform/destinations/databricks-volumes.mdx index 7a9402aa..8ee26865 100644 --- a/platform/destinations/databricks.mdx +++ b/platform/destinations/databricks-volumes.mdx @@ -6,9 +6,9 @@ Send processed data from Unstructured to Databricks Volumes. You'll need: -import DatabricksPrerequisites from '/snippets/general-shared-text/databricks-volumes.mdx'; +import DatabricksVolumesPrerequisites from '/snippets/general-shared-text/databricks-volumes.mdx'; - + To create the destination connector: @@ -16,11 +16,11 @@ To create the destination connector: 2. Click **Destinations**. 3. Click **Add new**. 4. Give the connector some unique **Name**. -5. In the **Provider** area, click **Databricks**. +5. In the **Provider** area, click **Databricks Volumes**. 6. Click **Continue**. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click **Save and Test**. -import DatabricksFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx'; +import DatabricksVolumesFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx'; - \ No newline at end of file + \ No newline at end of file diff --git a/platform/sources/databricks-volumes.mdx b/platform/sources/databricks-volumes.mdx new file mode 100644 index 00000000..f577321b --- /dev/null +++ b/platform/sources/databricks-volumes.mdx @@ -0,0 +1,26 @@ +--- +title: Databricks Volumes +--- + +Ingest your files into Unstructured from Databricks Volumes. + +You'll need: + +import DatabricksVolumesPrerequisites from '/snippets/general-shared-text/databricks-volumes.mdx'; + + + +To create the source connector: + +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Databricks Volumes**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. + +import DatabricksVolumesFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx'; + + \ No newline at end of file diff --git a/snippets/general-shared-text/databricks-volumes-platform.mdx b/snippets/general-shared-text/databricks-volumes-platform.mdx index a1f45588..b6b44fa7 100644 --- a/snippets/general-shared-text/databricks-volumes-platform.mdx +++ b/snippets/general-shared-text/databricks-volumes-platform.mdx @@ -2,35 +2,25 @@ Fill in the following fields: - **Name** (_required_): A unique name for this connector. - **Host** (_required_): The Databricks workspace host URL. -- **Cluster ID** : The Databricks cluster ID. - **Catalog** (_required_): The name of the catalog to use. - **Schema** : The name of the associated schema. If not specified, **default** is used. - **Volume** (_required_): The name of the associated volume. - **Volume Path** : Any optional path to access within the volume. -- **Overwrite** Check this box if existing data should be overwritten. -- **Encoding** : Any encoding to be applied to the data in the volume. If not specified, **utf-8**, is used. +- **Client ID** (_required_): The application ID value for the Databricks-managed service principal that has access to the volume. +- **Client Secret** (_required_): The associated OAuth secret value for the Databricks-managed service principal that has access to the volume. -Also fill in the following fields based on your authentication type, depending on your cloud provider: +To learn how to create a Databricks-managed service principal, get its application ID, and generate an associated OAuth secret, +see the documentation for +[AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), +[Azure](https://learn.microsoft.com/databricks/dev-tools/auth/oauth-m2m), +or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). -- For Databricks personal access token authentication (AWS, Azure, and GCP): +For Azure, only Databricks-managed service principals are supported. Microsoft Entra ID-managed service principals are not supported. - - **Token** : The Databricks personal access token value. - -- For username and password (basic) authentication (AWS only): - - - **Username** : The Databricks username value. - - **Password** : The associated Databricks password value. - -The following authentication types are currently not supported: - -- OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP). -- OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP). -- Azure managed identities (MSI) authentication (Azure only). -- Microsoft Entra ID service principal authentication (Azure only). -- Azure CLI authentication (Azure only). -- Microsoft Entra ID user authentication (Azure only). -- Google Cloud Platform credentials authentication (GCP only). -- Google Cloud Platform ID authentication (GCP only). +To learn how to grant a Databricks-managed service principal access to a volume, see the documentation for +[AWS](https://docs.databricks.com/volumes/utility-commands.html#change-permissions-on-a-volume), +[Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands#change-permissions-on-a-volume), +or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html#change-permissions-on-a-volume). diff --git a/snippets/general-shared-text/databricks-volumes.mdx b/snippets/general-shared-text/databricks-volumes.mdx index d8520bed..60a3c4f4 100644 --- a/snippets/general-shared-text/databricks-volumes.mdx +++ b/snippets/general-shared-text/databricks-volumes.mdx @@ -10,6 +10,11 @@ allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; pic allowfullscreen > +The preceding video shows how to use Databricks personal access tokens (PATs), which are supported only for [Unstructured Ingest](/ingestion/overview). + +To learn how to use Databricks-managed service principals, which are supported by both the [Unstructured Platform](/platform/overview) and Unstructured Ingest, +see the additional videos later on this page. + - The Databricks workspace URL. Get the workspace URL for [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids), [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids), @@ -21,17 +26,39 @@ allowfullscreen - Azure: `https://adb-..azuredatabricks.net` - GCP: `https://..gcp.databricks.com` -- The Databricks compute resource's ID. Get the compute resource ID for - [AWS](https://docs.databricks.com/integrations/compute-details.html), - [Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details), - or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html). - - The Databricks authentication details. For more information, see the documentation for [AWS](https://docs.databricks.com/dev-tools/auth/index.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html). - More specifically, you will need: + The following videos show how to create a Databricks-managed service principal and then grant it access to a Databricks volume: + + + + + + For the [Unstructured Platform](/platform/overview), only the following Databricks authentication type is supported: + + - For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal. + Note that for Azure, only Databricks-managed service principals are supported. Microsoft Entra ID-managed service principals are not supported. + + For [Unstructured Ingest](/ingestion/overview), the following Databricks authentication types are supported: - For Databricks personal access token authentication (AWS, Azure, and GCP): The personal access token's value. - For username and password (basic) authentication (AWS only): The user's name and password values. @@ -44,6 +71,10 @@ allowfullscreen - For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account's credentials file. - For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account's email address. -- The Databricks catalog name for the Volume. Get the catalog name for [AWS](https://docs.databricks.com/catalogs/manage-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/manage-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/manage-catalog.html). -- The Databricks schema name for the Volume. Get the schema name for [AWS](https://docs.databricks.com/schemas/manage-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/manage-schema), or [GCP](https://docs.gcp.databricks.com/schemas/manage-schema.html). -- The Databricks Volume name, and optionally any path in that Volume that you want to access directly. Get the Volume information for [AWS](https://docs.databricks.com/files/volumes.html), [Azure](https://learn.microsoft.com/azure/databricks/files/volumes), or [GCP](https://docs.gcp.databricks.com/files/volumes.html). \ No newline at end of file +- The Databricks catalog name for the volume. Get the catalog name for [AWS](https://docs.databricks.com/catalogs/manage-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/manage-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/manage-catalog.html). +- The Databricks schema name for the volume. Get the schema name for [AWS](https://docs.databricks.com/schemas/manage-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/manage-schema), or [GCP](https://docs.gcp.databricks.com/schemas/manage-schema.html). +- The Databricks volume name, and optionally any path in that volume that you want to access directly. Get the volume information for [AWS](https://docs.databricks.com/files/volumes.html), [Azure](https://learn.microsoft.com/azure/databricks/files/volumes), or [GCP](https://docs.gcp.databricks.com/files/volumes.html). +- Make sure that the target user or service principal has access to the target volume. To learn more, see the documentation for + [AWS](https://docs.databricks.com/volumes/utility-commands.html#change-permissions-on-a-volume), + [Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands#change-permissions-on-a-volume), + or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html#change-permissions-on-a-volume). \ No newline at end of file