Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -449,6 +449,7 @@
"pages": [
"platform/sources/overview",
"platform/sources/azure-blob-storage",
"platform/sources/databricks-volumes",
"platform/sources/google-cloud",
"platform/sources/s3",
"platform/sources/sharepoint"
Expand All @@ -460,6 +461,7 @@
"platform/destinations/overview",
"platform/destinations/astradb",
"platform/destinations/azure-cognitive-search",
"platform/destinations/databricks-volumes",
"platform/destinations/delta-table",
"platform/destinations/google-cloud",
"platform/destinations/milvus",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,21 @@ Send processed data from Unstructured to Databricks Volumes.

You'll need:

import DatabricksPrerequisites from '/snippets/general-shared-text/databricks-volumes.mdx';
import DatabricksVolumesPrerequisites from '/snippets/general-shared-text/databricks-volumes.mdx';

<DatabricksPrerequisites />
<DatabricksVolumesPrerequisites />

To create the destination connector:

1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Databricks**.
5. In the **Provider** area, click **Databricks Volumes**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import DatabricksFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx';
import DatabricksVolumesFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx';

<DatabricksFields />
<DatabricksVolumesFields />
26 changes: 26 additions & 0 deletions platform/sources/databricks-volumes.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: Databricks Volumes
---

Ingest your files into Unstructured from Databricks Volumes.

You'll need:

import DatabricksVolumesPrerequisites from '/snippets/general-shared-text/databricks-volumes.mdx';

<DatabricksVolumesPrerequisites />

To create the source connector:

1. On the sidebar, click **Connectors**.
2. Click **Sources**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Databricks Volumes**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import DatabricksVolumesFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx';

<DatabricksVolumesFields />
34 changes: 12 additions & 22 deletions snippets/general-shared-text/databricks-volumes-platform.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,35 +2,25 @@ Fill in the following fields:

- **Name** (_required_): A unique name for this connector.
- **Host** (_required_): The Databricks workspace host URL.
- **Cluster ID** : The Databricks cluster ID.
- **Catalog** (_required_): The name of the catalog to use.
- **Schema** : The name of the associated schema. If not specified, **default** is used.
- **Volume** (_required_): The name of the associated volume.
- **Volume Path** : Any optional path to access within the volume.
- **Overwrite** Check this box if existing data should be overwritten.
- **Encoding** : Any encoding to be applied to the data in the volume. If not specified, **utf-8**, is used.
- **Client ID** (_required_): The application ID value for the Databricks-managed service principal that has access to the volume.
- **Client Secret** (_required_): The associated OAuth secret value for the Databricks-managed service principal that has access to the volume.

Also fill in the following fields based on your authentication type, depending on your cloud provider:
To learn how to create a Databricks-managed service principal, get its application ID, and generate an associated OAuth secret,
see the documentation for
[AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html),
[Azure](https://learn.microsoft.com/databricks/dev-tools/auth/oauth-m2m),
or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html).

- For Databricks personal access token authentication (AWS, Azure, and GCP):
For Azure, only Databricks-managed service principals are supported. Microsoft Entra ID-managed service principals are not supported.

- **Token** : The Databricks personal access token value.

- For username and password (basic) authentication (AWS only):

- **Username** : The Databricks username value.
- **Password** : The associated Databricks password value.

The following authentication types are currently not supported:

- OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP).
- OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP).
- Azure managed identities (MSI) authentication (Azure only).
- Microsoft Entra ID service principal authentication (Azure only).
- Azure CLI authentication (Azure only).
- Microsoft Entra ID user authentication (Azure only).
- Google Cloud Platform credentials authentication (GCP only).
- Google Cloud Platform ID authentication (GCP only).
To learn how to grant a Databricks-managed service principal access to a volume, see the documentation for
[AWS](https://docs.databricks.com/volumes/utility-commands.html#change-permissions-on-a-volume),
[Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands#change-permissions-on-a-volume),
or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html#change-permissions-on-a-volume).



49 changes: 40 additions & 9 deletions snippets/general-shared-text/databricks-volumes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; pic
allowfullscreen
></iframe>

The preceding video shows how to use Databricks personal access tokens (PATs), which are supported only for [Unstructured Ingest](/ingestion/overview).

To learn how to use Databricks-managed service principals, which are supported by both the [Unstructured Platform](/platform/overview) and Unstructured Ingest,
see the additional videos later on this page.

- The Databricks workspace URL. Get the workspace URL for
[AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids),
[Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids),
Expand All @@ -21,17 +26,39 @@ allowfullscreen
- Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net`
- GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com`

- The Databricks compute resource's ID. Get the compute resource ID for
[AWS](https://docs.databricks.com/integrations/compute-details.html),
[Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details),
or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html).

- The Databricks authentication details. For more information, see the documentation for
[AWS](https://docs.databricks.com/dev-tools/auth/index.html),
[Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/),
or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html).

More specifically, you will need:
The following videos show how to create a Databricks-managed service principal and then grant it access to a Databricks volume:

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/wBmqv5DaA1E"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/DykQRxgh2aQ"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

For the [Unstructured Platform](/platform/overview), only the following Databricks authentication type is supported:

- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.
Note that for Azure, only Databricks-managed service principals are supported. Microsoft Entra ID-managed service principals are not supported.

For [Unstructured Ingest](/ingestion/overview), the following Databricks authentication types are supported:

- For Databricks personal access token authentication (AWS, Azure, and GCP): The personal access token's value.
- For username and password (basic) authentication (AWS only): The user's name and password values.
Expand All @@ -44,6 +71,10 @@ allowfullscreen
- For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account's credentials file.
- For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account's email address.

- The Databricks catalog name for the Volume. Get the catalog name for [AWS](https://docs.databricks.com/catalogs/manage-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/manage-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/manage-catalog.html).
- The Databricks schema name for the Volume. Get the schema name for [AWS](https://docs.databricks.com/schemas/manage-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/manage-schema), or [GCP](https://docs.gcp.databricks.com/schemas/manage-schema.html).
- The Databricks Volume name, and optionally any path in that Volume that you want to access directly. Get the Volume information for [AWS](https://docs.databricks.com/files/volumes.html), [Azure](https://learn.microsoft.com/azure/databricks/files/volumes), or [GCP](https://docs.gcp.databricks.com/files/volumes.html).
- The Databricks catalog name for the volume. Get the catalog name for [AWS](https://docs.databricks.com/catalogs/manage-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/manage-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/manage-catalog.html).
- The Databricks schema name for the volume. Get the schema name for [AWS](https://docs.databricks.com/schemas/manage-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/manage-schema), or [GCP](https://docs.gcp.databricks.com/schemas/manage-schema.html).
- The Databricks volume name, and optionally any path in that volume that you want to access directly. Get the volume information for [AWS](https://docs.databricks.com/files/volumes.html), [Azure](https://learn.microsoft.com/azure/databricks/files/volumes), or [GCP](https://docs.gcp.databricks.com/files/volumes.html).
- Make sure that the target user or service principal has access to the target volume. To learn more, see the documentation for
[AWS](https://docs.databricks.com/volumes/utility-commands.html#change-permissions-on-a-volume),
[Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands#change-permissions-on-a-volume),
or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html#change-permissions-on-a-volume).