Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions platform/connectors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ The Unstructured Platform supports connecting to the following source and destin
## Sources

- [Azure](/platform/sources/azure-blob-storage)
- [Databricks Volumes](/platform/sources/databricks)
- [S3](/platform/sources/s3)

If your source is not listed here, you might still be able to connect Unstructured to it through scripts or code by using the
Expand All @@ -22,6 +23,7 @@ If your source is not listed here, you might still be able to connect Unstructur
## Destinations

- [Azure Cognitive Search](/platform/destinations/azure-cognitive-search)
- [Databricks Volumes](/platform/destinations/databricks)
- [Pinecone](/platform/destinations/pinecone)
- [S3](/platform/destinations/s3)

Expand Down
14 changes: 8 additions & 6 deletions platform/destinations/databricks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,14 @@ import DatabricksPrerequisites from '/snippets/general-shared-text/databricks-vo

To create the destination connector:

1. On the sidebar, click **Destinations**.
2. Click **New Destination**.
3. In the **Type** drop-down list, select **Databricks**.
4. Fill in the fields as described later on this page.
5. Click **Save and Test**.
6. Click **Close**.
1. On the sidebar, click **Connectors**.
2. Click **Destinations**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Databricks**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import DatabricksFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx';

Expand Down
1 change: 1 addition & 0 deletions platform/destinations/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ To create a destination connector:
4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list:

- [Azure Cognitive Search](/platform/destinations/azure-cognitive-search)
- [Databricks Volumes](/platform/destinations/databricks)
- [Pinecone](/platform/destinations/pinecone)
- [S3](/platform/destinations/s3)

Expand Down
26 changes: 26 additions & 0 deletions platform/sources/databricks.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: Databricks Volumes
---

Ingest your files into Unstructured from Databricks Volumes.

You'll need:

import DatabricksVolumesPrerequisites from '/snippets/general-shared-text/databricks-volumes.mdx';

<DatabricksVolumesPrerequisites />

To create the source connector:

1. On the sidebar, click **Connectors**.
2. Click **Sources**.
3. Click **Add new**.
4. Give the connector some unique **Name**.
5. In the **Provider** area, click **Databricks**.
6. Click **Continue**.
7. Follow the on-screen instructions to fill in the fields as described later on this page.
8. Click **Save and Test**.

import DatabricksVolumesFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx';

<DatabricksVolumesFields />
1 change: 1 addition & 0 deletions platform/sources/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ To create a source connector:
4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list:

- [Azure](/platform/sources/azure-blob-storage)
- [Databricks Volumes](/platform/sources/databricks)
- [S3](/platform/sources/s3)

5. Click **Save and Test**.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ import AdditionalIngestDependencies from '/snippets/general-shared-text/ingest-d
The following environment variables:

- `DATABRICKS_HOST` - The Databricks host URL, represented by `--host` (CLI) or `host` (Python).
- `DATABRICKS_CLUSTER_ID` - The Databricks compute resource ID, represented by `--cluster-id` (CLI) or `cluster_id` (Python).
- `DATABRICKS_CATALOG` - The Databricks catalog name for the Volume, represented by `--catalog` (CLI) or `catalog` (Python).
- `DATABRICKS_SCHEMA` - The Databricks schema name for the Volume, represented by `--schema` (CLI) or `schema` (Python). If not specified, `default` is used.
- `DATABRICKS_VOLUME` - The Databricks Volume name, represented by `--volume` (CLI) or `volume` (Python).
Expand Down
29 changes: 8 additions & 21 deletions snippets/general-shared-text/databricks-volumes-platform.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,35 +2,22 @@ Fill in the following fields:

- **Name** (_required_): A unique name for this connector.
- **Host** (_required_): The Databricks workspace host URL.
- **Cluster ID** : The Databricks cluster ID.
- **Catalog** (_required_): The name of the catalog to use.
Comment on lines 4 to 5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things seemed to work for me without needing to specify a cluster ID.

- **Schema** : The name of the associated schema. If not specified, **default** is used.
- **Volume** (_required_): The name of the associated volume.
- **Volume Path** : Any optional path to access within the volume.
- **Overwrite** Check this box if existing data should be overwritten.
- **Encoding** : Any encoding to be applied to the data in the volume. If not specified, **utf-8**, is used.
- **Encoding** : Any encoding to be applied to the data in the volume. If not specified, **utf-8** is used.

Also fill in the following fields based on your authentication type, depending on your cloud provider:

- For Databricks personal access token authentication (AWS, Azure, and GCP):

- **Token** : The Databricks personal access token value.

- For username and password (basic) authentication (AWS only):

- **Username** : The Databricks username value.
- **Password** : The associated Databricks password value.

The following authentication types are currently not supported:

- OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP).
- OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP).
- Azure managed identities (MSI) authentication (Azure only).
- Microsoft Entra ID service principal authentication (Azure only).
- Azure CLI authentication (Azure only).
- Microsoft Entra ID user authentication (Azure only).
- Google Cloud Platform credentials authentication (GCP only).
- Google Cloud Platform ID authentication (GCP only).
- For Databricks personal access token authentication (AWS, Azure, and GCP) or for
Microsoft Entra ID user authentication (Azure only):

- **Token** : The Databricks personal access token value (for AWS, Azure, and GCP) or the
Microsoft Entra ID token value (Azure only).

- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP):

- **Client ID** : The service principal's client (application) ID value.
- **Client Secret** : The associated Databricks OAuth client secret value.
27 changes: 19 additions & 8 deletions snippets/general-shared-text/databricks-volumes.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
The Databricks Volumes prerequisites:

<iframe
width="560"
height="315"
src="https://www.youtube.com/embed/rNZpwa1-g7M"
title="YouTube video player"
frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen
></iframe>

- The Databricks workspace URL. Get the workspace URL for
[AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids),
[Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids),
Expand All @@ -11,26 +21,27 @@ The Databricks Volumes prerequisites:
- Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net`
- GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com`

- The Databricks compute resource's ID. Get the compute resource ID for
[AWS](https://docs.databricks.com/integrations/compute-details.html),
[Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details),
or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html).

- The Databricks authentication details. For more information, see the documentation for
[AWS](https://docs.databricks.com/dev-tools/auth/index.html),
[Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/),
or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html).

More specifically, you will need:
More specifically, you will need the following authentication details.

The following authentication types are supported by both [Unstructured API services](/api-reference/api-services/overview)
and the [Unstructured Platform](/platform/overview):

- For Databricks personal access token authentication (AWS, Azure, and GCP): The personal access token's value.
- For username and password (basic) authentication (AWS only): The user's name and password values.
- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.
- For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user.

The following authentication types are supported only by Unstructured API services:

- For username and password (basic) authentication (AWS only): The user's name and password values.
- For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional values.
- For Azure managed identities (MSI) authentication (Azure only): The client ID value for the corresponding managed identity.
- For Microsoft Entra ID service principal authentication (Azure only): The tenant ID, client ID, and client secret values for the corresponding service principal.
- For Azure CLI authentication (Azure only): No additional values.
- For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user.
- For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account's credentials file.
- For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account's email address.

Expand Down