diff --git a/platform/connectors.mdx b/platform/connectors.mdx index 41bef144..caeede32 100644 --- a/platform/connectors.mdx +++ b/platform/connectors.mdx @@ -12,6 +12,7 @@ The Unstructured Platform supports connecting to the following source and destin ## Sources - [Azure](/platform/sources/azure-blob-storage) +- [Databricks Volumes](/platform/sources/databricks) - [S3](/platform/sources/s3) If your source is not listed here, you might still be able to connect Unstructured to it through scripts or code by using the @@ -22,6 +23,7 @@ If your source is not listed here, you might still be able to connect Unstructur ## Destinations - [Azure Cognitive Search](/platform/destinations/azure-cognitive-search) +- [Databricks Volumes](/platform/destinations/databricks) - [Pinecone](/platform/destinations/pinecone) - [S3](/platform/destinations/s3) diff --git a/platform/destinations/databricks.mdx b/platform/destinations/databricks.mdx index 90dea652..7a9402aa 100644 --- a/platform/destinations/databricks.mdx +++ b/platform/destinations/databricks.mdx @@ -12,12 +12,14 @@ import DatabricksPrerequisites from '/snippets/general-shared-text/databricks-vo To create the destination connector: -1. On the sidebar, click **Destinations**. -2. Click **New Destination**. -3. In the **Type** drop-down list, select **Databricks**. -4. Fill in the fields as described later on this page. -5. Click **Save and Test**. -6. Click **Close**. +1. On the sidebar, click **Connectors**. +2. Click **Destinations**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Databricks**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. import DatabricksFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx'; diff --git a/platform/destinations/overview.mdx b/platform/destinations/overview.mdx index c9fbf686..3a0e676a 100644 --- a/platform/destinations/overview.mdx +++ b/platform/destinations/overview.mdx @@ -15,6 +15,7 @@ To create a destination connector: 4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list: - [Azure Cognitive Search](/platform/destinations/azure-cognitive-search) + - [Databricks Volumes](/platform/destinations/databricks) - [Pinecone](/platform/destinations/pinecone) - [S3](/platform/destinations/s3) diff --git a/platform/sources/databricks.mdx b/platform/sources/databricks.mdx new file mode 100644 index 00000000..52cff067 --- /dev/null +++ b/platform/sources/databricks.mdx @@ -0,0 +1,26 @@ +--- +title: Databricks Volumes +--- + +Ingest your files into Unstructured from Databricks Volumes. + +You'll need: + +import DatabricksVolumesPrerequisites from '/snippets/general-shared-text/databricks-volumes.mdx'; + + + +To create the source connector: + +1. On the sidebar, click **Connectors**. +2. Click **Sources**. +3. Click **Add new**. +4. Give the connector some unique **Name**. +5. In the **Provider** area, click **Databricks**. +6. Click **Continue**. +7. Follow the on-screen instructions to fill in the fields as described later on this page. +8. Click **Save and Test**. + +import DatabricksVolumesFields from '/snippets/general-shared-text/databricks-volumes-platform.mdx'; + + \ No newline at end of file diff --git a/platform/sources/overview.mdx b/platform/sources/overview.mdx index 37ac10b0..d2b55568 100644 --- a/platform/sources/overview.mdx +++ b/platform/sources/overview.mdx @@ -16,6 +16,7 @@ To create a source connector: 4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list: - [Azure](/platform/sources/azure-blob-storage) + - [Databricks Volumes](/platform/sources/databricks) - [S3](/platform/sources/s3) 5. Click **Save and Test**. diff --git a/snippets/general-shared-text/databricks-volumes-cli-api.mdx b/snippets/general-shared-text/databricks-volumes-cli-api.mdx index a7c1d169..dc7f9845 100644 --- a/snippets/general-shared-text/databricks-volumes-cli-api.mdx +++ b/snippets/general-shared-text/databricks-volumes-cli-api.mdx @@ -11,7 +11,6 @@ import AdditionalIngestDependencies from '/snippets/general-shared-text/ingest-d The following environment variables: - `DATABRICKS_HOST` - The Databricks host URL, represented by `--host` (CLI) or `host` (Python). -- `DATABRICKS_CLUSTER_ID` - The Databricks compute resource ID, represented by `--cluster-id` (CLI) or `cluster_id` (Python). - `DATABRICKS_CATALOG` - The Databricks catalog name for the Volume, represented by `--catalog` (CLI) or `catalog` (Python). - `DATABRICKS_SCHEMA` - The Databricks schema name for the Volume, represented by `--schema` (CLI) or `schema` (Python). If not specified, `default` is used. - `DATABRICKS_VOLUME` - The Databricks Volume name, represented by `--volume` (CLI) or `volume` (Python). diff --git a/snippets/general-shared-text/databricks-volumes-platform.mdx b/snippets/general-shared-text/databricks-volumes-platform.mdx index a1f45588..f4ee23b2 100644 --- a/snippets/general-shared-text/databricks-volumes-platform.mdx +++ b/snippets/general-shared-text/databricks-volumes-platform.mdx @@ -2,35 +2,22 @@ Fill in the following fields: - **Name** (_required_): A unique name for this connector. - **Host** (_required_): The Databricks workspace host URL. -- **Cluster ID** : The Databricks cluster ID. - **Catalog** (_required_): The name of the catalog to use. - **Schema** : The name of the associated schema. If not specified, **default** is used. - **Volume** (_required_): The name of the associated volume. - **Volume Path** : Any optional path to access within the volume. - **Overwrite** Check this box if existing data should be overwritten. -- **Encoding** : Any encoding to be applied to the data in the volume. If not specified, **utf-8**, is used. +- **Encoding** : Any encoding to be applied to the data in the volume. If not specified, **utf-8** is used. Also fill in the following fields based on your authentication type, depending on your cloud provider: -- For Databricks personal access token authentication (AWS, Azure, and GCP): - - - **Token** : The Databricks personal access token value. - -- For username and password (basic) authentication (AWS only): - - - **Username** : The Databricks username value. - - **Password** : The associated Databricks password value. - -The following authentication types are currently not supported: - -- OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP). -- OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP). -- Azure managed identities (MSI) authentication (Azure only). -- Microsoft Entra ID service principal authentication (Azure only). -- Azure CLI authentication (Azure only). -- Microsoft Entra ID user authentication (Azure only). -- Google Cloud Platform credentials authentication (GCP only). -- Google Cloud Platform ID authentication (GCP only). +- For Databricks personal access token authentication (AWS, Azure, and GCP) or for + Microsoft Entra ID user authentication (Azure only): + - **Token** : The Databricks personal access token value (for AWS, Azure, and GCP) or the + Microsoft Entra ID token value (Azure only). +- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): + - **Client ID** : The service principal's client (application) ID value. + - **Client Secret** : The associated Databricks OAuth client secret value. \ No newline at end of file diff --git a/snippets/general-shared-text/databricks-volumes.mdx b/snippets/general-shared-text/databricks-volumes.mdx index 8bed44d3..680e3580 100644 --- a/snippets/general-shared-text/databricks-volumes.mdx +++ b/snippets/general-shared-text/databricks-volumes.mdx @@ -1,5 +1,15 @@ The Databricks Volumes prerequisites: + + - The Databricks workspace URL. Get the workspace URL for [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids), [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids), @@ -11,26 +21,27 @@ The Databricks Volumes prerequisites: - Azure: `https://adb-..azuredatabricks.net` - GCP: `https://..gcp.databricks.com` -- The Databricks compute resource's ID. Get the compute resource ID for - [AWS](https://docs.databricks.com/integrations/compute-details.html), - [Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details), - or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html). - - The Databricks authentication details. For more information, see the documentation for [AWS](https://docs.databricks.com/dev-tools/auth/index.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html). - More specifically, you will need: + More specifically, you will need the following authentication details. + + The following authentication types are supported by both [Unstructured API services](/api-reference/api-services/overview) + and the [Unstructured Platform](/platform/overview): - For Databricks personal access token authentication (AWS, Azure, and GCP): The personal access token's value. - - For username and password (basic) authentication (AWS only): The user's name and password values. - For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal. + - For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user. + + The following authentication types are supported only by Unstructured API services: + + - For username and password (basic) authentication (AWS only): The user's name and password values. - For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional values. - For Azure managed identities (MSI) authentication (Azure only): The client ID value for the corresponding managed identity. - For Microsoft Entra ID service principal authentication (Azure only): The tenant ID, client ID, and client secret values for the corresponding service principal. - For Azure CLI authentication (Azure only): No additional values. - - For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user. - For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account's credentials file. - For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account's email address.