From c3ee2ed5b5c1b95f4c97386a12b2af63d7038784 Mon Sep 17 00:00:00 2001 From: Steinthor Palsson Date: Wed, 7 Feb 2024 21:05:20 -0500 Subject: [PATCH 1/4] Databricks workspace setup docs --- .../dlt-ecosystem/destinations/databricks.md | 84 ++++++++++++++++++- 1 file changed, 82 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/destinations/databricks.md b/docs/website/docs/dlt-ecosystem/destinations/databricks.md index 120ebfb6cd..f3e4d7b728 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/databricks.md +++ b/docs/website/docs/dlt-ecosystem/destinations/databricks.md @@ -15,7 +15,85 @@ keywords: [Databricks, destination, data warehouse] pip install dlt[databricks] ``` -## Setup Guide +## Set up your Databricks workspace + +To use the Databricks destination, you need: + +* A Databricks workspace with a Unity Catalog metastore connected +* A Gen 2 Azure storage account and container + +If you already have your Databricks workspace set up, you can skip to the [Loader setup Guide](#loader-setup-guide). + +### 1. Create a Databricks workspace in Azure + +1. Create a Databricks workspace in Azure + + In your Azure Portal search for Databricks and create a new workspace. In the "Pricing Tier" section, select "Premium" to be able to use the Unity Catalog. + +2. Create a storage account + + Search for "Storage accounts" in the Azure Portal and create a new storage account. Make sure to select "StorageV2 (general purpose v2)" as the account kind. + +3. Create a container in the storage account + + In the storage account, create a new container. This will be used as a datastore for your Databricks catalog. + +4. Create an Access Connector for Azure Databricks + + This will allow Databricks to access your storage account. + In the Azure Portal search for "Access Connector for Azure Databricks" and create a new connector. + +5. Grant access to your storage container + + Navigate to the storage container you created before and select "Access control (IAM)" in the left-hand menu. + + Add a new role assignment and select "Storage Blob Data Contributor" as the role. Under "Members" select "Managed Identity" and add the Databricks Access Connector you created in the previous step. + +### 2. Set up a metastore and Unity Catalog and get your access token + +1. Now go to your Databricks workspace + + To get there from the Azure Portal, search for "Databricks" and select your Databricks and click "Launch Workspace". + +2. In the top right corner, click on your email address and go to "Manage Account" + +3. Go to "Data" and click on "Create Metastore" + + Name your metastore and select a region. + If you'd like to set up a storage container for the whole metastore you can add your ADSL URL and Access Connector Id here. You can also do this on a granular level when creating the catalog. + + In the next step assign your metastore to your workspace. + +4. Go back to your workspace and click on "Compute" in the left-hand menu + +5. Create a new cluster + + Make sure to set "Access Mode" to either "Single User" or "Shared" to be able to use the Unity Catalog. + +6. Once your cluster is created, go to "Catalog" on the left sidebar + +7. Click "+ Add" and select "Add Storage Credential" + + Create a name and paste in the resource ID of the Databricks Access Connector from the Azure portal. + It will look something like this: `/subscriptions//resourceGroups//providers/Microsoft.Databricks/accessConnectors/` + + +8. Click "+ Add" again and select "Add external location" + + Set the URL of our storage container. This should be in the form: `abfss://@.dfs.core.windows.net/` + + Once created you can test the connection to make sure the container is accessible from databricks. + +9. Now you can create a catalog + + Go to "Catalog" and click "Create Catalog". Name your catalog and select the storage location you created in the previous step. + +10. Create your access token + + Click your email in the top right corner and go to "User Settings". Go to "Developer" -> "Access Tokens". + Generate a new token and save it. You will use it in your `dlt` configuration. + +## Loader setup Guide **1. Initialize a project with a pipeline that loads to Databricks by running** ``` @@ -32,7 +110,9 @@ This will install dlt with **databricks** extra which contains Databricks Python This should have your connection parameters and your personal access token. -It should now look like: +You will find your server hostname and HTTP path in the your cluster settings -> Advanced Options -> JDBC/ODBC. + +Example: ```toml [destination.databricks.credentials] From d15e9b64f8e87adcdb373738d46687e8b12b4d07 Mon Sep 17 00:00:00 2001 From: Steinthor Palsson Date: Wed, 7 Feb 2024 21:18:01 -0500 Subject: [PATCH 2/4] Include databricks in sidebar --- docs/website/sidebars.js | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/website/sidebars.js b/docs/website/sidebars.js index a906526e95..9a21ef11d5 100644 --- a/docs/website/sidebars.js +++ b/docs/website/sidebars.js @@ -97,6 +97,7 @@ const sidebars = { 'dlt-ecosystem/destinations/motherduck', 'dlt-ecosystem/destinations/weaviate', 'dlt-ecosystem/destinations/qdrant', + 'dlt-ecosystem/destinations/databricks', ] }, ], From b8c51c37096cea29e79b486a81de80812a857836 Mon Sep 17 00:00:00 2001 From: Steinthor Palsson Date: Thu, 8 Feb 2024 12:00:36 -0500 Subject: [PATCH 3/4] Cluster -> Warehouse --- .../dlt-ecosystem/destinations/databricks.md | 20 +++++++------------ 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/destinations/databricks.md b/docs/website/docs/dlt-ecosystem/destinations/databricks.md index f3e4d7b728..02820966c9 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/databricks.md +++ b/docs/website/docs/dlt-ecosystem/destinations/databricks.md @@ -60,35 +60,29 @@ If you already have your Databricks workspace set up, you can skip to the [Loade 3. Go to "Data" and click on "Create Metastore" Name your metastore and select a region. - If you'd like to set up a storage container for the whole metastore you can add your ADSL URL and Access Connector Id here. You can also do this on a granular level when creating the catalog. + If you'd like to set up a storage container for the whole metastore you can add your ADLS URL and Access Connector Id here. You can also do this on a granular level when creating the catalog. In the next step assign your metastore to your workspace. -4. Go back to your workspace and click on "Compute" in the left-hand menu +4. Go back to your workspace and click on "Catalog" in the left-hand menu -5. Create a new cluster - - Make sure to set "Access Mode" to either "Single User" or "Shared" to be able to use the Unity Catalog. - -6. Once your cluster is created, go to "Catalog" on the left sidebar - -7. Click "+ Add" and select "Add Storage Credential" +5. Click "+ Add" and select "Add Storage Credential" Create a name and paste in the resource ID of the Databricks Access Connector from the Azure portal. It will look something like this: `/subscriptions//resourceGroups//providers/Microsoft.Databricks/accessConnectors/` -8. Click "+ Add" again and select "Add external location" +6. Click "+ Add" again and select "Add external location" Set the URL of our storage container. This should be in the form: `abfss://@.dfs.core.windows.net/` Once created you can test the connection to make sure the container is accessible from databricks. -9. Now you can create a catalog +7. Now you can create a catalog Go to "Catalog" and click "Create Catalog". Name your catalog and select the storage location you created in the previous step. -10. Create your access token +8. Create your access token Click your email in the top right corner and go to "User Settings". Go to "Developer" -> "Access Tokens". Generate a new token and save it. You will use it in your `dlt` configuration. @@ -110,7 +104,7 @@ This will install dlt with **databricks** extra which contains Databricks Python This should have your connection parameters and your personal access token. -You will find your server hostname and HTTP path in the your cluster settings -> Advanced Options -> JDBC/ODBC. +You will find your server hostname and HTTP path in the Databricks workspace dashboard. Go to "SQL Warehouses", select your warehouse (default is called "Starter Warehouse") and go to "Connection details". Example: From 7343fff7a431b3db35ff227c870042ff128c66d3 Mon Sep 17 00:00:00 2001 From: Steinthor Palsson Date: Thu, 8 Feb 2024 12:09:40 -0500 Subject: [PATCH 4/4] Note ADLS gen 2 --- docs/website/docs/dlt-ecosystem/destinations/databricks.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/website/docs/dlt-ecosystem/destinations/databricks.md b/docs/website/docs/dlt-ecosystem/destinations/databricks.md index 02820966c9..ae1435b482 100644 --- a/docs/website/docs/dlt-ecosystem/destinations/databricks.md +++ b/docs/website/docs/dlt-ecosystem/destinations/databricks.md @@ -30,9 +30,10 @@ If you already have your Databricks workspace set up, you can skip to the [Loade In your Azure Portal search for Databricks and create a new workspace. In the "Pricing Tier" section, select "Premium" to be able to use the Unity Catalog. -2. Create a storage account +2. Create an ADLS Gen 2 storage account - Search for "Storage accounts" in the Azure Portal and create a new storage account. Make sure to select "StorageV2 (general purpose v2)" as the account kind. + Search for "Storage accounts" in the Azure Portal and create a new storage account. + Make sure it's a Data Lake Storage Gen 2 account, you do this by enabling "hierarchical namespace" when creating the account. Refer to the [Azure documentation](https://learn.microsoft.com/en-us/azure/storage/blobs/create-data-lake-storage-account) for further info. 3. Create a container in the storage account