Skip to content

Use this repository to git (get) started with data engineering tasks using Azure Data Factory and Azure Databricks

License

Notifications You must be signed in to change notification settings

commonacumen/gitstartedwithdataengineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gitstartedwithdataengineering

Use this repository to git (get) started with data engineering tasks using Azure Data Diabetes datasetFactory and Azure Databricks

Step 1 Clone or Fork this GitHub Repo to get the datasets and code

git clone https://github.com/commonacumen/gitstartedwithdataengineering.git

The datasets are from Diabetes dataset on Microsoft.com orginally from Original dataset description and Orginal data file and a ageband dataset created by me.

These datasets have been included in the data folder in this GitHub Repo Datasets Here

Step 2 Create an Azure Data Factory pipeline from local template to copy and transform datasets using ADF

Download ADF Template zip or find it in your cloned GitHub Repo.

adftemplatezip

Open up the ADF deployed by the ARM template. Select Pipeline > Import from pipeline template

adfplfromtemplate

Click on the zip file and click Open

adfOpenLocalTemplate

It should look like this

adftemplateUserinputs

Select +New in the first User input and create an HTTP Linked Service.

adfHttpLinkedService

Make sure the url is using raw

https://raw.githubusercontent.com/commonacumen/gitstartedwithdataengineering/main/data/

Select +New in the second User input and create an Azure Data Lake Storage Gen2 Linked Service

adfAdlsLinkedService

Then click on Use this template

adfAllUserinputs.png

It should look like this when it is imported

adfTemplateImported

Step 3 Debug the DiabetesCopyAndTransformDataToDelta Pipeline

Click on Debug, and Click OK

Once the pipeline runs successfully it should look like this

adfSuccessfulRun

Check that the files have been created in Storage using Azure Storage Explorer or the Azure Portal in the browser. The files should be in silver container at a path like diabetes/adfdelta/

adfFileInStorage

You can now save the pipeline by clicking on Publish all

Step 4 Import, configure, and run the Databricks notebook

Requirements

  • Databricks Runtime 8.3 or above when you create your cluster

  • Setup Permissions to ADLS Gen2

  • Secrets in Key vault

Steps

Import the Databricks notebook

Open up you Databricks workspace and navigate to your user, select the dropdown and select import

adbworkspace

Import from file if you cloned the repo locally or enter the URL https://raw.githubusercontent.com/commonacumen/gitstartedwithdataengineering/main/code/notebooks/ConnectToDeltaOnADLS.ipynb to the Notebook in GitHub Repo ConnectToDeltaOnADLS.ipynb and click Import

adbnotebookimport

You should now have a notebook that looks like this:

adbnotebook

Change the value of the adlsAccountName = "" in cell one to the ADLS AccountName of in your deployment

In my chase my deployment has a Storage account name of cdcacceleredfkdd3zynq6k so the first row of the cell would read:

adlsAccountName = "cdcacceleredfkdd3zynq6k"

adbrgservices

adbadlsacctname

Configure Service Principal and Permissions

Create a Service principal Reference

Create an Azure Active Directory app and service principal in the form of client ID and client secret.

  1. Sign in to your Azure Account through the Azure portal.

  2. Select Azure Active Directory.

  3. Select App registrations.

adbappreg

  1. Select New registration.

Name the application something like databricksSrvPrin. Select a supported account type, which determines who can use the application. After setting the values, select Register.

Note that it is a good idea to name the application with something unique to you like your email alias (darsch in my case) because other might use similar names like databricksSrvPrin.

adbregister

  1. Copy the Directory (tenant) ID and store it to use to create an application secret.

  2. Copy the Application (client) ID and store it to use to create an application secret.

adbappids

Assign Role Permissions

  1. At storage account level assign this app service pricipal the following role to the storage account in which the input path resides:

    Storage Blob Data Contributor to access storage

adbstorageiam

Create a new application secret

  • Select Azure Active Directory.

  • From App registrations in Azure AD, select your application.

  • Select Certificates & secrets.

  • Select Client secrets -> New client secret.

  • Provide a description AppSecret of the secret, and a duration. When done, select Add.

adbappsecret

After saving the client secret, the value of the client secret is displayed. Copy this value because you won't be able to retrieve the key later. You will provide the key value with the application ID to sign in as the application. Store the key value where your application can retrieve it.

adbappsecretval

Deploy a Key Vault and setup secrets

Create a Key Vault in the Resource group by clicking Create

Search for Key vault

adbkvsearch

Click Create

adbkvcreate

Create the Key Vault in the same Resource group and Region as you other resource deployed. Click Review and Create and then click Create

adbrevcreate

You should now have a Key vault in your resources

adbrgwithkv

Open up you Key vault and add the appsecret created above

Choose Secrets and click Generate/Import

adbkvsecretgen

Enter you secret Name and paste in the app secret you created earlier, set activation date and click Create

adbcreatesecret

It should look like this:

adbfirstsecret

Create the rest of the secrets you need for the notebook

Create the rest of the secrets in cell 4 of the notebook. The secret names are at the end of each line after EnterDatabrickSecretScopeHere

SubscriptionID = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","SubscriptionID")
DirectoryID = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","DirectoryID")
ServicePrincipalAppID = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","ServicePrincipalAppID")
ServicePrincipalSecret = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","AppSecret")
ResourceGroup = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","ResourceGroup")
BlobConnectionKey = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","Adls2-KeySecret")
#Secret Names#

SubscriptionID

DirectoryID

ServicePrincipalAppID

AppSecret (already created above)

ResourceGroup

Adls2-KeySecret

The Adls2-KeySecret is created using the storage account key

secrets

Create an Azure Key Vault-backed secret scope using the UI Reference

Verify that you have Contributor permission on the Azure Key Vault instance that you want to use to back the secret scope.

Go to https://databricks-instance/#secrets/createScope. This URL is case sensitive; The "S" in scope in createScope must be uppercase.

https://databricks-instance/#secrets/createScope

In my case https://adb-1558951773184856.16.azuredatabricks.net/#secrets/createScope

You can find the databricks-instance in the URL of your workspace

adbinstance

Enter Scope Name: I choose something like databricksSrvPrin which is what I used in the notebook

Manage Principal: All Users

DNS Name: https://xxxxxx.vault.azure.net/ Find in the properties of Key vault under Vault URI

Resource ID: Find in the properties of the Key vault. Looks something like this:

/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourcegroups/databricks-rg/providers/Microsoft.KeyVault/vaults/databricksKV

adbsecretResID

Click Create

adbsecretscope

Create a Databricks Cluster and attach to notebook

Create a cluster using the Runtime 8.3 or above

Enter Cluster Name, Runtime Version, Set Terminate after, Min Workers, Max Workers and click Create Cluster

adbcreatecluster

Add the Scopes into Cells 3 and 5

Change the value of EnterDatabrickSecretScopeHere in cell 3 and 5 to the Scope name you created earlier.

In my chase databricksSrvPrin so the 3 cell would read:

spark.conf.set(
    "fs.azure.account.key." + adlsAccountName + ".dfs.core.windows.net",
    dbutils.secrets.get(scope="databricksSrvPrin",key="Adls2-KeySecret"))

The 5 cell would read:

SubscriptionID = dbutils.secrets.get("databricksSrvPrin","SubscriptionID")
DirectoryID = dbutils.secrets.get("databricksSrvPrin","DirectoryID")
ServicePrincipalAppID = dbutils.secrets.get("databricksSrvPrin","ServicePrincipalAppID")
ServicePrincipalSecret = dbutils.secrets.get("databricksSrvPrin","AppSecret")
ResourceGroup = dbutils.secrets.get("databricksSrvPrin","ResourceGroup")
BlobConnectionKey = dbutils.secrets.get("databricksSrvPrin","Adls2-KeySecret")

Run the notebook one cell at a time (at least the first time)

Once the cluster is started you will be able to run the code in the cells

Click on Run Cell

adbcruncell

Do this for the next cell down etc.

You can skip cell 8 the first time because nothing has been mounted. You may get an error like this in cell 7:

adbunmount

About

Use this repository to git (get) started with data engineering tasks using Azure Data Factory and Azure Databricks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published