gitstartedwithdataengineering

Use this repository to git (get) started with data engineering tasks using Azure Data Diabetes datasetFactory and Azure Databricks

Step 1 Clone or Fork this GitHub Repo to get the datasets and code

git clone https://github.com/commonacumen/gitstartedwithdataengineering.git

The datasets are from Diabetes dataset on Microsoft.com orginally from Original dataset description and Orginal data file and a ageband dataset created by me.

These datasets have been included in the data folder in this GitHub Repo Datasets Here

Step 2 Create an Azure Data Factory pipeline from local template to copy and transform datasets using ADF

Download ADF Template zip or find it in your cloned GitHub Repo.

Open up the ADF deployed by the ARM template. Select Pipeline > Import from pipeline template

Click on the zip file and click Open

It should look like this

Select +New in the first User input and create an HTTP Linked Service.

Make sure the url is using raw

https://raw.githubusercontent.com/commonacumen/gitstartedwithdataengineering/main/data/

Select +New in the second User input and create an Azure Data Lake Storage Gen2 Linked Service

Then click on Use this template

It should look like this when it is imported

Step 3 Debug the DiabetesCopyAndTransformDataToDelta Pipeline

Click on Debug, and Click OK

Once the pipeline runs successfully it should look like this

Check that the files have been created in Storage using Azure Storage Explorer or the Azure Portal in the browser. The files should be in silver container at a path like diabetes/adfdelta/

You can now save the pipeline by clicking on Publish all

Step 4 Import, configure, and run the Databricks notebook

Requirements

Databricks Runtime 8.3 or above when you create your cluster
Setup Permissions to ADLS Gen2
Secrets in Key vault

Steps

Import the Databricks notebook

Open up you Databricks workspace and navigate to your user, select the dropdown and select import

Import from file if you cloned the repo locally or enter the URL https://raw.githubusercontent.com/commonacumen/gitstartedwithdataengineering/main/code/notebooks/ConnectToDeltaOnADLS.ipynb to the Notebook in GitHub Repo ConnectToDeltaOnADLS.ipynb and click Import

You should now have a notebook that looks like this:

Change the value of the adlsAccountName = "" in cell one to the ADLS AccountName of in your deployment

In my chase my deployment has a Storage account name of cdcacceleredfkdd3zynq6k so the first row of the cell would read:

adlsAccountName = "cdcacceleredfkdd3zynq6k"

Configure Service Principal and Permissions

Create a Service principal Reference

Create an Azure Active Directory app and service principal in the form of client ID and client secret.

Sign in to your Azure Account through the Azure portal.
Select Azure Active Directory.
Select App registrations.

Select New registration.

Name the application something like databricksSrvPrin. Select a supported account type, which determines who can use the application. After setting the values, select Register.

Note that it is a good idea to name the application with something unique to you like your email alias (darsch in my case) because other might use similar names like databricksSrvPrin.

Copy the Directory (tenant) ID and store it to use to create an application secret.
Copy the Application (client) ID and store it to use to create an application secret.

Assign Role Permissions

At storage account level assign this app service pricipal the following role to the storage account in which the input path resides:

Storage Blob Data Contributor to access storage

Create a new application secret

Select Azure Active Directory.
From App registrations in Azure AD, select your application.
Select Certificates & secrets.
Select Client secrets -> New client secret.
Provide a description AppSecret of the secret, and a duration. When done, select Add.

After saving the client secret, the value of the client secret is displayed. Copy this value because you won't be able to retrieve the key later. You will provide the key value with the application ID to sign in as the application. Store the key value where your application can retrieve it.

Deploy a Key Vault and setup secrets

Create a Key Vault in the Resource group by clicking Create

Search for Key vault

Click Create

Create the Key Vault in the same Resource group and Region as you other resource deployed. Click Review and Create and then click Create

You should now have a Key vault in your resources

Open up you Key vault and add the appsecret created above

Choose Secrets and click Generate/Import

Enter you secret Name and paste in the app secret you created earlier, set activation date and click Create

It should look like this:

Create the rest of the secrets you need for the notebook

Create the rest of the secrets in cell 4 of the notebook. The secret names are at the end of each line after EnterDatabrickSecretScopeHere

SubscriptionID = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","SubscriptionID")
DirectoryID = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","DirectoryID")
ServicePrincipalAppID = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","ServicePrincipalAppID")
ServicePrincipalSecret = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","AppSecret")
ResourceGroup = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","ResourceGroup")
BlobConnectionKey = dbutils.secrets.get("<EnterDatabrickSecretScopeHere>","Adls2-KeySecret")

#Secret Names#

SubscriptionID

DirectoryID

ServicePrincipalAppID

AppSecret (already created above)

ResourceGroup

Adls2-KeySecret

The Adls2-KeySecret is created using the storage account key

Create an Azure Key Vault-backed secret scope using the UI Reference

Verify that you have Contributor permission on the Azure Key Vault instance that you want to use to back the secret scope.

Go to https://databricks-instance/#secrets/createScope. This URL is case sensitive; The "S" in scope in createScope must be uppercase.

https://databricks-instance/#secrets/createScope

In my case https://adb-1558951773184856.16.azuredatabricks.net/#secrets/createScope

You can find the databricks-instance in the URL of your workspace

Enter Scope Name: I choose something like databricksSrvPrin which is what I used in the notebook

Manage Principal: All Users

DNS Name: https://xxxxxx.vault.azure.net/ Find in the properties of Key vault under Vault URI

Resource ID: Find in the properties of the Key vault. Looks something like this:

/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourcegroups/databricks-rg/providers/Microsoft.KeyVault/vaults/databricksKV

Click Create

Create a Databricks Cluster and attach to notebook

Create a cluster using the Runtime 8.3 or above

Enter Cluster Name, Runtime Version, Set Terminate after, Min Workers, Max Workers and click Create Cluster

Add the Scopes into Cells 3 and 5

Change the value of EnterDatabrickSecretScopeHere in cell 3 and 5 to the Scope name you created earlier.

In my chase databricksSrvPrin so the 3 cell would read:

spark.conf.set(
    "fs.azure.account.key." + adlsAccountName + ".dfs.core.windows.net",
    dbutils.secrets.get(scope="databricksSrvPrin",key="Adls2-KeySecret"))

The 5 cell would read:

SubscriptionID = dbutils.secrets.get("databricksSrvPrin","SubscriptionID")
DirectoryID = dbutils.secrets.get("databricksSrvPrin","DirectoryID")
ServicePrincipalAppID = dbutils.secrets.get("databricksSrvPrin","ServicePrincipalAppID")
ServicePrincipalSecret = dbutils.secrets.get("databricksSrvPrin","AppSecret")
ResourceGroup = dbutils.secrets.get("databricksSrvPrin","ResourceGroup")
BlobConnectionKey = dbutils.secrets.get("databricksSrvPrin","Adls2-KeySecret")

Run the notebook one cell at a time (at least the first time)

Once the cluster is started you will be able to run the code in the cells

Click on Run Cell

Do this for the next cell down etc.

You can skip cell 8 the first time because nothing has been mounted. You may get an error like this in cell 7:

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
code		code
data		data
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gitstartedwithdataengineering

Step 1 Clone or Fork this GitHub Repo to get the datasets and code

Step 2 Create an Azure Data Factory pipeline from local template to copy and transform datasets using ADF

Step 3 Debug the DiabetesCopyAndTransformDataToDelta Pipeline

Step 4 Import, configure, and run the Databricks notebook

Requirements

Import the Databricks notebook

Configure Service Principal and Permissions

Deploy a Key Vault and setup secrets

Create a Databricks Cluster and attach to notebook

Add the Scopes into Cells 3 and 5

Run the notebook one cell at a time (at least the first time)

About

Uh oh!

Releases

Packages

Languages

License

commonacumen/gitstartedwithdataengineering

Folders and files

Latest commit

History

Repository files navigation

gitstartedwithdataengineering

Step 1 Clone or Fork this GitHub Repo to get the datasets and code

Step 2 Create an Azure Data Factory pipeline from local template to copy and transform datasets using ADF

Step 3 Debug the DiabetesCopyAndTransformDataToDelta Pipeline

Step 4 Import, configure, and run the Databricks notebook

Requirements

Import the Databricks notebook

Configure Service Principal and Permissions

Deploy a Key Vault and setup secrets

Create a Databricks Cluster and attach to notebook

Add the Scopes into Cells 3 and 5

Run the notebook one cell at a time (at least the first time)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages