-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Lab: Orchestrating Jobs with Databricks

In this lab, you'll be configuring a multi-task job comprising of:
* A notebook that lands a new batch of data in a storage directory
* A Delta Live Table pipeline that processes this data through a series of tables
* A notebook that queries the gold table produced by this pipeline as well as various metrics output by DLT

## Learning Objectives
By the end of this lab, you should be able to:
* Schedule a notebook as a task in a Databricks Job
* Schedule a DLT pipeline as a task in a Databricks Job
* Configure linear dependencies between tasks using the Databricks Workflows UI

In [0]:
%run ../../Includes/Classroom-Setup-05.2.1L

Python interpreter will be restarted.
Python interpreter will be restarted.



Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-engineer-learning-path/v01"

Validating the locally installed datasets:
| listing local files...(1 seconds)
| completed (1 seconds total)

Creating & using the schema "hamed_vaheb_jcxq_da_delp_jobs_lab"...(0 seconds)
Predefined tables in "hamed_vaheb_jcxq_da_delp_jobs_lab":
| -none-

Predefined paths variables:
| DA.paths.working_dir:      dbfs:/mnt/dbacademy-users/hamed.vaheb@pwc.lu/data-engineer-learning-path/jobs_lab
| DA.paths.user_db:          dbfs:/mnt/dbacademy-users/hamed.vaheb@pwc.lu/data-engineer-learning-path/jobs_lab/database.db
| DA.paths.datasets:         dbfs:/mnt/dbacademy-datasets/data-engineer-learning-path/v01
| DA.paths.stream_path:      dbfs:/mnt/dbacademy-users/hamed.vaheb@pwc.lu/data-engineer-learning-path/jobs_lab/stream
| DA.paths.storage_location: dbfs:/mnt/dbacademy-users/hamed.vaheb@pwc.lu/data-engineer-learning-path/jobs_lab/storage_location

Setup completed (3 seconds)


## Land Initial Data
Seed the landing zone with some data before proceeding. 

You will re-run this command to land additional data later.

In [0]:
DA.data_factory.load()

Loading the file 01.json to the dbfs:/mnt/dbacademy-users/hamed.vaheb@pwc.lu/data-engineer-learning-path/jobs_lab/stream/01.json


## Schedule a Notebook Job

When using the Jobs UI to orchestrate a workload with multiple tasks, you'll always begin by scheduling a single task.

Before we start run the following cell to get the values used in this step.

In [0]:
DA.print_job_config()

0,1
Job Name:,
Batch Notebook Path:,
Query Notebook Path:,


Here, we'll start by scheduling the first notebook.

Steps:
1. Click the **Workflows** button on the sidebar
1. Select the **Jobs** tab.
1. Click the blue **Create Job** button
1. Configure the task:
    1. Enter **Batch-Job** for the task name
    1. For **Type**, select **Notebook**
    1. For **Path**, select the **Batch Notebook Path** value provided in the cell above
    1. From the **Cluster** dropdown, under **Existing All Purpose Clusters**, select your cluster
    1. Click **Create**
1. In the top-left of the screen, rename the job (not the task) from **`Batch-Job`** (the defaulted value) to the **Job Name** value provided in the cell above.
1. Click the blue **Run now** button in the top right to start the job.

<img src="https://files.training.databricks.com/images/icon_note_24.png"> **Note**: When selecting your all purpose cluster, you will get a warning about how this will be billed as all purpose compute. Production jobs should always be scheduled against new job clusters appropriately sized for the workload, as this is billed at a much lower rate.

## Schedule a DLT Pipeline as a Task

In this step, we'll add a DLT pipeline to execute after the success of the task we configured at the start of this lesson.

So that we can focus on Jobs and not Piplines, we are going to use the following utility command to create the pipeline for us.

In [0]:
DA.create_pipeline()

0,1
Pipeline Name:,


Steps:
1. At the top left of your screen, you'll see the **Runs** tab is currently selected; click the **Tasks** tab.
1. Click the large blue circle with a **+** at the center bottom of the screen to add a new task
1. Configure the task:
    1. Enter **DLT** for the task name
    1. For **Type**, select  **Delta Live Tables pipeline**
    1. For **Pipeline**, select the pipeline name provided in the cell above<br/>
    1. The **Depends on** field defaults to your previously defined task, **Batch-Job** - leave this value as-is
    1. Click the blue **Create task** button

You should now see a screen with 2 boxes and a downward arrow between them. 

Your **`Batch-Job`** task will be at the top, leading into your **`DLT`** task.

## Schedule an Additional Notebook Task

An additional notebook has been provided which queries some of the DLT metrics and the gold table defined in the DLT pipeline. 

We'll add this as a final task in our job.

Steps:
1. Click the large blue circle with a **+** at the center bottom of the screen to add a new task
Steps:
1. Configure the task:
    1. Enter **Query-Results** for the task name
    1. For **Type**, select **Notebook**
    1. For **Path**, select the **Query Notebook Path** value provided at the start of this lesson
    1. From the **Cluster** dropdown, under **Existing All Purpose Clusters**, select your cluster
    1. The **Depends on** field defaults to your previously defined task, **DLT** - leave this value as-is.
    1. Click the blue **Create task** button
    
Click the blue **Run now** button in the top right of the screen to run this job.

From the **Runs** tab, you will be able to click on the start time for this run under the **Active runs** section and visually track task progress.

Once all your tasks have succeeded, review the contents of each task to confirm expected behavior.

In [0]:
DA.validate_job_config()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>