
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# CI/CD with DABs
## Overview

In this section, we will build on our understanding of DABs and apply our knowledge to a CI/CD workflow.

In this lesson, you will learn about continuous deployment using Databricks Asset Bundles (DABs) within a workflow that features a more complex architecture.

## Learning Objectives:
_By the end of the demonstration, you will be able to do the following:_
- Understand how to set variables for a bundle
- Perform unit and integration tests with a DLT pipeline by deploying a DAB
- Deploy across multiple environments (catalogs): development, staging, and production.

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

  - In the drop-down, select **More**.

  - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## A. Classroom Setup

Run the following cell to configure your working environment for this course. 

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course.

In [0]:
%run ../Includes/Classroom-Setup-06

## IMPORTANT LAB INFORMATION

Recall that your credentials are stored in a file when running [0 - REQUIRED - Course Setup and Authentication]($../0 - REQUIRED - Course Setup and Authentication).

If you end your lab or your lab session times out, your environment will be reset.

If you encounter an error regarding unavailable catalogs or if your Databricks CLI is not authenticated, you will need to rerun the [0 - REQUIRED - Course Setup and Authentication]($../0 - REQUIRED - Course Setup and Authentication) notebook to recreate the catalogs and your Databricks CLI credentials.

**Use classic compute to use the CLI through a notebook.**

Run the Databricks CLI command below to confirm the Databricks CLI is authenticated.

<br></br>
##### DATABRICKS CLI ERROR TROUBLESHOOTING:
  - If you encounter an Databricks CLI authentication error, it means you haven't created the PAT token specified in notebook **0 - REQUIRED - Course Setup and Authentication**. You will need to set up Databricks CLI authentication as shown in that notebook.

  - If you encounter the error below, it means your `databricks.yml` file is invalid due to a modification. Even for non-DAB CLI commands, the `databricks.yml` file is still required, as it may contain important authentication details, such as the host and profile, which are utilized by the CLI commands.

![CLI Invalid YAML](../Includes/images/databricks_cli_error_invalid_yaml.png)

In [0]:
%sh
databricks catalogs list

## B. Inspect Pre-Configured YAML Files

Our goal is to deploy our project to the **dev**, **stage** and **prod** environments for our CI/CD pipeline. In this example, the project is a simple workflow that contains unit tests, a DLT pipeline and a notebook visualization. 

![Workflow](./images/06_Final_Workflow_Desc.png)


**NOTE:** In this advanced-level course, we assume prerequisites of essential DevOps concepts like code modularization, custom Python functions, unit testing with pytest, and integration tests with DLT. For more information on these topics, you can review the Databricks course: **DevOps Essentials for Data Engineering**. We will briefly explore each of these here, but will not spend much time on those fundamentals. The focus of this course is deployment with Databricks Asset Bundles.




Let's explore our project folder called **Full Project**. This folder contains all of our Databricks resources to deploy.

You will find the following in the root folder:

  - **src**
  - **resources**
  - **databricks.yml**
  - **tests**

1. In a new tab, open the **databricks.yml**.

    - It begins by defining the bundle name under the **bundle** mapping.
    
    - Then it defines the resources to include under the **include** mapping. All the YAML configuration files are in the **resources** folder.

    - Under the **targets** top-level mapping, you will see three defined targets and a variety of configuration specifics for each: development, stage, and production.
        - All three targets have a `root_path`.
        - All three targets have specific configurations.
        - The target environments **stage** and **production** will have additional variables that we need to configure, such as `target_catalog` and `raw_data_path` to specify the correct data.


2. In the new tab open **resources**. 

    - Here, you will find two folders: **job** and **pipeline**.

    - Click on the YAML file named **variables.yml**. You will find a pre-defined variables for the resources to be deployed when deploying the bundle. These include things like the job name, notebook paths, and parameters to pass to the notebooks.

    - The **health_etl_pipeline.pipeline.yml** file (located in the **pipeline** folder) describes the DLT pipeline configuration.

    - The **dabs_workflow.job.yml** file (located in the **job** folder) describes the different tasks that will be created. Notice that there are 3 tasks: **Unit_Tests**, **Visualization**, and **Health_ETL**. While **Health_ETL** is listed after **Visualization**, it depends on **Unit_Tests**. The order of the tasks doesn't matter, as the `depends_on` key configures the dependencies.

3. In the new tab navigate back to **src** in the root folder. This folder contains other folders and notebooks that are called from the YAML files you inspected in the previous steps. These notebooks will be chained together as part of the workflow we will deploy below.

    - **dlt_pipelines**: This folder contains two DLT notebooks (**gold_tables_dlt** and **ingests-bronze-silver_dlt**). You can inspect these notebooks to understand their role in the **Health_ETL** workflow.
    
    - **Final Visualization**: This notebook will be the final task in our workflow. It creates a stacked bar chart of cholesterol distribution by age group.

    - **helpers**: Contains a .py file with custom python methods for the transformation of the data in the DLT pipeline.

## C. Explore and Update YAML Configuration Files
We will update our YAML files to better understand how to point to the assets and variables needed to configure the bundle before validation using the Databricks CLI.

### C1: Explore the databricks.yml Configuration

Recall that to use a variable called `my_variable` in a bundle, refer to it using `${var.my_variable}`.

#### Instructions

1. Navigate to the folder named **Full Project**.
   
2. Click on the file **databricks.yml** and explore the bundle configuration.

3. Locate the mapping **targets**. 
   - Each target is a unique collection of artifacts, Databricks workspace settings, and Databricks job or pipeline details.
   - The targets mapping consists of one or more target mappings, which must each have a unique programmatic (or logical) name.

4. Locate the **development** target and examine the configuration. Notice the following:
   - The value for `default` is set to `True`.
   - The value for `existing_cluster_id` uses the variable `cluster_id`.
   - The **tasks** are set to use our lab compute cluster.

5. Locate the **stage** target and examine the configuration. Notice the following:
   - The `target_catalog` variable uses the variable `catalog_stage`.
   - The `raw_data_path` variable uses the volume `health` in `catalog_stage`.
   - The **tasks** are set to use our lab compute cluster.

6. Locate the **production** target and examine the configuration. Notice the following:
   - The `target_catalog` variable uses the variable `catalog_stage`.
   - The `raw_data_path` variable uses the volume `health` in `catalog_stage`.
   - No compute cluster is specified for the job. The default compute will use Serverless in production.


In [0]:
print(f'Your user name: {DA.catalog_name}')

### C2: Update **variables.yml**

Next, we will update the file **variables.yml**.

#### Instructions

1. Navigate to the **resources** folder.

2. Click on the file **variables.yml**.

3. Fill in the following details for the variables:

   - **TO DO**: `username`: Add your username here. Your username can be found in the cell above for your lab.
      - Use `${workspace.current_user.short_name}`

   - **TO DO**: `my_email`: Enter your email address here. This will be used to send notifications.
      - Use your email address

   - **TO DO**: `cluster_id`: 
      - You can use the `lookup` function with your username to obtain the cluster ID value.
      - Paste the value from cell 13 for the lookup cluster id variable

---

**NOTES:**
1. The file **variables_solution.yml** contains an example solution in case you need help.

2. In the Vocareum environment, all catalogs, clusters, and usernames match and have no spaces by default. If using this method outside the Vocareum environment, be cautious.

### D. Visualizing the Bundle's Assets

Here we will look at how to manually update our YAML files to help get acquainted with the setup. Since we are bringing in a pre-configured bundle, it's worth looking at the structure of files we'll be interacting with. Below is a diagram representing how the variables for the development catalog will be used. 

![Full DLT Pipeline](./images/06_img1.png)

## E. Notebook Execution

Now that we are familiar with the various folders and files that make up our bundle, let's make sure the CLI is installed by authenticating.


### Databricks CLI Authentication

Usually, you would need to set up authentication for the CLI. However, in this training environment, that has already been taken care of for you.

Verify the credentials by executing the following cell, which displays the contents of the `/Users` workspace directory.

In [0]:
%sh databricks workspace list /Users

### E1. Development Bundle

Here is what the configuration of our target mapping for development looks like in the databricks.yml file. 
```YAML
targets:

  development:
    mode: development
    default: true
    # In Development, we will use classic compute for our tasks 
    resources:
      jobs:
        health_etl_workflow:    
          name: health_etl_workflow_${bundle.target} 
          tasks:
            - task_key: Unit_Tests
              existing_cluster_id: ${var.cluster_id}
            - task_key: Visualization
              existing_cluster_id: ${var.cluster_id}
    workspace:
      root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
...
```

**NOTE:** Recall that with **development** and **stage** target environments we are using a mix of serverless and classic compute at the task level.

To validate the bundle, run the following cell. This uses all the default values from **variables.yml** (see diagram above).

In [0]:
%sh 
cd "Full Project" 
pwd;
databricks bundle validate -t development

After the development target validates, deploy the bundle to the development environment.

In [0]:
%sh
cd "Full Project" 
databricks bundle deploy -t development

To run the bundle, using the Databricks CLI, run the following cell. This is the name of the job. Note that the job will show `[dev <username>] health_etl_workflow_<target>` within **Workflows**. 

This makes sense when you refer back to the structure of the **dabs_workflow.job.yml** file located in **resources**: 
```YAML
resources:
  jobs:
    health_etl_workflow: # <----- Name of job to run
      name: health_etl_workflow${bundle.target}
      description: Final Workflow SDK
```

In [0]:
%sh
cd "Full Project" 
databricks bundle run health_etl_workflow

#### Summary - Development
While the job is running, examine the tasks when using the **development** target. Note the following:
- Unit tests passed
- DLT pipeline ETL and integration tests passed on a small sample of 7,500 rows of dev data.
- Visualization was created using the small sample of 7,500 rows of dev data.

### E2. Staging Bundle

Here is what the configuration of our target mapping for stage looks like in the databricks.yml file.
```YAML
  ...

  stage:
    mode: development
      # In stage, we will use classic compute for our tasks 
    resources:
      jobs:
        health_etl_workflow:   
          name: health_etl_workflow_${bundle.target}  
          tasks:
            - task_key: Unit_Tests
              existing_cluster_id: ${var.cluster_id}
            - task_key: Visualization
              existing_cluster_id: ${var.cluster_id}
    workspace:
      root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
    variables:
      target_catalog: ${var.catalog_stage}
      raw_data_path: /Volumes/${var.catalog_stage}/default/health
```

<br></br>

Imagine you've reviewed your code, analyzed coverage, etc., and are ready to deploy and test in a staging environment. DABs simplifies this by adjusting a few parameter values. Run the following cells to validate, deploy, and run with `stage` as the target.

<br></br>

In this example, since `target_catalog` and `raw_data_path` have default values, we can override them when deploying to other targets like stage within the **targets** mapping. This specifies reading data from the staging catalog.


#### BONUS
You can also override variable values through the Databricks CLI for DABs. For example, you can run `databricks bundle validate --var="target_catalog=<username>_2_stage" -t stage`. Keep in mind this will just reproduce the same job you went through a moment ago.

In [0]:
%sh
cd "Full Project" 
databricks bundle validate -t stage

In [0]:
%sh
cd "Full Project" 
databricks bundle deploy -t stage

In [0]:
%sh
cd "Full Project" 
databricks bundle run health_etl_workflow -t stage

#### Summary - Stage
While the job is running, examine the tasks when using the **development** target. Note the following:
- Unit tests passed
- DLT pipeline ETL and integration tests passed on a small sample of 35,000 rows of dev data.
- Visualization was created using the small sample of 35,000 rows of dev data.

### E3. Production
Here is what the configuration of our target mapping for production looks like in the databricks.yml file.
```YAML
  production:
    mode: production
    workspace:
      # host: can change host if isolating by workspace
      root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
    variables:
      target_catalog: ${var.catalog_prod}
      raw_data_path: /Volumes/${var.catalog_prod}/default/health
```

Here, we'll repeat the same bash commands using `%sh`. However, note that all production compute will run on serverless instead of classic compute, as we're not overriding the default compute.

You can verify this by deploying the job and inspecting the tasks.


In [0]:
%sh
cd "Full Project" 
databricks bundle validate -t production

In [0]:
%sh
cd "Full Project" 
databricks bundle deploy -t production

In [0]:
%sh
cd "Full Project" 
databricks bundle run health_etl_workflow -t production

## E4. Destroy all bundles

In [0]:
%sh
cd "Full Project";
databricks bundle destroy -t development --auto-approve;
databricks bundle destroy -t stage --auto-approve;
databricks bundle destroy -t production --auto-approve;

# Next Steps

![ci_cd](./images/ci_cd_overview.png)
Think about how you can use DABs to accelerate development by designing programmatic management of your workflows. With DABs you can create, manage, and deploy your different assets and artifacts in a consistent and repeatable manner for CI/CD workflows. You can continue to learn about DABs by completing the next lab, where you will have a chance to practice attaching a machine learning model to your workflow as a part of the Dev workflow.


&copy; 2025 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>