
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# 2 - Developing a Simple Pipeline

In this demonstration, we will create a simple Lakeflow Declarative Pipeline project using the new **ETL Pipeline multi-file editor** with declarative SQL.


### Learning Objectives

By the end of this lesson, you will be able to:
- Describe the SQL syntax used to create a Lakeflow Declarative Pipeline.
- Navigate the Lakeflow Declarative Pipeline ETL Pipeline multi-file editor to modify pipeline settings and ingest the raw data source file(s).
- Create, execute and monitor a Lakeflow Declarative Pipeline.

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

    - In the drop-down, select **More**.

    - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## A. Classroom Setup

Run the following cell to configure your working environment for this course.

This cell will also reset your `/Volumes/dbacademy/ops/labuser/` volume with the JSON files to the starting point, with one JSON file in each volume.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically create and reference the information needed to run the course.

In [0]:
%run ./Includes/Classroom-Setup-2

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Schema labuser11058730_1754017152.1_bronze_db already exists. No action taken.
Schema labuser11058730_1754017152.2_silver_db already exists. No action taken.
Schema labuser11058730_1754017152.3_gold_db already exists. No action taken.
----------------------------------------------------------------------------------------
Directory /Volumes/dbacademy/ops/labuser11058730_1754017152@vocareum_com/customers already exists. No action taken.
Directory /Volumes/dbacademy/ops/labuser11058730_1754017152@vocareum_com/orders already exists. No action taken.
Directory /Volumes/dbacademy/ops/labuser11058730_1754017152@vocareum_com/status already exists. No action taken.
----------------------------------------------------------------------------------------


Searching for files in /Volumes/dbacademy/ops/labuser11058730_1754017152@vocareum_com/customers/ volume to delete prior to creating files...
Deleting file: /Volumes/dbacademy/ops/labuser11058730_1754017152@vocareum_com/customers/00.json

Searc

Schemas are available, lab check passed: ['1_bronze_db', '2_silver_db', '3_gold_db'].


0,1
Your catalog name variable reference: DA.catalog_name:,
"Variable reference to your source files (Python - DA.paths.working_dir, SQL - DA.paths_working_dir):",


## B. Developing and Running a Lakeflow Declarative  Pipeline with the ETL Pipeline Multi-File Editor

This course includes a simple, pre-configured Lakeflow Declarative Pipeline to explore and modify. In this section, we will:

- Explore the ETL Pipeline multi-file editor and the declarative SQL syntax  
- Modify pipeline settings  
- Run the Lakeflow Declarative Pipeline and explore the streaming tables and materialized view.

1. Run the cell below and **copy the path** from the output cell to your **dbacademy.ops.labuser** volume. You will need this path when modifying your pipeline settings. 

   This volume path contains the **orders**, **status** and **customer** directories, which contain the raw JSON files.

   **EXAMPLE PATH**: `/Volumes/dbacademy/ops/labuser1234_5678@vocareum.com`

In [0]:
%python
print(DA.paths.working_dir)

/Volumes/dbacademy/ops/labuser11058730_1754017152@vocareum_com


2. In this course we have starter files for you to use in your pipeline. This demonstration uses the folder **2 - Developing a Simple Pipeline Project**. To create a pipeline and add existing assets to associate it with code files already available in your Workspace (including Git folders) complete the following:

   a. In the left navigation bar, select the **Folder** ![Folder Icon](./Includes/images/folder_icon.png) icon to open the Workspace navigation.

   b. Navigate to the **Build Data Pipelines with Lakeflow Declarative Pipelines** folder (you may already be there).

   c. **(PLEASE READ)** For ease of use, open this same notebook in a separate tab:

    - Right-click the notebook in the left navigation.

    - Select **Open in a New Tab**.

   d. In the new tab, click the **three-dot (ellipsis) icon** ![Ellipsis Icon](./Includes/images/ellipsis_icon.png) in the folder navigation bar.

   e. Select **Create** → **ETL Pipeline**.

   f. Complete the pipeline creation page with the following:

    - **Name**: `Name-your-pipeline-using-this-notebook-name-add-your-first-name` 
    - **Default catalog**: Select your **labuser** catalog  
    - **Default schema**: Select your **default** schema (database)

   g. Select **Add existing assets**. In the popup, complete the following:

    - **Pipeline root folder**: Select the **2 - Developing a Simple DLT Pipeline Project** folder (`/Workspace/Users/your-lab-user-name/build-data-pipelines-with-lakeflow-declarative-pipelines-3.0.0/Build Data Pipelines with Lakeflow Declarative Pipelines/2 - Developing a Simple DLT Pipeline Project`)

    - **Source code paths**: Within the same root folder as above, select the **orders** folder (`/Workspace/Users/your-lab-user-name/build-data-pipelines-with-lakeflow-declarative-pipelines-3.0.0/Build Data Pipelines with Lakeflow Declarative Pipelines/2 - Developing a Simple DLT Pipeline Project/orders`)

   h. Click **Add**, This will create a pipeline and associate the correct files for this demonstration.

**Example**

![Example Demo 2](./Includes/images/demo02_existing_assets.png)

3. In the new window, select the **orders_pipeline.sql** file and follow the instructions in the file. Leave this notebook as you will use it later.

![Orders File Directions](./Includes/images/demo02_select_orders_sql_file.png)

## C. Add a New File to Cloud Storage

1. After exploring and executing the pipeline by following the instructions in the **`orders_pipeline.sql`** file, run the cell below to add a new JSON file (**01.json**) to your volume at:  `/Volumes/dbacademy/ops/labuser-your-id/orders`.

   **NOTE:** If you receive the error `name 'DA' is not defined`, you will need to rerun the classroom setup script at the top of this notebook to create the `DA` object. This is required to correctly reference the path and successfully copy the file.

In [0]:
%python
copy_files('/Volumes/dbacademy_retail/v01/retail-pipeline/orders/stream_json', 
           f'{DA.paths.working_dir}/orders', 
           n = 2)


----------------Loading files to user's volume: '/Volumes/dbacademy/ops/labuser11058730_1754017152@vocareum_com/orders'----------------
File number 1 - 00.json is already in the source volume "/Volumes/dbacademy/ops/labuser11058730_1754017152@vocareum_com/orders". Skipping file.
File number 2 - Copying file /Volumes/dbacademy_retail/v01/retail-pipeline/orders/stream_json/01.json --> /Volumes/dbacademy/ops/labuser11058730_1754017152@vocareum_com/orders/01.json.


2. Complete the following steps to view the new file in your volume:

   a. Select the **Catalog** icon ![Catalog Icon](./Includes/images/catalog_icon.png) from the left navigation pane.  
   
   b. Expand your **dbacademy.ops.labuser** volume.  
   
   c. Expand the **orders** directory. You should see two files in your volume: **00.json** and **01.json**.

3. Run the cell below to view the data in the new **/orders/01.json** file. Notice the following:

   - The **01.json** file contains new orders.  
   - The **01.json** file has 25 rows.


In [0]:
%python
spark.sql(f'''
  SELECT *
  FROM json.`{DA.paths.working_dir}/orders/01.json`
''').display()

customer_id,notifications,order_id,order_timestamp
23180,Y,75297,1640996822
23082,Y,75298,1641000470
23550,Y,75299,1641000707
23362,Y,75300,1641002550
23210,N,75301,1641003380
23489,N,75302,1641004093
23328,Y,75303,1641004638
23954,Y,75304,1641009857
23648,Y,75305,1641011274
23310,Y,75306,1641015380


4. Go back to the **orders_pipeline.sql** file and select **Run Pipeline** to execute your ETL pipeline again with the new file (Step 13).  

   Watch the pipeline run and notice only 25 rows are added to the bronze and silver tables. 
   
   This happens because the pipeline has already processed the first **00.json** file (174 rows), and it is now only reading the new **01.json** file (25 rows), appending the rows to the streaming tables, and recomputing the materialized view with the latest data.

## D. Exploring Your Streaming Tables

1. View the new streaming tables and materialized view in your catalog. Complete the following:

   a. Select the catalog icon ![Catalog Icon](./Includes/images/catalog_icon.png) in the left navigation pane.

   b. Expand your **labuser** catalog.

   c. Expand the schemas **1_bronze_db**, **2_silver_db**, and **3_gold_db**. Notice that the two streaming tables and materialized view are correctly placed in your schemas.

      - **labuser.1_bronze_db.orders_bronze_demo2**

      - **labuser.2_silver_db.orders_silver_demo2**

      - **labuser.3_gold_db.orders_by_date_gold_demo2**

2. Run the cell below to view the data in the **labuser.1_bronze_db.orders_bronze_demo2** table. Before you run the cell, how many rows should this streaming table have?

   Notice that the following:
      - The table contains 199 rows (**00.json** had 174 rows, and **01.json** had 25 rows).
      - In the **source_file** column you can see the exact file the rows were ingested from.
      - In the **processing_time** column you can see the exact time the rows were ingested.

In [0]:
SELECT *
FROM 1_bronze_db.orders_bronze_demo2;

customer_id,notifications,order_id,order_timestamp,_rescued_data,processing_time,source_file
23094,Y,75123,1640392092,,2025-08-01T05:32:46.26Z,00.json
23457,N,75124,1640392500,,2025-08-01T05:32:46.26Z,00.json
23564,Y,75125,1640394862,,2025-08-01T05:32:46.26Z,00.json
23392,N,75126,1640396067,,2025-08-01T05:32:46.26Z,00.json
23101,Y,75127,1640399066,,2025-08-01T05:32:46.26Z,00.json
23466,N,75128,1640404853,,2025-08-01T05:32:46.26Z,00.json
23834,Y,75129,1640407272,,2025-08-01T05:32:46.26Z,00.json
23852,Y,75130,1640419989,,2025-08-01T05:32:46.26Z,00.json
23483,Y,75131,1640422131,,2025-08-01T05:32:46.26Z,00.json
23821,N,75132,1640423697,,2025-08-01T05:32:46.26Z,00.json


3. Complete the following steps to view the history of the **orders_bronze_demo2** streaming table:

   a. Select the **Catalog** icon ![Catalog Icon](./Includes/images/catalog_icon.png) in the left navigation pane.  
   
   b. Expand the **labuser.01_bronze_db** schema.  
   
   c. Click the three-dot (ellipsis) icon next to the **orders_bronze_demo2** table.  
   
   d. Select **Open in Catalog Explorer**.  
   
   e. In the Catalog Explorer, select the **History** tab. Notice an error is returned because viewing the history of a streaming table requires **SHARED_COMPUTE**. In our labs we use a **DEDICATED (formerly single user)** cluster.

   f. Above your catalogs on the left select your compute cluster and change it to the provided **shared_warehouse**.

   ![Change Compute](./Includes/images/change_compute.png)  
   
   g. Go back and look at the last two versions of the table. Notice the following:  
   
      - In the **Operation** column, the last two updates were **STREAMING UPDATE**.  
      
      - Expand the **Operation Parameters** values for the last two updates. Notice both use `"outputMode": "Append"`.  
      
      - Find the **Operation Metrics** column. Expand the values for the last two updates. Observe the following:
      
         - It displays various metrics for the streaming update: **numRemovedFiles, numOutputRows, numOutputBytes, and numAddedFiles**.  
         
         - In the `numOutputRows` values, 174 rows were added in the first update, and 25 rows in the second.
   
   h. Close the Catalog Explorer.

## E. Viewing Lakeflow Declarative Pipelines with the Pipelines UI

After exploring and creating your pipeline using the **orders_pipeline.sql** file in the steps above, you can view the pipeline(s) you created in your workspace via the **Pipelines** UI.

1. Complete the following steps to view the pipeline you created:

   a. In the main applications navigation pane on the far left (you may need to expand it by selecting the ![Expand Navigation Pane](./Includes/images/expand_main_navigation.png) icon at the top left of your workspace) right-click on **Pipelines** and select **Open Link in a New Tab**.

   b. This should take you to the pipelines you have created. You should see your **2 - Developing a Simple Pipeline Project - labuser** pipeline.

   c. Select your **2 - Developing a Simple Pipeline Project - labuser**. Here, you can use the UI to modify the pipeline.

   d. Select the **Settings** button at the top. This will take you to the settings within the UI.

   e. Select **Schedule** to schedule the pipeline. Select **Cancel**, we will learn how to schedule the pipeline later.

   f. Under your pipeline name, select the drop-down with the time date stamp. Here you can view the **Pipeline graph** and other metrics for each run of the pipeline.

   g. Close the pipeline UI tab you opened.

## Additional Resources

- [Lakeflow Declarative Pipelines](https://docs.databricks.com/aws/en/dlt/) documentation.


&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="blank">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy" target="blank">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use" target="blank">Terms of Use</a> | 
<a href="https://help.databricks.com/" target="blank">Support</a>