# Get Started with Databricks

This lab provides a comprehensive overview of Databricks modern approach to data warehousing, highlighting how a data lakehouse architecture combines the strengths of traditional data warehouses with the flexibility and scalability of the cloud. You’ll learn about the AI-driven features that enhance data transformation and analysis on the Databricks Data Intelligence Platform. Designed for data warehousing practitioners, this course provides you with the foundational information needed to begin building and managing high-performance, AI-powered data warehouses on Databricks. 

This lab is designed for those starting out in data warehousing and those who would like to execute data warehousing workloads on Databricks. Participants may also include data warehousing practitioners who are familiar with traditional data warehousing techniques and concepts and are looking to expand their understanding of how data warehousing workloads are executed on Databricks.

---



# Using Delta Lake Features with Databricks SQL
In this lab, we’ll explore the powerful features of Delta Lake and demonstrate how they enhance data management in a data warehousing context. We’ll start by creating and exploring Delta tables, then dive into key features like Time Travel and Version History.

**Learning Objectives**

By the end of this lab, you will be able to:
- Explore key features of Delta Lake such as **Time Travel**, **Version History**, and metadata management.
- Use SQL commands like `DESCRIBE EXTENDED`, `DESCRIBE HISTORY`, `VERSION AS OF`, and `RESTORE TABLE`.

Delta Lake is an open-source storage layer that brings reliability, performance, and ACID transactions to data lakes. In this lab, we will:
- Create and explore a Delta table.
- Simulate data changes and view metadata and history.
- Use Time Travel to query and restore data to previous states.

## Setup

**Create the `retail_sales` table** and view the data in the table
- We will use a pre-existing dataset file located at `DA.paths.datasets.retail/source_files/sales.csv`.

In [0]:
%sql
-- Create a Delta table from the CSV file
DROP TABLE IF EXISTS retail_sales;
CREATE TABLE IF NOT EXISTS retail_sales
USING DELTA
AS
SELECT *
FROM read_files(
  '/Volumes/databricks_simulated_retail_customer_data/v01/source_files/sales.csv',
  format => 'csv',
  header => true,
  inferSchema => true
);
-- Display Data of the retail_sales table
SELECT * FROM retail_sales;

## Explore the Table

Delta Lake tables store metadata that can be queried for insights about the table structure, state, and history.

### View Extended Metadata
Use the following command to view detailed metadata about the `retail_sales` table:

In [0]:
%sql
DESCRIBE EXTENDED retail_sales;

### Describe the History of the Table
Run the following command to view the history of `retail_sales`:

In [0]:
%sql
DESCRIBE HISTORY retail_sales;

As the `retail_sales` table has not been modified yet, its version history will only show **Version 0**, which represents the initial creation of the table. Further modifications to the table will create additional versions in the history log.

The history includes detailed metadata about the table's state and modifications:
- **Version**: Indicates the specific version of the table.
- **Timestamp**: Specifies when the operation occurred.
- **Operation**: Describes the type of modification (e.g., `INSERT`, `UPDATE`, `DELETE`).
- **Operation Metrics**: Includes key metrics like the number of rows added, removed, or affected, and the total data size.
- **User Information**: Tracks which user performed the operation.
- **Cluster Information**: Logs the cluster ID where the operation was executed.

## Simulating Data Changes and Time Travel Queries

Delta Lake allows for simulating data modifications and querying historical versions of the data using Time Travel.

###Simulate Data Changes
We will simulate updates and deletions to demonstrate how Delta Lake tracks changes in the table's version history.

#### Task 1: Update the Table
Use the following command to update the `retail_sales` table, updating the `product_name` column for all rows where the `product_category` is `Rony`.

In [0]:
%sql
UPDATE retail_sales
   SET product_name = 'Updated Items'
   WHERE product_category = 'Rony';

Before proceeding, let's obtain a timestamp representing this particular moment in time, which will be useful in a subsequent task. After executing, copy the resulting value to the clipboard for later use.

In [0]:
%sql
SELECT current_timestamp();

#### Delete Records
Use the following command to delete specific records from the `retail_sales` table, removing all records for a specific customer:

In [0]:
%sql
DELETE FROM retail_sales
WHERE customer_name = 'VASQUEZ,  YVONNE M';

###View the Table's History

To understand the number of versions currently in the table, run the following command:

In [0]:
%sql
DESCRIBE HISTORY retail_sales;

Delta Lake supports various operations that allow robust data management:
- **OPTIMIZE**: Optimizes the storage layout of the Delta table for better query performance. This operation compacts smaller files into larger ones, reducing the number of files scanned during queries.
- **UPDATE**: Modifies existing records in the table based on a condition.
- **DELETE**: Removes specific rows from the table based on a condition.

These operations enable efficient data management and ensure the table remains up-to-date with minimal manual intervention.

###Time Travel Queries

Delta Lake's Time Travel feature allows you to query data as it existed in previous versions or at specific timestamps.

#### Query a Specific Version
Retrieve data from an early version of the table using the following command:

In [0]:
%sql
SELECT * 
  FROM retail_sales
  VERSION AS OF 0;

####  Query by Timestamp
Retrieve data as it existed at a specific timestamp. Replace the text below with the timestamp copied earlier, uncomment the following lines, and run the cell.

In [0]:
%sql
-- SELECT * 
--   FROM retail_sales
--   TIMESTAMP AS OF 'PASTE TIMESTAMP HERE';

## Restore Table to a Previous Version

Delta Lake provides a powerful feature to restore a table to a previous state. This is particularly useful in scenarios where data is accidentally modified or deleted.

###View Table History
Before restoring, check the table's history to identify the version you want to restore:

In [0]:
%sql
DESCRIBE HISTORY retail_sales;

This command displays the table's operation history, including timestamps and version numbers.


###Restore the Table
Use the following command to restore the `retail_sales` table to a specific version. For this lab, we’ll restore it to **Version 2**:

In [0]:
%sql
RESTORE TABLE retail_sales TO VERSION AS OF 2;

This command reverts the table to the specified version, undoing any changes made after that version.

###Verify the Restoration
After restoring, query the table to confirm that it has been reverted to the expected state:

In [0]:
%sql
SELECT * FROM retail_sales;

The table will reflect the data as it existed in **Version 2**. This can also be validated by reviewing the history.
Use this feature to safeguard data integrity and recover from unintended modifications.


In [0]:
%sql
DESCRIBE HISTORY retail_sales;

Use this feature to safeguard data integrity and recover from unintended modifications.

## Conclusion
Delta Lake provides powerful features such as **Time Travel**, **Version History**, and **metadata management**, Delta Lake provides robust solutions for querying historical data, tracking changes, and recovering from unintended modifications. These capabilities not only ensure data reliability and consistency but also enhance scalability and performance.

# Data Ingestion Techniques

This notebook demonstrates the practical application of various data ingestion techniques in the Databricks Lakehouse, including:
- **CREATE TABLE AS SELECT** (CTAS)
- **COPY INTO** for incremental data loading
- Using the **Databricks Upload UI**
- Automating real-time ingestion with **Auto Loader**
- An introduction to **Lakeflow Connect**

**Learning Objectives**

By the end of this notebook, you should be able to:
- Create and populate Delta tables using **CREATE TABLE AS SELECT (CTAS)**.
- Incrementally load data into Delta tables using **COPY INTO**.
- Perform manual data ingestion through the **Databricks Upload UI**.
- Set up and manage real-time data ingestion pipelines with **Auto Loader**.
- Introduction to **Lakeflow Connect** for automated data ingestion pipeline creation and management.

## Querying Files
In the cell below, we are going to run a query on a directory of parquet files. These files are not currently registered as any kind of data object (i.e., a table), but we can run some kinds of queries exactly as if they were. We can run these queries on many data file types, too (CSV, JSON, etc.).

Most workflows will require users to access data from external cloud storage locations. 

In most companies, a workspace administrator will be responsible for configuring access to these storage locations. In this course, we are simply going to use data files that were already set up as part of the lab environment.


In [0]:
%sql
SELECT * FROM parquet.`/Volumes/databricks_simulated_e_commerce_clickstream_data/v01/raw/sales-historical` LIMIT 10;

We can equivalently use the `read_files()` function to read from files. The syntax is more complicated, but it allows us to pass parameters into the reader which is often required.

In [0]:
%sql
SELECT * FROM
  read_files(
    '/Volumes/databricks_simulated_retail_customer_data/v01/source_files/sales.csv',
    format => 'csv',
    header => true,
    inferSchema => true
  ) LIMIT 10;

## Create Table as Select (CTAS)

We are going to create a table that contains historical sales data from a previous point-of-sale system. This data is in the form of parquet files.

**`CREATE TABLE AS SELECT`** statements create and populate Delta tables using data retrieved from an input query. We can create the table and populate it with data at the same time.

CTAS statements automatically infer schema information from query results and do **not** support manual schema declaration. 

This means that CTAS statements are useful for external data ingestion from sources with well-defined schema, such as Parquet files and tables.

In [0]:
%sql
-- Create or replace the table 'retail_sales_bronze' using Delta format
CREATE OR REPLACE TABLE retail_sales_bronze 
  USING DELTA AS
    SELECT * FROM parquet.`/Volumes/databricks_simulated_e_commerce_clickstream_data/v01/raw/sales-historical`;

-- Describe the structure of the 'retail_sales_bronze' table
DESCRIBE retail_sales_bronze;

By running `DESCRIBE <table-name>`, we can see column names and data types. We see that the schema of this table looks correct.

## COPY INTO for Incremental Loading
**`COPY INTO`** provides an idempotent option to incrementally ingest data from external sources.

Note that this operation does have some expectations:
- Data schema should be consistent
- Duplicate records should try to be excluded or handled downstream

This operation is potentially much cheaper than full table scans for data that grows predictably.

We want to capture new data but not re-ingest files that have already been read. We can use `COPY INTO` to perform this action. 

The first step is to create an empty table. We can then use COPY INTO to infer the schema of our existing data and copy data from new files that were added since the last time we ran `COPY INTO`.

In [0]:
%sql
DROP TABLE IF EXISTS users_bronze;
CREATE TABLE users_bronze USING DELTA;

**COPY INTO** loads data from data files into a Delta table. This is a retriable and idempotent operation, meaning that files in the source location that have already been loaded are skipped.

The cell below demonstrates how to use COPY INTO with a parquet source, specifying:
- The path to the data.
- The FILEFORMAT of the data, in this case, parquet.
- COPY_OPTIONS -- There are a number of key-value pairs that can be used. We are specifying that we want to merge the schema of the data.

In [0]:
%sql
COPY INTO users_bronze
  FROM '/Volumes/databricks_simulated_e_commerce_clickstream_data/v01/raw/users-30m'
  FILEFORMAT = parquet
  COPY_OPTIONS ('mergeSchema' = 'true');

%md
## COPY INTO is Idempotent
COPY INTO keeps track of the files it has ingested previously. We can run it again, and no additional data is ingested because the files in the source directory haven't changed. Let's run the `COPY INTO` command again to show this. 

The count of total rows is the same as the `number_inserted_rows` above because no new data was copied into the table.

In [0]:
%sql
COPY INTO users_bronze
  FROM '/Volumes/databricks_simulated_e_commerce_clickstream_data/v01/raw/users-30m'
  FILEFORMAT = parquet
  COPY_OPTIONS ('mergeSchema' = 'true');


SELECT count(*) FROM users_bronze;

## Built-In Functions

Databricks has a vast [number of built-in functions](https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin.html) you can use in your code.

We are going to create a table for user data generated by the previous point-of-sale system, but we need to make some changes. 

The `first_touch_timestamp` is in the wrong format. We need to divide the timestamp that is currently in microseconds by 1e6 (1 million). We will then use `CAST` to cast the result to a [TIMESTAMP](https://docs.databricks.com/en/sql/language-manual/data-types/timestamp-type.html). Then, we `CAST` to [DATE](https://docs.databricks.com/en/sql/language-manual/data-types/date-type.html).

Since we want to make changes to the `first_touch_timestamp` data, we need to use the `CAST` keyword. The syntax for `CAST` is `CAST(column AS data_type)`. We first cast the data to a `TIMESTAMP` and then to a `DATE`.  To use `CAST` with `COPY INTO`, we need to use a `SELECT` clause (make sure you include the parentheses) after the word `FROM` (in the `COPY INTO`).

Our **`SELECT`** clause leverages two additional built-in Spark SQL commands useful for file ingestion:
* **`current_timestamp()`** records the timestamp when the logic is executed
* **`_metadata.file_name`** records the source data file for each record in the table


In [0]:
%sql
DROP TABLE IF EXISTS users_bronze;
CREATE TABLE users_bronze;
COPY INTO users_bronze FROM
  (SELECT *, 
    cast(cast(user_first_touch_timestamp/1e6 AS TIMESTAMP) AS DATE) first_touch_date, 
    current_timestamp() updated,
    _metadata.file_name source_file
  FROM '/Volumes/databricks_simulated_e_commerce_clickstream_data/v01/raw/users-historical/')
  FILEFORMAT = PARQUET
  COPY_OPTIONS ('mergeSchema' = 'true');

SELECT * FROM users_bronze LIMIT 10;

## Upload UI
The add data UI allows you to manually load data into Databricks from a variety of sources.

- Download a data file. For the purposes of this exercise, you may download the **sales.csv** file by following [this link](/ajax-api/2.0/fs/files/Volumes/databricks_simulated_retail_customer_data/v01/source_files/sales.csv). This will download the CSV file to your browser's download folder.
- Upload the data file to create a table. In the [Catalog Explorer](/explore/data/workspace/default) (also available from the left sidebar), do the following:
   - In the **workspace** catalog, navigate to the **default** schema. 
   - Select **Create > Create table** from the top-right corner.
   - Drop the **sales.csv** you just downloaded into the drop zone (or use the file navigator to find the file in your downloads folder).
- Complete the following steps to create the table:
   - Under **Table name**, name the table **`retail_sales_ui`**. Note that options are available to configure additional ingestion behavior, although we do not need to change any of these for this exercise.
   - Click **Create table** at the bottom of the page to create the table.
   - Confirm the table was created successfully. Then close the Catalog Explorer tab.

Use the SHOW TABLES statement to view the available tables in your schema. Confirm that the **`retail_sales_ui`** table has been created.

In [0]:
%sql
SHOW TABLES;

Query the table to review its contents.

**NOTE**: If you did not name the table as instructed, an error will be returned.

In [0]:
%sql
SELECT * FROM retail_sales_ui LIMIT 10;

## Databricks Auto Loader?

<img src="https://github.com/QuentinAmbard/databricks-demo/raw/main/product_demos/autoloader/autoloader-edited-anim.gif" style="float:right; margin-left: 10px" />

[Databricks Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) lets you scan a cloud storage folder (S3, ADLS, GS) and only ingest the new data that arrived since the previous run.

This is called **incremental ingestion**.

Auto Loader can be used in a near real-time stream or in a batch fashion, e.g., running every night to ingest daily data.

Auto Loader provides a strong gaurantee when used with a Delta sink (the data will only be ingested once).

### How Auto Loader simplifies data ingestion

Ingesting data at scale from cloud storage can be really hard at scale. Auto Loader makes it easy, offering these benefits:


* **Incremental** & **cost-efficient** ingestion (removes unnecessary listing or state handling)
* **Simple** and **resilient** operation: no tuning or manual code required
* Scalable to **billions of files**
  * Using incremental listing (recommended, relies on filename order)
  * Leveraging notification + message queue (when incremental listing can't be used)
* **Schema inference** and **schema evolution** are handled out of the box for most formats (csv, json, avro, images...)


### Auto Loader basics
Let's create a new Auto Loader stream that will incrementally ingest new incoming files.

In this example we will specify the full schema. We will also use `cloudFiles.maxFilesPerTrigger` to take 1 file a time to simulate a process adding files 1 by 1.


#### Visualization and Important Notes

Once the Auto Loader stream is running, click on the **display_query** link above the visualization (as shown in the image) to monitor metrics like input rate, processing rate, and batch duration.

- The **Input vs. Processing Rate** chart shows how records are being ingested and processed over time.
- The **Batch Duration** chart indicates the time taken to process each batch of records.


In [0]:
%python
# Use Auto Loader to read the cloud file
schema_location = f"/Volumes/workspace/default/v01/retail_sales_schema"

cloud_dir = f'/Volumes/databricks_simulated_retail_customer_data/v01/retail-pipeline/orders/stream_json/'
retail_sales_df = (spark.readStream
                   .format("cloudFiles")
                   .option("cloudFiles.format", "json")
                   .option("cloudFiles.maxFilesPerTrigger", "1")
                   .option("cloudFiles.inferColumnTypes", "true") 
                   .option("cloudFiles.schemaLocation", schema_location)  # Schema location for Auto Loader
                   .load(cloud_dir))  # Load the directory containing the CSV file

# Display the streaming DataFrame
checkpoint_location = f'/Volumes/workspace/default/checkpoint/retail_sales_df'
display(retail_sales_df, checkpointLocation = checkpoint_location)

_**🚨Important:**_

Make sure to **interrupt the cell** after completing the lab. The streaming query will continue running until explicitly interrupted, which could result in unnecessary resource usage.


## Lakeflow Connect

**NOTE: Lakeflow Connect is an advanced feature for automated data pipeline creation.**

Lakeflow Connect simplifies the creation and management of data pipelines for efficient ingestion and transformation of data into Delta Lake.

![lakeflow_connect.png](https://www.databricks.com/sites/default/files/inline-images/lakeflow-connect-video.gif?v=1718218999)

**The key benefits of using Lakeflow Connect are:**
- **Automated Pipeline Creation**: Easily configure data ingestion pipelines from various sources into Delta Lake without extensive coding.
- **Seamless Integration**: Lakeflow Connect supports multiple data sources and formats, enabling users to unify their data ingestion workflows.
- **Built-In Transformation**: Perform data validation, schema enforcement, and enrichment directly within the pipeline configuration.
- **Scalable and Reliable**: Designed for large-scale data processing, ensuring high availability and fault tolerance for enterprise workloads.

**Lakeflow Connect enables:**
- Real-time and batch data ingestion.
- Simplified pipeline monitoring and management.
- Integration with Delta Lake and Databricks ecosystem tools for optimized data operations.

**Documentation Reference**:
Learn more about Lakeflow Connect and its capabilities in the [official Databricks documentation](https://docs.databricks.com/en/ingestion/lakeflow-connect/index.html).

**NOTE:** Lakeflow Connect is in preview and not yet generally available. Updates will be provided once it becomes widely accessible.

## Conclusion

This notebook covered key data ingestion techniques in the Databricks Lakehouse, such as **CTAS**, **COPY INTO**, the **Upload UI**, and **Auto Loader** for incremental ingestion. Additionally, we introduced **Lakeflow Connect** for automated and scalable pipeline creation. These methods ensure efficient, reliable, and consistent data ingestion workflows, meeting the diverse needs of modern data engineering tasks.


In [0]:
for query in spark.streams.active:
    query.stop()

# Exploring Data Transformation in Databricks

This part of the notebook demonstrates the **Medallion Architecture** for data transformation, using **Materialized Views (MV)** and **Streaming Tables (ST)** in SQL. The pipeline progresses through the **Bronze**, **Silver**, and **Gold** layers, showcasing how to build efficient data pipelines in Databricks.

**Learning Objectives**

By the end of this notebook, you should be able to:
- Understand the **Medallion Architecture** and its role in data pipelines.
- Declare and configure **DLT** pipelines for automated data processing.
- Use **Materialized Views** and **Streaming Tables** for different data transformation workloads.
- Enforce data quality with **constraints** in DLT pipelines.
- Explore and analyze tables generated by a DLT pipeline using SQL.

## Tables as Query Results

DLT adapts standard SQL queries to combine DDL (data definition language) and DML (data manipulation language) into a unified declarative syntax.

There are two distinct types of persistent tables that can be created with DLT:

* **Materialized View**  
Materialized views are refreshed according to the update schedule of the pipeline in which they’re contained. Materialized views are powerful because they can handle any changes in the input. Each time the pipeline updates, query results are recalculated to reflect changes in upstream datasets that might have occurred because of compliance, corrections, aggregations, or general CDC.

* **Streaming Tables**  
Streaming tables allow you to process a growing dataset, handling each row only once. Because most datasets grow continuously over time, streaming tables are good for most ingestion workloads. Streaming tables are optimal for pipelines that require data freshness and low latency.

Note that both of these objects are persisted as tables stored with the Delta Lake protocol (providing ACID transactions, versioning, and many other benefits). We'll talk more about the differences between materialized views and streaming tables later in the notebook.

For both kinds of tables, DLT takes the approach of a slightly modified CTAS (create table as select) statement. Engineers just need to worry about writing queries to transform their data, and DLT handles the rest.

The basic syntax for a SQL DLT query is:

**`CREATE OR REFRESH [STREAMING] TABLE table_name`**<br/>
**`AS select_statement`**<br/>

## Explore Available Raw Files

Complete the following steps to explore the available raw data files that will be used for the DLT pipeline:

- Navigate to the available catalogs by selecting the catalog icon directly to the left of the notebook (do not select the **Catalog** text in the far left navigation bar).
- Expand the **databricks_simulated_retail_customer_data > v01 > Volumes**.
- Expand the volume that contains your **unique username**.
- Expand the **stream-source** directory. Notice that the directory contains three subdirectories: **customers** and **orders**.
- Expand each subdirectory. Notice that each contains a JSON file (00.json) with raw data. We will create a DLT pipeline that will ingest the files within this volume to create tables and materialized views for our consumers.

## DLT Pipeline: Customer Order Pipeline

Check out how to create **Streaming Tables** and **Materialized Views (MV)** for the Medallion Architecture by reviewing the **Customer Order Pipeline** notebook. Follow these steps:

- Open the [Customer Order Pipeline]($./Includes/03 - Data Ingestion and Transformation/Pipelines/Customer%20Order%20Pipeline) notebook.
   - Do not attempt to execute the code directly. It is intended to be executed within the context of the DLT pipeline workflow.
   - Examine how **Streaming Tables** and **Materialized Views** are implemented to process data through the **Bronze**, **Silver**, and **Gold** layers.
   - Observe the step-by-step creation and transformation of tables within the Medallion Architecture, including data ingestion, validation, and enrichment techniques.
- After reviewing the pipeline notebook and understanding its concepts, close the tab and return to this notebook to proceed with generating the DLT pipeline and completing additional tasks.

## Display Pipeline Configuration

We are going to manually configure a pipeline using the DLT UI. Configuring this pipeline will require parameters unique to a given user. Run the cell to print out values you'll use to configure your pipeline in subsequent steps.

In [0]:
import os

pipeline_name = 'Data Transformation Pipeline'
pipeline_root = f'{os.getcwd()}/pipelines'
catalog_name = 'workspace'
schema_name = 'default'
notebook_path = f'{os.getcwd()}/pipelines/00 - Customer Order Pipeline'
src_dir = '/Volumes/databricks_simulated_retail_customer_data/v01/retail-pipeline'


print(f"Pipeline Name:         {pipeline_name}")
print(f"Pipeline root folder:  {pipeline_root}")
print(f"Default Catalog:       {catalog_name}")
print(f"Default Schema:        {schema_name}")
print(f"Source:                {src_dir}")

## Create and Configure a Pipeline

Complete the following to configure the pipeline.

Steps:
- Open the [Pipelines user interface](/pipelines) (or use the **Jobs & Pipelines** option from the left sidebar and select the Pipelines tab).
- Click **Create** in the upper-right corner, and select **ETL pipeline** from the dropdown menu.
- Set the pipeline name to **Data Transformation Pipeline**.
- Then select **Add an existing asset** 
- Set the **Pipeline Root Folder path** and **Source code paths** using the value **Pipeline root folder** value from above. 
- Click on **Add** to confirm the pipeline creation.

- In the upper right bar click on the **Settings** button. Configure the pipeline as specified below. You'll need the values provided in the cell output above for this step.

| Setting | Instructions |
|--|--|
| Pipeline name | Enter the **Pipeline Name** provided above |
| Pipeline mode | Choose **Triggered** |
| Default catalog | Choose your **Default Catalog** provided above |
| Default schema | Choose the **Default Schema** provided above |
| Pipeline user mode | **Development** |

- In the setting, click **Add Configuration** and input the Key and Value in the table below:

| Key                 | Value                                      |
| ------------------- | ------------------------------------------ |
| **`source`** | Enter the **source** provided above |


## Check Your Pipeline Configuration

- In the Databricks workspace, open the **[Jobs & Pipeline](/jobs)** UI.
- Select the **Data Transformation Pipeline** pipeline configuration created previously.
- Review the pipeline configuration settings to ensure they are correctly configured according to the provided instructions.
- Once you've confirmed that the pipeline configuration is set up correctly, proceed to the next steps for running the pipeline.

## Update the pipeline

Trigger an update of the pipeline you created by clicking the **Run Pipeline** button in the Pipeline user interface.

You should se your pipeline being executed:

![execute](./images/01.png)


##Querying Tables in the Target Database

As long as a target database is specified during DLT Pipeline configuration, tables should be available to users throughout your Databricks environment. Let's explore them now. 

Run the cell below to see the tables registered to the database used so far. The tables were created in the **dbacademy** catalog, within your unique **schema** name.

In [0]:
%sql
SHOW TABLES

Note that the view we defined in our pipeline is absent from our tables list.

Query results from the **`order_silver`** table.

In [0]:
%sql
SELECT * FROM order_silver

Recall that **`orders_bronze`** was defined as a streaming table in DLT, but our results here are static.

Because DLT uses Delta Lake to store all tables, each time a query is executed, we will always return the most recent version of the table. But queries outside of DLT will return snapshot results from DLT tables, regardless of how they were defined.


### Show Lineage for Delta Tables in Unity Catalog


<img src="https://github.com/QuentinAmbard/databricks-demo/raw/main/product_demos/uc/lineage/uc-lineage-slide.png" style="float:right; margin-left:10px" width="700"/>


Unity Catalog captures runtime data lineage for all **table-to-table operations** executed on Databricks clusters or SQL endpoints. Lineage works seamlessly across all languages, including **SQL, Python, Scala, and R**. It can be visualized in **Data Explorer** in near real-time and can also be retrieved programmatically using the **REST API**. 

**Lineage Granularity Levels**

Unity Catalog supports data lineage at two levels:
- **Table-Level Lineage**:
   - Tracks the flow of data between entire tables.
   - Useful for understanding the broader context of data operations.

- **Column-Level Lineage**:
   - Tracks data transformations at the column level.
   - Ideal for use cases like **GDPR compliance** and tracking sensitive data dependencies.

**Access Control with Table ACLs**

Lineage respects the **Table ACLs** (Access Control Lists) defined in Unity Catalog:
- If a user does not have access to a table in the lineage graph, its details will be redacted.
- However, users will still see the presence of upstream or downstream dependencies, ensuring visibility into the flow of data while maintaining security.

---

**Benefits of Viewing Lineage**
- **End-to-End Data Visibility**:
   - Understand how data flows through the pipeline, from source to final output.

- **Compliance and Governance**:
   - Ensure GDPR compliance by tracking sensitive data dependencies at the column level.

- **Debugging and Optimization**:
   - Identify bottlenecks and optimize transformations for better performance.

### Steps to View Lineage in Unity Catalog

Follow these steps to view the lineage of Delta Tables in Unity Catalog:

**Step 1: Navigate to the Pipelines**
- Open the **[Jobs & Pipeline](/jobs)** UI
- Select the Data Transformation Pipeline pipeline configuration created previously.

**Step 2: Select the Materialized View**
- On the pipeline page, locate the materialized view of interest (e.g., `customer_order`).
- Click on the materialized view to open its **Details** tab. 

**Step 3: Open the Table in Catalog Explorer**
- Under the **Details** tab of the materialized view, locate the table name (e.g., `workspace.default.customer_order`).
- Click on the table name to navigate to the **Catalog Explorer**. 

**Step 4: View Lineage Tab**
- In the Catalog Explorer, select the **Lineage** tab from the menu at the top. 
- This will display a summary of the table's upstream and downstream dependencies.

**Step 5: Open the Lineage Graph**
- In the **Lineage** tab, locate the **"See Lineage Graph"** button in the top-right corner of the page.
- Click on the button to open the expanded lineage graph. 

**Step 6: Explore the Lineage Graph**
- The **Lineage Graph** will display the data flow for the table:
   - **Upstream Tables**: Represent the data sources feeding into the pipeline.
   - **Downstream Tables**: Represent the outputs or dependencies created from the table.
- Click on the **`+` icons** to expand the graph and reveal additional details about each connection.

## Conclusion

In this notebook, we explored data transformation in Databricks using the Medallion Architecture, highlighting the capabilities of DLT to build robust and efficient pipelines. We demonstrated the creation and configuration of pipelines that use Materialized Views (MV) and Streaming Tables (ST) to process and transform data across Bronze, Silver, and Gold layers.

# Data Orchestration and Querying Capabilities

In this lab, we’ll set up a **Databricks Lakeflow Jobs** leveraging SQL tasks to automate a series of **Data Warehousing** tasks. These tasks include data ingestion, validation, transformation, and generating insights within a **Medallion Architecture** (Bronze, Silver, Gold layers). Additionally, we will explore error handling, retries, scheduling options, and integration with external tools.

![medallion_architecture](./images/medallion_architecture.png)

Steps of the Medallion Architecture:

- Ingest all CSV files from the **myfiles** volume and create a bronze table.
- Prepare the bronze table by adding new columns and create a silver table.
- Create a gold aggregated table for consumers.

**Learning Objectives**

By the end of this lab, you will learn how to:

- Create a Lakeflow Job with **SQL tasks** for a Medallion Architecture pipeline.
- Set up task dependencies and implement conditional logic for lakeflow jobs control.
- Use the Lakeflow Job UI for **monitoring** and **data lineage visualization** to trace data transformations and dependencies.
- Configure error handling and retries for tasks.
- Schedule Lakeflow Job using manual triggers.
- Set up **notifications** for monitoring and analyze the execution history.

## Create ETL Pipelines using the Databricks Python API

Instead of using the UI, you can leverage the Databricks APis to create or manage your Databricks assets.

In the next steps, you will create the required pipelines to continue the lab instead of using the UI.

This is very useful when you want to scale or automate your operations, giving you more flexibility.

Let's first add a helper function that wraps the use of the Databricks API to cretae a pipeline.

In [0]:



# dbutils.entry_point.getDbutils().notebook().getContext().notebookPath().getOrElse(None)
# /Workspace/Users/adadouche@hotmail.com/esigelec-2025/02-tp/03-databricks/03-get-started-with-databricks-for-data-warehousing/Get Started with Databricks for Data Warehousing
# /Workspace/Users/adadouche@hotmail.com/esigelec-2025/02-tp/03-databricks/03-get-started-with-databricks-for-data-warehousing/Notebooks/Pipelines/01 - Raw Data to Bronze
# /Workspace/Users/adadouche@hotmail.com/esigelec-2025/02-tp/03-databricks/03-get-started-with-databricks-for-data-warehousing/01 - Raw Data to Bronze


In [0]:
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import pipelines

client = WorkspaceClient()

workspace_url = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}"

def generate_pipeline(
    pipeline_name, 
    pipeline_root_folder, 
    pipeline_notebooks, 
    use_catalog, 
    use_schema, 
    use_configuration=None, 
    use_serverless=True, 
    use_continuous=False
):
    """
    Generates a Databricks pipeline based on the specified configuration parameters.
    """

    # Get the current notebook folder path
    # current_folder_path = dbutils.entry_point.getDbutils().notebook().getContext().notebookPath().getOrElse(None)
    # main_course_folder_path = "/Workspace" + "/".join(current_folder_path.split("/")[:-1])

    # Create paths for the specified notebooks
    notebooks_paths = [
        f"{pipeline_root_folder}/{notebook}" for notebook in pipeline_notebooks
    ]

    # Create the pipeline
    pipeline_info = client.pipelines.create(
        allow_duplicate_names=True,
        name=pipeline_name,
        catalog=use_catalog,
        target=use_schema,
        serverless=use_serverless,
        continuous=use_continuous,
        development=True,  # Development mode
        configuration=use_configuration,
        libraries=[pipelines.PipelineLibrary(notebook=pipelines.NotebookLibrary(path=path)) for path in notebooks_paths]
    )

    # Store the pipeline ID
    current_pipeline_id = pipeline_info.pipeline_id
    print(f"Successfully created the ETL pipeline '{pipeline_name}' (id: {pipeline_info.pipeline_id}, url : {workspace_url}/pipelines/{pipeline_info.pipeline_id})")

    # https://dbc-173ef31c-a9d4.cloud.databricks.com/pipelines/79127095-d478-4081-a2c0-d5a1350d089c?o=1047293490737669


In [0]:
pipeline_list = [
    "01 - Raw Data to Bronze",
    "02 - Bronze to Silver",
    "03 - Silver to Gold"
]

for pipeline in pipeline_list:
    generate_pipeline(
        pipeline_name=pipeline,
        pipeline_root_folder=f'{os.getcwd()}/resources',
        pipeline_notebooks=[
            pipeline
        ],
        use_catalog="workspace", 
        use_schema="default",
        use_configuration={'source': f'/Volumes/databricks_simulated_retail_customer_data/v01/retail-pipeline'}
    )

## Create a Lakeflow Job in the UI

- In your Databricks workspace, click on the **Jobs & Pipeline** icon in the left sidebar.
   
- Click on **Create** in the upper-right corner of the **Jobs & Pipeline** page and select **Job**.

- Name the job "Serverless lakeflow job" or something similar for easy identification.

## Add Tasks to the Job

### Add Tasks to the Lakeflow Jobs

**Create Your First Task**:
   - Name the task `01_Raw_Data_to_Bronze` (spaces are not supported).
   - Set **Type** to `Pipeline`.
   - **Pipeline** should be set to `01 - Raw Data to Bronze`.
   - Click **Create Task**.
   - Notifications:
      - Add notification to send emails on failure (e.g., ` your-email@databricks.com`).

This task runs the Raw Data-to-Bronze pipeline to ingest data.

### Set Conditional Check for Bronze Pipeline

**Create Conditional Task**:
   - Click on **Add Task**
   - Set **Type** to `If/else condition`.
   - Name the task `pipeline_condition_1`.
   - Set **Depends on** to `01__Raw_Data_to_Bronze_DLT_Pipeline` to ensure this task runs after data quality checks.
   - Condition: Set the expression to:
     - Left operand: **&lcub;&lcub;tasks.01_Raw_Data_to_Bronze.result_state&rcub;&rcub;**
     - Operator: **==**
     - Right operand: **success**
   - Click **Save Task**.

This task evaluates whether the Bronze pipeline execution succeeded or failed, determining the next steps.

### Bronze to Silver DLT Pipeline

**Create Your Second Task**:
   - Click on **Add Task**
   - Name the task `02_Bronze_to_Silver`.
   - Set **Type** to `Pipeline`.
   - **Pipeline** should be set to `02 - Bronze to Silver`.
   - Set **Depends** on to **`pipeline_condition_1(True)`**
   - Click **Create Task**.
   - Notifications:
      - Add notification to send emails on failure (e.g., ` your-email@databricks.com`).

This task processes data from Bronze to Silver.

### Set Conditional Check for Silver Pipeline

**Create Conditional Task**:
   - Click on **Add Task**
   - Set **Type** to `If/else condition`.
   - Name the task `pipeline_condition_2`.
   - Set **Depends on** to `02_Bronze_to_Silver`.
   - Condition: Set the expression to:
     - Left operand: **&lcub;&lcub;tasks.02_Bronze_to_Silver.result_state&rcub;&rcub;**
     - Operator: **==**
     - Right operand: **success**   
   - Click **Save Task**.

This task evaluates whether the Silver pipeline execution succeeded or failed.

### Silver to Gold DLT Pipeline

**Create Your Third Task**:
   - Click on **Add Task**
   - Name the task `03_Silver_to_Gold`.
   - Set **Type** to `Pipeline`.
   - **Pipeline** should be set to `03 - Silver to Gold`.
   - Set **Depends on** to:
     - `pipeline_condition_2 (True)`.
   - Notifications:
      - Add notification to send emails on failure (e.g., ` your-email@databricks.com`).
   - Click **Create Task**.

This task processes data from Silver to Gold.

### Troubleshooting Notebook

**Create Fourth Task**:
   - Click on **Add Task**
   - Set **Type** to `Notebook`.
   - Name the task `Troubleshooting`.
   - **Source** should be set to `Workspace`.
   - Set **Path** to the notebook for saving the final report (e.g., `./resources/04 - Troubleshooting`).
   - Use the same cluster as the previous tasks.
   - Set **Depends on** to both:
     - `pipeline_condition_1 (False)` and `pipeline_condition_2 (False)`.
   - Set **Run if dependencies** to "At least one succeeded" to ensure it saves the report regardless of the path taken.
   - Notifications:
      - Add notification to send emails on Success (e.g., ` your-email@databricks.com`).
   - Click **Create Task**.

This task runs a troubleshooting notebook to analyze and resolve pipeline issues. Example steps in the notebook include querying logs and providing remediation suggestions.

### Enable Email Notifications

- **Set up Notifications**:
   - In the job's configuration, navigate to the **Notifications** section.
   - Enable email notifications by adding your email to receive updates on job completion.

## Trigger the Lakeflow Jobs Manually

Go to the [Jobs & Pipeline](/jobs) in the Databricks UI, select the **Serverless lakeflow job** job and click on **Run Now** in the top-right corner to manually trigger the job. 

This will execute all tasks in the Lakeflow Job according to their dependencies and conditions.

## Monitor the Lakeflow Jobs Execution

- **Navigate to the Runs Tab**:
   - In the job interface, you can view active and completed executions of the job.
   - Select the current execution, which should take you to the list of tasks included in the job

- **Observe Task Execution**:
   - Each task’s status is displayed, where you can see which tasks are currently executing or have completed.
   - Click on each task to view its execution details and outputs, allowing you to troubleshoot and verify each stage.
   - Check the logs to see if the Lakeflow Jobs followed the correct path based on the unusual pattern detection condition.

## Task 5: Lineage: Viewing Data Lineage for a Table
Data lineage in Unity Catalog provides end-to-end visibility into how data is sourced, transformed, and consumed. With lineage information, you can:

- Understand the dependencies of your datasets.
- Identify the upstream and downstream impact of schema changes.
- Debug pipeline issues by tracing data flow through the system.
- Ensure compliance by auditing data usage and transformations.

**Benefits of Data Lineage**
- **Visibility:** Gain a comprehensive view of data flow across your pipeline.
- **Impact Analysis:** Determine how changes in one dataset affect downstream applications.
- **Governance and Compliance:** Track data transformations and usage for regulatory requirements.

#### Viewing Lineage in Unity Catalog
The following code helps you access the lineage information for a table directly in the Databricks UI.

In [0]:
# Generate the workspace URL dynamically
workspace_url = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}"

# Define the table name for which to view the lineage
table_name = "order_table_gold"  # Replace with any other table name as needed

# Construct the URL for the data lineage page in Unity Catalog
lineage_url = f"{workspace_url}/explore/data/workspace/default/{table_name}?activeTab=lineage"

# Print a user-friendly message with the lineage URL
print(f"Access the data lineage for the table '{table_name}' using the following URL:")

# Display the URL as a clickable link in Databricks
displayHTML(f'<a href="{lineage_url}" target="_blank">Click here to view the lineage for {table_name}</a>')

## Conclusion

In this lab, you learned how to:
- Configure and execute a Databricks Lakeflow job with multiple tasks.
- Use dependencies and conditional paths to control the flow of tasks based on the conditions.
- Set up email notifications to stay updated on job execution.
- Trigger the Lakeflow job manually and monitor its execution.

This Lakeflow Jobs setup ensures robust automation for DLT pipelines with integrated troubleshooting and notification mechanisms. The conditional paths provide flexibility to handle success and failure scenarios efficiently, while monitoring and logging enhance visibility into pipeline executions.

Additionally, you explored how to leverage **data lineage** within Unity Catalog, enabling deeper insights into the relationships between datasets and transformations. This feature enhances governance, auditing, and troubleshooting across your Lakeflow Jobs.

# Data Presentation with AI/BI Dashboard

Databricks AI/BI Dashboards offers an enhanced visualization library and a streamlined configuration experience to help you quickly transform data into shareable insights. 

In this lab, we will create a new dashboard and add data and visualizations to the dashboard based on table data and SQL queries.

This lab uses the following resources from  `databricks_simulated_retail_customer_data.v01`:
* **sales** table
* **customers** table

These tables contain some retail sales figures and customer order details. We'll be using them as the source data for the visualizations in the dashboard.

### Create a new Dashboard
Creating a new dashboard in Databricks is simple and straight forward. 
* Navigate to **Dashboards** in the side navigation pane.
* Select **Create dashboard**. 
* At the top of the resulting screen, click on the Dashboard name and change it to **Retail Dashboard**.

You also have the option to import a dashboard if you already have one. All your existing Dashboards can be located from this area of the platform. There are also many quick create features throughout the platform that offer **Dashboards** as one of the options for creating them from other submenus.

### Adding Data Sources

With a completely new dashboard, you'll need to associate the dashboard to data before you can begin. 

You'll notice at the top of the dashboard screen, you have two tabs, **Page** and **Data**. 

- **Page:** The Pages tab allows users to create visualizations and construct their dashboards. Each item on the canvas is called a widget. Widgets have three types: visualizations, text boxes, and filters.

- **Data:** The **Data** tab allows you to define datasets that you will use in the dashboard. Datasets are bundled with dashboards when sharing, importing, or exporting them using the UI or API.

Select the **Data** tab to get started. 

There are three icons on the left side of the screen: **Dataset list**, **Catalog**, and **Assistant**. 
* **Dataset list** will present you with a list of all the Datasets and queries used for the dashboard. You can use multiple datasets in a single dashboard which can be selected from the list of available tables or created from SQL queries. 
* **Catalog** allows you to navigate the available catalogs, schemas, and tables.
* **Assistant** provides you with a AI-powered interface for asking queries in natural language to the platform to discover objects or gain insights or assistance on query writing. 

The following steps walk you through adding the tables for this example dashboard.

- With the **Datasets list** icon selected, click the **+ Add Data Source** button.
- If needed, select **All**.
- From the resulting pop-up, search for `databricks_simulated_retail_customer_data.v01.sales`.
- Click **sales** to add it as a dataset, and then select the **Confirm** button. Note that it appears in your dataset list
- Repeat these steps to add the `customers` tables.

Note that each table is added to the list as an automatically populated `SELECT *` statement in the query editing panel. You can modify the SQL query to alter the dataset.

---
### Adding Visualizations

#### Counter

The first visualization we'll be adding to the dashboard is a counter visualization to display the current sales against a sales goal of $3 million.

- In the **Data** tab, select the **+ Create from SQL** option.
- Enter the following query into the query editing space:<br>
    ```
    SELECT 
    sum(total_price) AS Total_Sales, 
    3000000 AS Sales_Goal 
    FROM databricks_simulated_retail_customer_data.v01.sales;
    ```

- Click **Run** to execute the query.
- Right-click the query in the **Datasets** list and select **Rename**, or use the kebab menu, to rename the query as **Count Total Sales**.

Now, let's add our first visualisation. Switch to the **Untitled Page** tab.

- At the bottom of the screen you have a toolbar for moving objects, adding a visualization, adding a text box, and adding a filter. Select **Add a visualization**.
- Move your cursor to anywhere on the screen and click to add the visualization to the canvas.
- In the **Configuration Panel** on the right, make the following selections for the settings:
    - **Dataset:** Count Total Sales
    - **Visualization:** Counter
    - **Title:** Checked
      - Click on **Widget Title** on the visualization.
      - Change it to **Sales Goal**.
    - **Value:** Total_Sales
    - **Target:** Sales_Goal

- Click on **Total_Sales** in the **Value** area of the configuration panel and select **Format** from the resulting dropdown. Make the following adjustments:
    - Change **Auto** to **Custom**
    - Set **Type** to Currency ($)
    - Set **Abbreviation** to **None**
- In the Style section, click the **+** next to **Conditional Style**. Configure it with the following settings:
    - If Value <= Target
    - Then (Color: Red)

This is a really simple visualization but let's you get a feel for working with Visualizations on dashboards. You can adjust the placement and size of the visualization by dragging the edges or click-holding while hovering over the visualization box.

#### Adding a Text Box

Let's add a name and a space for a text description of the dashboard to the canvas. When adding a new widget to the canvas, other widgets automatically move to accommodate your placement. You can use your mouse to move and resize widgets. To delete a widget, select it and then press the delete key. You can also manipulate the widgets through the use of the kebab menu icon in the upper right corner of each individual one.

Complete the following steps to add a text box to the dashboard:

-  Click the <b>Add a text box</b> icon and drag the widget to the top of your canvas.
- Type: `# Retail organization`

    **Note:** Text boxes use markdown. The `#` character in the included texts indicates that <b>Retail organization</b> is a level 1 heading. See <a href="https://www.markdownguide.org/basic-syntax/" target="_blank">this markdown guide</a> for more on basic markdown syntax.


### Publishing and Sharing

When your dashboard is complete, to share it with others, you need to publish it. 

Published dashboards can be shared with other users in your workspace and with users registered at the account level. That means that users registered to your Databricks account, even if they have not been assigned workspace access or compute resources, can be given access to your dashboards.

When you publish a dashboard, the default setting is to **embed credentials**. Embedding credentials in your published dashboard allows dashboard viewers to **use your credentials to access the data and power the queries that support it**. If you choose not to embed credentials, dashboard viewers use their own credentials to access necessary data and compute power. If a viewer does not have access to the default SQL warehouse that powers the dashboard, or if they do not have access to the underlying data, _visualizations will not render._

To publish your dashboard, complete the following steps:

- Click <b>Publish</b> in the upper-right corner of your dashboard. Read the setting and notes in the <b>Publish</b> dialog.
- Click <b>Publish</b> in the lower-right corner of the dialog. The <b>Sharing</b> dialog should open afterward. If it does not open, you can select **Share** next to **Publish** at the top of the dashboard.
    - You can use the text field to search for individual users, or share the dashboard with a preconfigured group, like <b>Admins</b> or <b>All workspace users</b>. From this window, you can grant leveled privileges like <b>Can Manage</b> or <b>Can Edit</b>. See <a href="https://docs.databricks.com/en/security/auth-authz/access-control/index.html#lakeview" target="_blank">Dashboard ACLs</a> for details on permissions.
    - The bottom of the <b>Sharing</b> dialog controls view access. Use this setting to easily share with all account users.
-  Under <b>Sharing settings</b>, choose <b>Anyone in my account can view</b> from the drop-down. Then, close the <b>Sharing</b> dialog.
-  Use the drop-down near the top of the dashboard to switch between <b>Draft</b> and <b>Published</b> versions of your dashboard.

**Note:** When you edit your draft dashboard, viewers of the published dashboards do not see your changes until you republish. The published dashboard includes visualizations that are built on queries that can be refreshed as new data arrives. Dashboards are updated with new data automatically without republishing.


### Interactive Features: Field Filters (Optional)
The dashboard you created in the lab is good for reporting, and viewers can use it to stay up-to-date on the most recent retail sales figures. However, the viewer has no controls that allow them to further explore the data. For example, if a user wants to see the data for a specific period, they would need to contact the dashboard author to request any changes.

You can create user controls that allow the viewer to filter certain data based on a field or a parameter value. Filters are widgets that allow dashboard viewers to narrow down results by filtering on specific fields or setting dataset parameters. 

Filters can be applied to fields of one or more datasets. Filters on fields allow users to focus on certain values, or ranges of values in the data. The filter applies to all visualizations built on the selected datasets.

To add a filter to the dashboard, complete the following steps:

-  Return to your dashboard if you've navigated away from it.
-  If viewing the published version, switch it to view the draft version of the dashboard.
-  Click the <b>Filter</b> icon in the toolbar near the bottom of the canvas.
- Place the widget near the top of your dashboard. You may want to add it under your text box. You can rearrange the widgets on the dashboard to organize it the way you want.

- When the filter widget is selected, the filter configuration panel appears on the right side of the screen.
  
-  Apply the following settings:
  - <b>Filter</b>: Single value
  - <b>Fields</b>: 
      - sales.total_price 
      - Count Total Sales.Sales_Goal
-  Use the checkboxes to turn on <b>Title</b>.
-  Double-click the title on the widget and change it to <b>Product category</b>
-  Use the drop-down in the filter widget to test your filter `(e.g., 3000000)`.

**Note:** The filter applies to each selected dataset in the filter configuration panel. All of the datasets you selected share the same range of values for product_category. A dashboard viewer can select from that list when choosing which data to filter on the dashboard.

You can also use parameters to create interactive dashboards. Parameters allow users to customize visualizations by substituting values into dataset queries at runtime. See <a href="https://docs.databricks.com/en/dashboards/parameters.html" target="_blank">What are dashboard parameters?</a> to learn more as that is beyond the scope of this course.

Remember to republish the dashboard after you've made edits to it in order for the published version to reflect your new filter.


The image below is supplied as an example of how your dashboard could appear once you've finished adding visualizations and customizing the colors and features of the dashboard. 

![Dashboard_Solution](./images/05_Dashboard_Solution.png)

# Data Presentation with AI/BI Genie Spaces

In this lab, you'll be looking into Databricks AI/BI Genie and the data exploration spaces you can create based on both existing dashboards and data sets as well as new combinations of data sets connected to the workspace. 

This lab uses the following resources from  `dbacademy_retail.v01`:
* **sales** table
* **customers** table

You will also need to have created the following:
  * _Your dashboard from previous lab ([05.1 - Creating AI-BI Dashboard in Databricks]($./05.1 - Creating AI-BI Dashboard in Databricks)) (Retail Dashboard)_

**Learning Objectives**

By the end of this lab, you will be able to:
- **Understand Databricks AI/BI Genie Spaces**:
   - Explore the concept of Genie Spaces for data exploration and querying.
   - Identify the resources and datasets used to create Genie Spaces.

- **Create Genie Spaces**:
   - Set up a Genie Space from the platform UI using existing datasets.
   - Configure Genie Space settings, including table associations, default warehouse, and sample questions.

- **Leverage Genie Space Features**:
   - Utilize Genie Space functionalities, such as multiple chat threads, data exploration, and monitoring user interactions.
   - Edit Genie Space settings, review associated data tables, and monitor user queries.

- **Create Genie Spaces from Dashboards**:
   - Generate a Genie Space directly from a pre-existing dashboard.
   - Interact with the Draft Genie Space and explore how Genie answers questions about the dashboard’s data.

- **Understand Genie’s Monitoring and Preview Features**:
   - Explore how Genie tracks user questions, ratings, and interactions.
   - Recognize the limitations and evolving nature of Genie Spaces in the Public Preview phase.

### Creating a Genie Space

In this part of the lab, we'll start with creating a Genie Space directly from the given UI area. Follow the steps below to create a Genie Space. 
- Navigate to **Genie** in the left side navigation of the platform. 
- Click **+ New** in the upper right corner.
- A new pop-up will appear prompting you to Connect your data. Within this pop-up, select **All** to locate the table:
      - Catalog: dbacademy_retail
      - Schema: v01
      - Table: customers
- When finished selecting data, click **Create** at the bottom.

You will now be presented with the Genie Space UI with the chat environment on the left and the settings and details on the left. With the **Configure** button at the top selected, click on **Settings**. (By default, the Configuration opens to **Context.**) Here you can edit the following information:
- **Title:** Basic Retail Details
- **Description:** "This Space is designed to provide a space to query the details of the customers dataset."
- **Default warehouse:** shared_warehouse
- **Sample Questions:** "How many customers do we have in CA?" (Click the **+ Add** button to add the question.)
- Click **Save** at the bottom to confirm the edits.

The Genie space's screen area is split into two sides with the chat window on the left and the configuration and settings on the right. The buttons at the top right allow you to choose among these areas:

* **+ New chat**: Allows you to create a new threaded dialogues with Genie. After you publish a Genie space to end users (who will probably have only "Can View" or "Can Run" access), this is one of the only Genie areas they will have access to.
* **History** (the icon that looks like a clock): Allows you to review the separate chat threads that you've had with Genie. After you publish a Genie space to end users (who will probably have only "Can View" or "Can Run" access), this is the other Genie area they will have access to.
* **Configure** (the icon that looks like a gear): Returns you to the edit screen for the settings of the Genie space, much like the screen you saw during the space's creation.
  Within Configure you'll have:
  * **Data** (the icon that looks like a stacked blocks): Allows you to review and edit the data tables associated with the Space.
  * **Instructions** (the icon that looks like a book): Allows you to provide general instructions, in natural language, on how Genie will behave when asked a question by a user.
  * **SQL Queries** (the icon will look like a command prompt): Allows you to add example SQL queries for Genie to learn from specific to the associated dataset(s).
* **Monitoring** (the icon that looks like a eye): Allows you to review what questions were asked, who asked them, and how they were rated by the user. 
* **Share** (the icon that looks like a lock): Allows you to set the share permissions and share the Genie space with end users. 

Under the kebab menu, you'll find:
* **Benchmarks** (the icon that looks like a graduate's mortarboard): Allows you to define a suite of questions that you run on a recurring basis to ensure the space continues to give good answers to the most important user questions.

Additionally, you'll have the options to Clone and delete the Genie space by moving it to trash from this menu.

### Creating a Genie Space from a Dashboard
Alternatively, you can create a Genie Space directly from a dashboard. 

- Navigate to **Dashboards** and select the dashboard you created during **Lesson 04 - AI/BI Dashboards** (Retail Dashboard).
- Switch to the **Draft** view for the dashboard.
- Click on **Publish** to open up the publishing dialog box.
- From this window, select the toggle for **Genie**. (Note this feature is in Beta currently.)
    - You will be given the option to select "Auto-generate Genie space" or "Link existing Genie space."
    - For this exercise, select "Auto-generate Genie space" and click **Publish**.
- Navigate to the published version of your Dashboard. 
- Select the **Ask Genie** option in the upper left corner. This opens a pop-up chat box on top of the dashboard. You can use the kebab menu to access the settings to dock the Genie chat to the side of the screen.
- Ask the following question in the chatbox.

    > _What tables are there and how are they connected? Give me a short summary._

- Review the response provided by Genie. 


# Lab - Data Warehousing Lab

This lab will guide you through creating a complete pipeline in Databricks, leveraging Delta Lake, data ingestion techniques, transformations, dashboards, and Databricks Genie. The goal is to give you hands-on experience with the Databricks platform.

**Learning Objectives**

By the end of this lab, you will:
- Create Delta tables and explore Delta Lake features like Time Travel and Version History.
- Perform data ingestion using techniques - Upload UI.
- Clean and transform datasets into Bronze, Silver, and Gold layers.
- Visualize insights using Databricks Dashboards.
- Leverage Databricks Genie for data exploration and analysis.


## Creating Delta Tables and Exploring Delta Lake Features
In this task, you will learn how to create Delta tables and explore the advanced features of Delta Lake.


### Create the `sales_table` Delta Table
Follow these steps to create a Delta table from a CSV file and explore its features:
- Create the Delta table by reading data from the CSV file.
- Verify the table creation by selecting a sample of the data.

In [0]:
%sql
---- Drop the table if it already exists for lab purposes
DROP TABLE IF EXISTS sales_table;

---- Create a Delta table using the CSV file
CREATE TABLE sales_table USING DELTA
AS
SELECT *
FROM read_files(
  '/Volumes/databricks_simulated_retail_customer_data/v01/source_files/sales.csv',
  <FILL_IN>
);

---- Select from the newly created table
<FILL_IN>

### Enable Column Mapping and Modify the Table
In this step, you will enhance the functionality and structure of the sales_table Delta table by enabling column mapping and modifying the schema. Column mapping is essential for managing schema evolution and ensuring data consistency in Delta Lake. Follow these steps:

- **Enable Column Mapping:**

  Set the table properties to enable column mapping. This feature allows you to rename columns, manage schema changes, and maintain backward compatibility for readers.

- **Drop Unnecessary Columns:**

  Remove the `_rescued_data` column, which is often added to capture extra data during schema inference but may not be required for further analysis.

In [0]:
%sql
---- Enable column mapping on the Delta table
ALTER TABLE sales_table SET TBLPROPERTIES (
   'delta.minReaderVersion' = <FILL_IN>,
   'delta.minWriterVersion' = <FILL_IN>,
   'delta.columnMapping.mode' = <FILL_IN>
);

---- Drop the column after enabling column mapping
ALTER TABLE sales_table DROP COLUMNS (<FILL_IN>);

- **Add and Update a New Column:**

  Add a new column named `discount_code` to the table schema and populate it with values based on conditions. In this step:

    - Assign `Discount_20%` to rows where the `product_category` is `'Ramsung'`.
    - Assign `N/A` to all other rows.

In [0]:
%sql
---- Alter the table by adding a new column
ALTER TABLE sales_table ADD COLUMNS (discount_code STRING);

---- Update the newly added column with data
UPDATE sales_table
SET discount_code = CASE
  WHEN product_category = <FILL_IN> THEN <FILL_IN>
  ELSE 'N/A'
END;

- **View Table History:**
  
  Use the `DESCRIBE HISTORY` command to view the version history of the table.

In [0]:
%sql
---- Display the history of changes made to the sales_table
<FILL_IN>

### Restore the Table Using Time Travel

Delta Lake's time travel feature allows you to access and restore previous versions of a Delta table. This is useful for scenarios such as data recovery, debugging, or auditing changes.

In this sub task, you will restore the `sales_table` Delta table to a specific version using the `RESTORE TABLE` command.

In [0]:
%sql
---- Restore the sales_table to previous version
<FILL_IN>

## Data Ingestion Techniques
In this task, you will learn how to ingest data into Databricks using the UI. This includes downloading a dataset, uploading it to your schema, and creating a Delta table.

### Uploading Data and Creating a Delta Table using UI

- Download the `customers.csv` data file by following [this link](/ajax-api/2.0/fs/files/Volumes/dbacademy_retail/v01/source_files/customers.csv). This will download the CSV file to your browser's download folder.
- Using the the [Catalog Explorer](/explore/data/dbacademy) user interface, create a table named *customers_ui* in your schema, using the file you just downloaded.

- **Verify the Table Creation**

  After successfully creating the Delta table, you can verify its creation and view a sample of the data by following these steps:

Use the `SHOW TABLES` command to display all tables in the current schema and confirm that `customers_ui` exists.

In [0]:
%sql
---- Show all tables in the current Schema
SHOW TABLES;

Use the `SELECT` statement to retrieve and display the first 10 records from the `customers_ui` table to ensure the data has been ingested correctly.


In [0]:
%sql
---- Display the first 10 records from the customers_ui table
<FILL_IN>

### Create Table as Select (CTAS)

In this step, we create the `customers_ui_bronze` Delta table by selecting data from `customers_ui` and applying transformations.

In [0]:
%sql
---- Drop the customers_ui_bronze table if it already exists
DROP TABLE IF EXISTS customers_ui_bronze;
---- Create a new Delta table
CREATE TABLE <FILL_IN>
SELECT *, 
  CAST(CAST(valid_from / 1e6 AS TIMESTAMP) AS DATE) AS first_touch_date, 
  CURRENT_TIMESTAMP() AS updated,
  _metadata.file_name AS source_file
FROM customers_ui;

---- Verify the data in the newly created table
<FILL_IN>;

## Data Transformation
In this task, you will transform the data in your Delta tables to create the Silver and Gold tables. These transformations will clean, enrich, and join the data to provide valuable insights for analytics and reporting.

### Create the Silver Table

The Silver table represents a refined layer with cleaned and enriched data derived from the Bronze table. 

Follow these steps:
- Transform the `customers_ui_bronze` table to clean and enrich the data.
- Create a new column, `loyalty_level`, that categorizes customers based on their loyalty segment.
- Save the results as the `customers_ui_silver` table.

In [0]:
%sql
---- Create or replace the Silver table
CREATE OR REPLACE TABLE customers_ui_silver AS
SELECT 
  <FILL_IN> 
  loyalty_segment, ---- Selecting relevant columns from the Bronze table.
  CASE 
    WHEN loyalty_segment = 1 THEN 'High'
    WHEN loyalty_segment = 2 THEN 'Medium'
    ELSE 'Low'
  END AS loyalty_level  ---- Adding a new column, loyalty_level, based on the loyalty_segment values.
FROM customers_ui_bronze;

---- Verify the Silver table
<FILL_IN>

### Create the Gold Table

The Gold table represents a business insights layer, created by joining the Silver table with the `sales_table`.

Follow these steps:

- Join the `customers_ui_silver` table with the `sales_table` on the `customer_id` column.
- Select key metrics and dimensions required for analytics and save the result as the `customers_ui_gold` table.

In [0]:
%sql
---- Create or replace the Gold table
CREATE OR REPLACE TABLE <FILL_IN> AS
SELECT 
  c.customer_id,
  c.customer_name,
  <FILL_IN>,
  s.product_category,
  s.product_name,
  s.total_price,
  <FILL_IN>
FROM customers_ui_silver c
JOIN sales_table s ---- Joining the customers_ui_silver table with the sales_table on the customer_id column.
ON c.customer_id = s.customer_id; ---- Selecting key attributes from both tables to create a comprehensive insights layer.

---- Verify the Gold table
<FILL_IN>

## Visualization with Dashboards
In this task, you will create a dashboard in Databricks to visualize insights derived from the Gold table. The task involves adding datasets, creating visualizations, and exploring the dashboard using Databricks Genie.

### Create a New Dashboard
Follow these steps to create a new dashboard:
* Navigate to **Dashboards** in the side navigation panel.
* Select **Create dashboard**. 
* At the top of the resulting screen, click on the Dashboard name and change it to **Customer_Sales Dashboard**.

### Adding Data to the Dashboard

To create visualizations, you need to associate datasets with the dashboard. Complete the following steps:

- Navigate to the **Data** tab in the dashboard.
- Use the **+ Select a table** button to add datasets. 
- Search for and select the **`customers_ui_gold`** table from *`workspace.default`* and click **Confirm**. The table will appear in your dataset list.

You can modify the SQL query associated with each dataset in the query editing panel to customize the data.

### Visualization - Combo Chart
Visualize the insights by creating a Combo Chart that displays total sales value and sales order counts over a three-month span.

**Steps to Create the Combo Chart:**

- In the **Data** tab, select the **+ Create from SQL** option.
- Enter and execute the following SQL query:

    ```sql
    SELECT customer_name, 
           total_price AS Total_Sales, 
           date_format(order_date, "MM") AS Month, 
           product_category 
    FROM workspace.default.customers_ui_gold 
    WHERE order_date >= to_date('2019-08-01')
    AND order_date <= to_date('2019-10-31');
    ```

- Rename the query to **Three Month Sales** and save it.
- Switch to the **Canvas** tab and click **Add a visualization** at the bottom.
- Select the **Three Month Sales** dataset and choose the **Combo** chart as the visualization type.
- Configure the chart settings:
    - **X axis:** Month
    - **Bar:** Total_Sales (Rename to **Total Sales Value**)
    - **Line:** COUNT(`*`) (Rename to **Count of Sales Orders**)

- Enable **dual axis** from the Y-axis configuration menu.
- Change the left Y-axis format to **Currency ($)**.

This visualization will show the correlation between sales volume and total sales value for each month.

### Creating a Genie Space from a Dashboard

Databricks Genie allows you to explore data directly from the dashboard in a conversational interface.

**Steps to Create a Genie Space:**

- Open the **Retail Dashboard** you created.
- Switch to the **Draft** view.
- Click the kebab menu (three vertical dots) in the upper-right corner and select **Open Draft Genie space**.
- In the chatbox, ask:

    `
    What tables are there and how are they connected? Give me a short summary.
    `

- Review the response provided by Genie to understand the data relationships and structure.

---

By completing this task, you have successfully created a visual dashboard to analyze business insights and leveraged Genie for exploratory analysis.

## Conclusion

Congratulations on completing the **Data Warehousing Comprehensive Lab**! Throughout this lab, you gained hands-on experience with Databricks to build and analyze a complete data pipeline, leveraging the robust features of Delta Lake, Databricks Dashboards, and Databricks Genie.