-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Exploring the Pipeline Events Logs

DLT uses the event logs to store much of the important information used to manage, report, and understand what's happening during pipeline execution.

Below, we provide a number of useful queries to explore the event log and gain greater insight into your DLT pipelines.

In [0]:
%run ../Includes/Classroom-Setup-04.4

Python interpreter will be restarted.
Python interpreter will be restarted.



Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-engineer-learning-path/v01"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| completed (3 seconds total)

Creating & using the schema "hamed_vaheb_jcxq_da_delp_pipeline_demo"...(5 seconds)
Loading batch 1 of 31...1 seconds
Predefined tables in "hamed_vaheb_jcxq_da_delp_pipeline_demo":
| __apply_changes_storage_customers_silver
| customer_counts_state
| customers_bronze
| customers_bronze_clean
| customers_silver
| orders_bronze
| orders_by_date
| orders_silver
| orderstable

Predefined paths variables:
| DA.paths.working_dir:      dbfs:/mnt/dbacademy-users/hamed.vaheb@pwc.lu/data-engineer-learning-path/pipeline_demo
| DA.paths.user_db:          dbfs:/mnt/dbacademy-users/hamed.vaheb@pwc.lu/data-engineer-learning-path/pipeline_demo/database.db
| DA.paths.datasets:         dbfs:/mnt/dbacademy-datasets/data-engineer-learning-path/v01
| DA.paths.storage_location: dbfs:/mnt/dbacademy

## Query Event Log
The event log is managed as a Delta Lake table with some of the more important fields stored as nested JSON data.

The query below shows how simple it is to read this table and created a DataFrame and temporary view for interactive querying.

In [0]:
event_log_path = f"{DA.paths.storage_location}/system/events"

event_log = spark.read.format('delta').load(event_log_path)
event_log.createOrReplaceTempView("event_log_raw")

display(event_log)

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
[0;32m<command-1232511262303705>[0m in [0;36m<cell line: 3>[0;34m()[0m
[1;32m      1[0m [0mevent_log_path[0m [0;34m=[0m [0;34mf"{DA.paths.storage_location}/system/events"[0m[0;34m[0m[0;34m[0m[0m
[1;32m      2[0m [0;34m[0m[0m
[0;32m----> 3[0;31m [0mevent_log[0m [0;34m=[0m [0mspark[0m[0;34m.[0m[0mread[0m[0;34m.[0m[0mformat[0m[0;34m([0m[0;34m'delta'[0m[0;34m)[0m[0;34m.[0m[0mload[0m[0;34m([0m[0mevent_log_path[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m      4[0m [0mevent_log[0m[0;34m.[0m[0mcreateOrReplaceTempView[0m[0;34m([0m[0;34m"event_log_raw"[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m      5[0m [0;34m[0m[0m

[0;32m/databricks/spark/python/pyspark/instrumentation_utils.py[0m in [0;36mwrapper[0;34m(*args, **kwargs)[0m
[1;32m     

## Set Latest Update ID

In many cases, you may wish to gain updates about the latest update (or the last N updates) to your pipeline.

We can easily capture the most recent update ID with a SQL query.

In [0]:
%sql
SELECT DISTINCT(event_type)
FROM event_log_raw



In [0]:
latest_update_id = spark.sql("""
    SELECT origin.update_id
    FROM event_log_raw
    WHERE event_type = 'create_update'
    ORDER BY timestamp DESC LIMIT 1""").first().update_id

print(f"Latest Update ID: {latest_update_id}")

# Push back into the spark config so that we can use it in a later query.
spark.conf.set('latest_update.id', latest_update_id)



## Perform Audit Logging

Events related to running pipelines and editing configurations are captured as **`user_action`**.

Yours should be the only **`user_name`** for the pipeline you configured during this lesson.

In [0]:
%sql
SELECT *
FROM event_log_raw 
WHERE event_type = 'user_action'



In [0]:
%sql
SELECT timestamp, details:user_action:action, details:user_action:user_name
FROM event_log_raw 
WHERE event_type = 'user_action'



## Examine Lineage

DLT provides built-in lineage information for how data flows through your table.

While the query below only indicates the direct predecessors for each table, this information can easily be combined to trace data in any table back to the point it entered the lakehouse.

In [0]:
%sql
SELECT details:flow_definition
FROM event_log_raw 
WHERE event_type = 'flow_definition' AND 
      origin.update_id = '${latest_update.id}'



In [0]:
%sql
SELECT details:flow_definition
FROM event_log_raw 
WHERE event_type = 'flow_definition' AND 
      origin.update_id = '${latest_update.id}'



In [0]:
%sql
SELECT details:flow_definition.output_dataset, details:flow_definition.input_datasets 
FROM event_log_raw 
WHERE event_type = 'flow_definition' AND 
      origin.update_id = '${latest_update.id}'



## Examine Data Quality Metrics

Finally, data quality metrics can be extremely useful for both long term and short term insights into your data.

Below, we capture the metrics for each constraint throughout the entire lifetime of our table.

In [0]:
%sql
SELECT details:flow_progress:data_quality AS DataQuality FROM event_log_raw WHERE event_type = 'flow_progress' AND details:flow_progress:data_quality IS NOT NULL



In [0]:
%sql
SELECT row_expectations.dataset as dataset,
       row_expectations.name as expectation,
       SUM(row_expectations.passed_records) as passing_records,
       SUM(row_expectations.failed_records) as failing_records
FROM
  (SELECT explode(
            from_json(details :flow_progress :data_quality :expectations,
                      "array<struct<name: string, dataset: string, passed_records: int, failed_records: int>>")
          ) row_expectations
   FROM event_log_raw
   WHERE event_type = 'flow_progress' AND 
         origin.update_id = '${latest_update.id}'
  )
GROUP BY row_expectations.dataset, row_expectations.name



-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>