![./ImageLab.png](./Images/ImageLab.png "./ImageLab.png")

# Data Engineering with Lakeflow, Jobs, AutoLoader and more

## DQX for Data Quality

In [0]:
# Lib installation
!pip install databricks-labs-dqx
dbutils.library.restartPython()

In [0]:
# Initializing DQX
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

ws = WorkspaceClient()
dq = DQEngine(ws)

In [0]:
# Definition of data quality rules
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import InputConfig, OutputConfig
from databricks.labs.dqx import check_funcs
from databricks.labs.dqx.rule import DQRowRule, DQDatasetRule, DQForEachColRule
import yaml

# Checks with YAML
checks = yaml.safe_load("""
- check:
    function: is_not_null_and_not_empty
    for_each_column: [customer_bk, customer_name]
  criticality: warn
  user_metadata:
    check_category: completeness
    responsible_data_steward: daniel.baraldi@databricks.com
- check:
    function: is_in_list
    for_each_column: [region]
    arguments: { allowed: ['AC', 'AL', 'AP', 'AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MA', 'MT', 'MS', 'MG', 'PA', 'PB', 'PR', 'PE', 'PI', 'RJ', 'RN', 'RS', 'RO', 'RR', 'SC', 'SP', 'SE', 'TO'] }
  criticality: warn
  user_metadata:
    check_category: consistency
    responsible_data_steward: daniel.baraldi@databricks.com
- check:
    function: is_unique
    arguments: { columns: [customer_bk] }
  criticality: warn
  user_metadata:
    check_category: uniqueness
    responsible_data_steward: daniel.baraldi@databricks.com
""")

# Example usingPython
# checks = [
#   DQRowRule(
#     name="col_not_null_or_empty",
#     criticality="error",
#     check_func=check_funcs.is_not_null_and_not_empty,
#     column=["customer_bk","customer_name"],
#     user_metadata={
#       "check_type": "completeness",
#       "responsible_data_steward": "daniel.baraldi@databricks.com"
#     },
#   ),

# ...
# ]



# Definições de entrada e saída
input = InputConfig(location="medallion.bronze.dim_customer")
output = OutputConfig(location="medallion.bronze.dim_customer_dq_output", mode="overwrite")
quarantine = OutputConfig(location="medallion.bronze.dim_customer_dq_quarantine", mode="overwrite")

# Leitura da entrada, aplicação de checks e salvamento manual
# df = spark.table(input.location)
# valid_df, invalid_df = dq.apply_checks_by_metadata_and_split(df, checks)
# dq.save_results_in_table(output_df=valid_df, quarantine_df=invalid_df,
#                          output_config=output, quarantine_config=quarantine)

# Alternativa que aplica na entrada e salva
dq.apply_checks_by_metadata_and_save_in_table(
    checks=checks,
    input_config=input,
    output_config=output,
    quarantine_config=quarantine,
)

Besides that, it is possible to create alerts for quality data issues, as showed below

![Alert.png](./Images/Alert.png "Alert.png")

![Alert2.png](./Images/Alert2.png "Alert2.png")