# DQX - Use as library demo

In this demo we demonstrate how to create and apply a set of rules from YAML configuration. 

**Note.**
This notebook can be executed without any modifications when using the `VS Code Databricks Extension`

### Install DQX

In [None]:
%pip install databricks-labs-dqx
%restart_python

### Import Required Modules

In [None]:
import yaml
from databricks.labs.dqx.engine import DQEngineCore, DQEngine
from databricks.sdk import WorkspaceClient
from pyspark.sql import SparkSession, Row

### Configure Test Data

The result of this next step is `new_users_df`, which represents a dataframe of new users which requires quality validation.

In [None]:
spark = SparkSession.builder.appName("DQX_demo_library").getOrCreate()

# Create a sample DataFrame representing your 'nations' table
new_users_sample_data = [
    Row(id=1, age=23, country='Germany'),
    Row(id=2, age=30, country='France'),
    Row(id=3, age=16, country='Germany'), # Invalid -> age - LT 18
    Row(id=None,  age=29, country='France'), # Invalid -> id - NULL
    Row(id=4,  age=29, country=''), # Invalid -> country - Empty
    Row(id=5,  age=23, country='Italy'), # Invalid -> country - not in
    Row(id=6,  age=123, country='France') # Invalid -> age - GT 120
]

new_users_df = spark.createDataFrame(new_users_sample_data)

### Demoing Functions
- `is_not_null_and_not_empty`
- `is_in_range`
- `is_in_list`

You can find documentation of all built-in quality [here](https://databrickslabs.github.io/dqx/docs/reference/quality_rules/).

We are demonstrating creating and validating a set of `Quality Checks` defined declaratively using YAML.

We can use `validate_checks` to verify that the definition is defined correctly.

In [None]:
checks_from_yaml = yaml.safe_load("""
- check:
    function: is_not_null_and_not_empty
    for_each_column:
      - id
      - age
      - country
    criticality: error
- check:
    function: is_in_range
    for_each_column:
      - age
    criticality: warn
    arguments:
      min_limit: 18
      max_limit: 120
- check:
    function: is_in_list
    for_each_column:
      - country
    criticality: warn
    arguments:
      allowed:
        - Germany
        - France
""")

# Validate YAML checks
status = DQEngine.validate_checks(checks_from_yaml)
print(f"Checks from YAML: {status}")

### Setup `DQEngine`

In [None]:
ws = WorkspaceClient()
dq_engine = DQEngine(ws)

### Apply Rules
`apply_checks_by_metadata` results in one `DataFrame` with `_errors` and `_warnings` metadata columns added.

In [None]:
validated_df = dq_engine.apply_checks_by_metadata(new_users_df, checks_from_yaml)
validated_df.show()

### Apply Rules And Split
`apply_checks_by_metadata_and_split` results in a `tuple[DataFrame, DataFrame]` with _errors and _warnings metadata columns added. The first DF contains valid records, and the second invalid/quarantined records.

In [None]:
valid_records_df, invalid_records_df = dq_engine.apply_checks_by_metadata_and_split(new_users_df, checks_from_yaml)
valid_records_df.show()

In [None]:
invalid_records_df.show()