# DQX - Use as library demo

In this demo we demonstrate how to create and apply a set of rules from an object and from a YAML configuration. 

**Note.**
This notebook can be executed without any modifications when using the `VS Code Databricks Extension`

### Instal DQX

In [None]:
%pip install databricks-labs-dqx==0.5.0
%restart_python

### Import Required Libraries

In [None]:
import yaml
from databricks.labs.dqx.engine import DQEngineCore, DQEngine
from databricks.sdk import WorkspaceClient
from pyspark.sql import SparkSession, Row

### Configure Test Data

The result of this next step is `new_users_df`, which represents a dataset of new users which requires quality validation.

In [None]:
spark = SparkSession.builder.appName("DQX_demo_library").getOrCreate()

# Create a sample DataFrame representing your 'nations' table
new_users_sample_data = [
    Row(id=1, age=23, country='Germany'),
    Row(id=2, age=30, country='France'),
    Row(id=3, age=16, country='Germany'), # Invalid -> age - LT 18
    Row(id=None,  age=29, country='France'), # Invalid -> id - NULL
    Row(id=4,  age=29, country=''), # Invalid -> country - Empty
    Row(id=5,  age=23, country='Italy'), # Invalid -> country - not in
    Row(id=6,  age=123, country='France') # Invalid -> age - GT 120
]

new_users_df = spark.createDataFrame(new_users_sample_data)

### Demoing Functions
- is_not_null_and_not_empty
- is_in_range
- is_in_list
Link to out of the box checks/rules - [check_funcs.py](https://github.com/databrickslabs/dqx/blob/98c3ef12d7b2ce3c1cc6c2bc3f9643a27876d817/src/databricks/labs/dqx/check_funcs.py)

We are demonstrating 2 methods for creating a `Checks` array:
- **checks_from_object**: Is a python array that can be casted to `list[DQRule]`
- **checks_from_yaml**: yaml object that can directly be used by `DQEngine`

Then we use `validate_checks` to make sure our configurations are correct.

In [None]:
checks_from_object = [
        {
            "check": {
                "function": "is_not_null_and_not_empty",
                "criticality": "warn",
                "for_each_column": ["id", "age", "country"],
                "arguments": {},
            },
            "user_metadata": {"check_type": "completeness", "check_owner": "someone@email.com"},
        },
        {
            "criticality": "error",
            "check": {
              "function": "is_in_range", 
              "for_each_column": ["age"],
              "arguments": {
                "min_limit": 18,
                "max_limit": 120
                }
            },
        },
        {
            "criticality": "error",
            "check": {
              "function": "is_in_list", 
              "for_each_column": ["country"],
              "arguments": {
                "allowed": ["Germany", "France"]
                }
            },
        },
    ]

# Validate YAML checks
status = DQEngine.validate_checks(checks_from_object)
print(f"Checks from Object: {status}")

checks_from_yaml = yaml.safe_load("""
- check:
    function: is_not_null_and_not_empty
    for_each_column:
      - id
      - age
      - country
    criticality: error
- check:
    function: is_in_range
    for_each_column:
      - age
    criticality: warn
    arguments:
      min_limit: 18
      max_limit: 120
- check:
    function: is_in_list
    for_each_column:
      - country
    criticality: warn
    arguments:
      allowed:
        - Germany
        - France
""")

# Validate YAML checks
status = DQEngine.validate_checks(checks_from_yaml)
print(f"Checks from YAML: {status}")

### Setup `DQEngine`

In [None]:
ws = WorkspaceClient()
dq_engine = DQEngine(ws)

### Apply Object Rules

In [None]:
object_rules = DQEngineCore.build_checks_by_metadata(checks_from_object)
object_rules_valid_rows_df, object_rules_invalid_rows_df = dq_engine.apply_checks_and_split(new_users_df, object_rules)

In [None]:
object_rules_valid_rows_df.show()

In [None]:
object_rules_invalid_rows_df.show()

### Apply YAML Rules

In [None]:
yaml_rules_valid_rows, yaml_rules_invalid_rows = dq_engine.apply_checks_by_metadata_and_split(new_users_df, checks_from_yaml)

In [None]:
yaml_rules_valid_rows.show()

In [None]:
yaml_rules_invalid_rows.show()