# DQX - Use as library demo

In this demo we demonstrate how to create and apply a set of rules from an object and from a YAML configuration. 

**Note.**
This notebook can be executed without any modifications when using the `VS Code Databricks Extension`

### Install DQX

In [1]:
%pip install databricks-labs-dqx
%restart_python

Collecting databricks-labs-dqx
  Obtaining dependency information for databricks-labs-dqx from https://files.pythonhosted.org/packages/62/02/dd1313073e0cdcaee1371d03a98038fd7e90cd48d76be6ef50b57de72587/databricks_labs_dqx-0.5.0-py3-none-any.whl.metadata
  Downloading databricks_labs_dqx-0.5.0-py3-none-any.whl.metadata (3.4 kB)
Collecting databricks-labs-blueprint<0.10,>=0.9.1 (from databricks-labs-dqx)
  Obtaining dependency information for databricks-labs-blueprint<0.10,>=0.9.1 from https://files.pythonhosted.org/packages/73/f7/4e77bdcd83fb5e53d79526f4532dd05af53e5dcbb2c2854ae536baecf133/databricks_labs_blueprint-0.9.3-py3-none-any.whl.metadata
  Downloading databricks_labs_blueprint-0.9.3-py3-none-any.whl.metadata (55 kB)
[?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/55.1 kB ? eta -:--:--
[2K     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.1/55.1 kB 2.5 MB/s eta 0:00:00
[?25hCollecting databricks-labs-lsql<0.15,>=0.5 (from databricks-labs-dqx)
  Obtaining dependency infor

TypeError: Cannot read properties of undefined (reading 'fsPath')

### Import Required Libraries

In [2]:
import yaml
from databricks.labs.dqx.engine import DQEngineCore, DQEngine
from databricks.sdk import WorkspaceClient
from pyspark.sql import SparkSession, Row



### Configure Test Data

The result of this next step is `new_users_df`, which represents a dataset of new users which requires quality validation.

In [3]:
spark = SparkSession.builder.appName("DQX_demo_library").getOrCreate()

# Create a sample DataFrame representing your 'nations' table
new_users_sample_data = [
    Row(id=1, age=23, country='Germany'),
    Row(id=2, age=30, country='France'),
    Row(id=3, age=16, country='Germany'), # Invalid -> age - LT 18
    Row(id=None,  age=29, country='France'), # Invalid -> id - NULL
    Row(id=4,  age=29, country=''), # Invalid -> country - Empty
    Row(id=5,  age=23, country='Italy'), # Invalid -> country - not in
    Row(id=6,  age=123, country='France') # Invalid -> age - GT 120
]

new_users_df = spark.createDataFrame(new_users_sample_data)



### Demoing Functions
- is_not_null_and_not_empty
- is_in_range
- is_in_list
Link to built-in quality checks - [check_funcs.py](https://github.com/databrickslabs/dqx/blob/98c3ef12d7b2ce3c1cc6c2bc3f9643a27876d817/src/databricks/labs/dqx/check_funcs.py)

We are demonstrating 2 methods for creating a `Checks` array:
- **checks_from_object**: Is a python array that can be casted to `list[DQRule]`
- **checks_from_yaml**: yaml object that can directly be used by `DQEngine`

Then we use `validate_checks` to make sure our configurations are correct.

In [17]:
checks_from_yaml = yaml.safe_load("""
- check:
    function: is_not_null_and_not_empty
    for_each_column:
      - id
      - age
      - country
    criticality: error
- check:
    function: is_in_range
    for_each_column:
      - age
    criticality: warn
    arguments:
      min_limit: 18
      max_limit: 120
- check:
    function: is_in_list
    for_each_column:
      - country
    criticality: warn
    arguments:
      allowed:
        - Germany
        - France
""")

# Validate YAML checks
status = DQEngine.validate_checks(checks_from_yaml)
print(f"Checks from YAML: {status}")

Checks from YAML: No errors found

### Setup `DQEngine`

In [10]:
ws = WorkspaceClient()
dq_engine = DQEngine(ws)



### Apply YAML Rules

In [26]:
validated_df = dq_engine.apply_checks_by_metadata(new_users_df, checks_from_yaml)
validated_df.show()

+----+---+-------+--------------------+---------+
+----+---+-------+--------------------+---------+
|   1| 23|Germany|                NULL|     NULL|
|   2| 30| France|                NULL|     NULL|
|   3| 16|Germany|[{age_not_in_rang...|     NULL|
|NULL| 29| France|[{id_is_null_or_e...|     NULL|
|   4| 29|       |[{country_is_null...|     NULL|
|   5| 23|  Italy|[{country_is_not_...|     NULL|
|   6|123| France|[{age_not_in_rang...|     NULL|
+----+---+-------+--------------------+---------+

In [20]:
yaml_rules_valid_rows, yaml_rules_invalid_rows = dq_engine.apply_checks_by_metadata_and_split(new_users_df, checks_from_yaml)



In [12]:
yaml_rules_valid_rows.show()

+---+---+-------+
| id|age|country|
+---+---+-------+
|  1| 23|Germany|
|  2| 30| France|
+---+---+-------+

In [13]:
yaml_rules_invalid_rows.show()

+----+---+-------+--------------------+---------+
+----+---+-------+--------------------+---------+
|   3| 16|Germany|[{age_not_in_rang...|     NULL|
|NULL| 29| France|[{id_is_null_or_e...|     NULL|
|   4| 29|       |[{country_is_null...|     NULL|
|   5| 23|  Italy|[{country_is_not_...|     NULL|
|   6|123| France|[{age_not_in_rang...|     NULL|
+----+---+-------+--------------------+---------+