Skip to content

Latest commit

 

History

History
132 lines (83 loc) · 3.47 KB

dataset.md

File metadata and controls

132 lines (83 loc) · 3.47 KB

Examples of use: spalah.dataset

This module contains various storage and dataset specific functions, like: DeltaTableConfig, check_dbfs_mounts etc.

DeltaTableConfig

Retrieve delta table properties

from spalah.dataset import DeltaTableConfig

dp = DeltaTableConfig(table_path="/tmp/nested_schema_dataset")

print(dp.properties)

# results to:
{'delta.deletedFileRetentionDuration': 'interval 15 days'}

Set delta table properties

from spalah.dataset import DeltaTableConfig

dp = DeltaTableConfig(table_path="/tmp/nested_schema_dataset")

dp.properties = {
    "delta.logRetentionDuration": "interval 10 days",
    "delta.deletedFileRetentionDuration": "interval 15 days"
}

and the outcome is:

2023-05-20 18:21:35,155 INFO      Applying table properties on 'delta.`/tmp/nested_schema_dataset`':
2023-05-20 18:21:35,156 INFO      Checking if 'delta.logRetentionDuration = interval 10 days' is set on delta.`/tmp/nested_schema_dataset`
2023-05-20 18:21:35,534 INFO      The property has been set
2023-05-20 18:21:35,535 INFO      Checking if 'delta.deletedFileRetentionDuration = interval 15 days' is set on delta.`/tmp/nested_schema_dataset`
2023-05-20 18:21:35,837 INFO      The property has been set

If the existing properties to be preserved use parameter keep_existing_properties=True:

from spalah.dataset import DeltaTableConfig

dp = DeltaTableConfig(table_path="/tmp/nested_schema_dataset")

dp.keep_existing_properties = True

dp.properties = {
    "delta.logRetentionDuration": "interval 10 days",
    "delta.deletedFileRetentionDuration": "interval 15 days"
}

Retrieve delta table check constraints

from spalah.dataset import DeltaTableConfig

dp = DeltaTableConfig(table_path="/tmp/nested_schema_dataset")

print(dp.check_constraints)

{} # results to empty dictionary, so no check constraints are set yet

Set delta table check constraints

from spalah.dataset import DeltaTableConfig

dp = DeltaTableConfig(table_path="/tmp/nested_schema_dataset")

dp.check_constraints = {'id_is_not_null': 'id is not null'} 

and the outcome is:

2023-05-20 18:27:42,070 INFO      Applying check constraints on 'delta.`/tmp/nested_schema_dataset`':
2023-05-20 18:27:42,071 INFO      Checking if constraint 'id_is_not_null' was already set on delta.`/tmp/nested_schema_dataset`
2023-05-20 18:27:42,433 INFO      The constraint id_is_not_null has been successfully added to 'delta.`/tmp/nested_schema_dataset`

If the existing constraints to be preserved use parameter keep_existing_check_constraints=True:

from spalah.dataset import DeltaTableConfig

dp = DeltaTableConfig(table_path="/tmp/nested_schema_dataset")
dp.keep_existing_check_constraints = True

dp.check_constraints = {'Name_is_not_null': 'Name is not null'} 

so, the second check of the check constraints shows both constraints defined: existing and the new one:

print(dp.check_constraints)

{'id_is_not_null': 'id is not null', 'name_is_not_null': 'Name is not null'}

The following check shows that the constraint id_is_not_null is already set and protects the column ID from being null:

spark.sql(
    """
    INSERT INTO delta.`/tmp/nested_schema_dataset` (ID, Name, Address)
    VALUES (NULL, 'Alex', NULL) 
    """
)

ERROR Utils: Aborting task
org.apache.spark.sql.delta.schema.DeltaInvariantViolationException: 
CHECK constraint id_is_not_null (id IS NOT NULL) violated by row with values:
 - id : null
...