# Quality Enforcement

One of the main motivations for using Delta Lake to store data is that you can provide guarantees on the quality of your data. While schema enforcement is automatic, additional quality checks can be helpful to ensure that only data that meets your expectations makes it into your Lakehouse.

This notebook will review a few approaches to quality enforcement. Some of these are Databricks-specific features, while others are general design principles.

## Learning Objectives
By the end of this lesson, you should be able to:
- Add check constraints to Delta tables
- Describe and implement a quarantine table
- Apply logic to add data quality tags to Delta tables

In [0]:
%run ../Includes/Classroom-Setup-4.2

## Table Constraints

Databricks allows <a href="https://docs.databricks.com/delta/delta-constraints.html" target="_blank">table constraints</a> to be set on Delta tables.

Table constraints apply boolean filters to columns within a table and prevent data that does not fulfill these constraints from being written.

Start by looking at our existing tables.

In [0]:
%sql
SHOW TABLES

If these exist, table constraints will be listed under the **`properties`** of the extended table description.

In [0]:
%sql
DESCRIBE EXTENDED heart_rate_silver

When defining a constraint, be sure to give it a human-readable name. (Note that names are not case sensitive.)

In [0]:
%sql
ALTER TABLE heart_rate_silver ADD CONSTRAINT date_within_range CHECK (time > '2017-01-01');

None of the existing data in our table violated this constraint. Both the name and the actual check are displayed in the **`properties`** field.

In [0]:
%sql
DESCRIBE EXTENDED heart_rate_silver

But what happens if the conditions of the constraint aren't met?

We know that some of our devices occasionally send negative **`bpm`** recordings.

In [0]:
%sql
SELECT COUNT(*) FROM heart_rate_silver
WHERE heartrate <= 0 

Delta Lake will prevent us from applying a constraint that existing records violate.

In [0]:
import pyspark
try:
    spark.sql("ALTER TABLE heart_rate_silver ADD CONSTRAINT validbpm CHECK (heartrate > 0);")
    raise Exception("Expected failure")

except pyspark.sql.utils.AnalysisException as e:
    print("Failed as expected...")
    print(e)

Notice below how we failed to applied the constraint

In [0]:
%sql
DESCRIBE EXTENDED heart_rate_silver

How do we deal with this? 

We could manually delete offending records and then set the check constraint, or set the check constraint before processing data from our bronze table.

However, if we set a check constraint and a batch of data contains records that violate it, the job will fail and we'll throw an error.

If our goal is to identify bad records but keep streaming jobs running, we'll need a different solution.

One idea would be to quarantine invalid records.

Note that if you need to remove a constraint from a table, the following code would be executed.

In [0]:
%sql
ALTER TABLE heart_rate_silver DROP CONSTRAINT validbpm;

## Quarantining

The idea of quarantining is that bad records will be written to a separate location.

This allows good data to processed efficiently, while additional logic and/or manual review of erroneous records can be defined and executed away from the main pipeline.

Assuming that records can be successfully salvaged, they can be easily backfilled into the silver table they were deferred from.

Here, we'll implement quarantining by performing writes to two separate tables within a **`foreachBatch`** custom writer.

Start by creating a table with the correct schema.

In [0]:
%sql
CREATE TABLE IF NOT EXISTS bpm_quarantine
    (device_id LONG, time TIMESTAMP, heartrate DOUBLE)
USING DELTA
LOCATION '${da.paths.user_db}/bpm_quarantine'

With Structured Streaming operations, writing to an additional table can be accomplished within **`foreachBatch`** logic.

Below, we'll update the logic to add filters at the appropriate locations.

For simplicity, we won't check for duplicate records as we insert data into the quarantine table.

In [0]:
sql_query = """
MERGE INTO heart_rate_silver a
USING stream_updates b
ON a.device_id=b.device_id AND a.time=b.time
WHEN NOT MATCHED THEN INSERT *
"""

class Upsert:
    def __init__(self, query, update_temp="stream_updates"):
        self.query = query
        self.update_temp = update_temp 
        
    def upsert_to_delta(self, micro_batch_df, batch):
        micro_batch_df.filter("heartrate" > 0).createOrReplaceTempView(self.update_temp)
        micro_batch_df._jdf.sparkSession().sql(self.query)
        micro_batch_df.filter("heartrate" <= 0).write.format("delta").mode("append").saveAsTable("bpm_quarantine")

Note that within the **`foreachBatch`** logic, the DataFrame operations are treating the data in each batch as if it's static rather than streaming.

As such, we use the **`write`** syntax instead of **`writeStream`**.

This also means that our exactly-once guarantees are relaxed. In our example above, we have two ACID transactions:
1. Our SQL query executes to run an insert-only merge to avoid writing duplicate records to our silver table.
2. We write a microbatch of records with negative heartrates to the **`bpm_quarantine`** table

If our job fails after our first transaction completes but before the second completes, we will re-execute the full microbatch logic on job restart.

However, because our insert-only merge already prevents duplicate records from being saved to our table, this will not result in any data corruption.

## Flagging
To avoid multiple writes and managing multiple tables, you may choose to implement a flagging system to warn about violations while avoiding job failures.

Flagging is a low touch solution with little overhead.

These flags can easily be leveraged by filters in downstream queries to isolate bad data.

**`case`** / **`when`** logic makes this easy.

Run the following cell to see the compiled Spark SQL from the PySpark code below.

In [0]:
from pyspark.sql import functions as F

F.when(F.col("heartrate") <= 0, "Negative BPM").otherwise("OK").alias("bpm_check")

Here, we'll just insert this logic as an additional transformation on a batch read of our bronze data to preview the output.

In [0]:
json_schema = "device_id LONG, time TIMESTAMP, heartrate DOUBLE"

deduped_df = (spark.read
                  .table("bronze")
                  .filter("topic = 'bpm'")
                  .select(F.from_json(F.col("value").cast("string"), json_schema).alias("v"))
                  .select("v.*", F.when(F.col("v.heartrate") <= 0, "Negative BPM")
                                  .otherwise("OK")
                                  .alias("bpm_check"))
                  .dropDuplicates(["device_id", "time"]))

display(deduped_df)

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()