# Schema Enforcement

😲 The health tracker changed how it records data, which means that the raw data schema has changed.

## Notebook Objective

In this notebook we:
1. Observe how schema enforcement deals with schema changes

## Step Configuration

In [0]:
%run ./includes/configuration

Out[3]: DataFrame[]

No running streams.


## Import Operation Functions

In [0]:
%run ./includes/main/python/operations_v2

❗️Note that we have loaded our operation functions from the file `includes/main/python/operations_v2`. This updated operations file has been modified to transform the bronze table using the new schema.

The new schema has been loaded as `json_schema_v2`.

### Display the Files in the Raw Paths

In [0]:
display(dbutils.fs.ls(rawPath))

path,name,size
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/health_tracker_data_2020_1.json,health_tracker_data_2020_1.json,310628
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/health_tracker_data_2020_2.json,health_tracker_data_2020_2.json,284670
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/late/,late/,0


## Start Streams

Before we add new streams, let's start the streams we have previously engineered.

We will start two named streams:

- `write_raw_to_bronze`
- `write_bronze_to_silver`

### Current Delta Architecture

Next, we demonstrate everything we have built up to this point in our
Delta Architecture.

Again, we do so with composable functions included in the
file `includes/main/python/operations`.

In [0]:
rawDF = read_stream_raw(spark, rawPath)
transformedRawDF = transform_raw(rawDF)
rawToBronzeWriter = create_stream_writer(
    dataframe=transformedRawDF,
    checkpoint=bronzeCheckpoint,
    name="write_raw_to_bronze",
    partition_column="p_ingestdate",
)
rawToBronzeWriter.start(bronzePath)

bronzeDF = read_stream_delta(spark, bronzePath)
transformedBronzeDF = transform_bronze(bronzeDF)
bronzeToSilverWriter = create_stream_writer(
    dataframe=transformedBronzeDF,
    checkpoint=silverCheckpoint,
    name="write_bronze_to_silver",
    partition_column="p_eventdate",
)
bronzeToSilverWriter.start(silverPath)

Out[15]: <pyspark.sql.streaming.StreamingQuery at 0x7f4b86496c40>

## Update the Silver Table

We periodically run the `update_silver_table` function to update the Silver table based on the known issue of negative readings being ingested.

In [0]:
update_silver_table(spark, silverPath)

Out[16]: True

## Show Running Streams

In [0]:
for stream in spark.streams.active:
    print(stream.name)

write_bronze_to_silver
write_raw_to_bronze


## Retrieve Third Month of Data

Next, we use the utility function, `retrieve_data` to retrieve another file.

After you ingest the file, view the streams above.

In [0]:
retrieve_data(2020, 3, rawPath)

Out[18]: True

## Exercise:Write an Assertion Statement to Verify File Ingestion

### Expected File

The expected file has the following name:

In [0]:
file_2020_3 = "health_tracker_data_2020_3.json"

In [0]:

assert file_2020_3 in [item.name for item in dbutils.fs.ls(rawPath)]

In [0]:
%sql

SELECT COUNT(*) FROM health_tracker_plus_bronze

count(1)
10920


In [0]:
%sql

SELECT COUNT(*) FROM health_tracker_plus_silver

count(1)
7200


In [0]:
%sql

DESCRIBE health_tracker_plus_silver

col_name,data_type,comment
device_id,int,
heartrate,double,
eventtime,timestamp,
name,string,
p_eventdate,date,
,,
# Partitioning,,
Part 0,p_eventdate,


## What Is Schema Enforcement?
Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a “reservation”), and rejects any writes with columns that aren’t on the list.

## Show Running Streams

In [0]:
for stream in spark.streams.active:
    print(stream.name)

write_raw_to_bronze


Note that the `write_bronze_to_silver` stream has died. If you navigate back up to the cell in which we started the streams, you should see the following error:

`org.apache.spark.sql.AnalysisException: A schema mismatch detected when writing to the Delta table`.

The stream has died because the schema of the incoming data did not match the schema of the table being written to.

## Stop All Streams

In the next notebook, we will take a look at schema evolution with Delta Lake.

Before we do so, let's shut down all streams in this notebook.

In [0]:
stop_all_streams()


Out[22]: True

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>