# Schema Evolution

😲 The health tracker changed how it records data, which means that the
raw data schema has changed. In this notebook, we show how to build our
streams to merge the changes to the schema.

**TODO** *Discussion on what kinds of changes will work with the merge option.*

## Notebook Objective

In this notebook we:
1. Use schema evolution to deal with schema changes

## Step Configuration

In [0]:
%run ./includes/configuration

Out[3]: DataFrame[]

No running streams.


## Import Operation Functions

In [0]:
%run ./includes/main/python/operations_v2

### Display the Files in the Raw Paths

In [0]:
display(dbutils.fs.ls(rawPath))

path,name,size
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/health_tracker_data_2020_1.json,health_tracker_data_2020_1.json,310628
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/health_tracker_data_2020_2.json,health_tracker_data_2020_2.json,284670
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/health_tracker_data_2020_3.json,health_tracker_data_2020_3.json,402785
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/late/,late/,0


## Start Streams

Before we add new streams, let's start the streams we have previously engineered.

We will start two named streams:

- `write_raw_to_bronze`
- `write_bronze_to_silver`

❗️Note that we have loaded our operation functions from the file `includes/main/python/operations_v2`. This updated operations file has been modified to transform the bronze table using the new schema.

The new schema has been loaded as `json_schema_v2`.

### Current Delta Architecture
**TODO**
Next, we demonstrate everything we have built up to this point in our
Delta Architecture.

Again, we do so with composable functions included in the
file `includes/main/python/operations`.

Add the `mergeSchema=True` argument to the Silver table stream writer.

In [0]:

rawDF             = read_stream_raw(spark, rawPath)
transformedRawDF  = transform_raw(rawDF)
rawToBronzeWriter = create_stream_writer(
  dataframe=transformedRawDF,
  checkpoint=bronzeCheckpoint,
  name="write_raw_to_bronze",
  partition_column="p_ingestdate"
)
rawToBronzeWriter.start(bronzePath)

bronzeDF             = read_stream_delta(spark, bronzePath)
transformedBronzeDF  = transform_bronze(bronzeDF)
bronzeToSilverWriter = create_stream_writer(
  dataframe=transformedBronzeDF,
  checkpoint=silverCheckpoint,
  name="write_bronze_to_silver",
  partition_column="p_eventdate",
  mergeSchema=True,
)
bronzeToSilverWriter.start(silverPath)

Out[16]: <pyspark.sql.streaming.StreamingQuery at 0x7f66e60ef0d0>

## Show Running Streams

In [0]:
for stream in spark.streams.active:
    print(stream.name)

write_bronze_to_silver
write_raw_to_bronze


In [0]:
%sql

SELECT COUNT(*) FROM health_tracker_plus_bronze

count(1)
10920


In [0]:
%sql

SELECT COUNT(*) FROM health_tracker_plus_silver

count(1)
10920


In [0]:
%sql

DESCRIBE health_tracker_plus_silver

col_name,data_type,comment
device_id,int,
heartrate,double,
eventtime,timestamp,
name,string,
p_eventdate,date,
device_type,string,
,,
# Partitioning,,
Part 0,p_eventdate,


## Stop All Streams

In the next notebook in this course, we will take a look at schema enforcement and evolution with Delta Lake.

Before we do so, let's shut down all streams in this notebook.

In [0]:
stop_all_streams()


Out[18]: True

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>