# Silver to Gold - Building Aggregate Data Marts for End Users

We will now perform some aggregations on the data, as requested by one of our end users who wants to be able to quickly see summary statistics, aggregated by device id, in a dashboard in their chosen BI tool.

## Notebook Objective

In this notebook we:
1. Create aggregations on the Silver table data
1. Load the aggregate data into a Gold table

## Step Configuration

In [0]:
%run ./includes/configuration

Out[3]: DataFrame[]

No running streams.


## Import Operation Functions

In [0]:
%run ./includes/main/python/operations

### Display the Files in the Raw Paths

In [0]:
display(dbutils.fs.ls(rawPath))

path,name,size
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/health_tracker_data_2020_1.json,health_tracker_data_2020_1.json,310628
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/health_tracker_data_2020_2.json,health_tracker_data_2020_2.json,284670
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/late/,late/,0


## Make Notebook Idempotent

In [0]:
dbutils.fs.rm(goldPath, recurse=True)
dbutils.fs.rm(goldCheckpoint, recurse=True)

Out[16]: False

## Start Streams

Before we add new streams, let's start the streams we have previously engineered.

We will start two named streams:

- `write_raw_to_bronze`
- `write_bronze_to_silver`

### Current Delta Architecture

Next, we demonstrate everything we have built up to this point in our
Delta architecture.

Again, we do so with composable functions included in the
file `includes/main/python/operations`.

In [0]:
rawDF = read_stream_raw(spark, rawPath)
transformedRawDF = transform_raw(rawDF)
rawToBronzeWriter = create_stream_writer(
    dataframe=transformedRawDF,
    checkpoint=bronzeCheckpoint,
    name="write_raw_to_bronze",
    partition_column="p_ingestdate",
)
rawToBronzeWriter.start(bronzePath)

bronzeDF = read_stream_delta(spark, bronzePath)
transformedBronzeDF = transform_bronze(bronzeDF)
bronzeToSilverWriter = create_stream_writer(
    dataframe=transformedBronzeDF,
    checkpoint=silverCheckpoint,
    name="write_bronze_to_silver",
    partition_column="p_eventdate",
)
bronzeToSilverWriter.start(silverPath)

Out[17]: <pyspark.sql.streaming.StreamingQuery at 0x7f99a74c4dc0>

## Update the Silver Table

We periodically run the `update_silver_table` function to update the table and address the known issue of negative readings being ingested.

In [0]:
update_silver_table(spark, silverPath)

Out[19]: True

## Show Running Streams

In [0]:
for stream in spark.streams.active:
    print(stream.name)

write_bronze_to_silver
write_raw_to_bronze


## Create Aggregation per User

**Exercise:** Create a read stream DataFrame and aggregate over the Silver table

Use the following aggregates:
- mean of heartrate, aliased as `mean_heartrate`
- standard deviation of heartrate, aliased as `std_heartrate`
- maximum of heartrate, aliased as `max_heartrate`

In [0]:
# TODO

from pyspark.sql.functions import col, mean, stddev, max

silverTableReadStream = read_stream_delta(spark, silverPath)

gold_health_tracker_data_df =(
    silverTableReadStream.groupBy("device_id")
    .agg(
    mean(col("heartrate")).alias("mean_heartrate"),
    stddev(col("heartrate")).alias("std_heartrate"),
    max(col("heartrate")).alias("max_heartrate"),
  )
)


## WRITE Stream Gold Table Aggregation

Note that we cannot use outputMode "append" for aggregations - we have to use "complete".

**Exercise:** Write the aggregate DataFrame to a Gold table

In [0]:

tableName = "aggregate_heartrate"
tableCheckpoint = goldCheckpoint + tableName
tablePath = goldPath + tableName

(
  gold_health_tracker_data_df.writeStream
  .format("delta")
  .outputMode("complete")
  .option("checkpointLocation", tableCheckpoint)
  .queryName("write_silver_to_gold")
  .start(tablePath)
)

Out[23]: <pyspark.sql.streaming.StreamingQuery at 0x7f99a74c4610>

## Register Gold Table in the Metastore

In [0]:
spark.sql(
    """
DROP TABLE IF EXISTS health_tracker_gold_aggregate_heartrate
"""
)

spark.sql(
    f"""
CREATE TABLE health_tracker_gold_aggregate_heartrate
USING DELTA
LOCATION "{tablePath}"
"""
)

Out[25]: DataFrame[]

### Troubleshooting

😫 If you try to run this before the `writeStream` above has been created, you may see the following error:

`
AnalysisException: Table schema is not set.  Write data into it or use CREATE TABLE to set the schema.;`

If this happens, wait a moment for the `writeStream` to instantiate and run the command again.

We could now use this `health_tracker_gold` Delta table to define a dashboard. The query used to create the table could be issued nightly to prepare the dashboard for the following business day, or as often as needed according to SLA requirements.

## Stop All Streams

In the next notebook, you will harden the Silver to Gold Step.

Before we do so, let's shut down all streams in this notebook.

In [0]:
stop_all_streams()


Out[26]: True

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>