d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Silver Table Updates

We have processed data from the Bronze table to the Silver table.

We now need to do some updates to ensure high data quality in the Silver
table. Because batch loading has no mechanism for checkpointing, we will
need a way to load _only the new records_ from the Bronze table.

We also need to deal with the quarantined records.

## Notebook Objective

In this notebook we:
1. Update the `read_batch_bronze` function to read only new records
1. Fix the bad quarantined records from the Bronze table
1. Write the repaired records to the Silver table

## Step Configuration

In [0]:
%run ./includes/configuration

## Import Operation Functions

In [0]:
%run ./includes/main/python/operations

### Land More Raw Data

In [0]:
ingest_classic_data(hours=10)

### Current Delta Architecture
Next, we demonstrate everything we have built up to this point in our
Delta Architecture.

### The Raw to Bronze Pipeline

In [0]:
rawDF = read_batch_raw(rawPath)
transformedRawDF = transform_raw(rawDF)
rawToBronzeWriter = batch_writer(
    dataframe=transformedRawDF, partition_column="p_ingestdate"
)

rawToBronzeWriter.save(bronzePath)

### Purge Raw File Path

Manually purge the raw files that have already been loaded.

In [0]:
# ANSWER
dbutils.fs.rm(rawPath, recurse=True)

### The Bronze to Silver Pipeline


In the previous notebook, to ingest only the new data we ran

```
bronzeDF = (
  spark.read
  .table("health_tracker_classic_bronze")
  .filter("status = 'new'")
)
```

**Exercise**

Update the function `read_batch_bronze` in the
`includes/main/python/operations` file so that it reads only the new
files in the Bronze table.

♨️ After updating the `read_batch_bronze` function, re-source the
`includes/main/python/operations` file to include your updates by running the cell below.

In [0]:
%run ./includes/main/python/operations

In [0]:
bronzeDF = read_batch_bronze(spark)
transformedBronzeDF = transform_bronze(bronzeDF)

(silverCleanDF, silverQuarantineDF) = generate_clean_and_quarantine_dataframes(
    transformedBronzeDF
)

bronzeToSilverWriter = batch_writer(
    dataframe=silverCleanDF, partition_column="p_eventdate", exclude_columns=["value"]
)
bronzeToSilverWriter.save(silverPath)

update_bronze_table_status(spark, bronzePath, silverCleanDF, "loaded")
update_bronze_table_status(spark, bronzePath, silverQuarantineDF, "quarantined")

### Perform a Visual Verification of the Silver Table

In [0]:
%sql
SELECT * FROM health_tracker_classic_silver

## Handle Quarantined Records

### Step 1: Load Quarantined Records from the Bronze Table

**EXERCISE**

Load all records from the Bronze table with a status of `"quarantined"`.

In [0]:
# ANSWER

bronzeQuarantinedDF = spark.read.table("health_tracker_classic_bronze").filter(
    "status = 'quarantined'"
)
display(bronzeQuarantinedDF)

### Step 2: Transform the Quarantined Records

This applies the standard bronze table transformations.

In [0]:
bronzeQuarTransDF = transform_bronze(bronzeQuarantinedDF, quarantine=True).alias(
    "quarantine"
)
display(bronzeQuarTransDF)

### Step 3: Join Quarantined Data with User Data

We do this to retrieve the correct device id associated with each user.

In [0]:
health_tracker_user_df = spark.read.table("health_tracker_user").alias("user")
repairDF = bronzeQuarTransDF.join(
    health_tracker_user_df,
    bronzeQuarTransDF.device_id == health_tracker_user_df.user_id,
)
display(repairDF)

### Step 4: Select the Correct Device from the Joined `user` DataFrame

In [0]:
silverCleanedDF = repairDF.select(
    col("quarantine.value").alias("value"),
    col("user.device_id").cast("INTEGER").alias("device_id"),
    col("quarantine.steps").alias("steps"),
    col("quarantine.eventtime").alias("eventtime"),
    col("quarantine.name").alias("name"),
    col("quarantine.eventtime").cast("date").alias("p_eventdate"),
)
display(silverCleanedDF)

### Step 5: Batch Write the Repaired (formerly Quarantined) Records to the Silver Table

After loading, this will also update the status of the quarantined records
to `loaded`.

In [0]:
bronzeToSilverWriter = batch_writer(
    dataframe=silverCleanedDF, partition_column="p_eventdate", exclude_columns=["value"]
)
bronzeToSilverWriter.save(silverPath)

update_bronze_table_status(spark, bronzePath, silverCleanedDF, "loaded")

### Display the Quarantined Records

If the update was successful, there should be no quarantined records
in the Bronze table.

In [0]:
display(bronzeQuarantinedDF)


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>