d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Raw to Bronze Pattern

## Notebook Objective

In this notebook we:
1. Ingest Raw Data
2. Augment the data with Ingestion Metadata
3. Batch write the augmented data to a Bronze Table

## Step Configuration

In [0]:
%run ./includes/configuration

### Display the Files in the Raw Path

In [0]:
display(dbutils.fs.ls(rawPath))

## Make Notebook Idempotent

In [0]:
dbutils.fs.rm(bronzePath, recurse=True)

## Ingest raw data

Next, we will read files from the source directory and write each line as a string to the Bronze table.

🤠 You should do this as a batch load using `spark.read`

Read in using the format, `"text"`, and using the provided schema.

In [0]:
# ANSWER
kafka_schema = "value STRING"

raw_health_tracker_data_df = (
    spark.read.format("text").schema(kafka_schema).load(rawPath)
)

## Display the Raw Data

🤓 Each row here is a raw string in JSON format, as would be passed by a stream server like Kafka.

In [0]:
display(raw_health_tracker_data_df)

## Ingestion Metadata

As part of the ingestion process, we record metadata for the ingestion.

**EXERCISE:** Add metadata to the incoming raw data. You should add the following columns:

- data source (`datasource`), use `"files.training.databricks.com"`
- ingestion time (`ingesttime`)
- status (`status`), use `"new"`
- ingestion date (`ingestdate`)

In [0]:
# ANSWER
from pyspark.sql.functions import current_timestamp, lit

raw_health_tracker_data_df = raw_health_tracker_data_df.select(
    "value",
    lit("files.training.databricks.com").alias("datasource"),
    current_timestamp().alias("ingesttime"),
    lit("new").alias("status"),
    current_timestamp().cast("date").alias("ingestdate"),
)

## WRITE Batch to a Bronze Table

Finally, we write to the Bronze Table.

Make sure to write in the correct order (`"datasource"`, `"ingesttime"`, `"value"`, `"status"`, `"p_ingestdate"`).

Make sure to use following options:

- the format `"delta"`
- using the append mode
- partition by `p_ingestdate`

In [0]:
# ANSWER
from pyspark.sql.functions import col

(
    raw_health_tracker_data_df.select(
        "datasource",
        "ingesttime",
        "value",
        "status",
        col("ingestdate").alias("p_ingestdate"),
    )
    .write.format("delta")
    .mode("append")
    .partitionBy("p_ingestdate")
    .save(bronzePath)
)

In [0]:
display(dbutils.fs.ls(bronzePath))

## Register the Bronze Table in the Metastore

The table should be named `health_tracker_classic_bronze`.

In [0]:
# ANSWER
spark.sql(
    """
DROP TABLE IF EXISTS health_tracker_classic_bronze
"""
)

spark.sql(
    f"""
CREATE TABLE health_tracker_classic_bronze
USING DELTA
LOCATION "{bronzePath}"
"""
)

## Display Classic Bronze Table

Run this query to display the contents of the Classic Bronze Table

In [0]:
%sql

SELECT * FROM health_tracker_classic_bronze

### Query Broken Records


Run a SQL query to display just the incoming records for "Gonzalo Valdés".

🧠 You can use the SQL operator `RLIKE`, which is short for regex `LIKE`,
to create your matching predicate.

[`RLIKE` documentation](https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#rlike)

In [0]:
%sql

SELECT * FROM health_tracker_classic_bronze WHERE value RLIKE 'Gonzalo Valdés'

### What do you notice?

### Display the User Dimension Table


Run a SQL query to display the records in `health_tracker_user`.

In [0]:
%sql

SELECT * FROM health_tracker_user

## Purge Raw File Path

We have loaded the raw files using batch loading, whereas with the Plus pipeline we used Streaming.

The impact of this is that batch does not use checkpointing and therefore does not know which files have been ingested.

We need to manually purge the raw files that have been loaded.

In [0]:
dbutils.fs.rm(rawPath, recurse=True)


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>