# Raw to Bronze Pattern

## Notebook Objective

In this notebook we:
1. Ingest Raw Data
2. Augment the data with Ingestion Metadata
3. Stream write the augmented data to a Bronze Table

## Step Configuration

In [0]:
%run ./includes/configuration

Out[25]: DataFrame[]

All streams stopped.


### Display the Files in the Raw Path

In [0]:
display(dbutils.fs.ls(rawPath))

path,name,size
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/raw/health_tracker_data_2020_1.json,health_tracker_data_2020_1.json,310628


## Make Notebook Idempotent

In [0]:
dbutils.fs.rm(bronzePath, recurse=True)
dbutils.fs.rm(bronzeCheckpoint, recurse=True)

Out[29]: True

## Ingest raw data

Next, we will stream files from the source directory and write each line as a string to the Bronze table.

In [0]:
kafka_schema = "value STRING"

raw_health_tracker_data_df = (
    spark.readStream.format("text").schema(kafka_schema).load(rawPath)
)

**Exercise:** Write an Assertion Statement to Verify the Schema of the Raw Data

At this point, we write an assertion statement to verify that our streaming DataFrame has the schema we expect.

Your assertion should make sure that the `raw_health_tracker_data_df` DataFrame has the correct schema.

🤠 The function `_parse_datatype_string` (read more [here](http://spark.apache.org/docs/2.1.2/api/python/_modules/pyspark/sql/types.html)) converts a DDL format schema string into a Spark schema.

In [0]:
from pyspark.sql.types import _parse_datatype_string
assert raw_health_tracker_data_df.schema == _parse_datatype_string(kafka_schema), "File not present in Raw Path"
print("Assertion passed.")

Assertion passed.


## Display the Raw Data

🤓 Each row here is a raw string in JSON format, as would be passed by a stream server like Kafka.

In [0]:
display(raw_health_tracker_data_df, streamName="display_raw")

value
"{""time"":1577836800.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":52.8139067501}"
"{""time"":1577840400.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":53.9078900098}"
"{""time"":1577844000.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":52.7129593616}"
"{""time"":1577847600.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":52.2880422685}"
"{""time"":1577851200.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":52.5156095386}"
"{""time"":1577854800.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":53.6280743846}"
"{""time"":1577858400.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":52.1760037066}"
"{""time"":1577862000.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":90.0456721836}"
"{""time"":1577865600.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":89.4695644522}"
"{""time"":1577869200.0,""name"":""Deborah Powell"",""device_id"":0,""heartrate"":88.1490304138}"


❗️ To prevent the `display` function from continuously streaming, run the following utility function.

In [0]:
stop_named_stream(spark, "display_raw")

Out[33]: True

## Ingestion Metadata

As part of the ingestion process, we record metadata for the ingestion. In this case, we track the data sources, the ingestion time (`ingesttime`), and the ingest date (`ingestdate`) using the `pyspark.sql` functions `current_timestamp` and `lit`.

In [0]:
from pyspark.sql.functions import current_timestamp, lit

raw_health_tracker_data_df = raw_health_tracker_data_df.select(
    lit("files.training.databricks.com").alias("datasource"),
    current_timestamp().alias("ingesttime"),
    "value",
    current_timestamp().cast("date").alias("ingestdate"),
)

## WRITE Stream to a Bronze Table

Finally, we write to the Bronze Table using Structured Streaming.

🙅🏽‍♀️ While we _can_ write directly to tables using the `.table()` notation, this will create fully managed tables by writing output to a default location on DBFS. This is not best practice and should be avoided in nearly all cases.

### Partitioning
This course uses a dataset that is extremely small relative to an actual production system. Still we demonstrate the best practice of partitioning by date and partition on the ingestion date, column `p_ingestdate`.

😲 Note that we have aliased the `ingestdate` column to be `p_ingestdate`. We have done this in order to inform anyone who looks at the schema for this table that it has been partitioned by the ingestion date.

In [0]:
from pyspark.sql.functions import col

(
    raw_health_tracker_data_df.select(
        "datasource", "ingesttime", "value", col("ingestdate").alias("p_ingestdate")
    )
    .writeStream.format("delta")
    .outputMode("append")
    .option("checkpointLocation", bronzeCheckpoint)
    .partitionBy("p_ingestdate")
    .queryName("write_raw_to_bronze")
    .start(bronzePath)
)

Out[35]: <pyspark.sql.streaming.StreamingQuery at 0x7fd2dcf39c10>

### Checkpointing

When defining a Delta Lake streaming query, one of the options that you need to specify is the location of a checkpoint directory.

`.writeStream.format("delta").option("checkpointLocation", <path-to-checkpoint-directory>) ...`

This is actually a structured streaming feature. It stores the current state of your streaming job.

Should your streaming job stop for some reason and you restart it, it will continue from where it left off.

💀 If you do not have a checkpoint directory, when the streaming job stops, you lose all state around your streaming job and upon restart, you start from scratch.

✋🏽 Also note that every streaming job should have its own checkpoint directory: no sharing.

## Create a Reference to the Delta table files

In this command we create a Spark DataFrame via a reference to the Delta file in DBFS.

In [0]:
bronze_health_tracker = spark.readStream.format("delta").load(bronzePath)

### Troubleshooting

😫 If you try to run this before the `writeStream` above has been created, you may see the following error:

`
AnalysisException: Table schema is not set.  Write data into it or use CREATE TABLE to set the schema.;`

If this happens, wait a moment for the `writeStream` to instantiate and run the command again.

## Display the files in the Delta table

These files can be viewed using the `dbutils.fs.ls` function.

In [0]:
display(dbutils.fs.ls(bronzePath))

path,name,size
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/bronze/_delta_log/,_delta_log/,0
dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/bronze/p_ingestdate=2022-03-25/,p_ingestdate=2022-03-25/,0


**Exercise:** Write an Assertion Statement to Verify the Schema of the Bronze Delta Table

At this point, we write an assertion statement to verify that our Bronze Delta table has the schema we expect.

Your assertion should make sure that the `bronze_health_tracker` DataFrame has the correct schema.

💪🏼 Remember, the function `_parse_datatype_string` converts a DDL format schema string into a Spark schema.

In [0]:

assert bronze_health_tracker.schema == _parse_datatype_string("datasource STRING, ingesttime TIMESTAMP, value STRING, p_ingestdate DATE"), "File not present in Bronze Path"
print("Assertion passed.")



Assertion passed.


## Display Running Streams

You can use the following code to display all streams that are currently running.

In [0]:
for stream in spark.streams.active:
    print(stream.name)

write_raw_to_bronze


## Register the Bronze Table in the Metastore

Recall that a Delta table registered in the Metastore is a reference to a physical table created in object storage.

We just created a Bronze Delta table in object storage by writing data to a specific location. If we register that location with the Metastore as a table, we can query the tables using SQL.

(Because we will never directly query the Bronze table, it is not strictly necessary to register this table in the Metastore, but we will do so for demonstration purposes.)

At Delta table creation, the Delta files in Object Storage define the schema, partitioning, and table properties. For this reason, it is not necessary to specify any of these when registering the table with the Metastore. Furthermore, no table repair is required. The transaction log stored with the Delta files contains all metadata needed for an immediate query.

In [0]:
spark.sql(
    """
DROP TABLE IF EXISTS health_tracker_plus_bronze
"""
)

spark.sql(
    f"""
CREATE TABLE health_tracker_plus_bronze
USING DELTA
LOCATION "{bronzePath}"
"""
)

Out[43]: DataFrame[]

In [0]:
display(
    spark.sql(
        """
  DESCRIBE DETAIL health_tracker_plus_bronze
  """
    )
)

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,445ccb7a-681d-48a0-81a6-c720a9d48b41,dbacademy_gaurav_chattree.health_tracker_plus_bronze,,dbfs:/dbacademy/gaurav_chattree/dataengineering/plus/bronze,2022-03-25T05:44:14.477+0000,2022-03-25T05:44:17.000+0000,List(p_ingestdate),1,77601,Map(),1,2


## Delta Lake Python API
Delta Lake provides programmatic APIs to examine and manipulate Delta tables.

Here, we create a reference to the Bronze table using the Delta Lake Python API.

In [0]:
from delta.tables import DeltaTable

bronzeTable = DeltaTable.forPath(spark, bronzePath)

In [0]:
display(bronzeTable.history())

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata
0,2022-03-25T05:44:17.000+0000,8047228571528786,gchattre@ur.rochester.edu,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> 733fdb5a-68da-4b0c-9198-0ed8b0802f77, epochId -> 0)",,List(3719025789001790),0325-052223-ncmvp8x,,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 3720, numOutputBytes -> 77601, numAddedFiles -> 1)",


## Stop All Streams

In the next notebook, we will stream data from the Bronze table to a Silver Delta table.

Before we do so, let's shut down all streams in this notebook.

In [0]:
stop_all_streams()


Out[47]: True

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>