# Auto Load Data to Multiplex Bronze

The chief architect has decided that rather than connecting directly to Kafka, a source system will send raw records as JSON files to cloud object storage. In this notebook, you'll build a multiplex table that will ingest these records with Auto Loader and store the entire history of this incremental feed. The initial table will store data from all of our topics and have the following schema. 

| Field | Type |
| --- | --- |
| key | BINARY |
| value | BINARY |
| topic | STRING |
| partition | LONG |
| offset | LONG
| timestamp | LONG |
| date | DATE |
| week_part | STRING |

This single table will drive the majority of the data through the target architecture, feeding three interdependent data pipelines.

<img src="https://files.training.databricks.com/images/ade/ADE_arch_bronze.png" width="60%" />

**NOTE**: Details on additional configurations for connecting to Kafka are available <a href="https://docs.databricks.com/spark/latest/structured-streaming/kafka.html" target="_blank">here</a>.


## Learning Objectives

By the end of this lesson, you should be able to:
- Describe a multiplex design
- Apply Auto Loader to incrementally process records
- Configure trigger intervals
- Use "trigger-available-now" logic to execute triggered incremental loading of data.

The following cell declares the paths needed throughout this notebook.

In [0]:
%run ../Includes/Classroom-Setup-3.1

<img src="https://files.training.databricks.com/images/icon_warn_24.png"> All records are being stored on the DBFS root for this training example.

Setting up separate databases and storage accounts for different layers
of data is preferred in both development and production environments.

## Examine Source Data

Data files are being written to the path specified by the variable below.

Run the following cell to examine the schema in the source data and determine if anything needs to be changed as it's being ingested.

In [0]:
spark.read.json(DA.paths.source_daily).printSchema()

## Prepare Data to Join with Date Lookup Table
The initialization script has loaded a **`date_lookup`** table. This table has a number of pre-computed date values. Note that additional fields indicating holidays or financial quarters might often be added to this table for later data enrichment.

Pre-computing and storing these values is especially important based on our desire to partition our data by year and week, using the string pattern **`YYYY-WW`**. While Spark has both **`year`** and **`weekofyear`** functions built in, the **`weekofyear`** function may not provide expected behavior for dates falling in the last week of December or <a href="https://spark.apache.org/docs/2.3.0/api/sql/#weekofyear" target="_blank">first week of January</a>, as it defines week 1 as the first week with >3 days.

While this edge case is esoteric to Spark, a **`date_lookup`** table that will be used across the organization is important for making sure that data is consistently enriched with date-related details.

In [0]:
%sql

DESCRIBE date_lookup

The current table being implemented requires that we capture the accurate **`week_part`** for each **`date`**.

The following call creates the **`DataFrame`** needed for the subsequent join operation.

In [0]:
date_lookup_df = spark.table("date_lookup").select("date", "week_part")

Working with the JSON data stored in the **`DA.paths.source_daily`** location, transform the **`timestamp`** column as necessary to join it with the **`date`** column.

In [0]:
# TODO
from pyspark.sql import functions as F
json_df = spark.read.json(DA.paths.source_daily)
 
joined_df = (json_df.join(F.broadcast(date_lookup_df),
                          FILL_IN,  # Insert the matching condition
                          "left"))
 
display(joined_df)

## Define Triggered Incremental Auto Loading to Multiplex Bronze Table

Below is starter code for a function to incrementally process data from the source directory to the bronze table, creating the table during the initial write.

Fill in the missing code to:
- Configure the stream to use Auto Loader
- Configure Auto Loader to use the JSON format
- Perform a broadcast join with the date_lookup table
- Partition the data by the **`topic`** and **`week_part`** fields

In [0]:
# TODO
def process_bronze():
    query = (spark.readStream
                  .FILL_IN
                  .FILL_IN
                  .option("cloudFiles.schemaLocation", f"{DA.paths.checkpoints}/bronze_schema")
                  .load(DA.paths.source_daily)
                  .join(F.broadcast(date_lookup_df), F.to_date((F.col("timestamp")/1000).cast("timestamp")) == F.col("date"), "left")
                  .writeStream
                  .option("checkpointLocation", f"{DA.paths.checkpoints}/bronze")
                  .partitionBy(FILL_IN)
                  .trigger(availableNow=True)
                  .table("bronze"))
 
    query.awaitTermination()

Run the cell below to process an incremental batch of data.

In [0]:
process_bronze()

Review the count of processed records.

In [0]:
%sql
SELECT COUNT(*) FROM bronze

Preview the data to ensure records are being ingested correctly.

In [0]:
%sql
SELECT * FROM bronze

The **`DA.daily_stream.load()`** code below is a helper class to land new data in the source directory.

Executing the following cell should successfully process a new batch.

In [0]:
DA.daily_stream.load()

In [0]:
process_bronze()

Confirm the count is now higher.

In [0]:
%sql
SELECT COUNT(*) FROM bronze

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
DA.cleanup()