
<div  style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://raw.githubusercontent.com/derar-alhussein/Databricks-Certified-Data-Engineer-Professional/main/Includes/images/customers_orders.png" width="60%">
</div>

In [0]:
%run ../Includes/Copy-Datasets

Data catalog: workspace
Schema: bookstore_eng_pro


In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

def batch_upsert(microBatchDF, batchId):
    window = Window.partitionBy("order_id", "customer_id").orderBy(F.col("_commit_timestamp").desc())
    
    (microBatchDF.filter(F.col("_change_type").isin(["insert", "update_postimage"]))
                 .withColumn("rank", F.rank().over(window))
                 .filter("rank = 1")
                 .drop("rank", "_change_type", "_commit_version")
                 .withColumnRenamed("_commit_timestamp", "processed_timestamp")
                 .createOrReplaceTempView("ranked_updates"))
    
    query = """
        MERGE INTO customers_orders c
        USING ranked_updates r
        ON c.order_id=r.order_id AND c.customer_id=r.customer_id
            WHEN MATCHED AND c.processed_timestamp < r.processed_timestamp
              THEN UPDATE SET *
            WHEN NOT MATCHED
              THEN INSERT *
    """
    
    microBatchDF.sparkSession.sql(query)

In [0]:
%sql
CREATE TABLE IF NOT EXISTS customers_orders
(order_id STRING, order_timestamp Timestamp, customer_id STRING, quantity BIGINT, total BIGINT, books ARRAY<STRUCT<book_id STRING, quantity BIGINT, subtotal BIGINT>>, email STRING, first_name STRING, last_name STRING, gender STRING, street STRING, city STRING, country STRING, row_time TIMESTAMP, processed_timestamp TIMESTAMP)

In [0]:
def process_customers_orders():
    orders_df = spark.readStream.table("orders_silver")
    
    cdf_customers_df = (spark.readStream
                             .option("readChangeData", True)
                             .option("startingVersion", 2)
                             .table("customers_silver")
                       )

    query = (orders_df
                .join(cdf_customers_df, ["customer_id"], "inner")
                .writeStream
                    .foreachBatch(batch_upsert)
                    .option("checkpointLocation", f"{bookstore.checkpoint_path}/customers_orders")
                    .trigger(availableNow=True)
                    .start()
            )
    
    query.awaitTermination()
    
process_customers_orders()

In the function  process_customers_orders(), the line  .foreachBatch(batch_upsert)  is part of the PySpark Structured Streaming API. When you use  .foreachBatch(), you provide a function (in this case,  batch_upsert) that will be called for each micro-batch of streaming data. PySpark automatically passes two arguments to this function:

-   The first argument is the micro-batch DataFrame (microBatchDF), which contains the data for that batch.
-   The second argument is the batch ID (batchId), which is a unique identifier for the batch.

So, even though you don't see explicit parameters being passed in the code, PySpark handles this internally. When a new micro-batch is ready, it calls  batch_upsert(microBatchDF, batchId)  for you, supplying the required arguments. This is a standard pattern for using  .foreachBatch()  in PySpark streaming.

In [0]:
%sql
SELECT * FROM customers_orders

order_id,order_timestamp,customer_id,quantity,total,books,email,first_name,last_name,gender,street,city,country,row_time,processed_timestamp
4904,2022-03-14T11:07:00.000Z,C00509,2,55,"List(List(B03, 1, 35), List(B04, 1, 20))",,Evey,Feore,Female,09166 Talisman Lane,Farah,Afghanistan,2022-03-14T05:07:00.000Z,2025-12-30T14:29:35.000Z
4926,2022-03-15T18:07:00.000Z,C00509,1,33,"List(List(B07, 1, 33))",,Evey,Feore,Female,09166 Talisman Lane,Farah,Afghanistan,2022-03-14T05:07:00.000Z,2025-12-30T14:29:35.000Z
4928,2022-03-15T20:07:00.000Z,C00509,1,33,"List(List(B07, 1, 33))",,Evey,Feore,Female,09166 Talisman Lane,Farah,Afghanistan,2022-03-14T05:07:00.000Z,2025-12-30T14:29:35.000Z
5000,2022-03-20T00:07:00.000Z,C00509,0,0,List(),,Evey,Feore,Female,09166 Talisman Lane,Farah,Afghanistan,2022-03-14T05:07:00.000Z,2025-12-30T14:29:35.000Z
5025,2022-03-21T09:07:00.000Z,C00509,2,62,"List(List(B09, 1, 24), List(B11, 1, 38))",,Evey,Feore,Female,09166 Talisman Lane,Farah,Afghanistan,2022-03-14T05:07:00.000Z,2025-12-30T14:29:35.000Z
4815,2022-03-04T12:06:00.000Z,C00439,3,83,"List(List(B02, 1, 28), List(B07, 1, 33), List(B06, 1, 22))",,Tallou,Duffitt,Female,7083 Judy Center,Greda,Croatia,2022-02-25T07:06:00.000Z,2025-12-30T14:29:35.000Z
4844,2022-03-08T23:06:00.000Z,C00439,2,68,"List(List(B09, 1, 24), List(B10, 1, 44))",,Tallou,Duffitt,Female,7083 Judy Center,Greda,Croatia,2022-02-25T07:06:00.000Z,2025-12-30T14:29:35.000Z
4857,2022-03-10T04:06:00.000Z,C00439,2,73,"List(List(B09, 1, 24), List(B01, 1, 49))",,Tallou,Duffitt,Female,7083 Judy Center,Greda,Croatia,2022-02-25T07:06:00.000Z,2025-12-30T14:29:35.000Z
4889,2022-03-13T02:07:00.000Z,C00504,1,49,"List(List(B01, 1, 49))",ffeldmusco@so-net.ne.jp,Freddi,Feldmus,Female,35916 Sachs Crossing,Baoluan,China,2022-03-27T02:09:00.000Z,2025-12-30T14:29:35.000Z
4911,2022-03-14T16:07:00.000Z,C00504,2,42,"List(List(B04, 1, 20), List(B06, 1, 22))",ffeldmusco@so-net.ne.jp,Freddi,Feldmus,Female,35916 Sachs Crossing,Baoluan,China,2022-03-27T02:09:00.000Z,2025-12-30T14:29:35.000Z


In [0]:
bookstore.load_new_data()
bookstore.process_bronze()
bookstore.process_orders_silver()
bookstore.process_customers_silver()

process_customers_orders()

Loading kafka-streaming-05.json file to the bookstore dataset


25/12/30 14:52:22 Spark Server has not sent updates for Streaming Query c88194a3-ced4-424e-8149-7f1f9e4bb0f1 in60 seconds, but the query is still active. Marking query as in-progress. Spark Session ID is 46e43e03-dfca-48a8-88f8-26485c734512. This is typically not a problem.
25/12/30 14:52:23 Spark Server has not sent updates for Streaming Query c88194a3-ced4-424e-8149-7f1f9e4bb0f1 in60 seconds, but the query is still active. Marking query as in-progress. Spark Session ID is 46e43e03-dfca-48a8-88f8-26485c734512. This is typically not a problem.
25/12/30 14:52:23 Spark Server has not sent updates for Streaming Query c88194a3-ced4-424e-8149-7f1f9e4bb0f1 in60 seconds, but the query is still active. Marking query as in-progress. Spark Session ID is 46e43e03-dfca-48a8-88f8-26485c734512. This is typically not a problem.
25/12/30 14:52:23 Spark Server has not sent updates for Streaming Query c88194a3-ced4-424e-8149-7f1f9e4bb0f1 in60 seconds, but the query is still active. Marking query as in-p

In [0]:
%sql
SELECT count(*) FROM customers_orders

count(*)
842
