## AUTOLOADER

#### In this NOTEBOOK, we demonstrate the power of DATABRICKS AUTOLOADER , to incrementaly ingest data using STREAMING QUERIES.

In [0]:
%python
# We are going to list down the parquet files that we have in the orders-raw directory. We will be using this directory for creating a streaming query.
files = dbutils.fs.ls(f"{dataset_bookstore}/orders-raw")
display(files)

path,name,size,modificationTime
dbfs:/mnt/demo-datasets/bookstore/orders-raw/01.parquet,01.parquet,18823,1716830624000


* To use **AUTOLOADER** we need to use *cloudFiles* format

In [0]:
#  We use AUTOLOADER (It is a streaming query which uses SPARK STRUCTURED STREAMING to load data incrementally) to read files from a directory and detect new files as they arrive and read them incrementally.

(spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format","parquet")
      .option("cloudFiles.schemaLocation","dbfs:/mnt/demo/orders_checkpoint")
      .load(f"{dataset_bookstore}/orders-raw")   
    .writeStream
      .option("checkpointLocation","dbfs:/mnt/demo/orders_checkpoint")   
      .table("orders_updates")  
)

# The 1st SPIKE in the below graphs represent the loading of data due to the above command.
# The 2nd SPIKE in the below graphs happen as we again try to load new records on to "orders_updates" STREAMING TABLE. Check code block 7 and 8 for the command that is used to load the data .

Out[9]: <pyspark.sql.streaming.query.StreamingQuery at 0x7fc9745ecc70>

In [0]:
%sql
SELECT * FROM orders_updates LIMIT 20;

order_id,order_timestamp,customer_id,quantity,total,books,_rescued_data
6341,1657520256,C00788,1,41,"List(List(B08, 1, 41))",
6342,1657520256,C00788,1,41,"List(List(B08, 1, 41))",
6343,1657531717,C00654,1,28,"List(List(B02, 1, 28))",
6344,1657531717,C00654,1,28,"List(List(B02, 1, 28))",
6345,1657543676,C00762,1,49,"List(List(B01, 1, 49))",
6346,1657543676,C00762,1,49,"List(List(B01, 1, 49))",
6347,1657546079,C01014,1,28,"List(List(B02, 1, 28))",
6348,1657546658,C00633,1,24,"List(List(B09, 1, 24))",
6349,1657546658,C00633,1,24,"List(List(B09, 1, 24))",
6350,1657547177,C00638,1,35,"List(List(B03, 1, 35))",


In [0]:
%sql
SELECT count(*) FROM orders_updates;
-- We count the records. There are 1000 of them.

count(1)
1000


In [0]:
# Now we are going to load new data files from our source directory.
load_new_data()

Loading 02.parquet file to the bookstore dataset


In [0]:
load_new_data()

Loading 03.parquet file to the bookstore dataset


In [0]:
# Let us list the contents of our source directory
files = dbutils.fs.ls(f"{dataset_bookstore}/orders-raw")
display(files)

path,name,size,modificationTime
dbfs:/mnt/demo-datasets/bookstore/orders-raw/01.parquet,01.parquet,18823,1716830624000
dbfs:/mnt/demo-datasets/bookstore/orders-raw/02.parquet,02.parquet,18814,1716830711000
dbfs:/mnt/demo-datasets/bookstore/orders-raw/03.parquet,03.parquet,18822,1716830717000


In [0]:
%sql
SELECT count(*) FROM orders_updates;
-- After placing more data on the "orders-raw" directory, we recount the records in the "orders_updates" table. We find that the new data is loaded onto the "orders_updates" table automatically. This is what we expected to achieve using the feature DATABRICKS AUTOLOADER.

count(1)
3000


In [0]:
%sql
DESCRIBE HISTORY orders_updates

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
3,2024-05-27T17:25:18.000+0000,6344707903279464,arzanishyn@gmail.com,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> e7e21dbd-0507-4ee4-9baf-2351ce77ef6b, epochId -> 2)",,List(2567944741657932),0527-162333-a6922r5e,2.0,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 1000, numOutputBytes -> 19059, numAddedFiles -> 1)",,Databricks-Runtime/11.3.x-scala2.12
2,2024-05-27T17:25:13.000+0000,6344707903279464,arzanishyn@gmail.com,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> e7e21dbd-0507-4ee4-9baf-2351ce77ef6b, epochId -> 1)",,List(2567944741657932),0527-162333-a6922r5e,1.0,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 1000, numOutputBytes -> 19051, numAddedFiles -> 1)",,Databricks-Runtime/11.3.x-scala2.12
1,2024-05-27T17:24:22.000+0000,6344707903279464,arzanishyn@gmail.com,STREAMING UPDATE,"Map(outputMode -> Append, queryId -> e7e21dbd-0507-4ee4-9baf-2351ce77ef6b, epochId -> 0)",,List(2567944741657932),0527-162333-a6922r5e,0.0,WriteSerializable,True,"Map(numRemovedFiles -> 0, numOutputRows -> 1000, numOutputBytes -> 19060, numAddedFiles -> 1)",,Databricks-Runtime/11.3.x-scala2.12
0,2024-05-27T17:24:16.000+0000,6344707903279464,arzanishyn@gmail.com,CREATE TABLE,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(2567944741657932),0527-162333-a6922r5e,,WriteSerializable,True,Map(),,Databricks-Runtime/11.3.x-scala2.12


In [0]:
%sql
DROP TABLE orders_updates

In [0]:
dbutils.fs.rm("dbfs:/mnt/demo/orders_checkpoint",True)

Out[19]: True