## What a data Stream is?

- Any data source that grows over time.

Examples:

- New files landing in cloud storage
- Updates to a database captured in a cdc feed.
- Events queued in a pub/sub messaging feed. (Like kafka)
- A delta table.

## Spark Streaming

Spark streaming is another Spark feature that allows to treat infinite input data stream as a structured table.

With this, new data in the source result in new records to an **unbounded** table. 

## Triggers for streaming data

**Unspecified: processingTime = "500ms"**

**Fixed interval**

In [0]:
# .trigger( processingTime = "10 seconds" ) #Process data in micro-batches at the user-specified intervals.

**Triggered batch**

In [0]:
#.trigger( once=True ) #Process all available data in a single batch, then stop.

**Triggered micro-batches**

In [0]:
#.trigger(availableNow=True) #Process all available data in multiple micro-batches, then stop.

## Output modes for streaming

**Append**

In [0]:
#.outputMode("append")

**Complete**

In [0]:
#.outputMode("complete")

## Unsupported operations

- Sorting
- Deduplication

## Practice

**Create the table from the csv folder**

In [0]:
csv_file_df = spark.read.option("header", "true").option("delimiter", ";")\
    .csv("dbfs:/external_data/bookstore/books-csv/*.csv")
csv_file_df.write.mode("overwrite").saveAsTable("external_data.books")

**Read the table as a stream table**

In [0]:
spark.readStream.table("external_data.books").createOrReplaceTempView("books_streaming_tmp_vw")

In [0]:
%sql
select *
from books_streaming_tmp_vw

In [0]:
%sql
CREATE OR REPLACE TEMP VIEW author_counts_tmp_vw as
(
  select author, count(*) as total_books
  from books_streaming_tmp_vw
  group by author
)

Both views will behave as a live query, capturing all the upcomming data.

**Creating a delta table from a streaming temporary view**

This option will look for changes in the source view every 10 seconds and perform a complete overwriting

In [0]:
spark.table("author_counts_tmp_vw")\
  .writeStream\
  .trigger(processingTime = '10 seconds')\
  .option("checkpointLocation", "dbfs:/external_data/bookstore/author_counts_checkpoint")\
  .outputMode("complete")\
  .table("external_data.author_counts")

In [0]:
%sql

select *
from external_data.author_counts

In [0]:
%sql
INSERT INTO external_data.books
values ("B19", "Introduction to Modeling and Simulation", "Mark W. Spong", "Computer Science", 25),
        ("B20", "Robot Modeling and Control", "Mark W. Spong", "Computer Science", 30),
        ("B21", "Turing's Vision: The Birth of Computer Science", "Chris Bernhardt", "Computer Science", 35)

Inserting new data

In [0]:
%sql
INSERT INTO external_data.books
values ("B16", "Hands-On Deep Learning Algorithms with Python", "Sudharsan Ravichandiran", "Computer Science", 25),
        ("B17", "Neural Network Methods in Natural Language Processing", "Yoav Goldberg", "Computer Science", 30),
        ("B18", "Understanding digital signal processing", "Richard Lyons", "Computer Science", 35)

**Batch Writing Mode**

This option will insert all that's available at the momment and write it as a delta table in micro batches

In [0]:
spark.table("author_counts_tmp_vw")\
  .writeStream\
  .trigger(availableNow=True)\
  .option("checkpointLocation", "dbfs:/external_data/bookstore/author_counts_checkpoint")\
  .outputMode("complete")\
  .table("external_data.author_counts")\
    .awaitTermination()

In [0]:
%sql
select *
from external_data.author_counts