# Sructured Streaming

In this notebook you will use API of Structured Streaming to process data in microbatches.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

from pyspark.sql.types import StructType, StructField, TimestampType, LongType, IntegerType, StringType
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Streaming I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

stream_input_path = os.path.join(project_path, 'data/questions-queue')

stream_output_path = os.path.join(project_path, 'output/streaming-output/1')

checkpoint_location = os.path.join(project_path, 'output/streaming-output/checkpoint/1')

# Task I

* First write batch query to read data from `data/questions-queue`
    * The dataset contains 1000 json files where each file has only 1 record to mimic a queue
    * This dataset simulates the queue you might have comming from Kafka, Kinesis or another streaming system 
 
* Then write the same query as streaming query and take 5 records/files per microbatch
* Use the memory as sink

#### Write the schema of the json data:

In [None]:
stream_schema = StructType(
    [
        StructField('question_id', LongType()),
        StructField('creation_date', TimestampType()),
        StructField('title', StringType()),
        StructField('r', IntegerType())
    ]
)

<b>First write the batch query:</b>

In [None]:
batch_query = (
    spark
    .read
    .format('json')
    .schema(stream_schema)
    .option('path', stream_input_path)
    .load()
)

In [None]:
batch_query.count()

#### Now write it as the streaming query:

Hint:
* use `maxFilesPerTrigger` option to achieve 5 files per microbatch

In [None]:
streaming_query = (
    spark
    .readStream
    .schema(stream_schema)
    .option('maxFilesPerTrigger', 5)
    .json(stream_input_path)
)

#### Write the stream to memory sink:

Hint:
* as `format` use memory
* as `outputMode` use append
* use `queryName` so you can query the table in memory
* use [start()](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamWriter.start.html#pyspark.sql.streaming.DataStreamWriter.start) to run the query

In [None]:
q = (
    streaming_query
    .writeStream
    .format('memory')
    .outputMode('append')
    .queryName('my_stream')
    .start()
)

#### See the memory table:

Hint:
* write some sql queries against the memory table
* use `spark.sql(...).show()`

In [None]:
spark.sql('select * from my_stream order by r desc').show()

In [None]:
spark.sql('select count(*) from my_stream').show()

#### Streaming queries management

* see active queries: `spark.streams.active`
* see last microbatch: `query.lastProgress`
* see id of the query: `query.id`
* see the name of the query: query.name

In [None]:
spark.streams.active

In [None]:
q.lastProgress

In [None]:
q.recentProgress

In [None]:
q.id

In [None]:
q.name

#### Stop the query:

Hint:
* use `stop()`

In [None]:
q.stop()

# Task II

Write the same streaming query with file sink and use 100 files per microbatch. (How many files will be created?)

Hint:
* as `format` use parquet
* for `mode` use `append`
* use `checkpointLocation` in `option`

In [None]:
streaming_query = (
    spark
    .readStream
    .schema(stream_schema)
    .option('maxFilesPerTrigger', 100)
    .json(stream_input_path)
)

q = (
    streaming_query
    .writeStream
    .format('parquet')
    .outputMode('append')
    .queryName('my_stream')
    .option('path', stream_output_path)
    .option('checkpointLocation', checkpoint_location)
    .start()
)

In [None]:
q.lastProgress

In [None]:
q.stop()

## Note

The size of microbatch is 100, if it is processed by 8 tasks (depending on the paralellism you are using), There will be 10 cycles before all 1000 files are processed. Each task will produce one output file per cycle => there will be 8 x 10 = 80 files.

In [None]:
spark.stop()