# Sructured Streaming

* First write batch query to read data from 'data/questions-queue'
 * This dataset simulates the queue you might have comming from Kafka or Kinesis and so on
 * Each file in this dataset is one record in json
* Then write it as streaming query and take 5 files per microbatch. Use the memory as sink.
* Finaly write it as streaming query and use file sink with 100 files per microbatch. (How many files will be created?)

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, row_number
)
from pyspark.sql import Window
from pyspark.sql.types import StructType, StructField, TimestampType, LongType, IntegerType, StringType
import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Streaming I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-2]) 

stream_input_path = os.path.join(project_path, 'data/questions-queue')

stream_output_path = os.path.join(project_path, 'output/streaming-output/1')

checkpoint_location = os.path.join(project_path, 'output/streaming-output/checkpoint/1')

In [None]:
stream_schema = StructType(
    [
        StructField('question_id', LongType()),
        StructField('creation_date', TimestampType()),
        StructField('title', StringType()),
        StructField('r', IntegerType())
    ]
)

<b>First write the batch query:</b>

In [None]:
batch_query = (
    spark
    .read
    .format('json')
    .schema(stream_schema)
    .option('path', stream_input_path)
    .load()
)

In [None]:
batch_query.count()

<b>Now write it as the streaming query:</b>

In [None]:
streaming_query = (
    spark
    .readStream
    .schema(stream_schema)
    .option('maxFilesPerTrigger', 5)
    .json(stream_input_path)
)

In [None]:
q = (
    streaming_query
    .writeStream
    .format('memory')
    .outputMode('append')
    .queryName('my_stream')
    .start()
)

In [None]:
spark.sql('select * from my_stream order by r desc').show()

In [None]:
spark.sql('select count(*) from my_stream').show()

In [None]:
spark.streams.active

In [None]:
q.lastProgress

In [None]:
q.recentProgress

In [None]:
q.id

In [None]:
q.stop()

<b>Now use the file-sink:</b>

In [None]:
streaming_query = (
    spark
    .readStream
    .schema(stream_schema)
    .option('maxFilesPerTrigger', 100)
    .json(stream_input_path)
)

q = (
    streaming_query
    .writeStream
    .format('parquet')
    .outputMode('append')
    .queryName('my_stream')
    .option('path', stream_output_path)
    .option('checkpointLocation', checkpoint_location)
    .start()
)

In [None]:
q.lastProgress

## Note

The size of micrbatch is 100, it is processed by 4 tasks, to each task takes aprox. 25 files and merges that into one file. There will be 2000 / 25 = 80 files.