## Spark Structured Streaming
Spark Structured Streaming is a new feature introduced in Spark 2.0 which provides the same API for streaming applications as available for DataFrames in batch applications. Spark Structured Streaming provides a unified API for streaming applications which are more efficient and easy to use compared to the old Spark Streaming API that is based on RDDs. Spark Streaming allows the user to run continuous queries over the streaming data. In recents years, there was a gap to run simple SQL-like queries on structured but continuous data. Spark introduced Structured Streaming to help Data Analysts to run SQL-like simple queries on streams of data. It provides all the basic features of RDD-based streaming and also provides an SQL interface based on the Spark DataFrame API. In addition, Spark Structured Streaming provides features such as:
* Streaming aggregations
* Continuous window aggregations
* Stateful streaming aggregations
* Watermarks for streaming to handle late events

In Structured Streaming users can perform SQL-like operations and Spark internally makes sure that they run continuously on infinite data streams. A Structured Streaming job takes data from a streaming source, applies different transformations and operations on the streaming dataset and writes the results to some external data sink. With Spark Structured Streaming the user can write to sinks in the following write modes:
* **Append**  : Write new rows only
* **Update**  : Write the rows that were updated
* **Complete**: Write all rows including updated and new rows


Open a terminal and run the following command:

*nc -lk 9999*

![nc-lk](nc-lk.png)

Run the cell below and insert data into the terminal.

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()

# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

# Split the lines into words
words = lines.select( explode( split(lines.value, " ") ).alias("word") )

# Generate running word count
wordCounts = words.groupBy("word").count()


 # Start running the query that prints the running counts to the console
query = wordCounts.writeStream.outputMode("complete").format("console").start()

query.awaitTermination(2)

False

In [4]:
# run this cell to stop the Structured Streaming Application
query.stop()

## Joining Streaming Data with Static Data
In this example, customer information and product information is static data while the transaction information is streamed. For each incoming new transaction, we would like to join the static customer and product information. 

In [5]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

cusomersFile = "./input/customers.csv"
productsFile = "./input/products.csv"
transactionsData = "./input/transactions"

cusomersDF = spark.read.format("csv").option("header", "true")\
                  .option("inferSchema", "true").load(cusomersFile)
cusomersDF.show(5)

productsDF = spark.read.format("csv").option("header", "true")\
                  .option("inferSchema", "true").load(productsFile)
productsDF.show(5)

schema = StructType([ StructField("t_id", IntegerType(), True), 
                     StructField("p_id", IntegerType(), True),
                     StructField("cust_id", IntegerType(), True) ])

transactionsDF =spark.readStream.format("csv").option("inferSchema", "true")\
                     .option("maxFilesPerTrigger", 1).schema(schema)\
                     .load(transactionsData)

salesPerCustomer= transactionsDF.join(cusomersDF,"cust_id").join(productsDF,"p_id")\
                                .groupBy("cust_id").sum("price").alias("sales")

 # start running the query that prints the running counts to the console
query = salesPerCustomer.writeStream.outputMode("complete").format("console").start()

query.awaitTermination(10)

+-------+---------------+------+---+
|cust_id|           name|gender|age|
+-------+---------------+------+---+
|      1|    Tawsha Haig|Female| 26|
|      2|Sayres Aiskrigg|  Male| 39|
|      3|    Tate Metham|  Male| 54|
|      4|   Fanya Torres|Female| 38|
|      5| Callie Perrigo|Female| 39|
+-------+---------------+------+---+
only showing top 5 rows

+----+--------------------+-----+
|p_id|        product_name|price|
+----+--------------------+-----+
|   1|Sping Loaded Cup ...|   16|
|   2|Mustard - Individ...|   16|
|   3|      Shrimp - Prawn|    9|
|   4|  Bread Country Roll|   23|
|   5|Flower - Dish Garden|   26|
+----+--------------------+-----+
only showing top 5 rows



False

Check the terminal to see the cust_id and the sum of the prices.

## References
1. [structured-streaming-programming-guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
2. [introduction-to-spark-structured-streaming-part-6](http://blog.madhukaraphatak.com/introduction-to-spark-structured-streaming-part-6/)