# Overview of Streaming Technologies and Spark Streaming

As part of this topic, we will see the overview of technologies used in building Streaming data pipelines. Also, we will have a deeper look into Spark Structured Streaming by developing solution for a simple problem.

* Overview of Streaming Technologies
* Spark Structured Streaming – Overview
* Setup Project
* Develop logic using REPL
* Development Life Cycle – IDE
* Output Modes and Sinks
* Windowing and Handling Late Data

### Overview of Streaming Technologies
Let us go through the details about Streaming Technologies.

* Ingestion
* Real-Time Processing
* Databases
* Visualization
* Frameworks

***Ingestion***

There are many technologies which are used in ingesting data in real time.

* Logstash
    * Can read data from different sources
    * Can apply simple row level transformations such as converting date formats, masking some of the attribute values in each message etc.
    * Can work with other technologies in building streaming pipelines such as Kafka
* Flume
    * Runs as agent
    * Each agent have a source, channel, and sink
    * Supports many sources and sinks
    * Can work with other technologies in building streaming pipelines such as Kafka
    * Can push data to technologies like Storm, Flink, Spark Streaming etc to run real-time streaming analytics.
* Kafka connect and Kafka topic
    * Kafka topic is false tolerant and highly reliable intermediate data streaming mechanism
    * Kafka connect is to read data from different sources and push messages to Kafka topic and also consume messages from Kafka topic and push to supported targets.
    * Kafka connect and the topic will facilitate us to get data from different types of sources to different types of sinks.
    * Can push data to technologies like Kafka Streams, Storm, Flink, Spark Streaming etc to run real-time streaming analytics.
* Kinesis firehose and Kinesis data streams
    * Kinesis is AWS Service which is very similar to Kafka
    * Kinesis Firehose is similar to Kafka connect and Kinesis data streams is similar to the topic
    * No need for dedicated cluster and will only be charged for the usage.
* and more

***Real-Time processing***

As the data come through tools like logstash, flume, kafka etc we might want to perform standard transformations such as data cleansing, standardization, lookups, joins, aggregations, sorting, ranking etc. While some of the data ingestion tools are capable of some of the transformations they do not come up with all the features. Also, the ingestion might get delayed and make the flow unstable. Hence we need to use the tools which are built for performing transformations as the data is streamed. Here are some of the prominent tools.

* Spark Streaming or Spark Structured Streaming (a module built as part of Spark)
* Flink
* Storm
* Kafka Streams
* and more

***Databases***

Once the data is processed, we have to store data in databases to persist and build a visualization layer on top of the processed data. We can use

* RDBMS – such as Oracle, MySQL, Postgres etc
* Data Warehouses – such as Teradata, Redshift etc
* NoSQL databases – such as HBase, Cassandra, MongoDB, DynamoDB etc
* Search based databases – such as Elastic Search

***Visualization***

Visualization is typically done as part of the application development using standard frameworks.

* d3js
* Kibana
* Standard reporting tools such as Tableau
* and more

***Frameworks***

As we discuss different moving parts in building streaming pipelines now let us get into frameworks. Most of these frameworks do not have visualization included.

* Kafka
    * Kafka Connect
    * Kafka Topic
    * Kafka Streams
* ELK
    * Elastic Search (Database)
    * Logstash (streaming and processing logs)
    * Kibana (Visualization)
* HDF – Streaming services running behind NiFi
* MapR Streams – Streaming services running on MapR cluster
* AWS Services
    * DynamoDB (Database)
    * s3 (persistent storage of flat file format)
    * Kinesis (streaming and processing logs)
We have highlighted some of the popular frameworks. Almost all the top vendors such as Cloudera, Google, Microsoft Azure etc have necessary services to build streaming pipelines.

### Spark Structured Streaming – Overview
Apache Spark is a proven distributed computing framework with modules for different purposes

* Core APIs – Transformations and Actions
* Spark SQL and Data Frames
* Spark Streaming (legacy) and Spark Structured Streaming
* Spark MLLib
* Spark GraphX
* and more

We can use Spark Structured Streaming to apply complex business rules either by using Data Frame operations or Spark SQL. Let us review Official Documentation to understand how it is structured.

Typical batch job execution life cycle.

* Create a Spark Context (Provision Resources)
* Run the jobs or applications
    * Read Data from Source (typically from files or databases)
    * Apply Transformations
        * Row Level Transformations
        * Joining Data Sets
        * Group and Perform Aggregations
        * Sorting and Ranking
        * Deduplication
        * and more
    * Write Data to Target/Sink (typically to files or databases)
* Close Spark Context (Cleanup Resources)
* We typically schedule jobs using enterprise scheduling tools.
* This works fine if the frequency of the job is beyond an hour.

Streaming Context for micro batches.

* However, if we have to apply transformations in real time, then the overhead of creating and closing spark context in relative to data processing is considerably higher.
* We can solve this problem by using the Streaming Context. It is created when we use spark.readStream. Instead of closing the context, it will keep on polling the source and read the data at regular intervals.
* Context will be closed when we terminate it.

***Important Concepts***

Let us understand some of the important concepts related to Spark Structured Streaming. We have already seen spark.read to read the data and df.write to write the data while building batch data pipelines. For streaming pipelines, we have spark.readStream to read the data and df.writeStream to write the data in streaming fashion.

* Sources
    * File
    * Kafka
    * Socket (for testing)
* Basic Operations or Transformations
    * Row Level Transformations
    * Joining Data Sets
    * Group and Perform Aggregations
    * Sorting and Ranking
    * Deduplication
    * and more
* Window Operations on Event Time (will cover later)
    * Handling late data and watermarking
* Output Modes
    * Append Mode
    * Update Mode
    * Complete Mode
* Sinks/Targets
    * Console
    * File
    * Memory
    * Kafka
    * foreach (can be used to write to Database)
* Fault Tolerance and Offset Management

### Setup Project
Let us understand how to setup a project to build applications using Spark Structured Streaming.

***Development Life Cycle***

Let us first go through the details about the Development Life Cycle.

* Make sure gen_logs is set up and data is being streamed
* Create new project StreamingDemo using IntelliJ
    * Choose scala 2.11
    * Choose sbt 0.13.x
    * Make sure JDK is chosen
* Update build.sbt. See below
* Define application properties
* Create GetStreamingDepartmentTraffic object
* Add logic to process data using Spark Structured Streaming
* Build jar file
* Ship to cluster and deploy

***Dependencies (build.sbt)***

Spark structured streaming require Spark SQL dependencies.

* Add type safe config dependency so that we can externalize properties
* Add spark-core and spark-sql dependencies
* Replace build.sbt with below lines of code


***Externalize Properties***
We need to make sure that the application can be run in different environments. It is very important to understand how to externalize properties and pass the information at run time.

* Make sure build.sbt have dependency related to typesafe config
* Create a new directory under src/main by name resources
* Add a file called application.properties and add below entries


### Develop logic using REPL
Here are the steps involved in developing streaming applications.

* Make sure to redirect the output of log file to localhost using port number 9999 (<mark>tail_logs.sh|nc -lk 9999</mark>)
* Typical data processing life cycle
    * Read Data – We can read data from files as well as tools like Kafka, Flume etc.
    * Process Data – Once the data is read it can be processed using Data Frame Operations or Spark SQL.
    * Write Data – We can write Data Frame to different sinks such as File, Kafka, Console, Memory, Database etc.
* We will start with reading data from the file as well as Kafka and then look into other aspects. To  validate read is successful we will write into memory and run queries.

***Reading Data***

* **spark.readStream** is the higher level API to read data in streaming fashion. It is similar to **spark.read**
* We can read the data either from files or from tools like Kafka, Flume etc.
* To read data from files, either we can use APIs such as **spark.readStream.csv** and pass the path or we can use APIs such as **spark.readStream.format** where file format is passed as an argument.
* Following are the file formats supported (same as **spark.read**)
    * csv and text
    * json
    * orc
    * parquet
* When we try to read the data from files, we need to apply schema. Unlike in **spark.read**, by default schema inference is disabled. We can enable schema inference by setting **spark.sql.streaming.schemaInference** to true
* We can also read data in streaming fashion from external web services or tools like kafka, flume etc using **spark.readStream.format**. We need to pass connectivity information using the option function.
* Depending upon the format we need to set options (e.g.: host and port for socket)
* Once we pass all the information, we can invoke the load function to create Data Frame.

***Processing Data***

Let us see how we can process data using Data Frame Operations or Spark SQL. As both are covered extensively earlier, we will only see how either of the approaches can be used rather than diving deep into all aspects of Data Frame Operations or Spark SQL.

* Once data is processed we can write the output into the specified target using **format** on top of **df.writeStream**.
* We need to specify output mode (append, complete or update). By default, it is append. However, when aggregations are involved we can only use complete or update.
* To print the data on the console in real time we need to use the console as part of the format.
 *Here is the complete code snippet, which read data in real time and then applies logic to get department count and print output every 20 seconds on the console.
 *For use cases like moving aggregations, we can use Window Operations.
 
DATA FRAME OPERATIONS

Let us see how we can process data using Data Frame Operations after creating Data Frame using readStream.
* We need to use APIs such as select, withColumn to project the data.
* We need to use APIs such as the filter or where to filter the data.
* Aggregations can be performed using groupBy.
* As part of this process, we need to use functions that are available under **org.apache.spark.sql.functions**. In our case, we have used functions like split, to_timestamp while filtering as well as projecting the data.

SPARK SQL

Let us see how we can process data using Spark SQL after creating Data Frame using readStream.
* We first have to register Data Frame as a view.
* Once Data Frame is registered as a view, we can develop SQL based query and pass it to spark.sql to process the data.
* It will create a new Data Frame with processed data. We can write the Data Frame to target using relevant APIs.

WINDOW OPERATIONS

Let us see how we can process data using Data Frame Operations after creating Data Frame using readStream.
* We need to use APIs such as select, withColumn to project the data.
* We need to use APIs such as the filter or where to filter the data.
* Aggregations can be performed using groupBy. We can pass window function as an argument along with the window interval as well as the slide interval as arguments.
* As part of this process, we need to use functions that are available under **org.apache.spark.sql.functions**. In our case, we have used functions like split, to_timestamp while filtering as well as projecting the data.

***Writing Data***

As we have seen how to read data and process it using Data Frame Operations or Spark SQL, now let us see how to write the data back to a sink.
* We can write the output to a different type of sinks or targets.
    * file
    * kafka
    * memory
    * console
    * database using foreach
* We can use writeStream.format to write data into file or memory or console or external plugins like Kafka. For Databases, we need to have a custom writer where we provide logic to open connection, process data and close the connection.
* We need to specify outputMode while writing data to Sink. Valid modes are append, update and complete.
update and complete are used on top of aggregated results while append is used on top of Data Frames which are processed using row level transformations.
* We might not be able to use all 3 modes with every sink. For example, we will not be able to write data to files in a complete or update mode.
* Go to this [link](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes) as part of official documentation to get the most relevant information about compatibility between output modes, sinks, and transformations.

### Development Life Cycle – IDE
As we explored Spark Structured Streaming APIs in detail, now let us understand how we can develop applications for Streaming Data Pipelines using IDE.
* Create a scala program by choosing the Scala Class and then type Object
* Make sure the program is named as GetStreamingDepartmentTraffic
* First, we need to import necessary APIs
* Develop necessary logic
    * Get the properties from application.properties
    * Create a spark session object by name spark
    * Create stream using spark.readStream
    * Process data using Data Frame Operations
    * Write the output to console (in actual applications we write the output to the database)

***Build, Deploy and Run***

Let us see how we can build and run the application locally and then on the cluster.
* Right click on the project and copy path
* Go to terminal and run cd command with the path copied
* Run <mark>sbt package</mark>
* It will generate a jar file for our application
* Copy to the server where you want to deploy
* Start streaming tail_logs to web service – <mark>tail_logs.sh|nc -lk gw02.itversity.com 9999</mark>
* Run below command in another session on the server

### Output Modes and Sinks
Now let us talk review details with respect to Output Modes and Sinks.
* Output Modes
    * **Complete** – The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
    * **Append** – Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
    * **Update** – Only the rows that were updated in the Result Table since the last trigger will be written to the external storage. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
* Sinks
    * file
    * kafka
    * memory
    * console
    * foreach
* Not all output modes are supported by all types of sinks and transformations. Go to this [link](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes)  as part of official documentation to get the most relevant information about compatibility between output modes, sinks, and transformations.
* We need to implement ForeachWriter to use foreach to write data into target Databases.

### Windowing and Handling Late Data
Let us see Windowing and Handling Late Date using Watermarking.
* In the previous example, we try to run every 20 seconds.
* It will take complete data and perform aggregations every interval as there is no interval or window while grouping the data.
* To actually run aggregations every interval, we need to either pass timestamp as part of the data or add timestamp while reading the data using <mark>option("includeTimestamp", true)</mark>
* We can also perform sliding windows such as a 10-minute window every 5 minutes.
* When we use windowing, data might come late sometimes.
* We can handle late data by using the concept of watermarking **(withWatermark).**