<br><br><br>
<span style="color:red;font-size:60px">Spark Streaming</span>
<br><br>
<li>Enables "the internet of things"</li>
<li>New data is constantly "streaming" into an application</li>
<li>The application processes this data in real time or "near" real time</li>
<li>Streaming applications are <span style="color:darkred">Producer-Consumer</span> applications</li>
<li>Other systems: Kafka, (Twitter) Storm, Gearpump, Apex, ... </li>


<br><br><br>
<span style="color:green;font-size:xx-large">Examples of Streaming Applications</span>

<li>Uber uses a streaming application to monitor customer requests and move drivers from one part of a city to another</li>
<li>Real time A/B testing</li>
<li>Figuring out fake news in real time (facebook/twitter)</li>


<br><br><br>
<span style="color:green;font-size:xx-large">Characteristics of Streaming Applications</span>
<li>Low latency and high throughput of incoming streams</li>
<li>Fault tolerance, usually by maintaining the current state in memory (low latency) and backing it up on a disk (fault tolerance)</li>
<ul>
    <li>Example: A system reporting the top 10 "most traded" stocks in real time</li>
    <li>A running count of each stock needs to be kept in memory</li>
    <li>The count needs to be updated with each tick</li>
    <li>The count must have a backup in case of failures (it can't start from scratch!)</li>
</ul>
<li>Interoperability with batch processing systems</li>
<ul>
    <li>Example: A real time inventory tracker</li>
    <li>A static file may contain item (number, price, location) information</li>
    <li>The streaming application contains transaction information (item, number sold, number bought)</li>
    <li>The streaming data needs to be joined with the static file to get updated inventory information</li>
</ul>
<li><span style="color:red">Exactly once</span> data guarantees</li>
<ul>
    <li>Some stream processing systems may guarantee "at least once" (data could be duplicated or read twice), others may guarantee "at most once" (data could be lost). Ideally, they should guarantee "exactly once".
</ul>
<li>Dealing with uneven data arrival rates</li>
<ul>
    <li>If data is arriving from multiple sources, they may not arrive in the correct temporal sequence (see next point!)</li>
</ul>

<br><br><br>
<span style="color:green;font-size:xx-large">Time</span>
<br>
<li><span style="color:red">Event time</span>: the time the event occurs (e.g., the time  a trade actually takes place). Note that the event time is usually independent of the streaming application</li>
<li><span style="color:red">Processing time</span>: the time that the system processes the event. This may be very close to the event time (nanoseconds) or not so close (milliseconds, seconds, ....). <b>Ideally, a streaming app should report the correct status at each event time</b></li>

<img src="event time vs processing time.png">

<br><br>
<span style="color:green;font-size:xx-large">Stream processing models</span>
<br><br>
<li><span style="color:red">Record-at-a-time</span> processes each piece of arriving data as soon as it arrives</li>
<ul>
    <li>low latency (immediate processing of data)</li>
    <li>difficult to deal with out of sequence data</li>
    <li>(lower throughput) relatively inefficient since the processing algorithm has to run for each data arrival</li>
</ul>
<li><span style="color:red">micro-batching</span> accumulates arriving data into small batches and processes a batch at a time</li>
<ul>
    <li>high latency (data is processed only when the batch is formed)</li>
    <li>easier to deal with out of sequence data (e.g., it is less important within a batch)</li>
    <li>(higher throughput) relatively efficient since the processing algorithm runs a batch at a time</li>
</ul>
<li>Generally, if a stream is very busy, micro batching is preferred. If latency is an issue (e.g., detecting hacker attacks on a machine), then record-at-a-time is preferred</li>
<li>Spark uses the micro-batching model (with latencies as low as a nanosecond)</li>


<br><br><br><br>
<span style="color:green;font-size:xx-large">Windowing</span>

<li><span style="color:red">Windowing</span> is a stream processing pattern where incoming data stream is divided into chunks based on temporal boundaries</li>
<li>For example, a high frequency trader may gather data every nanosecond but may process the data in one second chunks. The one second chunk is a window on the stream</li>

<span style="color:blue;font-size:large">Types of windows</span>
<li><span style="color:red">fixed windows</span>: Fixed windows are defined by a window length The incoming stream is divided into chunks, each of the window length size and each data point is processed in exactly one window. For example, a traffic control system may divide the day into 5 minute chunks and process traffic in each 5 minute chunk to get a sense for the differences in commuting patterns at different times</li>
<li><span style="color:red">sliding windows</span>: Sliding windows sit on top of the data stream and are defined by a "slide factor" and a "window length". For example, a stock trading application may look at 5 minute moving average that slide every minute. In this type of windowing system, each data point will be processed in multiple windows. A sliding window with a slide factor = window length is the same as a fixed window</li>
<li><span style="color:red">session windows</span>: Session windows coincide with the start and end of a session. For example, an analytics app on a web page may look at the sessions of each user for analysis (time spent/pages visited/ads clicked/page sequences). In this case, each session window will have its own (temporal) length</li>

<br>
<img src="fixed windows.png">

<img src="sliding windows.png">

<img src="session windowing.png">

<br><br><br>
<span style="color:green;font-size:xx-large">Spark Streaming</span>
<br><br>

<li>Spark doesn't work with continuous streams</li>
<li>Rather, it monitors the stream port, collects data into small batches, and processes each batch</li>
<li>Because the smallest batch size is <span style="color:blue">500ms</span> for RDD streaming and 1 nanosecond for Structured Streaming, think of Spark Streaming as "Almost Streaming"</li>


<br><br>
<span style="color:blue;font-size:large">Streaming abstractions in Spark</span>



<li><span style="color:red">RDD based streaming abstraction</span>: DStream (Discretized Stream) abstraction</li>
<li><span style="color:red">Dataframe based stream abstraction</span>: Structured streaming</li>


<span style="color:blue;font-size:large">A Spark Streaming Application</span>
<img src="streaming.png">

<br><br><br><br><br>
<span style="color:red;font-size:50px">RDD Based Streaming</span>
<br><br>

<span style="color:green;font-size:xx-large">Streaming context</span>



<li>The entry point for spark streaming</li>
<li>Sets up batch size</li>
<li>Arguments: the spark context and the time interval for a micro batch</li>
<li>We will use 10 seconds to give us enough time to see stuff happenning</li>

In [5]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc,Seconds(10.toLong))


import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@56e89e6a


In [2]:
val ssc = new StreamingContext(sc,Seconds(10))

ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@4a2c2521


<span style="color:green;font-size:xx-large">Batch processing</span>

<li>The basic unit of a streaming application is a batch processing program</li>
<li>When a batch is ready (e.g., every 10 seconds), the batch processing program processes it to generate data</li>
<li>Example, a twitter stream application may compute the sentiment on the data</li>
<li>A stock price stream application may calculate or update moving averages or other technical indicators</li>
<li>We'll start with a "socket listener"</li>
<li><span style="color:blue">ssc.socketTextStream</span> listens at the specified socket (localhost for us)</li>

<span style="color:blue;font-size:large">Create a listener</span>
<li>We'll use a local socket listener</li>
<li>Open a terminal (OSX/Linux) or cmd (Windows) window</li>
<li>OSX or linux (at the terminal prompt)</li>
<ul><li> nc -lk 4444</li></ul>
<li>Windows</li>
<ul><li>Follow instructions <a href="https://joncraton.org/blog/46/netcat-for-windows/">here</a> (make sure you read the comments if you run into trouble!) and download netcat
    </ul>
    <li>Once the listener is created, start the stream</li>
    <li>And start typing!</li>

<span style="color:blue;font-size:large">The program</span>
<li>Reads data coming in through the socket</li>
<li>Does a word count on each batch</li>
<li>Each batch is a <span style="color:blue">DStream</span> object</li>

<br><br><br>
<span style="color:green;font-size:xx-large">The Spark Streaming Process</span>
<li>Create a StreamingContext from the SparkContext</li>
<li>Create a DStream from the StreamingContext</li>
<li>Create a set of transformations and actions that can be applied to each RDD in the DStream</li>
<li>Start the streaming context</li>
<ul>
    <li>Once started, you cannot change the transformations and actions</li>
    <li>There must be at least one action, otherwise the StreamingContext will not start</li>
</ul>
<li>When done, stop the streaming context</li>
<ul>
    <li>Once stopped, you can't restart the streaming context</li>
    <li>You must create a new streaming context if you need to keep it running</li>
</ul>

In [17]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc,Seconds(10.toLong))

//NOTE: lines, words, pairs, wordCounts are all DStream objects, not RDDs!
val lines = ssc.socketTextStream("localhost", 4444)
val words = lines.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey((x, y) => x + y)
wordCounts.print()
// wordCounts.foreachRDD((rdd,time) => print(time.toString.takeRight(10),": Total  ",rdd.collect()(0)))

import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@258ccff1
lines: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@797b9aab
words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@53aaa0ed
pairs: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.MappedDStream@f047741
wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.ShuffledDStream@4c34c217


<span style="color:blue;font-size:large">Start the stream</span>
<li>So far, we've created a program that can be applied to each microbatch</li>
<li>Once we start listening at the stream, the program will be active</li>

In [18]:
//!nc -lk 4444

<span style="color:blue">start</span> is a method that starts listening on the stream and processing batches when they arrive
<li>Any unprocessed data in the stream buffer will be picked up by the application</li>

In [19]:
ssc.start()

-------------------------------------------
Time: 1671473640000 ms
-------------------------------------------

22/12/19 13:14:01 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/19 13:14:01 WARN BlockManager: Block input-0-1671473641200 replicated to only 0 peer(s) instead of 1 peers
22/12/19 13:14:01 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/19 13:14:01 WARN BlockManager: Block input-0-1671473641600 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1671473650000 ms
-------------------------------------------
(Hi,1)
(is,1)
(name,1)
(Wei,1)
(my,1)
(Zhou,1)

22/12/19 13:14:13 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/19 13:14:13 WARN BlockManager: Block input-0-1671473653400 replicated to only 0 peer(s) instead of 1 peers
22/12/19 13:14:13 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/19 13:14

<span style="color:blue;font-size:large">When done, stop the stream</span>

In [20]:
ssc.stop(false)

22/12/19 13:14:25 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
22/12/19 13:14:25 WARN SocketReceiver: Error receiving data
java.net.SocketException: Socket closed
	at java.base/java.net.SocketInputStream.socketRead0(Native Method)
	at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:181)
	at java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
	at org.apache

<br><br><br>
<span style="color:green;font-size:xx-large">DStream: The Streaming Structured Object</span>
<li>DStream: Discretized Stream</li>
<li>A DStream object is a sequence of RDDs, one at each time interval</li>
<li>Each micro-batch is processed independently of other batches</li>
<li>micro-batch RDDs are "state indepedent"</li>
<li>Roughly:
    <ul>
        <li>inputs are collected from the stream</li>
        <li>at the time interval mark, they are put into a DStream</li>
        <li>the DStream is handed over to the process app for processing</li>
        <li>so, each interval corresponds to one DStream in the stream</li>


<br><br><br>
<span style="color:green;font-size:xx-large">Streaming sources</span>
<li><b>socketTextStream</b>: collects data from a socket listener</li>
<li><b>fileStream</b>: collects data from new files in a "hadoop compatible" file system</li>
<li><b>textFileStream</b>: collects data from new files as text</li>
<li><b>Other:</b> rawSocketStream, queueStream, binaryRecordsStream

<span style="color:blue;font-size:large">File streaming example</span>
<li>Create a folder where the files will be added</li>
<li>A file stream looks for <span style="color:red">new</span> files</li>
<li>Thanks to the vagaries of file systems, this may not be very clear</li>
<li>For example: </li>
<ul><li>On a mac, the file must have a new id</li>
    <li>Use the cp command in terminal to make a new file</li>
    </ul>
    <li>Of course, in practice, the file will always be new (for ex, an internet log file that gets created every few minutes</li>

In [None]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc,Seconds(10.toLong))

In [None]:
val lines = ssc.textFileStream("stream")
val words = lines.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey( (x, y) => x + y)
wordCounts.print()




In [None]:
ssc.start()

In [None]:
ssc.stop(false)

In [None]:
//cat file1
//cp file1 file7 ->copy file 1 to file 7