<br><br><br>
<span style="color:red;font-size:60px">Windowing with DStreams</span>
<br><br>

In [None]:
import org.apache.log4j.Logger
import org.apache.log4j.Level

Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)



<li>Data distributions may change over time (non-stationary)</li>
<li>We might want to overweight recent data and underweight older data</li>
<li>Examples:
<ul><li>Monitoring frequency of errors in log files
<li>A high frequency market making app that uses data from short time windows
<li>Monitoring passenger flows into a railway station 
</ul>
    <li>Spark Streaming allows the creation of sliding windows (and, therefore, also fixed windows)</li>

<br><br><br>
<span style="color:green;font-size:xx-large">Windows in Spark</span>
<br><br>
<li>Spark DStream implements sliding windows</li>
<li>Each window has a length (for example, 20 seconds)</li>
<li>Each window has a "sliding interval" (for example, the window is created every 10 seconds)</li>
<li>Window length is a multiple of microbatch size</li>
<li>sliding interval is a multiple of microbatch size</li>
<li>Both can be, at a minimum, the batch size</li>


<span style="color:blue;font-size:large">windows, slide intervals, and microbatches</span>
<img src="streaming_windows.png">

<br><br><br>
<span style="color:green;font-size:xx-large">Window transformations</span>

<li><b>Window(window_length,slide_interval)</b>: Creates a window</li>
<li><b>countByWindow(window_length,slide_interval)</b>: returns the number of elements in a window</li>
<li><b>countByValueAndWindow(window_length,slide_interval)</b>: Counts the number for each key</li>
<li><b>reduceByWindow(function,window_length,slide_interval)</b>: Applies reduce, using the specified function, on the data in the DStreams in the window</li>
<li><b>reduceByKeyAndWindow</b>: The key version of reduceByWindow.  </li>
<li>Windowed operations require a checkpoint</li>

<b>reduceByKeyAndWindow(reduce_function,inverse_function,window_duration,slide_duration)</b>
<ul>
<li><b>reduce_function</b>: The ordinary reduce function (in our example, add)
<li><b>inverse_function</b>: What to do with data that slides out of the window
<li><b>window_duration</b>: window size (multiple of batch interval
<li><b>slide_duration</b>: the amount the window will slide by

<span style="color:blue;font-size:large">Example: countByWindow</span>

<li>Microbatch size: 10 seconds</li>
<li>Window length: 30 seconds</li>
<li>Slide interval: 20 seconds</li>
<li>Report the total number of words in each microbatch and in each window</li>



In [4]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 4444)
ssc.checkpoint("checkpoint")
val words = lines.flatMap(l => l.split(" "))

//Total words in the microbatch (10 seconds)
//This will print every 10 seconds
words.count.print
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey((x, y) => x + y)
wordCounts.print()

//Total number of words in every 30 second window sliding every 20 seconds
//The print is at each point that the window slides (20, 40, 60 ,80,....)
//This will print at 20 (total words in 20 seconds since this is the first slide)
//at 40 seconds (total words in the last 3 10 second intervals)
//at 60 seconds, etc.
val wcount = words.countByWindow(Seconds(30),Seconds(20))
wcount.print

import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@490f1dd2
lines: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@74eb4bd8
words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@53486503
pairs: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.MappedDStream@4faf3938
wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.ShuffledDStream@7e7f4c92
wcount: org.apache.spark.streaming.dstream.DStream[Long] = org.apache.spark.streaming.dstream.MappedDStream@50e46d0e


In [5]:
ssc.start

-------------------------------------------
Time: 1669757220000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1669757220000 ms
-------------------------------------------

22/11/29 16:27:01 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/11/29 16:27:01 WARN BlockManager: Block input-0-1669757220800 replicated to only 0 peer(s) instead of 1 peers
22/11/29 16:27:02 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/11/29 16:27:02 WARN BlockManager: Block input-0-1669757221800 replicated to only 0 peer(s) instead of 1 peers
22/11/29 16:27:02 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/11/29 16:27:02 WARN BlockManager: Block input-0-1669757222400 replicated to only 0 peer(s) instead of 1 peers
22/11/29 16:27:03 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/11/29 16:27:03 WARN BlockManager: Block input-0-16697572

In [6]:
ssc.stop(false)

22/11/29 16:27:48 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
22/11/29 16:27:48 WARN SocketReceiver: Error receiving data
java.net.SocketException: Socket closed
	at java.base/java.net.SocketInputStream.socketRead0(Native Method)
	at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:181)
	at java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
	at org.apache

<li>To pretty print, we can add  <span style="color:blue">foreachRDD</span></li>
<li><span style="color:blue">foreachRDD</span> is an iterator through a collection of DStream RDDs</li>
<li>In this example, we use it for printing</li>
<li>Since wcount is an RDD, we need to extract the number of elements from it</li>

<span style="color:blue;font-size:large">Example: foreachRDD</span>
<li>DStream objects consist of RDDs</li>
<li>foreachRDD applies a function to each RDD in a DStream object</li>

In [1]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 4444)
ssc.checkpoint("checkpoint")
val words = lines.flatMap(l => l.split(" "))

//Total words in the microbatch (10 seconds)
//This will print every 10 seconds
words.count.print

//Total number of words in every 30 second window sliding every 20 seconds
//The print is at each point that the window slides (20, 40, 60 ,80,....)
//This will print at 20 (total words in 20 seconds since this is the first slide)
//at 40 seconds (total words in the last 3 10 second intervals)
//at 60 seconds, etc.
val wcount = words.countByWindow(Seconds(30),Seconds(20))
wcount.foreachRDD((rdd,time) => print(time.toString.takeRight(10),": Total  ",rdd.collect()(0)))

Intitializing Scala interpreter ...

Spark Web UI available at http://vickyzmbp-2:4042
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1670087311860)
SparkSession available as 'spark'


import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@175d7431
lines: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@67ea2d09
words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@59745967
wcount: org.apache.spark.streaming.dstream.DStream[Long] = org.apache.spark.streaming.dstream.MappedDStream@1fc78c96


In [2]:
ssc.start()

22/12/03 12:08:43 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/03 12:08:43 WARN BlockManager: Block input-0-1670087322800 replicated to only 0 peer(s) instead of 1 peers
22/12/03 12:08:47 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/03 12:08:47 WARN BlockManager: Block input-0-1670087327000 replicated to only 0 peer(s) instead of 1 peers
22/12/03 12:08:47 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/03 12:08:47 WARN BlockManager: Block input-0-1670087327200 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1670087330000 ms
-------------------------------------------
69

-------------------------------------------
Time: 1670087340000 ms
-------------------------------------------
0

(7340000 ms,: Total  ,69)

In [3]:
ssc.stop(false)

22/12/03 12:09:05 WARN SocketReceiver: Error receiving data
java.net.SocketException: Socket closed
	at java.base/java.net.SocketInputStream.socketRead0(Native Method)
	at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:181)
	at java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
	at org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.getNext(SocketInputDStream.scala:121)
	at org.a

<span style="color:blue;font-size:large">Example: countByValueAndWindow</span>


<li>Microbatch size: 10 seconds</li>
<li>Window length: 30 seconds</li>
<li>Slide interval: 20 seconds</li>
<li>Calculate the total instances of each word in a window</li>
<li>We'll use countByValueAndWindow for this</li>

In [13]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 4444)
ssc.checkpoint("checkpoint")
val words = lines.flatMap(l => l.split(" "))

// Construct the (word,1) pairs
val pairs = words.map(word => (word, 1))
pairs.print()

val windowedWordCounts1 = pairs.countByValueAndWindow(Seconds(30),Seconds(20))
// val windowedWordCounts1 = words.countByValueAndWindow(Seconds(30),Seconds(20))

windowedWordCounts1.print

windowedWordCounts1.foreachRDD((rdd,time) => {
    println(time.toString.takeRight(15));
    rdd.foreach(r => println(r._1._1,r._2))
})
 

import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@6a5ae9aa
lines: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@65fb7203
words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@b05a14f
pairs: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.MappedDStream@6f4f7882
windowedWordCounts1: org.apache.spark.streaming.dstream.DStream[((String, Int), Long)] = org.apache.spark.streaming.dstream.ReducedWindowedDStream@5265db78


In [14]:
sc.setLogLevel("ERROR")

In [15]:
ssc.start()

-------------------------------------------
Time: 1669757840000 ms
-------------------------------------------

-------------------------------------------
Time: 1669757850000 ms
-------------------------------------------
(jim,1)
(james,1)
(nacy,1)

-------------------------------------------
Time: 1669757850000 ms
-------------------------------------------
((james,1),1)
((nacy,1),1)
((jim,1),1)

669757850000 ms
(james,1)
(nacy,1)
(jim,1)
-------------------------------------------
Time: 1669757860000 ms
-------------------------------------------
(vicky,1)
(jim,1)
(james,1)
(vicky,1)

-------------------------------------------
Time: 1669757870000 ms
-------------------------------------------
(jim,1)
(james,1)
(nacy,1)

-------------------------------------------
Time: 1669757870000 ms
-------------------------------------------
((vicky,1),2)
((james,1),3)
((nacy,1),2)
((jim,1),3)

669757870000 ms
(vicky,2)
(nacy,2)
(james,3)
(jim,3)


In [16]:
ssc.stop(false)

22/11/29 16:37:52 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver


<span style="color:blue;font-size:large">Example: reduceByKeyAndWindow</span>


<li>Microbatch size: 10 seconds</li>
<li>Window length: 30 seconds</li>
<li>Slide interval: 20 seconds</li>
<li>Every time the window slides, some data goes out of the window and some data enters</li>
<li>We'll score each word as follows:</li>
<ul>
    <li>Add the instances of the words that enter to the current total</li>
    <li>Subtract half the instances of the words that leave to the current total</li>
</ul>

In [8]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 4444)
ssc.checkpoint("checkpoint")
val words = lines.flatMap(l => l.split(" "))
val pairs = words.map(word => (word, 1.0)) //Make this double so that the division by 2 works properly

//Total words in the microbatch (10 seconds)
// pairs.count.print

//All the words in a 20 second window
val window_data = words.window(Seconds(20))

//Each word count, in a 60 second window, sliding every 20 seconds. The word counts are incremented by
// the new words that enter every 20 seconds and decreased by half the count of the departing words (i.e., 
//of the words that go out of the window
// val windowedWordCounts = pairs.reduceByKeyAndWindow((x, y)=> x + y,(x,y)=> x - 0.25*y, Seconds(60), Seconds(20))


window_data.foreachRDD((r,time) => print(time.toString.takeRight(7)))
// pairs.print
// windowedWordCounts.print


import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@2013c08
lines: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@6a1b3ba6
words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@5599dee
pairs: org.apache.spark.streaming.dstream.DStream[(String, Double)] = org.apache.spark.streaming.dstream.MappedDStream@5ada69d0
window_data: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.WindowedDStream@51b0efff


In [9]:
ssc.start

0000 ms22/12/06 16:33:22 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/06 16:33:22 WARN BlockManager: Block input-0-1670362402400 replicated to only 0 peer(s) instead of 1 peers
22/12/06 16:33:26 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/06 16:33:26 WARN BlockManager: Block input-0-1670362405800 replicated to only 0 peer(s) instead of 1 peers
0000 ms0000 ms

In [10]:
ssc.stop(false)

22/12/06 16:33:43 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
22/12/06 16:33:43 WARN SocketReceiver: Error receiving data
java.net.SocketException: Socket closed
	at java.base/java.net.SocketInputStream.socketRead0(Native Method)
	at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:181)
	at java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
	at org.apache