<br><br><br>
<span style="color:red;font-size:60px">Windowing with DStreams</span>
<br><br>

In [None]:
import org.apache.log4j.Logger
import org.apache.log4j.Level

Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)



<li>Data distributions may change over time (non-stationary)</li>
<li>We might want to overweight recent data and underweight older data</li>
<li>Examples:
<ul><li>Monitoring frequency of errors in log files
<li>A high frequency market making app that uses data from short time windows
<li>Monitoring passenger flows into a railway station 
</ul>
    <li>Spark Streaming allows the creation of sliding windows (and, therefore, also fixed windows)</li>

<br><br><br>
<span style="color:green;font-size:xx-large">Windows in Spark</span>
<br><br>
<li>Spark DStream implements sliding windows</li>
<li>Each window has a length (for example, 20 seconds)</li>
<li>Each window has a "sliding interval" (for example, the window is created every 10 seconds)</li>
<li>Window length is a multiple of microbatch size</li>
<li>sliding interval is a multiple of microbatch size</li>
<li>Both can be, at a minimum, the batch size</li>


<span style="color:blue;font-size:large">windows, slide intervals, and microbatches</span>
<img src="streaming_windows.png">

<br><br><br>
<span style="color:green;font-size:xx-large">Window transformations</span>

<li><b>Window(window_length,slide_interval)</b>: Creates a window</li>
<li><b>countByWindow(window_length,slide_interval)</b>: returns the number of elements in a window</li>
<li><b>countByValueAndWindow(window_length,slide_interval)</b>: Counts the number for each key</li>
<li><b>reduceByWindow(function,window_length,slide_interval)</b>: Applies reduce, using the specified function, on the data in the DStreams in the window</li>
<li><b>reduceByKeyAndWindow</b>: The key version of reduceByWindow.  </li>
<li>Windowed operations require a checkpoint</li>

<b>reduceByKeyAndWindow(reduce_function,inverse_function,window_duration,slide_duration)</b>
<ul>
<li><b>reduce_function</b>: The ordinary reduce function (in our example, add)
<li><b>inverse_function</b>: What to do with data that slides out of the window
<li><b>window_duration</b>: window size (multiple of batch interval
<li><b>slide_duration</b>: the amount the window will slide by

<span style="color:blue;font-size:large">Example: countByWindow</span>

<li>Microbatch size: 10 seconds</li>
<li>Window length: 30 seconds</li>
<li>Slide interval: 20 seconds</li>
<li>Report the total number of words in each microbatch and in each window</li>



In [None]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 4444)
ssc.checkpoint("checkpoint")
val words = lines.flatMap(l => l.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey((x, y) => x + y)
wordCounts.print()
//Total words in the microbatch (10 seconds)
//This will print every 10 seconds
//words.count.print

//Total number of words in every 30 second window sliding every 20 seconds
//The print is at each point that the window slides (20, 40, 60 ,80,....)
//This will print at 20 (total words in 20 seconds since this is the first slide)
//at 40 seconds (total words in the last 3 10 second intervals)
//at 60 seconds, etc.
val wcount = words.countByWindow(Seconds(40),Seconds(20))
wcount.print

In [None]:
ssc.start

In [None]:
ssc.stop(false)

<li>To pretty print, we can add  <span style="color:blue">foreachRDD</span></li>
<li><span style="color:blue">foreachRDD</span> is an iterator through a collection of DStream RDDs</li>
<li>In this example, we use it for printing</li>
<li>Since wcount is an RDD, we need to extract the number of elements from it</li>

<span style="color:blue;font-size:large">Example: foreachRDD</span>
<li>DStream objects consist of RDDs</li>
<li>foreachRDD applies a function to each RDD in a DStream object</li>

In [None]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 4444)
ssc.checkpoint("checkpoint")
val words = lines.flatMap(l => l.split(" "))

//Total words in the microbatch (10 seconds)
//This will print every 10 seconds
words.count.print

//Total number of words in every 30 second window sliding every 20 seconds
//The print is at each point that the window slides (20, 40, 60 ,80,....)
//This will print at 20 (total words in 20 seconds since this is the first slide)
//at 40 seconds (total words in the last 3 10 second intervals)
//at 60 seconds, etc.
val wcount = words.countByWindow(Seconds(30),Seconds(20))
wcount.foreachRDD((rdd,time) => print(time.toString.takeRight(6),": Total  ",rdd.collect()(0)))

In [None]:
ssc.start()

In [None]:
ssc.stop(false)

<span style="color:blue;font-size:large">Example: countByValueAndWindow</span>


<li>Microbatch size: 10 seconds</li>
<li>Window length: 30 seconds</li>
<li>Slide interval: 20 seconds</li>
<li>Calculate the total instances of each word in a window</li>
<li>We'll use countByValueAndWindow for this</li>

In [2]:


import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 4444)
ssc.checkpoint("checkpoint")
val words = lines.flatMap(l => l.split(" "))

//Construct the (word,1) pairs
val pairs = words.map(word => (word, 1))
pairs.print()

val windowedWordCounts1 = pairs.countByValueAndWindow(Seconds(30),Seconds(20))


windowedWordCounts1.print

windowedWordCounts1.foreachRDD((rdd,time) => {
    println(time.toString.takeRight(15));
    rdd.foreach(r => println(r._1._1,r._2))
})
 

import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@3f907db8
lines: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@3fe4a956
words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@2438db2e
pairs: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.MappedDStream@3c5f8492
windowedWordCounts1: org.apache.spark.streaming.dstream.DStream[((String, Int), Long)] = org.apache.spark.streaming.dstream.ReducedWindowedDStream@9e531c


In [4]:
ssc.start()

-------------------------------------------
Time: 1668290350000 ms
-------------------------------------------
(John,1)
(John,1)
(Jim,1)
(Jim,1)
(Jim,1)

-------------------------------------------
Time: 1668290360000 ms
-------------------------------------------

-------------------------------------------
Time: 1668290360000 ms
-------------------------------------------
((Jim,1),3)
((John,1),2)

668290360000 ms
(Jim,3)
(John,2)
-------------------------------------------
Time: 1668290370000 ms
-------------------------------------------
(james,1)
(james,1)
(Jim,1)

-------------------------------------------
Time: 1668290380000 ms
-------------------------------------------
(John,1)
(Jim,1)
(Jim,1)
(Jim,1)

-------------------------------------------
Time: 1668290380000 ms
-------------------------------------------
((Jim,1),4)
((james,1),2)
((John,1),1)

668290380000 ms
(Jim,4)
(james,2)
(John,1)


In [3]:
sc.setLogLevel("ERROR")

In [5]:
ssc.stop(false)

22/11/12 16:59:44 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver


<span style="color:blue;font-size:large">Example: reduceByKeyAndWindow</span>


<li>Microbatch size: 10 seconds</li>
<li>Window length: 30 seconds</li>
<li>Slide interval: 20 seconds</li>
<li>Every time the window slides, some data goes out of the window and some data enters</li>
<li>We'll score each word as follows:</li>
<ul>
    <li>Add the instances of the words that enter to the current total</li>
    <li>Subtract half the instances of the words that leave to the current total</li>
</ul>

In [None]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 4444)
ssc.checkpoint("checkpoint")
val words = lines.flatMap(l => l.split(" "))
val pairs = words.map(word => (word, 1.0)) //Make this double so that the division by 2 works properly

//Total words in the microbatch (10 seconds)
pairs.count.print

//All the words in a 20 second window
val window_data = words.window(Seconds(20))

//Each word count, in a 60 second window, sliding every 20 seconds. The word counts are incremented by
// the new words that enter every 20 seconds and decreased by half the count of the departing words (i.e., 
//of the words that go out of the window
val windowedWordCounts = pairs.reduceByKeyAndWindow((x, y)=> x + y,(x,y)=> x - y/2, Seconds(60), Seconds(20))


window_data.print
windowedWordCounts.print


In [None]:
ssc.start

In [None]:
ssc.stop(false)