<br><br><br>
<span style="color:red;font-size:60px">DStream transformations</h1>
<br><br>
<li><b>map</b>: map, what else!</li>
<li><b>flatMap</b>: flatMap, what else!</li>
<li><b>filter</b>: filter</li>
<li><b>repartition</b>: changes the number of partitions (increase or decrease) for the DStream</li>
<li><b>count</b>: the number of elements in the RDD of the source dstream</li>
<li><b>countByValue</b>: computes the frequency of each key and returns a DStream of (key,count) pairs</li>
<li><b>union</b>: union of two DStreams</li>
<li><b>reduceByKey</b>: reduceByKey</li>
<li><b>updateStateByKey</b>: applies a function to a DStream to update values for a given key</li>
<li><b>transform</b>: transform a dstream into a new dstream

<br><br><br>
<span style="color:green;font-size:xx-large">DStream union</span>
<br><br>
<li>Returns the union of two DStream objects</li>
<li>Use this to combine data arriving from two different streams</li>

In [1]:
import org.apache.log4j.Logger
import org.apache.log4j.Level

Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)

Intitializing Scala interpreter ...

Spark Web UI available at http://10.56.170.160:4043
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1669753030397)
SparkSession available as 'spark'


import org.apache.log4j.Logger
import org.apache.log4j.Level


In [3]:
import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc,Seconds(10.toLong))
val lines1 = ssc.socketTextStream("localhost", 4444)
val lines2 = ssc.socketTextStream("localhost",9999)
val lines3 = lines1.union(lines2)
val words = lines3.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey((x, y) => x + y)
wordCounts.print()


import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@50c58355
lines1: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@70b0443c
lines2: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@5494de26
lines3: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.UnionDStream@194de938
words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@3cee1ede
pairs: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.MappedDStream@4...


In [4]:
sc.setLogLevel("ERROR")

In [5]:
ssc.start()

-------------------------------------------
Time: 1669753280000 ms
-------------------------------------------

-------------------------------------------
Time: 1669753290000 ms
-------------------------------------------

-------------------------------------------
Time: 1669753300000 ms
-------------------------------------------
(a,1)
(is,1)
(from,1)
(not,1)
(it,1)
(4,1)
(boy,1)
(good,1)

-------------------------------------------
Time: 1669753310000 ms
-------------------------------------------
(a,1)
(I,1)
(9,1)
(is,1)
(from,1)
(it,1)
(am,1)

-------------------------------------------
Time: 1669753320000 ms
-------------------------------------------
(a,1)
(not,1)
(good,1)

-------------------------------------------
Time: 1669753330000 ms
-------------------------------------------

-------------------------------------------
Time: 1669753340000 ms
-------------------------------------------
(a,2)
(I,1)
(9,1)
(is,1)
(from,1)
(not,1)
(it,1)
(am,1)

-----------------------------

In [6]:
ssc.stop(false)

22/11/29 15:23:16 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
22/11/29 15:23:16 ERROR ReceiverTracker: Deregistered receiver for stream 1: Stopped by driver


<br><br><br>
<span style="color:green;font-size:xx-large">Checkpointing</span>
<li>A stream runs continuously for a very long time and must be resilient to failures</li>
<li>checkpointing is a mechanism for recovering from failures</li>
<li>After failure, a stream can be initialized from an existing checkpoint</li>
<li>A checkpoint must be created if a <span style="color:blue">stateful</span> transformation is being used</li>
<li>Checkpointing, in a stateful transaction, computes the value of an RDD and saves it. Recovery then starts from this value</li>

<span style="color:blue;font-size:large">Stateful transformations and checkpointing</span>
<li>The value of stateful results (e.g., averages/sums) at a point in time depends on all the prior RDDs</li>
<li>Spark requires that stateful transformations be checkpointed</li>


<span style="color:blue;font-size:large">Setting the checkpoint</span>
<li>Specify the location where checkpoints will be saved</li>
<li>On HDFS, the location will be distributed and fault tolerant</li>

In [5]:
ssc.checkpoint("checkpoint") //checkpoint is the directory where checkpoint data will be saved

<br><br><br>
<span style="color:green;font-size:xx-large">DStream updateStateByKey</span>
<li>updateStateByKey maintains state information that is updated by each batch
<li>Need to define a state (runningCount in example below)
<li>And define a state update function (updateFunction in example below)
<li>Note the ByKey part!

<br><br><br>
<span style="color:blue;font-size:large">Set up the streaming context</span>

In [6]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(20))

import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@5f39a1bd


<br><br><br>
<span style="color:blue;font-size:large">Create a function that updates the state</span>
<li>We need to keep a running total for each key</li>
<li>So that's what we will update</li>
<li>We need to initialize values for each key's running total so we'll use Option[Int] for that (either it exists or it doesn't)</li>

<br><br><br>
<span style="color:blue;font-size:large">Define an update function</span>
<li> Simple function. Set value to 0 if it doesn't already exist
<li> then add in the new value and return the updated value
<li> This function will be applied to updateStateByKey
    

In [20]:
val updateFunction = (nv: Seq[Int], rc: Option[Int]) => { //nv = new value; rc = running count
    val uc = rc.getOrElse(0) + nv.sum  //If rc does not exist, set it to 0, otherwise use the existing value
    Some(uc) //return Some(uc)  
//     uc
}

updateFunction: (Seq[Int], Option[Int]) => Some[Int] = $Lambda$3737/0x00000008010d3040@5396f126


In [18]:
updateFunction(Seq(5,7),Some(5))

res12: Some[Int] = Some(17)


In [19]:
updateFunction(Seq(),None)

res13: Some[Int] = Some(0)


<br><br><br>
<span style="color:blue;font-size:large">Do the transformations (adding updateStateByKey)</span>

In [13]:
import org.apache.spark.streaming.{Seconds, StreamingContext}

val updateFunction = (nv: Seq[Int], rc: Option[Int]) => { //nv = new value; rc = running count
    val uc = rc.getOrElse(0) + nv.sum  //If rc does not exist, set it to 0, otherwise use the existing value
    Some(uc) //return Some(uc)  
}

val ssc = new StreamingContext(sc, Seconds(20))
ssc.checkpoint("checkpoint")

val lines = ssc.socketTextStream("localhost", 4444)

val words = lines.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))

//val runningCount = pairs.updateStateByKey[Int]((a: Seq[Int],b: Option[Int]) => Some(b.getOrElse(0) + a.sum))
val runningCount = pairs.updateStateByKey[Int](updateFunction) //define a state & use update function 
runningCount.print()
ssc.start

import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@76e93f62
lines: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@b069e4f
words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@40d162b5
pairs: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.MappedDStream@9221be0
runningCount: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.StateDStream@39382edc


-------------------------------------------
Time: 1669754200000 ms
-------------------------------------------

-------------------------------------------
Time: 1669754220000 ms
-------------------------------------------
(a,1)
(b,2)
(c,1)

-------------------------------------------
Time: 1669754240000 ms
-------------------------------------------
(a,2)
(b,4)
(c,2)

-------------------------------------------
Time: 1669754260000 ms
-------------------------------------------
(a,2)
(b,4)
(c,2)

-------------------------------------------
Time: 1669754280000 ms
-------------------------------------------
(a,2)
(b,4)
(c,2)



In [14]:
ssc.stop(false)

22/11/29 15:38:08 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver


<br><br><br>
<span style="color:green;font-size:xx-large">DStream transform</span>
<li>The <b>transform</b> function on a DStream returns a new DStream by applying a function on each rdd in the DStream</li>
<li>It is particularly useful for joining a static RDD to the RDDs in a DStream</li>
<li>Example: Individual sales of products are coming in on a stream. Calculate revenue on each sale by multiplying by prices in a static RDD</li>
<pre>
Example Sales
A,7
B,2
A,11
</pre>

In [5]:
import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc,Seconds(10.toLong))
val prices = Array(("A",10.2),("B",7.4))
val pricesRDD = sc.parallelize(prices)

val lines = ssc.socketTextStream("localhost", 4444)
val sales = lines.map(line=>line.split(","))
        .map(l=>(l(0),l(1).toInt))
val salesPrices = sales.transform(rdd => rdd.join(pricesRDD))
val revenue = salesPrices.map(t => (t._1,t._2._1*t._2._2))
revenue.print()

import org.apache.spark.streaming.{Seconds, StreamingContext}
ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@6bb3bad7
prices: Array[(String, Double)] = Array((A,10.2), (B,7.4))
pricesRDD: org.apache.spark.rdd.RDD[(String, Double)] = ParallelCollectionRDD[32] at parallelize at <console>:36
lines: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] = org.apache.spark.streaming.dstream.SocketInputDStream@3ba37767
sales: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.MappedDStream@6af7abd6
salesPrices: org.apache.spark.streaming.dstream.DStream[(String, (Int, Double))] = org.apache.spark.streaming.dstream.TransformedDStream@5da96aaf
revenue: org.apache.spark.streaming.dstream.DStream[...


In [6]:
ssc.start

-------------------------------------------
Time: 1671419310000 ms
-------------------------------------------

22/12/18 22:08:30 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/18 22:08:30 WARN BlockManager: Block input-0-1671419310600 replicated to only 0 peer(s) instead of 1 peers
22/12/18 22:08:31 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/12/18 22:08:31 WARN BlockManager: Block input-0-1671419311000 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1671419320000 ms
-------------------------------------------
(A,71.39999999999999)
(A,112.19999999999999)
(B,14.8)



In [7]:
ssc.stop(false)

22/12/18 22:08:45 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
22/12/18 22:08:45 WARN SocketReceiver: Error receiving data
java.net.SocketException: Socket closed
	at java.base/java.net.SocketInputStream.socketRead0(Native Method)
	at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:181)
	at java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
	at org.apache