<br><br><br>
<span style="color:red;font-size:60px">DStream transformations</h1>
<br><br>
<li><b>map</b>: map, what else!</li>
<li><b>flatMap</b>: flatMap, what else!</li>
<li><b>filter</b>: filter</li>
<li><b>repartition</b>: changes the number of partitions (increase or decrease) for the DStream</li>
<li><b>count</b>: the number of elements in the RDD of the source dstream</li>
<li><b>countByValue</b>: computes the frequency of each key and returns a DStream of (key,count) pairs</li>
<li><b>union</b>: union of two DStreams</li>
<li><b>reduceByKey</b>: reduceByKey</li>
<li><b>updateStateByKey</b>: applies a function to a DStream to update values for a given key</li>

<br><br><br>
<span style="color:green;font-size:xx-large">DStream union</span>
<br><br>
<li>Returns the union of two DStream objects</li>
<li>Use this to combine data arriving from two different streams</li>

In [None]:
import org.apache.log4j.Logger
import org.apache.log4j.Level

Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)

In [None]:
import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc,Seconds(10.toLong))
val lines1 = ssc.socketTextStream("localhost", 4444)
val lines2 = ssc.socketTextStream("localhost",9999)
val lines3 = lines1.union(lines2)
val words = lines3.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey((x, y) => x + y)
wordCounts.print()


In [None]:
ssc.start()

In [None]:
ssc.stop(false)

<br><br><br>
<span style="color:green;font-size:xx-large">Checkpointing</span>
<li>A stream runs continuously for a very long time and must be resilient to failures</li>
<li>checkpointing is a mechanism for recovering from failures</li>
<li>After failure, a stream can be initialized from an existing checkpoint</li>
<li>A checkpoint must be created if a <span style="color:blue">stateful</span> transformation is being used</li>
<li>Checkpointing, in a stateful transaction, computes the value of an RDD and saves it. Recovery then starts from this value</li>

<span style="color:blue;font-size:large">Stateful transformations and checkpointing</span>
<li>The value of stateful results (e.g., averages/sums) at a point in time depends on all the prior RDDs</li>
<li>Spark requires that stateful transformations be checkpointed</li>


<span style="color:blue;font-size:large">Setting the checkpoint</span>
<li>Specify the location where checkpoints will be saved</li>
<li>On HDFS, the location will be distributed and fault tolerant</li>

In [None]:
ssc.checkpoint("checkpoint") //checkpoint is the directory where checkpoint data will be saved

<br><br><br>
<span style="color:green;font-size:xx-large">DStream updateStateByKey</span>
<li>updateStateByKey maintains state information that is updated by each batch
<li>Need to define a state (runningCount in example below)
<li>And define a state update function (updateFunction in example below)
<li>Note the ByKey part!

<br><br><br>
<span style="color:blue;font-size:large">Set up the streaming context</span>

In [None]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(20))

<br><br><br>
<span style="color:blue;font-size:large">Create a function that updates the state</span>
<li>We need to keep a running total for each key</li>
<li>So that's what we will update</li>
<li>We need to initialize values for each key's running total so we'll use Option[Int] for that (either it exists or it doesn't)</li>

<br><br><br>
<span style="color:blue;font-size:large">Define an update function</span>
<li> Simple function. Set value to 0 if it doesn't already exist
<li> then add in the new value and return the updated value
<li> This function will be applied to updateStateByKey
    

In [None]:
val updateFunction = (nv: Seq[Int], rc: Option[Int]) => { //nv = new value; rc = running count
    val uc = rc.getOrElse(0) + nv.sum  //If rc does not exist, set it to 0, otherwise use the existing value
    Some(uc) //return Some(uc) Why Some?   
}

In [None]:
updateFunction(Seq(5,7),Some(5))

In [None]:
updateFunction(Seq(5,7),None)

<br><br><br>
<span style="color:blue;font-size:large">Do the transformations (adding updateStateByKey)</span>

In [None]:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(20))
ssc.checkpoint("checkpoint")

val lines = ssc.socketTextStream("localhost", 4444)

val words = lines.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))

//val runningCount = pairs.updateStateByKey[Int]((a: Seq[Int],b: Option[Int]) => Some(b.getOrElse(0) + a.sum))
val runningCount = pairs.updateStateByKey[Int](updateFunction)
runningCount.print()
ssc.start

In [None]:
ssc.stop(false)

<br><br><br>
<span style="color:green;font-size:xx-large">DStream transform</span>
<li>The <b>transform</b> function on a DStream returns a new DStream by applying a function on each rdd in the DStream</li>
<li>It is particularly useful for joining a static RDD to the RDDs in a DStream</li>
<li>Example: Individual sales of products are coming in on a stream. Calculate revenue on each sale by multiplying by prices in a static RDD</li>
<pre>
Example Sales
A,7
B,2
A,11
</pre>

In [None]:
import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc,Seconds(10.toLong))
val prices = Array(("A",10.2),("B",7.4))
val pricesRDD = sc.parallelize(prices)

val lines = ssc.socketTextStream("localhost", 4444)
val sales = lines.map(line=>line.split(","))
        .map(l=>(l(0),l(1).toInt))
val salesPrices = sales.transform(rdd => rdd.join(pricesRDD))
val revenue = salesPrices.map(t => (t._1,t._2._1*t._2._2))
revenue.print()

In [None]:
ssc.start

In [None]:
ssc.stop(false)