# ***Spark Streaming Window Operation***

Spark Streaming also provides windowed computations. It allows you to apply transformations over a sliding window of data:
- Each window contains a set of batches of the input stream
- Windows can be overlapped
    - i.e., the same batch can be included in many consecutive windows
    
Windows CAN be overlapped, but CANNOT be something in which you have one batch and another. Each window must contain a number of batches that is a multiple of the batch size.

Every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream.

Any window operation needs to specify two parameters:

- **Window length**
    - The duration of the window (3 in the example)
- **Sliding interval**
    - The interval at which the window operation is performed (2 in the example)

These two parameters must be multiples of the batch interval of the source DStream.


## **Basic Window Transformations**

If you want to manage windows, a set of functions are avaliable:

- **window(windowLength, slideInterval)**
    - Returns a new DStream which is computed based on windowed batches of the source DStream


- **countByWindow(windowLength, slideInterval)**
    - Returns a new single-element stream containing the number of elements of each window
        - The returned object is a Dstream of Long objects. However, it contains only one value for each window (the number of elements of the last analyzed window)


- **reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration)**
    - Returns a new single-element stream, created by aggregating elements in the stream over a sliding interval using func
        - The function must be associative and commutative so that it can be computed correctly in parallel
    - If invReduceFunc is not None, the reduction is done incrementally using the old window's reduced value


- **countByValueAndWindow(windowDuratio n , slideDuration)**
    - When called on a DStream of elements of type K, returns a new DStream of (K, Long) pairs where the value of each key K is its frequency in each window of the source DStream
    

- **reduceByKeyAndWindow(func, invFunc, windowDuration, slideDuration=None, numPartitions=None)**
    - When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window
    - The window duration (length) is specified as a parameter of this invocation (windowDuration)
    - If **slideDuration** is None, the batchDuration of the StreamingContext object is used
        - i.e., 1 batch sliding window
    - If **invFunc** is provideved (is not None), the reduction is done incrementally using the old window's reduced values
        - i.e., invFunc is used to apply an inverse reduce operation by considering the old values that left the window (e.g., subtracting old counts)
        
        
# ***Checkpoints***
A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.) For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures

Checkpointing is enabled by using the **checkpoint(String folder)** method of SparkStreamingContext


## **Word Count with Windows and Checkpoints**

In [None]:
from pyspark.streaming import StreamingContext

# Set prefix of the output folders
outputPathPrefix="resSparkStreamingExamples"

#Create a configuration object and#set the name of the applicationconf
SparkConf().setAppName("Streaming word count")

# Create a Spark Context object
sc = SparkContext(conf=conf)

# Create a Spark Streaming Context object
ssc = StreamingContext(sc, 5)

# Set the checkpoint folder (it is needed by some window transformations)
ssc.checkpoint("checkpointfolder")

In [None]:
# Create a (Receiver) DStream that will connect to localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

# Apply a chain of transformations to perform the word count task
# The returned RDDs are DStream RDDs
words = lines.flatMap(lambda line: line.split(" "))
wordsOnes = words.map(lambda word: (word, 1))

# reduceByKeyAndWindow is used instead of reduceByKey
# The durantion of the window is also specified
wordsCounts = wordsOnes\
.reduceByKeyAndWindow(lambda v1, v2: v1+v2, None, 15)

**Typical Exam Question**
- Professor shows to code snippets, one with **reduceByKey** and the other with **reduceByKeyAndWindow**, and ask 'Is the output the same or not?'.

Basic idea is that, if you apply window, only the window method, you somehow redefine the granularity of you DStream. All the operations applied will be at window level, on the bateches contained in the window. 

**In general**, if you apply window, and you have more than one analysis associated with the generated window, it is better to window and specific methods without window. 

**Differently**, if you have a single step associated with that specific window, you can use the approach of this example. Also if the size of the window for different operations are different.

In [None]:
# Print the num. of occurrences of each word of the current window
# (only 10 of them)
wordsCounts.pprint()

# Store the output of the computation in the folders with prefix
# outputPathPrefix
wordsCounts.saveAsTextFiles(outputPathPrefix, "")

#Start the computation
ssc.start()
ssc.awaitTermination ()

## **Word count - Version 2**

first part is the same ...

In [None]:
# Create a (Receiver) DStream that will connect to localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

# Apply a chain of transformations to perform the word count task
# The returned RDDs are DStream RDDs
words = lines.flatMap(lambda line: line.split(" "))
wordsOnes = words.map(lambda word: (word, 1))

# reduceByKeyAndWindow is used instead of reduceByKey
# The durantion of the window is also specified
wordsCounts = wordsOnes\
.reduceByKeyAndWindow(lambda v1, v2: v1+v2, \
                      lambda vnow, vold: vnow-vold, 15)
# In this solution the inverse function is also specified
# in order to compute the result incrementally

In the **reduceByKeyAndWindow** you can see that there are two set of functions. The first one is used to combine the values that we have. It is used to sum the values associated with the batch in that window. Simply sum the values. 

The second part recieves the old value and the new one (**vold** is the sum of the values of the first part of the previous window, **vnow** is the new value considering all the data). Applying this function, you are discarding the contribution of the first part of the privious window. 

Let's visualize it:

**first batch**
Paolo

Paolo

Paolo

Paolo

Garza

**second batch**
Paolo

Garza

**third batch**
Paolo


**first window**
If we consider the first window (first + second batch), we'll have:
(Paolo, 5)
(Graza, 2)

If you don't use the inverse function, the system will re-analyze this input data and compute the final result combining the seven value.


**second window**
For the second window, if we don't use the incremental approach (aka inverse function) system will return:
(Paolo, 2)
(Garza, 1)

In order to do this, the system needs to re-analyze the entire content of the window. 

### Now, system can do something slightly more efficient

To compute these result, system can use another approach. During the analysis of the first window, system can store some intermediate results. Specifically, it can store the infromation that in the first batch we have:

(Paolo, 4)

(Garza, 1)

While in the second batch we have:

(Paolo, 1)

(Garza, 1)

And then the output of the first window:

(Paolo, 5)

(Garza, 2)

Given these values, we can compute the second window by considering the difference between first and second batch, and summing it with the third batch:

**first batch - second batch**

(Paolo, |5 - 4|)

(Garza, |1 - 2|)

=


(Paolo, 1)

(Garza, 1)

**intermediate result + third batch**

(Paolo, 1 + 1)

(Garza, 1 + 0)

=


(Paolo, 2)

(Garza, 1)

There is no need to rescan all the data.