# ***Streaming Data Analytics***

Act of continuously incorporating new data to compute a result Input data is unbounded → no beginning and no end. The application will output multiple versions of the results as it runs or put them in a storage.

Many important applications must process large streams of live data and provide results in near-real-time:
- Social network trends
- Website statistics
- Intrusion detection systems
- ...

Several frameworks have been proposed to process in real-time or in near real-time data streams:
- Apache Spark (Streaming component)
- Apache Storm
- Amazon Kinensis Streams
- ...

All these frameworks use a cluster of servers to scale horizontally with respect to the (big) amount of data to be analyzed.

Two main “solutions”
1. **“Continuous”** computation of data streams
    - Data are processed as soon as they arrive
        - Every time a new record arrives from the input stream, it is immediately processed and a result is emitted as soon as possible
    - Real-time processing
2. **“Micro-batch”** stream processing
    - Input data are collected in micro-batches
    - Each micro-batch contains all the data received in a time window (typically less than a few seconds of data)
    - One micro-batch a time is processed
    - Every time a micro-batch of data is ready, its entire content is processed and a result is emitted
    - Near real-time processing
    
### **Type of processing**

**At-most-once**
- Every input element of a stream is processed once or less
- It is also called no guarantee
- The result **can be wrong/approximated**

**At-least-once**
- Every input element of a stream is processed once or more
- Input elements are replayed when there are failures
- The result **can be wrong/approximated**

**Exactly-once**
- Every input element of a stream is processed exactly once
- Input elements are replayed when there are failures
- If elements have been already processed they are not reprocessed
- The result is **always correct**
- Slower than the other processing approaches


# ***Spark Streaming***

Spark Streaming is a framework for large scale stream processing:
- Scales to 100s of nodes
- Can achieve second scale latencies
- Provides a simple batch-like API for implementing complex algorithm
- Micro-batch streaming processing
- Exactly-once guarantees
- Can absorb live data streams from Kafka, Flume, ZeroMQ, Twitter, ...

Spark streaming runs a streaming computation as a series of very small, deterministic batch jobs. It splits each input stream in “portions” andprocesses one portion at a time (in the incoming order). Each **portion** is called **batch**.

Spark streaming
- Splits the live stream into batches of X seconds
- Treats each batch of data as RDDs and processes them using RDD operations
- Finally, the processed results of the RDD operations are returned in batches

## **Key Concepts**

**DStream**
- Sequence of RDDs representing a discretized version of the input stream of data
    - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets, ..
- One RDD for each batch of the input stream


**Transformations**
- Modify data from one DStream to another
- “Standard” RDD operations
    - map, countByValue, reduce, join, ...
    - Window and Stateful operations
    - window, countByValueAndWindow, ...
    

**Output Operations/Actions**
- Send data to external entity
    - saveAsHadoopFiles, saveAsTextFile, ...