# ***Spark Streaming Program***

**Define** a Spark Streaming Context object
- Define the size of the batches (in seconds) associated with the Streaming context
- Specify the input stream and define a DStream based on it
- Specify the operations to execute for each batch of data
- Use transformations and actions similar to the ones available for “standard” RDDs

**Invoke** the start method
- To start processing the input stream
- Wait until the application is killed or the timeout specified in the application expires
- If the timeout is not set and the application is not killed the application will run forever

### **Initialize Spark Streaming Context**
The Spark Streaming Context is defined by using the **StreamingContext(SparkConf sparkC, Duration batchDuration)** constructor of the class pyspark.streaming.StreamingContext.
The batchDuration parameter specifies the “size” of the batches in second.

### **Input from TCP socket**
A DStream can be associated with the content emitted by a TCP socket. 
**socketTextStream(String hostname, int port_number)** is used to create a DStream based on the textual content emitted by a TPC socket.

### **Input from HDFS folder**
A DStream can be associated with the content of an input (HDFS) folder
- Every time a new file is inserted in the folder, the content of the file is “stored” in the associated DStream and processed
- Pay attention that updating the content of a file does not trigger/change the content of the DStream
- textFileStream(String folder) is used to create a DStream based on the content of the input folder

**N.B.:** data already in the folder before you start the application will not be analyzed!
### **Transformations**
Analogously to standard RDDs, also DStreams are characterized by a set of transformations. When applied to DStream objects, transformations return a new DStream Object. The transformation is applied on one batch (RDD) of the input DStream at a time and returns a batch (RDD) of the new DStream:
- i.e., each batch (RDD) of the input DStream is associated with exactly one batch (RDD) of the returned DStream
Many of the available transformations are the same transformations available for standard RDDs.

### **Basic transformations:**
- **map(func)**
    - Returns a new DStream by passing each element of the source DStream through a function func
- **flatMap(func)**
    - Each input item can be mapped to 0 or more output items. Returns a new DStream
- **filter(func)**
    - Returns a new DStream by selecting only the records of the source DStream on which func returns true
- **reduce(func)** (in DStreams is no longer an action but a transformation)
    - Returns a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func
         - The function must be associative and commutative so that it can be computed in parallel
    - Note that the reduce method of DStreams is a transformation
- **reduceByKey(func)**
    - When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function
- **combineByKey( createCombiner, mergeValue, mergeCombiners)**
    - When called on a DStream of (K, V) pairs, returns a new DStream of (K, W) pairs where the values for each key are aggregated using the given combine functions
- **reduceByKey(func)**
    - When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function
- **combineByKey( createCombiner, mergeValue, mergeCombiners)**
    - When called on a DStream of (K, V) pairs, returns a new DStream of (K, W) pairs where the values for each key are aggregated using the given combine functions
- **countByValue()**
    - When called on a DStream of elements of type K, returns a new DStream of (K, Long) pairs where the value of each key is its frequency in each batch of the source Dstream
    - Note that the countByValue method of DStreams is a transformation
- **count()**
    - Returns a new DStream of single-element RDDs by counting the number of elements in each batch (RDD) of the source Dstream
        - i.e., it counts the number of elements in each input batch (RDD)
    - Note that the count method of DStreams is a transformation
- **union(otherStream)**
    - Returns a new DStream that contains the union of the elements in the source DStream and otherDStream
- **join(otherStream)**
    - When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key
- **cogroup(otherStream)**
    - When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples

### **Basic Actions**
- **pprint()**
    - Prints the first 10 elements of every batch of data in a DStream on the standard output of the driver node running the streaming application
      - Useful for development and debugging
- **saveAsTextFiles(prefix, [suffix])**
    - Saves the content of the DStream on which it is invoked as text files
      - One folder for each batch
      - The folder name at each batch interval is generated based on prefix, time of the batch (and suffix): "prefix-TIME_IN_MS[.suffix]“
      

### **Start and Run**
- The **streamingContext.start()** method is used to start the application on the input stream(s)
- The **awaitTerminationOrTimeout(long millisecons)** method is used to specify how long the application will run
- The **awaitTermination()** method is used to run the application forever
- Until the application is explicitly killed
- The processing can be manually stopped using **streamingContext.stop()**

**Points to remember:**
- Once a context has been started, no new streaming computations can be set up or added to it
- Once a context has been stopped, it cannot be restarted
- Only one StreamingContext per application can be active at the same time
- stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() called stopSparkContext to False.