## Spark Streaming
### Analyzing data which is coming in real time and analazying 

## Use cases:
- Credit card Fraud detection
- Spam filtering
- Network intrusion detection
- real time social media analytics
- stock market analysis

## Data Sources spark supports:
- Flat Files( as they are created )
- TCP/IP
- Apache Flume
- Apache Kafka
- Amazon Kinesis
- Social media ( Instagram, Twitter, Facebook )

## Steaming Context is created from the Spark Context to enable streaming. Then Streaming creates a DStream(Discretized Stream) on which processing occurs
- Dstream is broken into micro-batches
- Data is received, accumulated as a micro-batch and processed as a micro-batch
- Each micro-batch is an RDD
- regular transformations and actions can be made on these RDD's

In [1]:
import findspark
import os
import configparser
findspark.init()
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from functools import reduce
from pyspark.sql import DataFrame
import pymongo

In [2]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [3]:
#First, we import StreamingContext, which is the main entry point for all streaming functionality.

from pyspark import SparkConf, SparkContext
conf = (SparkConf()
         .setMaster("local[9]")
         .setAppName("v2maestros")
         .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)


In [4]:
#We create a local StreamingContext with two execution threads, and batch interval of 1 second.
ssc = StreamingContext(sc, 1)

In [7]:
sc

In [5]:
totalLines = 0

In [6]:
lines = ssc.socketTextStream("localhost", 9000)

In [7]:
words = lines.flatMap(lambda line: line.split(" "))

In [8]:
pairs = words.map(lambda word: (word,1))


In [9]:
wordCounts = pairs.reduceByKey(lambda x, y : x + y)


In [10]:
wordCounts.pprint(5)


In [11]:
totalLines = 0
linesCount = 0
def computeMetrics(rdd):
    global totalLines
    global linesCount
    linesCount = rdd.count()
    totalLines += linesCount
    print(rdd.collect())
    print("lines in RDD:" , linesCount, " TOtla:", totalLines)

In [12]:
lines.foreachRDD(computeMetrics)

In [13]:
def windowMetrics(rdd):
    print("window RDD size:", rdd.count())

In [14]:
windowedRDD = lines.window(4,2)

In [15]:
windowedRDD.foreachRDD(windowMetrics)

In [16]:
ssc.start()

-------------------------------------------
Time: 2020-08-12 18:37:04
-------------------------------------------

[]
lines in RDD: 0  TOtla: 0
-------------------------------------------
Time: 2020-08-12 18:37:05
-------------------------------------------

[]
lines in RDD: 0  TOtla: 0
window RDD size: 0
-------------------------------------------
Time: 2020-08-12 18:37:06
-------------------------------------------

[]
lines in RDD: 0  TOtla: 0
-------------------------------------------
Time: 2020-08-12 18:37:07
-------------------------------------------

[]
lines in RDD: 0  TOtla: 0
window RDD size: 0
-------------------------------------------
Time: 2020-08-12 18:37:08
-------------------------------------------

[]
lines in RDD: 0  TOtla: 0
-------------------------------------------
Time: 2020-08-12 18:37:09
-------------------------------------------

[]
lines in RDD: 0  TOtla: 0
window RDD size: 0
-------------------------------------------
Time: 2020-08-12 18:37:10
---------

In [17]:
ssc.stop()

## Reference:
- Udemy
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
- https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html