# **Data Streaming using PySpark [CN7030]**

**`Dr Amin Karami, UEL Docklands Campus, March 2022`**

`E: a.karami@uel.ac.uk`

`W: www.aminkarami.com`

---

**Checkpointing**:	A streaming application must operate 24/7 and hence must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible, Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such that it can recover from failures.

In [7]:
# Load Spark engine
import findspark
findspark.init()

In [8]:
import os
import sys
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession

In [9]:
## Checkpoint Directory
checkpointDir = 'checkpoint/count'
  

def CreateContext():
    ssc = StreamingContext(SparkContext(), 1)
    lines = ssc.socketTextStream("localhost", 7000) \
        .map(lambda x: int(x) % 10) \
        .map(lambda x: (x,1)) \
        .countByValueAndWindow(10, 5)
    lines.pprint()
    lines.count().pprint()
    
    #Set up checkpoint
    ssc.checkpoint(checkpointDir)
    return ssc

## Create Checkpoint for ssc with the `getOrCreate()` method for streaming contexts
ssc = StreamingContext.getOrCreate(checkpointDir, CreateContext)

22/03/28 14:27:12 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [10]:
ssc.start()

22/03/28 14:27:15 WARN SocketInputDStream: isTimeValid called with 1648474001000 ms whereas the last valid time is 1648474005000 ms
                                                                                

-------------------------------------------
Time: 2022-03-28 14:26:50
-------------------------------------------

-------------------------------------------
Time: 2022-03-28 14:26:50
-------------------------------------------
0

-------------------------------------------
Time: 2022-03-28 14:26:55
-------------------------------------------

-------------------------------------------
Time: 2022-03-28 14:26:55
-------------------------------------------
0

-------------------------------------------
Time: 2022-03-28 14:27:00
-------------------------------------------

-------------------------------------------
Time: 2022-03-28 14:27:00
-------------------------------------------
0

-------------------------------------------
Time: 2022-03-28 14:27:05
-------------------------------------------

-------------------------------------------
Time: 2022-03-28 14:27:05
-------------------------------------------
0

-------------------------------------------
Time: 2022-03-28 14:27:10
--

                                                                                

-------------------------------------------
Time: 2022-03-28 14:27:25
-------------------------------------------

-------------------------------------------
Time: 2022-03-28 14:27:25
-------------------------------------------
0



22/03/28 14:27:26 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/03/28 14:27:26 WARN BlockManager: Block input-0-1648474045800 replicated to only 0 peer(s) instead of 1 peers
                                                                                

-------------------------------------------
Time: 2022-03-28 14:27:30
-------------------------------------------
((0, 1), 1)

-------------------------------------------
Time: 2022-03-28 14:27:30
-------------------------------------------
1



                                                                                

-------------------------------------------
Time: 2022-03-28 14:27:35
-------------------------------------------
((0, 1), 1)

-------------------------------------------
Time: 2022-03-28 14:27:35
-------------------------------------------
1



                                                                                

-------------------------------------------
Time: 2022-03-28 14:27:40
-------------------------------------------

-------------------------------------------
Time: 2022-03-28 14:27:40
-------------------------------------------
0



                                                                                

-------------------------------------------
Time: 2022-03-28 14:27:45
-------------------------------------------

-------------------------------------------
Time: 2022-03-28 14:27:45
-------------------------------------------
0



                                                                                

-------------------------------------------
Time: 2022-03-28 14:27:50
-------------------------------------------

-------------------------------------------
Time: 2022-03-28 14:27:50
-------------------------------------------
0



                                                                                

In [11]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

22/03/28 14:28:06 WARN JobGenerator: Timed out while stopping the job generator (timeout = 10000)
