# **Data Streaming using PySpark **



---

**Discretized Stream or DStream**: it is the basic abstraction, representing a continuous stream of data. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval.


In [1]:
# Load Spark engine
#!pip3 install findspark
import findspark
findspark.init()

In [1]:
import random,os,sys
import time
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

os.environ['PYSPARK_DRIVER_PYTHON_OPTS']= "notebook"
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
os.environ['PYSPARK_PYTHON'] = sys.executable

In [2]:
#### Syntax:
# StreamingContext(SparkContext sparkContext,Duration batchDuration)
#   sparkContext - existing SparkContext
#   batchDuration - the time interval at which streaming data will be divided into batches

sc = SparkContext(appName="PythonStreamingQueue")
ssc = StreamingContext(sc, 1)

In [3]:
# Create random values using RDD parallelize
rddQueue = list()
for i in range(4):
    rddQueue += [sc.parallelize([j for j in random.sample(range(1, 1000), 50)], 3)]
    # random.sample(range(1, 1000), 50): pick randomly the 50 values from a range of (1, 1000)
    # divide the data into 3 partitions
    # rddQueue[0] = 50 random values between 1 and 1000
    # rddQueue[1] = 50 random values between 1 and 1000
    # rddQueue[2] = 50 random values between 1 and 1000
    # rddQueue[3] = 50 random values between 1 and 1000

In [4]:
# check the created random numbers at first room:
print(rddQueue[0].collect())

[514, 641, 578, 570, 792, 978, 234, 472, 145, 438, 777, 187, 949, 285, 268, 399, 948, 807, 598, 983, 872, 752, 601, 661, 259, 908, 15, 941, 571, 731, 970, 554, 247, 371, 567, 914, 850, 962, 22, 229, 334, 562, 50, 175, 769, 98, 437, 518, 832, 751]


In [5]:
rdd_Stream = ssc.queueStream(rddQueue)
mapped_rdd_Stream = rdd_Stream.map(lambda x: (x % 10, 1))
reduced_rdd_Stream = mapped_rdd_Stream.reduceByKey(lambda x, y: x + y)
reduced_rdd_Stream.pprint()

In [6]:
ssc.start()
time.sleep(5)  # wait here for 5 seconds
ssc.stop(stopSparkContext=True, stopGraceFully=True)  # stop Sparkstreaming

-------------------------------------------
Time: 2022-03-22 10:54:05
-------------------------------------------
(8, 9)
(0, 4)
(1, 8)
(9, 5)
(2, 8)
(3, 1)
(4, 5)
(5, 4)
(7, 6)

-------------------------------------------
Time: 2022-03-22 10:54:06
-------------------------------------------
(0, 5)
(8, 2)
(1, 4)
(9, 6)
(2, 9)
(3, 5)
(4, 9)
(5, 2)
(6, 4)
(7, 4)

-------------------------------------------
Time: 2022-03-22 10:54:07
-------------------------------------------
(0, 7)
(8, 2)
(9, 7)
(1, 3)
(2, 8)
(3, 2)
(4, 10)
(5, 4)
(6, 4)
(7, 3)

-------------------------------------------
Time: 2022-03-22 10:54:08
-------------------------------------------
(0, 3)
(8, 3)
(9, 7)
(1, 4)
(2, 4)
(3, 8)
(4, 9)
(5, 5)
(6, 3)
(7, 4)

-------------------------------------------
Time: 2022-03-22 10:54:09
-------------------------------------------

-------------------------------------------
Time: 2022-03-22 10:54:10
-------------------------------------------

------------------------------------