# **Data Streaming using PySpark**

---

**DataFrame and SQL Operations**:	You have to create a SparkSession using the SparkContext that the StreamingContext is using. Furthermore, this has to done such that it can be restarted on driver failures. This is done by creating a lazily instantiated singleton instance of SparkSession. Each RDD is converted to a DataFrame, registered as a temporary table and then queried using SQL.


In [1]:
# Load Spark engine
import findspark
findspark.init()

In [2]:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import Row, SparkSession

In [3]:
# Lazily instantiated global instance of SparkSession

def getSparkSessionInstance(sparkConf):
    if ('sparkSessionSingletonInstance' not in globals()):
        globals()['sparkSessionSingletonInstance'] = SparkSession.builder.config(conf=sparkConf).getOrCreate()
    return globals()['sparkSessionSingletonInstance']

In [4]:
sc = SparkContext(appName="PythonSqlNetworkWordCount")
ssc = StreamingContext(sc, 5)

22/03/28 14:18:06 WARN Utils: Your hostname, Predator-G3572 resolves to a loopback address: 127.0.1.1; using 172.29.43.74 instead (on interface eth0)
22/03/28 14:18:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/28 14:18:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/03/28 14:18:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [5]:
lines = ssc.socketTextStream('localhost', 7000)
# open cmd and type: nc -lk 7000

In [6]:
words = lines.flatMap(lambda line: line.split(" "))


# converting RDDs of the words DStream to DataFrame and run SQL query
def process(time, rdd):
    print("========= %s =========" % str(time))

    try:
        # TODO: Get the singleton instance of SparkSession
        spark = getSparkSessionInstance(rdd.context.getConf())

        # TODO: Convert RDD[String] to RDD[Row] to DataFrame
        rowRdd = rdd.map(lambda w: Row(word=w))
        wordsDataFrame = spark.createDataFrame(rowRdd)

        # TODO: Creates a temporary view using the DataFrame.
        wordsDataFrame.createOrReplaceTempView("words")

        # TODO: Do word count on table using SQL and print it
        wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")
        wordCountsDataFrame.show()
    except:
        pass

words.foreachRDD(process)

In [7]:
ssc.start()

[Stage 0:>                                                          (0 + 1) / 1]



22/03/28 14:20:32 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/03/28 14:20:32 WARN BlockManager: Block input-0-1648473631800 replicated to only 0 peer(s) instead of 1 peers
22/03/28 14:20:34 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/03/28 14:20:34 WARN BlockManager: Block input-0-1648473634400 replicated to only 0 peer(s) instead of 1 peers




22/03/28 14:20:37 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/03/28 14:20:37 WARN BlockManager: Block input-0-1648473636800 replicated to only 0 peer(s) instead of 1 peers
22/03/28 14:20:38 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
22/03/28 14:20:38 WARN BlockManager: Block input-0-1648473638400 replicated to only 0 peer(s) instead of 1 peers
                                                                                

+-----+-----+
| word|total|
+-----+-----+
|count|    1|
| this|    2|
|  and|    1|
+-----+-----+



[Stage 0:>                                                          (0 + 1) / 1]



                                                                                

+----+-----+
|word|total|
+----+-----+
| and|    1|
|    |    1|
|this|    1|
+----+-----+



[Stage 0:>                                                          (0 + 1) / 1]



In [None]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)