# FIT3182 - Big data management and processing

# Activity: Spark Streaming with Python#

**Apache Spark** is a fast and general engine for large-scale data processing. It has been reported that Spark is **100x faster** than Hadoop MapReduce in memory and **10x faster** on disk. Apache Spark is designed to write applications quickly in Java, Scala or Python.

**Apache Spark Streaming** makes it easy to build scalable fault-tolerant **streaming** applications. Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

In this activity, we will first learn how to **write Spark Streaming programs in Python** using **discretized stream** or **DStreams** which represents a continuous stream of data. Then, we will introduce you to **Spark structured streaming**.

Let's get started!

## 1. Overview ##

### What is Apache Spark?
Apache Spark is a fast and general engine for big data processing and a distributed processing framework.

It aims to provide a big data processing framework that can be used for streaming data manipulation (Spark streaming), machine learing and batch processing (Hadoop integration). Spark introduces an **abstract common data format** used to for efficient data sharing across parallel computation - **RDD (Resilient Distributed Datasets)**.

### What is Apache Spark Streaming?
Spark Streaming provides a high-level abstraction called **discretized stream** or **DStream (a sequence of RDD)**, which represents a continuous stream of data. **Streaming data** can be brought from many difference live streams or sources (e.g. Twitter, Kafka). Then, the processed data can be manipulated and stored into a big database and/or published into Web pages.

Processing streaming data is a new way of looking at and manipulating real-time streaming data which contradits batching processing. By processing streaming data, as one of the obvious benefits, we can reduce latency between an event occurring and taking an action driven by it.

Once real-time input data streams are received, Spark Streaming divides the data into "batches", and then the Spark Engine process them. In this activity, we will learn and practice how we can manipulate input data streams in Python.

<font color="blue">
    
#### What is batch?
- How long the data will be collected (time in seconds) before processing

#### What is RDD?
- Collection of data distributed across a cluster of machines
- Think of it like a new type of format
    - i.e., xml, json...
- Data is static, doesn't change overtime

#### What is DStream?
- Continuous stream of data arriving in real time
- Processed in mirco-batches

## 2. Create Streaming Context ##

### Our Example
To explain the use of the Spark APIs of Python, we will demonstrate a simple example:  ***"counting the number of words in input data streams"***.

Imagine we are receiving the input text data streams through a TCP socket from a certain data server, and we wish to count the number of words in the data.

### SparkContext and StreamingContext
Apache Spark community released a powerful Python package, **`pyspark`**. Using **`pyspark`**, we can  initialise Spark, load streaming data, create RDD  from the data, sort, filter and sample the data. 

Especially, we will use and import **`StreamingContext`** from **`pyspark`**, which is the main entry point for Spark Streaming functionality. The **`StreamingContext`** object provides methods used to create DStreams from various input sources.  

Spark applications run as independent sets of processes on a cluster, which is specified by the **`SparkContext`** object. **`SparkContext`** can connect to several types of cluster managers (local (standalone), Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (passed to `SparkContext`) to the executors. Finally, **`SparkContext`** sends tasks to the executors to run.

### Python Code
Thus, we need to import these two context:

```
from pyspark import SparkContext # spark
from pyspark.streaming import StreamingContext # spark streaming
```

As mentioned, **`SparkContext`** is the main object under which everything else can be used. Then, we need to pass this object with a batch interval (in this example, we use **10 seconds**) into the **`StreamingContext`** object. By doing so, we're ready to create our own stream context via `StreamingContext`:

```
# Create a local StreamingContext with as many working processors as possible and a batch interval of 10 seconds            
batch_interval = 10

# local[*]: run Spark locally with as many working processors as logical cores on your machine.
sc = SparkContext(master="local[*]", appName = "WordCountApp") 

# a batch interval of 10 seconds   
ssc = StreamingContext(sc, batch_interval)
```

In the field of `master`, we use a local server with as many working processors (or threads) as possible (i.e. `local[*]`). If we want Spark to run locally with 'k' worker threads, we can specify as `local[k]`.

The `appName` field is a name to be shown on the Sparking cluster UI. The batch interval (i.e. `batch_interval`) must be set based on the latency requirements of your application and available cluster resources.


## 3. Create DStream Data

Once a `StreamingContext` (i.e. `ssc`) is defined, we can now define a DStreams representing the streaming data that can be received from a data server through a TCP socket. This server is specified in the method `ssc.socketTextStream(host, port)`, where `host` indicates the host name and `port` is its port number. With this example, the host is the local host and the port is 9999.

```
# Create a DStream connecting to hostname:port
host = "localhost"
port = 9999
lines = ssc.socketTextStream(host, port)
```

The variable `lines` represents the stream of data (i.e. DStream) that will be received from the data server. A unit record in this data corresponds to a line of text. 

To count the number of the words in each line, we may want to define a function that can split the line into words. With this example, we use a lambda function;

```
# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))
```

`flatMap` is a one-to-many DStream operation. It creates a new DStream by generating multiple new records from each record. Thus, each line will be split into multiple words and we create a new DStream which is the stream of words. 

Now we further create a DStream of pairs (ie. the `pairs` DStream consisting of (word, count) pairs). For this purpose, we can use `reduceByKey` transformation for counting the number of each word in the `pairs` DStream. We can implement as follows:
```
# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the counting result
wordCounts.pprint()
```

## 4. Run Sparking Stream

Note that up to now, we have only established a computation environment for our Spark Streaming example. Thus, no real processing has started yet. To start processing, we need to perform the following code:

```
# Start the computation
ssc.start()             
# Wait for the computation to terminate. 
# We have added a `timeout` to deliberately cancel the execution after one minute. 
# In practice, you would not set this.

try:
    ssc.awaitTermination(timeout=60)  
except KeyboardInterrupt:
    ssc.stop()
    
# If we want to manually stop the streaming context, use the following.
ssc.stop()
```

### Important Note
We need to wrap up all the above code snippets as below. It is a **Streaming Client** program. This program counts the words in the line sent by the **Streaming Server** application. Before running the **Streaming Client**, we need to run a **Streaming Server** application. Please download **FIT3182 - TCP_Server.ipnyb** file from Moodle and open it in another tab. Run the **FIT3182 - TCP_Server.ipnyb** code. Then, run the code below. The lines sent from the TCP Server will be counted and printed on this browser every 10 seconds.

In [3]:
import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# We add this line to avoid an error : "Cannot run multiple SparkContexts at once". 
# If there is an existing spark context, we will reuse it instead of creating a new context.
sc = SparkContext.getOrCreate()

# Create a local StreamingContext with as many working processors as possible 
# and a batch interval of 10 seconds            
batch_interval = 10

# If there is no existing spark context, we now create a new context
#! 'local[*]' = use all available processors
if (sc is None):
    sc = SparkContext(master="local[*]", appName = "WordCountApp")
ssc = StreamingContext(sc, batch_interval) # sc = spark context
 
host = "localhost"
port = 9999

# IMPORTANT
lines = ssc.socketTextStream(host, int(port))
# This is to create the DStream (continuous stream of data)

# THE ACTUAL WORK WE DOING: "counting the number of words in input data streams"
# Useful for transforming streams of text data (sentence/paragraph/post/tweet) into streams
# Individual words/token. Delimiter is space ' '
# Split each line into words
words = lines.flatMap(lambda line: line.split(" ")) # Transform DStream input to DStream Object
#! flatmap() applies a given function to EACH ELEMENT of the RDD/DStream object

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
#! map() applied a given function to EACH ELEMENT of the RDD/DStream object
# BUT resulting output is often same length as input
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
#! reduceByKey --> groups the tuple with the same key and reduces them with the function
# we want to sum the values associated with each key

# Print the result                            
wordCounts.pprint()

ssc.start()
try:
    ssc.awaitTermination(timeout=20)
except KeyboardInterrupt:
    ssc.stop()
    sc.stop()
#except Exception as e:
#    print(f'Error at starting context: {e}')
#finally:
#    ssc.stop(stopSparkContext = True, stopGracefully = True)

#ssc.stop()
#sc.stop()

# Usually, u need to stop the ssc but you can set the timeout termination with .awaitTermination(timeout=60) where it will stop after 60seconds

Py4JJavaError: An error occurred while calling o3880.start.
: java.lang.IllegalStateException: Only one StreamingContext may be started in this JVM. Currently running StreamingContext was started atorg.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:557)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.base/java.lang.reflect.Method.invoke(Method.java:568)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
py4j.ClientServerConnection.run(ClientServerConnection.java:106)
java.base/java.lang.Thread.run(Thread.java:833)
	at org.apache.spark.streaming.StreamingContext$.org$apache$spark$streaming$StreamingContext$$assertNoOtherContextIsActive(StreamingContext.scala:763)
	at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:576)
	at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:557)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)


-------------------------------------------
Time: 2024-04-30 04:05:40
-------------------------------------------
('4', 2)
('line', 10)
('0', 2)
('1', 3)
('This', 10)
('5', 1)
('3', 2)
('is', 10)

-------------------------------------------
Time: 2024-04-30 04:05:50
-------------------------------------------
('4', 2)
('line', 10)
('0', 1)
('1', 1)
('This', 10)
('5', 1)
('3', 3)
('is', 10)
('2', 2)



## 5. Concepts in Sparking Streaming##

Now we will learn some basic concepts in Spark Streaming. 

### Discretized Streams (DStreams)
As mentioned above, **DStream** is the basic abstraction in Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. 

A DStream is seen as a continuous series of **RDDs**, which is Spark's abstraction of an immutable, distributed dataset (see [Spark Programming](https://spark.apache.org/docs/latest/rdd-programming-guide.html) to learn its more details). Each RDD in a DStream contains data from a certain interval.

Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in our above example, the `flatMap` operation is applied on each RDD to generate the RDDs of the `words` DStream. 


### Transformations on DStreams
We can apply various transformation operations on a DStream to modify its structure. Below, we see some of these transformations.

#### UpdateStateByKey Operation
This operation allows us to maintain **arbitrary state** while continuously updating it with new information. 

In order to use this operation, we need to do the following: 
    1. Define the state
    2. Define the state update function: specify with a function how to update the state 

To illustrate, let's get back to our previous example. Now we want to keep a count of each word seen in a text data stream. Here, **the running count is the state** and we will use **the `updateStateByKey` operation** for this update purpose:

In [None]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# add the new values with the previous running count to get the new count
def updateFunc(new_values, prev_running_count):
    return sum(new_values) + (prev_running_count or 0)
  
# Create a local StreamingContext with as many working processors as possible and a batch interval of 10 seconds            
batch_interval = 10

# We add this line to avoid an error : "Cannot run multiple SparkContexts at once". If there is an existing spark context, we will reuse it instead of creating a new context.
sc = SparkContext.getOrCreate()

# If there is no existing spark context, we now create a new context
if (sc is None):
    sc = SparkContext(master="local[*]", appName = "WordCountApp")
ssc = StreamingContext(sc, batch_interval)

#! To enable checkpointing
#! To store the state of the streaming application periodically to a reliable source
ssc.checkpoint("checkpoint")

host = "localhost"
port = 9999

lines = ssc.socketTextStream(host, int(port))

# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.updateStateByKey(updateFunc)
#! updateStateByKey() allow to maintain arbitrary state while processing the incoming data stream
#! the function is applied to EACH BATCH OF THE INPUT DATA STREAM to UPDATE THE STATE
#! the UPDATED state is then stored in memory and can be used in the next batch processing
#! Mostly for maintain and update a rolling count or cumulative count of data

# Print the result                            
wordCounts.pprint()

ssc.start()
try:
    ssc.awaitTermination(timeout=60)
except KeyboardInterrupt:
    ssc.stop()
    sc.stop()

ssc.stop()
sc.stop()

Can you see the effect of using the `updateStateByKey` operation? YES it's function is obvious. This operation is calling a function (`updateFunc`).  The `updateFunc` function has two parameters: 
    1. `new_values` having a sequence of (word, 1) pairs 
    2. `prev_running_count` having the previous count information of the pairs. 


Note that the `updateStateByKey` opertion needs the checkpoint directory to be configured. 

##### Checkpointing

A streaming application must run 24 hours a day. Thus, it needs to be resilient to failures caused by some unexpected errors such as system failures, driver failure, JVM crashes, etc. Checkpointing saves the generated RDDs to a reliable storate and performs receovery from an error. 

To summarise, checkpoints provide a way of recovering to a safe stable application snapshot. Using the `ssc.checkpoint()` method, we can tell the Spark engine **where to store the checkpoint files**.

**Options for checkpointing**
- HDFS (Hadoop Distributed File System)
    - Designed for storing large datasets across clusters of computers
    - Parallel processing
        - Access and store data at the same time
    - Scalable
        - U want more space? Add more nodes (computers)
    - Fault tolerance
        - Replicates data across multiple nodes, maybe 4 copies
- Cloud Storage
- Local:
    - MUST AVOID IT AT ALL COST = SERIOUS PROBLEM --> DATA LOSS 

### Window Operation
Spark Streaming also provides windowed computations. This function allows to apply transformations over a sliding window of data. 

Every time the window slides over a source DStream. Thus, the source RDDs that fall within the window are combined and operated to produce the RDDs of the windowed DStream. 

A window operation needs two parameters:
    1. window length: the duration of the window.
    2. sliding interval: the interval at which the window operation is performed.

These two parameters must be multiples of the batch interval (i.e. in our example: 10 sec) the source DStream.

To illustrate, refer to our previous example. If we want to generate word counts over the last 20 seconds of data, every 10 seconds, we can use the following command:

```
windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x + y, lambda x, y: x - y, 20, 10)
```

```
reduceByKeyAndWindow()
```

<font color='red'>
- Takes in 3 arguements by default but we will USE 4
    
- 1st : Reduce function, to combine the count of the words over the window
    
- 2nd : (OPTIONAL) Inverse reduce function, to remove the counts of words that are no longer in the window
    
- 3rd : Window size/length (20 seconds)
    
- 4th : Sliding interval (10 seconds)
</font><br>

<font color='blue'>
**Exercise**: Apply the reduceByKeyAndWindow operation, and check how it is working!
</font><br>

<font color="red">

**WHAT IS A SLIDING WINDOW OF DATA**
- Method to perform windowed computation on a continuous stream of data
- Break down a continuous stream of data into discrete chunks

In [4]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# add the new values with the previous running count to get the new count
def updateFunc(new_values, prev_running_count):
    return sum(new_values) + (prev_running_count or 0)
  
# Create a local StreamingContext with as many working processors as possible and a batch interval of 10 seconds            
batch_interval = 10

# We add this line to avoid an error : "Cannot run multiple SparkContexts at once". If there is an existing spark context, we will reuse it instead of creating a new context.
sc = SparkContext.getOrCreate()

# If there is no existing spark context, we now create a new context
if (sc is None):
    sc = SparkContext(master="local[*]", appName = "WordCountApp")
ssc = StreamingContext(sc, batch_interval)

#! To enable checkpointing
#! To store the state of the streaming application periodically to a reliable source
ssc.checkpoint("checkpoint")

host = "localhost"
port = 9999

lines = ssc.socketTextStream(host, int(port))

# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.updateStateByKey(updateFunc)
windowedWordCounts = pairs.reduceByKeyAndWindow(lambda x, y: x+y, lambda x, y: x-y, 20, 10)

# Print the result                            
wordCounts.pprint()
windowedWordCounts.pprint()

ssc.start()
try:
    ssc.awaitTermination(timeout=60)
except KeyboardInterrupt:
    ssc.stop()
    sc.stop()

ssc.stop()
sc.stop()

Py4JJavaError: An error occurred while calling o10033.start.
: java.lang.IllegalStateException: Only one StreamingContext may be started in this JVM. Currently running StreamingContext was started atorg.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:557)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.base/java.lang.reflect.Method.invoke(Method.java:568)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:282)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
py4j.ClientServerConnection.run(ClientServerConnection.java:106)
java.base/java.lang.Thread.run(Thread.java:833)
	at org.apache.spark.streaming.StreamingContext$.org$apache$spark$streaming$StreamingContext$$assertNoOtherContextIsActive(StreamingContext.scala:763)
	at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:576)
	at org.apache.spark.streaming.api.java.JavaStreamingContext.start(JavaStreamingContext.scala:557)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)


-------------------------------------------
Time: 2024-04-30 04:21:40
-------------------------------------------
('line', 10)
('0', 2)
('1', 1)
('This', 10)
('5', 1)
('3', 1)
('is', 10)
('2', 5)

-------------------------------------------
Time: 2024-04-30 04:21:50
-------------------------------------------
('4', 1)
('line', 10)
('0', 1)
('1', 4)
('This', 10)
('5', 2)
('3', 1)
('is', 10)
('2', 1)

-------------------------------------------
Time: 2024-04-30 04:22:00
-------------------------------------------
('4', 1)
('line', 10)
('1', 2)
('0', 1)
('This', 10)
('5', 2)
('3', 3)
('is', 10)
('2', 1)

-------------------------------------------
Time: 2024-04-30 04:22:10
-------------------------------------------
('4', 2)
('line', 10)
('0', 1)
('1', 3)
('This', 10)
('5', 1)
('3', 2)
('is', 10)
('2', 1)

-------------------------------------------
Time: 2024-04-30 04:22:20
-------------------------------------------
('4', 1)
('line', 10)
('0', 3)
('1', 3)
('This', 10)
('3', 1)
('is', 10

### Join Operations

Also, we can easily join two different streams into one stream data in Spark Streaming.

For example, if we want to join the `stream2` data into the `stream1` data, we can use the following code: 

```
stream1 = ...
stream2 = ...
joinedStream = stream1.join(stream2)
```

### Output Operations on DStreams ##

When we want to send DStream to an external system or database, we can use various output operations. The following output operations can be used:

    - print(): print the first ten elements of every batch of data in a DStream running the streaming application. In Python, pprint() corresponds to print().
    - saveAsTextFiles(prefix, [suffix]): save the DStream data as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
    - foreachRDD(func): Each RDD in DStream can be pushed out using this method. Note that the function `func` is executed on the running the streaming application, and will usually have RDD actions.

For example, with our original example, on the `wordCounts` DStream, we can use the following code:

```
def sendPartition(iter):
    connection = createNewConnection() # Assuming such fucntion exists
    for record in iter:
        connection.send(record)
    connection.close()
    
wordCounts.foreachRDD(lambda rdd: rdd.foreachPartition(sendPartition))
```

In `sendPartition()`, we create a single connection object and send all the records in a RDD partition using that connection.

As an example, if we can store each RDD into a MongoDB database, for example the `test_db`, then we can use the following code in the `sendPartition()` function:

```
connection = MongoClient()
test_db = connection.get_database('test_db')
....
```
You will learn more on this topic in next tutorial.

## 6. Spark Structured Streaming

Reference link: <a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" target="_blank">[REF]</a>

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. This is a simple example of a Structured Streaming query adapted from Spark's official documentation.

Let’s say you want to maintain a running word count of text data received from a data server listening on a TCP socket. Make sure the TCP Server notebook is running. 

First, we have to import the necessary classes and create a local SparkSession, the starting point of all functionalities related to Spark.

- Spark streaming in based on micro-batch processing. Data is processed in small and discrete batches
- Structured streaming is based on continuous processing. Data is processed continuously as it arrived
- Structured streaming is generally considered easier to use than Spark streaming
- Structured streaming is introduced in Spark 2.0 where spark streaming is Spark 1.2

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [None]:
spark = SparkSession.builder.getOrCreate()

Next, let’s create a streaming DataFrame that represents text data received from a server listening on localhost:9999, and transform the DataFrame to calculate word counts.

In [None]:
socket_sdf = (
    spark.readStream
    .format('socket')
    .options(host='localhost', port=9999)
    .load()
)

In [None]:
word_counts_sdf = (
    socket_sdf
    #! Create a new column / replace an existing column
    .withColumn('value', split('value', '\s+'))
    #! select a column from the dataframe
    .select( explode('value').alias('word') ) 
    # In this case we select the 'word' column by exploding the 'value' column using explode()
    # explode() used to transform array-like column into a set of rows,
    # with each row containing 1 value from the array
    .groupBy('word')
    .count()
)

This **word_count_sdf** DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named “value”, and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have used two built-in SQL functions - split and explode, to split each line into multiple rows with a word each. In addition, we use the function alias to name the new column as “word”. Finally, we group the unique values in the Dataset and count them. Note that this is a streaming DataFrame which represents the running word counts of the stream.

In [None]:
writer = (
    word_counts_sdf.writeStream
    .format('console')
    .outputMode('complete')
)

In [None]:
query = writer.start()

After the code above is executed, the streaming computation will have started in the background.

In [None]:
query.stop()

-------------------------------------------
Time: 2024-04-30 03:56:40
-------------------------------------------
('4', 4)
('line', 10)
('1', 1)
('0', 1)
('This', 10)
('5', 2)
('3', 2)
('is', 10)

-------------------------------------------
Time: 2024-04-30 03:56:50
-------------------------------------------
('4', 4)
('line', 10)
('0', 3)
('This', 10)
('3', 1)
('is', 10)
('2', 2)

-------------------------------------------
Time: 2024-04-30 03:57:00
-------------------------------------------
('4', 2)
('line', 10)
('1', 1)
('This', 10)
('5', 1)
('3', 2)
('is', 10)
('2', 4)

-------------------------------------------
Time: 2024-04-30 03:57:10
-------------------------------------------
('4', 4)
('line', 10)
('0', 3)
('This', 10)
('is', 10)
('2', 3)

-------------------------------------------
Time: 2024-04-30 03:57:20
-------------------------------------------
('4', 2)
('line', 10)
('0', 2)
('This', 10)
('5', 1)
('3', 2)
('is', 10)
('2', 3)

------------------------------------------

## Summary

Congratulations on finishing this activity!

<font color='blue'>
**Wrap up what we've learned:**
- Learned that Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
- Learned that Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results
- Learned that using "StreamingContext", we can define the input sources by creating input DStreams; apply transformation and output operations to DStreams; and receive data and process it.
- Learned that the "updateStateByKey" operation allows you to maintain arbitrary state while continuously updating it with new information. 
- Learned how to use "dstream.foreachRDD" that allows data to be sent out to external systems.
    
**Note: From Spark 3 onwards, the DStream API for Kafka integration has been removed for Python. Therefore, in this tutorial, we also introduced you to Structured Streaming with Spark. With structured streaming, you can express your streaming computation the same way you would express a batch computation on static data. You will use structured streaming in your assignment.**
    
Extra reading: Do refer to this paper which dives deeper into Strcutured Streaming with SPARK:
    
https://dl-acm-org.ezproxy.lib.monash.edu.au/doi/pdf/10.1145/3183713.3190664