# Lecture 3: Spark Streaming

_Spark Streaming_ is an extension of the Spark API that enables scalabe stream processing.

The continous stream of input data can be ingested from many data sources such as **Kafka**, **Amazon s3** or **TCP sockets**. 

The Spark API allows to process data via high-level functions such as *map* and *reduce*. As we are going to see, it is also possible to use dataframe operations. 

Processed data can be exported to an external database and used to make live dashboards or offline analyses, or stored in files, or be used in a further stage of a Kafka pipeline. 

Overall, the practice of reading data from a set of sources, pre-process it, and then store it in a different format for later analysis is extremely common, and has its own name: **realtime ETL pipelines**.
- **E**xtract
- **T**transform
- **L**oad

Spark streaming works by dividing the input data into _micro-batches_ that can be treated as static datasets. In Spark this is referred to as a *discretized stream* (*DStream*). The DStream is represented using RDDs.

![DStream](imgs/lecture3/DStream.png)

Any transformation applied on the DStream, i.e. anything like a `Dstream.map()`, will act independently on each batch. For example, in the image below, we can filter the original RDD to remove some data and produce a new stream. 

![DStream_filter](imgs/lecture3/Dstream_filter.png)

In this lecture we will see how to setup a simple stream using a TCP socket as a data source.

## Create and Start a Spark Session

In [1]:
# import the python libraries to create/connect to a Spark Session
from pyspark.sql import SparkSession

# build a SparkSession 
#   connect to the master node on the port where the master node is listening (7077)
#   declare the app name 
#   configure the executor memory to 512 MB
#   either *connect* or *create* a new Spark Context
spark = SparkSession.builder \
    .master("spark://spark-master:7077")\
    .appName("My streaming spark application")\
    .config("spark.executor.memory", "512m")\
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")\
    .config("spark.sql.execution.arrow.pyspark.fallback.enabled", "false")\
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/05/25 15:15:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
spark

In [3]:
# create a spark context
sc = spark.sparkContext

# print its status
sc

## Spark _Streaming_ context

The first step of a Spark streaming application is the creation of a `StreamingContext`. 

The `StreamingContext` is a crucial component in Spark Streaming. It's responsible for initializing the Spark Streaming application and specifying how to handle micro-batches of data. 

The `StreamingContext` is a similar concept to the `sparkContext` but it requires to be initialized with some additional information to know how to handle the micro-batches.

To create a `StreamingContext`, you can use the `StreamingContext(SparkContext, batch_interval)` constructor. The `SparkContext` object provides the necessary environment for Spark Streaming, while the `batch_interval` parameter determines the (wall-time) duration of each batch in seconds.

It's important to note that you can only have at most **one** `StreamingContext` for each Spark application. Attempting to create multiple `StreamingContext` objects in a single application will result in errors.

Create a Spark `StreamingContext` with a batch interval of 2 seconds

In [4]:
from pyspark.streaming import StreamingContext

# create a streaming context with a batch interval of 2 seconds
ssc = StreamingContext(sc, 2) 

### Starting and Stopping Spark Streaming

To process data in real-time using Spark, we need to create a `StreamingContext`, define the operations to perform on the data, and specify the data source and sink to connect to.

Once the streaming operations are defined, we can start processing the stream by calling the `.start()` method of the `StreamingContext` object (`ssc` in our case). Similarly, we can stop the streaming processing by calling the `.stop()` method.

**NOTE:** It's important to note that when we stop the `StreamingContext`, the default behavior is to also stop the `SparkContext`. This means that the entire Spark application will be closed by default. To prevent this, we can pass the `stopSparkContext=False` option when stopping the `StreamingContext`.

### TCP Socket Source

For this example spark will read data from a TCP socket using Spark Streaming.

A TCP socket is a communication endpoint used to establish a connection between two devices over a network.
You can think of it as a telephone connection: two endpoints have to enstablish a connection; once the connection is enstablished, a communication can occur, with a data transfer; as soon as one of the two ends interrupts the connection the whole communcation is lost. 

We will generate a dummy data stream representing fake credit card transactions.

A simple python program will be used to create this data stream.
You will be able to find it in `utils/producer.py`. 
When executed, the producer will try to enstablish a TCP connection and send data on port `5555` of a given `host` (`spark-master` in our case). 

Before executing the producer program, take a moment to review the `producer.py` code to understand how it works. It's important to understand the logic of the program before using it to generate the streaming data.

In [6]:
! cat utils/producer.py

import socket
import json
import time
import random
import argparse

# Define some lists of first and last names to use for generating random messages
first_names=('John','Andy','Joe','Alice','Jill')
last_names=('Johnson','Smith','Jones', 'Millers','Darby')

# Define a function for sending messages over the socket
def send_messages(client_socket):
    try:
        while 1:
            # Generate a random message with a random name, surname, amount, delta_t, and flag
            msg = {
                'name': random.choice(first_names),
                'surname': random.choice(last_names),
                'amount': '{:.2f}'.format(random.random()*1000),
                'delta_t': '{:.2f}'.format(random.random()*10),
                'flag': random.choices([0,1], weights=[0.8, 0.2])[0]
            }
            # Encode the message as JSON and send it over the socket
            client_socket.send((json.dumps(msg)+"\n").encode('utf-8'))
            # Sleep for a s

The producer will generate new records in the form of a random combination of:
- `name`
- `surname`
- `amount`: amount of the credit card transaction
- `delta_t`: time between transactions
- `flag`: random flag to indicate if potentially fraudolent or not

These information will be formatted into a `.json` data format

### Declaring the `StreamingContext` data source as a TCP socket

To inform Spark that the StreamingContext data source will be a TCP socket located at a specific `hostname` and `port`, we can use the `socketTextStream(hostname, port)` method.

Refer to the [StreamingContext documentation](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.streaming.StreamingContext.html) for additional available options.


In [7]:
# the hostname and port number
hostname = "spark-master"
portnumber = 5555

# declare the Spark Streaming source as TCP socket 
socket_stream = ssc.socketTextStream(hostname, portnumber)

### Start the python producer.py script

From a terminal/WSL, connect to the `spark-master` Docker container using the command
```bash
docker exec -it spark-master bash
``` 

From inside the docker container, move to the `/mapd-workspace` folder and execute the python script with the option `--hostname spark-master`:

```bash
python notebooks/utils/producer.py --hostname spark-master
```

## Exploring the data stream

The first thing we need to to is load the data describing each transaction, formatted as `json`.

In [8]:
import json

# use the map() transformation to apply the same function to all rdds
# the function we want to run is the json.loads() of the messages
json_stream = socket_stream.map(""" --- """)

It is possible to print some elements of each batch with `pprint()`. This can be used to explore the RDDs.

In [9]:
json_stream.pprint()

**Start the computations with `ssc.start()` and stop with `ssc.stop(stopSparkContext=False)`.** 

_Remember that once the StreamingContext has been stopped, it must be redefined anew if we want to restart the streaming computations._

In [9]:
ssc.start()

[Stage 2:>                  (0 + 1) / 1][Stage 3:>                  (0 + 1) / 1]

23/05/25 15:02:27 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(Pyt

[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:02:29 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 75) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(Pyt

[Stage 2:>                  (0 + 1) / 1][Stage 4:>                  (0 + 1) / 1]

23/05/25 15:02:30 ERROR TaskSetManager: Task 0 in stage 4.0 failed 4 times; aborting job
23/05/25 15:02:30 ERROR JobScheduler: Error running job streaming job 1685026948000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/bi

[Stage 2:>                  (0 + 1) / 1][Stage 5:>                  (0 + 1) / 1]

23/05/25 15:02:32 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 79) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(Pyt

[Stage 2:>                                                          (0 + 1) / 1]

e.spark.streaming.scheduler.JobScheduler$JobHandler.$anonfun$run$1(JobScheduler.scala:256)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
23/05/25 15:02:33 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 83) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_it

[Stage 2:>                  (0 + 1) / 1][Stage 6:>                  (0 + 1) / 1]

23/05/25 15:02:34 ERROR TaskSetManager: Task 0 in stage 6.0 failed 4 times; aborting job
23/05/25 15:02:34 ERROR JobScheduler: Error running job streaming job 1685026952000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/bi

[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:02:34 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 87) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(Pyt

[Stage 2:>                  (0 + 1) / 1][Stage 7:>                  (0 + 1) / 1]

23/05/25 15:02:35 ERROR TaskSetManager: Task 0 in stage 7.0 failed 4 times; aborting job
23/05/25 15:02:35 ERROR JobScheduler: Error running job streaming job 1685026954000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/bi

[Stage 2:>                  (0 + 1) / 1][Stage 8:>                  (0 + 1) / 1]

23/05/25 15:02:37 WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 91) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(Pyt

[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:02:38 ERROR JobScheduler: Error running job streaming job 1685026956000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a

23/05/25 15:02:38 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 95) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(Pyt

[Stage 2:>                  (0 + 1) / 1][Stage 9:>                  (0 + 1) / 1]

23/05/25 15:02:39 ERROR TaskSetManager: Task 0 in stage 9.0 failed 4 times; aborting job


[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:02:40 ERROR JobScheduler: Error running job streaming job 1685026958000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a

[Stage 2:>                  (0 + 1) / 1][Stage 10:>                 (0 + 1) / 1]

23/05/25 15:02:41 WARN TaskSetManager: Lost task 0.0 in stage 10.0 (TID 99) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(Py

[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:02:42 WARN TaskSetManager: Lost task 0.0 in stage 11.0 (TID 103) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 11:>                 (0 + 1) / 1][Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:02:42 ERROR TaskSetManager: Task 0 in stage 11.0 failed 4 times; aborting job
23/05/25 15:02:43 ERROR JobScheduler: Error running job streaming job 1685026962000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

23/05/25 15:02:44 WARN TaskSetManager: Lost task 0.0 in stage 12.0 (TID 107) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 12:>                 (0 + 1) / 1][Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:02:45 ERROR TaskSetManager: Task 0 in stage 12.0 failed 4 times; aborting job
23/05/25 15:02:45 ERROR JobScheduler: Error running job streaming job 1685026964000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

23/05/25 15:02:46 WARN TaskSetManager: Lost task 0.0 in stage 13.0 (TID 111) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 13:>                 (0 + 1) / 1]

23/05/25 15:02:46 ERROR TaskSetManager: Task 0 in stage 13.0 failed 4 times; aborting job
23/05/25 15:02:47 ERROR JobScheduler: Error running job streaming job 1685026966000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:02:48 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 115) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:02:50 WARN TaskSetManager: Lost task 0.0 in stage 15.0 (TID 119) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:02:52 WARN TaskSetManager: Lost task 0.0 in stage 16.0 (TID 123) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:02:54 WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 127) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:02:56 WARN TaskSetManager: Lost task 0.0 in stage 18.0 (TID 131) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 18:>                 (0 + 1) / 1]

23/05/25 15:02:56 ERROR TaskSetManager: Task 0 in stage 18.0 failed 4 times; aborting job


[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:02:57 ERROR JobScheduler: Error running job streaming job 1685026976000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a

23/05/25 15:02:58 WARN TaskSetManager: Lost task 0.0 in stage 19.0 (TID 135) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:00 WARN TaskSetManager: Lost task 0.0 in stage 20.0 (TID 139) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:02 WARN TaskSetManager: Lost task 0.0 in stage 21.0 (TID 143) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 21:>                 (0 + 1) / 1][Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:03:02 ERROR TaskSetManager: Task 0 in stage 21.0 failed 4 times; aborting job
23/05/25 15:03:03 ERROR JobScheduler: Error running job streaming job 1685026982000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

23/05/25 15:03:04 WARN TaskSetManager: Lost task 0.0 in stage 22.0 (TID 147) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:06 WARN TaskSetManager: Lost task 0.0 in stage 23.0 (TID 151) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 23:>                 (0 + 1) / 1]

23/05/25 15:03:07 ERROR TaskSetManager: Task 0 in stage 23.0 failed 4 times; aborting job
23/05/25 15:03:07 ERROR JobScheduler: Error running job streaming job 1685026986000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:03:08 WARN TaskSetManager: Lost task 0.0 in stage 24.0 (TID 155) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:10 WARN TaskSetManager: Lost task 0.0 in stage 25.0 (TID 159) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:12 WARN TaskSetManager: Lost task 0.0 in stage 26.0 (TID 163) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:14 WARN TaskSetManager: Lost task 0.0 in stage 27.0 (TID 167) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 27:>                 (0 + 1) / 1][Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:03:15 ERROR TaskSetManager: Task 0 in stage 27.0 failed 4 times; aborting job
23/05/25 15:03:15 ERROR JobScheduler: Error running job streaming job 1685026994000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

23/05/25 15:03:16 WARN TaskSetManager: Lost task 0.0 in stage 28.0 (TID 171) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 28:>                 (0 + 1) / 1]

23/05/25 15:03:17 ERROR TaskSetManager: Task 0 in stage 28.0 failed 4 times; aborting job


[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:03:17 ERROR JobScheduler: Error running job streaming job 1685026996000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/sql/utils.py", line 190, in deco
    return f(*a

23/05/25 15:03:18 WARN TaskSetManager: Lost task 0.0 in stage 29.0 (TID 175) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:20 WARN TaskSetManager: Lost task 0.0 in stage 30.0 (TID 179) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:22 WARN TaskSetManager: Lost task 0.0 in stage 31.0 (TID 183) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 31:>                 (0 + 1) / 1]

23/05/25 15:03:22 ERROR TaskSetManager: Task 0 in stage 31.0 failed 4 times; aborting job
23/05/25 15:03:23 ERROR JobScheduler: Error running job streaming job 1685027002000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:03:24 WARN TaskSetManager: Lost task 0.0 in stage 32.0 (TID 187) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:26 WARN TaskSetManager: Lost task 0.0 in stage 33.0 (TID 191) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:28 WARN TaskSetManager: Lost task 0.0 in stage 34.0 (TID 195) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:30 WARN TaskSetManager: Lost task 0.0 in stage 35.0 (TID 199) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:32 WARN TaskSetManager: Lost task 0.0 in stage 36.0 (TID 203) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:34 WARN TaskSetManager: Lost task 0.0 in stage 37.0 (TID 207) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:36 WARN TaskSetManager: Lost task 0.0 in stage 38.0 (TID 211) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:38 WARN TaskSetManager: Lost task 0.0 in stage 39.0 (TID 215) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:40 WARN TaskSetManager: Lost task 0.0 in stage 40.0 (TID 219) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:42 WARN TaskSetManager: Lost task 0.0 in stage 41.0 (TID 223) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:44 WARN TaskSetManager: Lost task 0.0 in stage 42.0 (TID 227) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:46 WARN TaskSetManager: Lost task 0.0 in stage 43.0 (TID 231) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:48 WARN TaskSetManager: Lost task 0.0 in stage 44.0 (TID 235) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:50 WARN TaskSetManager: Lost task 0.0 in stage 45.0 (TID 239) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:52 WARN TaskSetManager: Lost task 0.0 in stage 46.0 (TID 243) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 46:>                 (0 + 1) / 1][Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:03:53 ERROR TaskSetManager: Task 0 in stage 46.0 failed 4 times; aborting job
23/05/25 15:03:53 ERROR JobScheduler: Error running job streaming job 1685027032000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

23/05/25 15:03:54 WARN TaskSetManager: Lost task 0.0 in stage 47.0 (TID 247) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 47:>                 (0 + 1) / 1]

23/05/25 15:03:55 ERROR TaskSetManager: Task 0 in stage 47.0 failed 4 times; aborting job
23/05/25 15:03:55 ERROR JobScheduler: Error running job streaming job 1685027034000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

[Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:03:56 WARN TaskSetManager: Lost task 0.0 in stage 48.0 (TID 251) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:03:58 WARN TaskSetManager: Lost task 0.0 in stage 49.0 (TID 255) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

[Stage 2:>                  (0 + 1) / 1][Stage 49:>                 (0 + 1) / 1][Stage 2:>                                                          (0 + 1) / 1]

23/05/25 15:03:58 ERROR TaskSetManager: Task 0 in stage 49.0 failed 4 times; aborting job
23/05/25 15:03:58 ERROR JobScheduler: Error running job streaming job 1685027038000 ms.0
org.apache.spark.SparkException: An exception was raised by Python:
Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/util.py", line 71, in call
    r = self.func(t, *rdds)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/streaming/dstream.py", line 254, in takeAndPrint
    taken = rdd.take(num + 1)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/pyspark/context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/opt/conda/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/usr/b

23/05/25 15:04:00 WARN TaskSetManager: Lost task 0.0 in stage 50.0 (TID 259) (172.21.0.3 executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

23/05/25 15:04:02 WARN TaskSetManager: Lost task 0.0 in stage 51.0 (TID 263) (172.21.0.5 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/worker.py", line 678, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/bin/spark-3.3.2-bin-hadoop3/python/lib/pyspark.zip/pyspark/serializers.py", line 273, in dump_stream
    vs = list(itertools.islice(iterator, batch))
TypeError: 'str' object is not callable

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:552)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:758)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(P

In [10]:
ssc.stop(stopSparkContext=False)

23/05/25 15:04:03 ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
23/05/25 15:04:03 ERROR TaskSchedulerImpl: Lost executor 1 on 172.21.0.6: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.


## Working with Streaming data

Now that we know how to stream data into Spark, let's explore how we can perform basic distributed operations on the data.

However, before we can proceed, we need to make sure that we have properly restarted the `StreamingContext` object, as the connection between the socket and Spark will be lost when the context is stopped.

To restart the streaming context, we need to:
1. Create a new `StreamingContext` object (we can reuse the `ssc` object in our case).
2. Point it to the correct TCP socket and port where the data is being streamed from.
3. Restart the Python producer application.

Once the `StreamingContext` is properly set up and running, we can start applying distributed operations to the streaming data. 

In [11]:
# create a new Spark StreamingContext with a batch wall-time of 2 seconds
ssc = """ --- """

In [12]:
# define the socket stream using the appropriate endpoint and port
socket_stream = """ --- """

In [13]:
# start the python producer script
### from the terminal/WSL shell

We now start listening on the TCP socket, interpreting the input data stream as `json` loads.

**Remember to get rid of the `pprint()` action, that would otherwise be performed continously, dumping the input data into the Jupyter cells.**

In [15]:
# create a new json_stream object by reading the json loads from the socket
json_stream = socket_stream.map(""" --- """)

AttributeError: 'str' object has no attribute 'map'

#### Converting Streaming Data to a DataFrame

To make use of Spark's higher-level APIs, we can convert each batch of streaming data into a DataFrame. 

To do so, we'll first need to convert the numeric features of the incoming JSON data into Python floats and integers. This is a simple type cast operation that can be easily parallelized.

After casting the data, we can create a `Row` object for each transaction using the resulting Python dictionary. These `Row` objects can then be used to create a DataFrame, allowing us to use Spark's higher-level APIs for data processing.

In [None]:
from pyspark.sql import Row

# create a row for each message 
#   convert each numerical value to the proper python type
#   create a row from each message
def create_row_rdd(t):
    t['amount'] = float(t['amount'])
    t['delta_t'] = float(t['delta_t'])
    t['flag'] = int(t['flag'])
    
    return Row(**t)

# apply the transformation to the json_stream rdd
row_stream = json_stream.map(create_row_rdd)

The method `DStream.foreachRDD` can be used to apply custom transformations to each *batch* of data. 

In our case, we are insterested in converting each batch of data into a Spark DataFrame and perform operations, such as counting the number of transactions for each user. 

In this specific use-case, we can identify batches where a user has performed more than one transaction with the `flag` field equal to one as fraudulent. For simplicity, we will assume that these batches represent fraudulent activity.

In reality, this might be a flag you might set on the fly using statig-rules or a ML-based model.

**NOTE**: If left unconstrained, Spark might want to create a very large number of partitions for this streaming application.

Using way more partitions than necessary always results in a huge over-head due to the partition-to-partition communications.

We can force Spark to use a small yet reasonable (given the problem and resources we have) number of partitions
thus making it more efficient in the case of small workloads and few executors

In [None]:
# this line is a trick to force Spark to use a small number of partitions (4 in this example)
spark.conf.set("spark.sql.shuffle.partitions", 4)

### Process each bach to identify possibly fraudolent transactions


1. convert the RDD into a DataFrame (provide the schema if necessary)
2. compute the _number of flagged transactions per batch per user_ (create a unique `userID` field as the combination of _FirstLastname_ to idenfity individual users)
3. identify all the "suspicios" transactions per user: all users with more than one flagged transaction per batch will be assigned a `isFraud` boolean variable
4. format the resulting `userID` and `isFraud` information in a DataFrame to mimick a "live-report" of the suspicious transactions

In [None]:
from pyspark.sql.functions import concat, col, lit, countDistinct

def process_batch(rdd):
    # convert rdd to df
    #   check the documentation and/or the Lecture2 notebook 
    #   for details on how to create and pass a schema to a dataframe   
    df = """ --- """
    
    # find number of transactions for each user when flag = 1 
    #    declare a new column to create a unique user identifier 
    #    this can be easily done by concatenating first- and last-name fields
    #    check the concat function from pyspark.sql.functions 
    num_transactions = """ --- """
    
    # find suspicious transactions
    #    filter only users with more than one transaction per batch
    #    create a "fraud" column with a value of 1 for the selected users (check the lit function)
    #    from the dataframe, project only the unique id and fraud columns
    sus_transactions = """ --- """
    
    # (trigger an automatic alert)
    # print the first 5 items of the resulting dataframe
    sus_transactions.show(5)

Finally, instruct Spark to execute this `process_batch` function **for each RDD** you will have in your DStream

In [None]:
row_stream.foreachRDD(process_batch)

Now you should be ready to start the spark streaming context

In [None]:
ssc.start()

In [None]:
# stop streaming context
ssc.stop(stopSparkContext=False)

## Stop worker and master

In [None]:
sc.stop()
spark.stop()

Finally, use `docker compose down` to stop and clear all running containers.