Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Spark Submit #2883

Closed
mohamedih opened this issue Jun 19, 2020 · 19 comments
Closed

Error in Spark Submit #2883

mohamedih opened this issue Jun 19, 2020 · 19 comments
Labels
on-hold Issues or Pull Requests with this label will never be considered stale

Comments

@mohamedih
Copy link

mohamedih commented Jun 19, 2020

I'm using Spark chart in IBM Cloud. I'm using values.yaml. The cluster master and the workers are up and running in Kubernetes. When i use spark submit from pod inside Kubernetes, I see the below behavior:

This log from the driver itself.

20/06/19 04:44:45 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/6 is now RUNNING
20/06/19 04:44:45 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/5 is now EXITED (Command exited with code 1)
20/06/19 04:44:45 INFO StandaloneSchedulerBackend: Executor app-20200619044436-0006/5 removed: Command exited with code 1
20/06/19 04:44:45 INFO BlockManagerMaster: Removal of executor 5 requested
20/06/19 04:44:45 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 5
20/06/19 04:44:45 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
20/06/19 04:44:45 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200619044436-0006/7 on worker-20200619005448-172.30.161.25-38829 (172.30.161.25:38829) with 1 core(s)
20/06/19 04:44:45 INFO StandaloneSchedulerBackend: Granted executor ID app-20200619044436-0006/7 on hostPort 172.30.161.25:38829 with 1 core(s), 1024.0 MB RAM
20/06/19 04:44:45 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/7 is now RUNNING
20/06/19 04:44:47 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/6 is now EXITED (Command exited with code 1)
20/06/19 04:44:47 INFO StandaloneSchedulerBackend: Executor app-20200619044436-0006/6 removed: Command exited with code 1
20/06/19 04:44:47 INFO BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
20/06/19 04:44:47 INFO BlockManagerMaster: Removal of executor 6 requested
20/06/19 04:44:47 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 6
20/06/19 04:44:47 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200619044436-0006/8 on worker-20200619005407-172.30.161.23-45683 (172.30.161.23:45683) with 1 core(s)
20/06/19 04:44:47 INFO StandaloneSchedulerBackend: Granted executor ID app-20200619044436-0006/8 on hostPort 172.30.161.23:45683 with 1 core(s), 1024.0 MB RAM
20/06/19 04:44:47 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/8 is now RUNNING
20/06/19 04:44:47 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/7 is now EXITED (Command exited with code 1)
20/06/19 04:44:47 INFO StandaloneSchedulerBackend: Executor app-20200619044436-0006/7 removed: Command exited with code 1
20/06/19 04:44:47 INFO BlockManagerMaster: Removal of executor 7 requested
20/06/19 04:44:47 INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster.
20/06/19 04:44:47 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 7
20/06/19 04:44:47 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200619044436-0006/9 on worker-20200619005448-172.30.161.25-38829 (172.30.161.25:38829) with 1 core(s)
20/06/19 04:44:47 INFO StandaloneSchedulerBackend: Granted executor ID app-20200619044436-0006/9 on hostPort 172.30.161.25:38829 with 1 core(s), 1024.0 MB RAM
20/06/19 04:44:47 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/9 is now RUNNING
20/06/19 04:44:49 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/8 is now EXITED (Command exited with code 1)
20/06/19 04:44:49 INFO StandaloneSchedulerBackend: Executor app-20200619044436-0006/8 removed: Command exited with code 1
20/06/19 04:44:49 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200619044436-0006/10 on worker-20200619005407-172.30.161.23-45683 (172.30.161.23:45683) with 1 core(s)
20/06/19 04:44:49 INFO StandaloneSchedulerBackend: Granted executor ID app-20200619044436-0006/10 on hostPort 172.30.161.23:45683 with 1 core(s), 1024.0 MB RAM
20/06/19 04:44:49 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
20/06/19 04:44:49 INFO BlockManagerMaster: Removal of executor 8 requested
20/06/19 04:44:49 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 8
20/06/19 04:44:49 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/10 is now RUNNING
20/06/19 04:44:50 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/9 is now EXITED (Command exited with code 1)
20/06/19 04:44:50 INFO StandaloneSchedulerBackend: Executor app-20200619044436-0006/9 removed: Command exited with code 1
20/06/19 04:44:50 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200619044436-0006/11 on worker-20200619005448-172.30.161.25-38829 (172.30.161.25:38829) with 1 core(s)
20/06/19 04:44:50 INFO StandaloneSchedulerBackend: Granted executor ID app-20200619044436-0006/11 on hostPort 172.30.161.25:38829 with 1 core(s), 1024.0 MB RAM
20/06/19 04:44:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
20/06/19 04:44:50 INFO BlockManagerMaster: Removal of executor 9 requested
20/06/19 04:44:50 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 9
20/06/19 04:44:50 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/11 is now RUNNING
20/06/19 04:44:52 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200619044436-0006/10 is now EXITED (Command exited with code 1)
20/06/19 04:44:52 INFO StandaloneSchedulerBackend: Executor app-20200619044436-0006/10 removed: Command exited with code 1
20/06/19 04:44:52 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200619044436-0006/12 on worker-20200619005407-172.30.161.23-45683 (172.30.161.23:45683) with 1 core(s)
20/06/19 04:44:52 INFO StandaloneSchedulerBackend: Granted executor ID app-20200619044436-0006/12 on hostPort 172.30.161.23:45683 with 1 core(s), 1024.0 MB RAM
20/06/19 04:44:52 INFO BlockManagerMaster: Removal of executor 10 requested

This is error form one of the workers.

20/06/19 04:44:47 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=39057" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@cloudant-spark-app-59dcc87d58-r4cxr:39057" "--executor-id" "8" "--hostname" "172.30.161.23" "--cores" "1" "--app-id" "app-20200619044436-0006" "--worker-url" "spark://Worker@172.30.161.23:45683"
20/06/19 04:44:49 INFO Worker: Executor app-20200619044436-0006/8 finished with state EXITED message Command exited with code 1 exitStatus 1
20/06/19 04:44:49 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 8
20/06/19 04:44:49 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20200619044436-0006, execId=8)
20/06/19 04:44:49 INFO Worker: Asked to launch executor app-20200619044436-0006/10 for Cloudant Spark SQL Example with Dataframe
20/06/19 04:44:49 INFO SecurityManager: Changing view acls to: spark
20/06/19 04:44:49 INFO SecurityManager: Changing modify acls to: spark
20/06/19 04:44:49 INFO SecurityManager: Changing view acls groups to: 
20/06/19 04:44:49 INFO SecurityManager: Changing modify acls groups to: 
20/06/19 04:44:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
20/06/19 04:44:49 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=39057" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@cloudant-spark-app-59dcc87d58-r4cxr:39057" "--executor-id" "10" "--hostname" "172.30.161.23" "--cores" "1" "--app-id" "app-20200619044436-0006" "--worker-url" "spark://Worker@172.30.161.23:45683"
20/06/19 04:44:52 INFO Worker: Executor app-20200619044436-0006/10 finished with state EXITED message Command exited with code 1 exitStatus 1
20/06/19 04:44:52 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 10
20/06/19 04:44:52 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20200619044436-0006, execId=10)
20/06/19 04:44:52 INFO Worker: Asked to launch executor app-20200619044436-0006/12 for Cloudant Spark SQL Example with Dataframe
20/06/19 04:44:52 INFO SecurityManager: Changing view acls to: spark
20/06/19 04:44:52 INFO SecurityManager: Changing modify acls to: spark
20/06/19 04:44:52 INFO SecurityManager: Changing view acls groups to: 
20/06/19 04:44:52 INFO SecurityManager: Changing modify acls groups to: 
20/06/19 04:44:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
20/06/19 04:44:52 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=39057" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@cloudant-spark-app-59dcc87d58-r4cxr:39057" "--executor-id" "12" "--hostname" "172.30.161.23" "--cores" "1" "--app-id" "app-20200619044436-0006" "--worker-url" "spark://Worker@172.30.161.23:45683"
20/06/19 04:44:54 INFO Worker: Executor app-20200619044436-0006/12 finished with state EXITED message Command exited with code 1 exitStatus 1
20/06/19 04:44:54 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 12
20/06/19 04:44:54 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20200619044436-0006, execId=12)
20/06/19 04:44:54 INFO Worker: Asked to launch executor app-20200619044436-0006/14 for Cloudant Spark SQL Example with Dataframe
20/06/19 04:44:54 INFO SecurityManager: Changing view acls to: spark
20/06/19 04:44:54 INFO SecurityManager: Changing modify acls to: spark
20/06/19 04:44:54 INFO SecurityManager: Changing view acls groups to: 
20/06/19 04:44:54 INFO SecurityManager: Changing modify acls groups to: 
20/06/19 04:44:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()

Here is the spark submit command

./bin/spark-submit --class ReadFromCloudant  --master spark://ews-spark-master-0.ews-spark-headless.default.svc.cluster.local:7077  

> exp-spark-in-scala-assembly-0.1.0-SNAPSHOT.jar
@mohamedih
Copy link
Author

mohamedih commented Jun 19, 2020

Here is the Kubernetes resources for the drivers process

apiVersion: v1
kind: Service
metadata:
name: cloudant-spark-app
labels:
app: cloudant-spark-app
spec:
ports:
- port: 4040
name: http
protocol: TCP
targetPort: 4040
selector:
app: cloudant-spark-app

apiVersion: apps/v1
kind: Deployment
metadata:
name: cloudant-spark-app
labels:
app: cloudant-spark-app
spec:
replicas: 1
selector:
matchLabels:
app: cloudant-spark-app
template:
metadata:
labels:
app: cloudant-spark-app
spec:
containers:
- name: cloudant-spark-app
image: fxo-cio-events-dev-docker-local.artifactory.swg-devops.com/spark_cloudant_to_db2:0.9
imagePullPolicy: Always
ports:
- containerPort: 4040
imagePullSecrets:
- name: artifactory-secret

The other example that i used


apiVersion: v1
kind: Service
metadata:
name: db2-cloudant-job
spec:
clusterIP: None
selector:
app: db2-cloudant-job

apiVersion: v1
kind: Pod
metadata:
name: db2-cloudant-job
labels:
billingType: "monthly" # optional params [hourly | monthly (default)]
#region: us-south # Example: us-south
#zone: dal12 # Example: dal13
spec:
imagePullSecrets:
- name: artifactory-secret
containers:
- image: fxo-cio-events-dev-docker-local.artifactory.swg-devops.com/spark_cloudant_to_db2:0.9
imagePullPolicy: Always
name: container-name

@andresbono
Copy link
Member

Hi @mohamedih, did you try with the commands shown in the chart notes? You can obtain them with:

$ helm get notes <your_release>

It will show how to execute a spark-submit with a sample jar file

@mohamedih
Copy link
Author

@andresbono it shows command that use the workers directly, but in my case i will not use the workers to submit the jobs. not sure what is actually the problem.

@andresbono
Copy link
Member

Hi @mohamedih, we have identified some issues related to the submit command from remote nodes. We are going to investigate it and we will try to find a solution. We will update this issue when we have more information.

@andresbono andresbono added the on-hold Issues or Pull Requests with this label will never be considered stale label Jun 22, 2020
@andresbono
Copy link
Member

FYI @mohamedih, #2946 (comment). You might find it useful.

@MohamedKari
Copy link

Hi,

I am experiencing the exact same thing on Minikube.

Works when submitting from a worker, but not from a new pod created:

helm install spark-bitnami bitnami/spark
kubectl run -i --tty spark-interactive --image=bitnami/spark:3.0.1-debian-10-r12 -- bash
spark-submit --master spark://spark-bitnami-master-svc:7077 --class org.apache.spark.examples.SparkPi /opt/bitnami/spark/examples/jars/spark-examples_2.12-3.0.1.jar 5

results (on the submitting container) in

0/10/04 21:53:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/10/04 21:53:12 INFO SparkContext: Running Spark version 3.0.1
20/10/04 21:53:12 INFO ResourceUtils: ==============================================================
20/10/04 21:53:12 INFO ResourceUtils: Resources for spark.driver:

20/10/04 21:53:12 INFO ResourceUtils: ==============================================================
20/10/04 21:53:12 INFO SparkContext: Submitted application: Spark Pi
20/10/04 21:53:12 INFO SecurityManager: Changing view acls to: spark
20/10/04 21:53:12 INFO SecurityManager: Changing modify acls to: spark
20/10/04 21:53:12 INFO SecurityManager: Changing view acls groups to: 
20/10/04 21:53:12 INFO SecurityManager: Changing modify acls groups to: 
20/10/04 21:53:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
20/10/04 21:53:13 INFO Utils: Successfully started service 'sparkDriver' on port 36171.
20/10/04 21:53:13 INFO SparkEnv: Registering MapOutputTracker
20/10/04 21:53:13 INFO SparkEnv: Registering BlockManagerMaster
20/10/04 21:53:13 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/10/04 21:53:13 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/10/04 21:53:13 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
20/10/04 21:53:13 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-6ca773b7-e641-46ed-bb53-fcac323696dc
20/10/04 21:53:13 INFO MemoryStore: MemoryStore started with capacity 413.9 MiB
20/10/04 21:53:13 INFO SparkEnv: Registering OutputCommitCoordinator
20/10/04 21:53:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/10/04 21:53:13 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://spark-interactive:4040
20/10/04 21:53:13 INFO SparkContext: Added JAR file:/opt/bitnami/spark/examples/jars/spark-examples_2.12-3.0.1.jar at spark://spark-interactive:36171/jars/spark-examples_2.12-3.0.1.jar with timestamp 1601848393418
20/10/04 21:53:13 WARN SparkContext: Please ensure that the number of slots available on your executors is limited by the number of cores to task cpus and not another custom resource. If cores is not the limiting resource then dynamic allocation will not work properly!
20/10/04 21:53:13 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-bitnami-master-svc:7077...
20/10/04 21:53:13 INFO TransportClientFactory: Successfully created connection to spark-bitnami-master-svc/10.107.88.54:7077 after 61 ms (0 ms spent in bootstraps)
20/10/04 21:53:13 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20201004215313-0006
20/10/04 21:53:13 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201004215313-0006/0 on worker-20201004212445-172.18.0.8-44203 (172.18.0.8:44203) with 1 core(s)
20/10/04 21:53:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46245.
20/10/04 21:53:13 INFO NettyBlockTransferService: Server created on spark-interactive:46245
20/10/04 21:53:13 INFO StandaloneSchedulerBackend: Granted executor ID app-20201004215313-0006/0 on hostPort 172.18.0.8:44203 with 1 core(s), 1024.0 MiB RAM
20/10/04 21:53:13 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/10/04 21:53:13 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201004215313-0006/1 on worker-20201004212414-172.18.0.7-43745 (172.18.0.7:43745) with 1 core(s)
20/10/04 21:53:13 INFO StandaloneSchedulerBackend: Granted executor ID app-20201004215313-0006/1 on hostPort 172.18.0.7:43745 with 1 core(s), 1024.0 MiB RAM
20/10/04 21:53:13 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, spark-interactive, 46245, None)
20/10/04 21:53:13 INFO BlockManagerMasterEndpoint: Registering block manager spark-interactive:46245 with 413.9 MiB RAM, BlockManagerId(driver, spark-interactive, 46245, None)
20/10/04 21:53:13 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, spark-interactive, 46245, None)
20/10/04 21:53:13 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, spark-interactive, 46245, None)
20/10/04 21:53:13 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/0 is now RUNNING
20/10/04 21:53:13 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/1 is now RUNNING
20/10/04 21:53:14 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/10/04 21:53:15 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
20/10/04 21:53:15 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 5 output partitions
20/10/04 21:53:15 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
20/10/04 21:53:15 INFO DAGScheduler: Parents of final stage: List()
20/10/04 21:53:15 INFO DAGScheduler: Missing parents: List()
20/10/04 21:53:15 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
20/10/04 21:53:15 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.1 KiB, free 413.9 MiB)
20/10/04 21:53:15 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1816.0 B, free 413.9 MiB)
20/10/04 21:53:15 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-interactive:46245 (size: 1816.0 B, free: 413.9 MiB)
20/10/04 21:53:15 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1223
20/10/04 21:53:15 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4))
20/10/04 21:53:15 INFO TaskSchedulerImpl: Adding task set 0.0 with 5 tasks
20/10/04 21:53:16 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/0 is now EXITED (Command exited with code 1)
20/10/04 21:53:16 INFO StandaloneSchedulerBackend: Executor app-20201004215313-0006/0 removed: Command exited with code 1
20/10/04 21:53:16 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201004215313-0006/2 on worker-20201004212445-172.18.0.8-44203 (172.18.0.8:44203) with 1 core(s)
20/10/04 21:53:16 INFO StandaloneSchedulerBackend: Granted executor ID app-20201004215313-0006/2 on hostPort 172.18.0.8:44203 with 1 core(s), 1024.0 MiB RAM
20/10/04 21:53:16 INFO BlockManagerMaster: Removal of executor 0 requested
20/10/04 21:53:16 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
20/10/04 21:53:16 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
20/10/04 21:53:16 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/1 is now EXITED (Command exited with code 1)
20/10/04 21:53:16 INFO StandaloneSchedulerBackend: Executor app-20201004215313-0006/1 removed: Command exited with code 1
20/10/04 21:53:16 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201004215313-0006/3 on worker-20201004212414-172.18.0.7-43745 (172.18.0.7:43745) with 1 core(s)
20/10/04 21:53:16 INFO StandaloneSchedulerBackend: Granted executor ID app-20201004215313-0006/3 on hostPort 172.18.0.7:43745 with 1 core(s), 1024.0 MiB RAM
20/10/04 21:53:16 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/10/04 21:53:16 INFO BlockManagerMaster: Removal of executor 1 requested
20/10/04 21:53:16 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
20/10/04 21:53:16 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/2 is now RUNNING
20/10/04 21:53:16 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/3 is now RUNNING
20/10/04 21:53:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/2 is now EXITED (Command exited with code 1)
20/10/04 21:53:18 INFO StandaloneSchedulerBackend: Executor app-20201004215313-0006/2 removed: Command exited with code 1
20/10/04 21:53:18 INFO BlockManagerMaster: Removal of executor 2 requested
20/10/04 21:53:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
20/10/04 21:53:18 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201004215313-0006/4 on worker-20201004212445-172.18.0.8-44203 (172.18.0.8:44203) with 1 core(s)
20/10/04 21:53:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20201004215313-0006/4 on hostPort 172.18.0.8:44203 with 1 core(s), 1024.0 MiB RAM
20/10/04 21:53:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/10/04 21:53:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/4 is now RUNNING
20/10/04 21:53:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/3 is now EXITED (Command exited with code 1)
20/10/04 21:53:18 INFO StandaloneSchedulerBackend: Executor app-20201004215313-0006/3 removed: Command exited with code 1
20/10/04 21:53:18 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201004215313-0006/5 on worker-20201004212414-172.18.0.7-43745 (172.18.0.7:43745) with 1 core(s)
20/10/04 21:53:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
20/10/04 21:53:18 INFO BlockManagerMaster: Removal of executor 3 requested
20/10/04 21:53:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
20/10/04 21:53:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20201004215313-0006/5 on hostPort 172.18.0.7:43745 with 1 core(s), 1024.0 MiB RAM
20/10/04 21:53:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/5 is now RUNNING
20/10/04 21:53:20 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/4 is now EXITED (Command exited with code 1)
20/10/04 21:53:20 INFO StandaloneSchedulerBackend: Executor app-20201004215313-0006/4 removed: Command exited with code 1
20/10/04 21:53:20 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
20/10/04 21:53:20 INFO BlockManagerMaster: Removal of executor 4 requested
20/10/04 21:53:20 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201004215313-0006/6 on worker-20201004212445-172.18.0.8-44203 (172.18.0.8:44203) with 1 core(s)
20/10/04 21:53:20 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 4
20/10/04 21:53:20 INFO StandaloneSchedulerBackend: Granted executor ID app-20201004215313-0006/6 on hostPort 172.18.0.8:44203 with 1 core(s), 1024.0 MiB RAM
20/10/04 21:53:20 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/6 is now RUNNING
20/10/04 21:53:20 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/5 is now EXITED (Command exited with code 1)
20/10/04 21:53:20 INFO StandaloneSchedulerBackend: Executor app-20201004215313-0006/5 removed: Command exited with code 1
20/10/04 21:53:20 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20201004215313-0006/7 on worker-20201004212414-172.18.0.7-43745 (172.18.0.7:43745) with 1 core(s)
20/10/04 21:53:20 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
20/10/04 21:53:20 INFO StandaloneSchedulerBackend: Granted executor ID app-20201004215313-0006/7 on hostPort 172.18.0.7:43745 with 1 core(s), 1024.0 MiB RAM
20/10/04 21:53:20 INFO BlockManagerMaster: Removal of executor 5 requested
20/10/04 21:53:20 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 5
20/10/04 21:53:20 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20201004215313-0006/7 is now RUNNING
^C20/10/04 21:53:20 INFO SparkContext: Invoking stop() from shutdown hook
20/10/04 21:53:20 INFO SparkUI: Stopped Spark web UI at http://spark-interactive:4040
20/10/04 21:53:20 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:38, took 5.619530 s

on the Master

20/10/04 21:53:13 INFO Master: Registering app Spark Pi
20/10/04 21:53:13 INFO Master: Registered app Spark Pi with ID app-20201004215313-0006
20/10/04 21:53:13 INFO Master: Launching executor app-20201004215313-0006/0 on worker worker-20201004212445-172.18.0.8-44203
20/10/04 21:53:13 INFO Master: Launching executor app-20201004215313-0006/1 on worker worker-20201004212414-172.18.0.7-43745
20/10/04 21:53:16 INFO Master: Removing executor app-20201004215313-0006/0 because it is EXITED
20/10/04 21:53:16 INFO Master: Launching executor app-20201004215313-0006/2 on worker worker-20201004212445-172.18.0.8-44203
20/10/04 21:53:16 INFO Master: Removing executor app-20201004215313-0006/1 because it is EXITED
20/10/04 21:53:16 INFO Master: Launching executor app-20201004215313-0006/3 on worker worker-20201004212414-172.18.0.7-43745
20/10/04 21:53:18 INFO Master: Removing executor app-20201004215313-0006/2 because it is EXITED
20/10/04 21:53:18 INFO Master: Launching executor app-20201004215313-0006/4 on worker worker-20201004212445-172.18.0.8-44203
20/10/04 21:53:18 INFO Master: Removing executor app-20201004215313-0006/3 because it is EXITED
20/10/04 21:53:18 INFO Master: Launching executor app-20201004215313-0006/5 on worker worker-20201004212414-172.18.0.7-43745
20/10/04 21:53:20 INFO Master: Removing executor app-20201004215313-0006/4 because it is EXITED
20/10/04 21:53:20 INFO Master: Launching executor app-20201004215313-0006/6 on worker worker-20201004212445-172.18.0.8-44203
20/10/04 21:53:20 INFO Master: Removing executor app-20201004215313-0006/5 because it is EXITED
20/10/04 21:53:20 INFO Master: Launching executor app-20201004215313-0006/7 on worker worker-20201004212414-172.18.0.7-43745
20/10/04 21:53:20 INFO Master: Received unregister request from application app-20201004215313-0006
20/10/04 21:53:20 INFO Master: Removing app app-20201004215313-0006
20/10/04 21:53:20 WARN Master: Got status update for unknown executor app-20201004215313-0006/6
20/10/04 21:53:20 WARN Master: Got status update for unknown executor app-20201004215313-0006/7
20/10/04 21:53:20 INFO Master: 172.18.0.5:47218 got disassociated, removing it.
20/10/04 21:53:20 INFO Master: spark-interactive:36171 got disassociated, removing it.

Been working on this for hours, but didn't find a way fix it. Any ideas?

Thanks, Mo

@MohamedKari
Copy link

MohamedKari commented Oct 4, 2020

Finally, I think, I got somewhere. The executor can't resolve the Pod hostname to the IP (for Pod hostname resolution, see https://stackoverflow.com/questions/59258223/how-to-resolve-pod-hostnames-from-other-pods). Therefore, passing the driver IP explicitly using --conf spark.driver.host=172.18.0.5 will do the trick:

spark-submit --master spark://spark-bitnami-master-svc:7077 --class org.apache.spark.examples.SparkPi --conf spark.driver.host=172.18.0.5 /opt/bitnami/spark/examples/jars/spark-examples_2.12-3.0.1.jar 5

or

pyspark --master spark://spark-bitnami-master-svc:7077  --conf spark.driver.host=172.18.0.5

Now, the executor can respond correctly.

@mohamedih
Copy link
Author

@MohamedKari try to add "spark.driver.host" to your spark-submit command. where the value will be your spark driver service internal host name.

example :
CMD ["./bin/spark-submit" ,"--class", "ReadFromCloudant","--conf", "spark.driver.host=db2-to-cloudant-spark-app.ews-spark","--master", "spark://spark33-2439-44-master-svc.ews-spark:7077", "--driver-class-path", "jars/db2jcc4.jar", "--jars", "jars/db2jcc4.jar", "exp-spark-in-scala-assembly-0.1.0-SNAPSHOT.jar"]

@andresbono
Copy link
Member

Thanks for this valuable information! We will try this fix for the chart. Contributions are more than welcome, in case you want to draft a PR addressing the changes.

@MohamedKari
Copy link

MohamedKari commented Oct 8, 2020

@andresbono, Thinking about it, I think it's not really a problem of the Bitnami Spark Image.

Doing it for Native Spark on Kubernetes (not using the Bitnami image; I'm was doing a comparison of deployment alternatives), I came up with the following setup:

The setup of using a headless service directly relates to https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode-networking.

apiVersion: v1
kind: Pod
metadata:
  name: spark-notebook
  labels:
    role: spark-notebook
spec:
  serviceAccountName: spark
  containers:
  - name: spark-notebook
    image: mokari94/spark-app:latest
    env:
      - name: HOME # overwrites HOME so Jupyter will for sure have permission
        value: "/tmp/.jupyter"
      - name: PYSPARK_DRIVER_PYTHON
        value: "jupyter"
      - name: PYSPARK_DRIVER_PYTHON_OPTS
        value: "notebook . --ip 0.0.0.0"
    command: 
      - /opt/spark/bin/pyspark 
      - --master 
      - k8s://https://kubernetes:443 
      - --conf 
      - spark.jars.ivy=/tmp/.ivy 
      - --conf 
      - spark.kubernetes.container.image=mokari94/spark-app:latest 
      - --conf
      - -- spark.driver.port=40694
      - --conf 
      - spark.driver.host=spark-notebook-service 
      # The above is the name of the headless service
      # executors will need to send their communication to the driver process
      # which we want to be on the Jupyter server.
      # However, pod names are not directly resolvable to the host.
      # So instead, we use a service that then routes to the pod.
      # Instead of using a "standard" service that can loadbalance across multiple pods, 
      # we use a headless service that simply passes through requests to the pod that matches the selector.
      - --conf 
      - spark.kubernetes.driver.pod.name=spark-notebook

    ports:
    - containerPort: 8888
    - containerPort: 40694
---
apiVersion: v1
kind: Service
metadata:
  name: spark-notebook-service
spec:
  clusterIP: None
  selector:
    role: spark-notebook
  ports:
    - protocol: TCP
      port: 8888
      targetPort: 8888
      name: jupyter-web-ui
    - protocol: TCP
      port: 40694
      targetPort: 40694
      name: spark-driver-port

So, I guess, including a note in the image docs should be enough to close this issue. Pondering over it a bit conceptually, I don't see how the image itself could be changed, because it's really a problem of deploying the driver, isn't it? What do you think?

@yangjinlogic
Copy link

I met a error ,this is my command:./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://10.96.254.134:7077 --deploy-mode cluster file:///examples/jars/spark-examples_2.12-3.0.1.jar 1000

but show this:
java.nio.file.NoSuchFileException: /examples/jars/spark-examples_2.12-3.0.1.jar

the command path is /opt/spark.

@andresbono
Copy link
Member

Hi @MohamedKari, I agree with that, the container image itself doesn't need changes. We can change the chart, or at least the chart documentation to explain how to connect to Spark from remote nodes.

@andresbono
Copy link
Member

Hi @yangjinlogic, are you using the Bitnami Chart for Spark? It seems like you are using other container images.

@brsolomon-deloitte
Copy link
Contributor

Took me a minute to understand what was being said in the comments above; it's that spark.driver.host needs to be the IP of the client submitting the job. Programmatically, that is:

./bin/run-example \
  --master spark://spark-master-svc:7077 \
  --deploy-mode client \
  --conf spark.driver.host=$(hostname -i) \
  SparkPi

@HeisenbergPPT
Copy link

Finally, I think, I got somewhere. The executor can't resolve the Pod hostname to the IP (for Pod hostname resolution, see https://stackoverflow.com/questions/59258223/how-to-resolve-pod-hostnames-from-other-pods). Therefore, passing the driver IP explicitly using --conf spark.driver.host=172.18.0.5 will do the trick:

spark-submit --master spark://spark-bitnami-master-svc:7077 --class org.apache.spark.examples.SparkPi --conf spark.driver.host=172.18.0.5 /opt/bitnami/spark/examples/jars/spark-examples_2.12-3.0.1.jar 5

or

pyspark --master spark://spark-bitnami-master-svc:7077  --conf spark.driver.host=172.18.0.5

Now, the executor can respond correctly.

thank you so much, this really helped me ! after adding spark.driver.host conf, it finally started to distribute tasks to other workers!

@rafariossaa
Copy link
Contributor

I am glad that helped you.
Could I close this issue ?

@HeisenbergPPT
Copy link

HeisenbergPPT commented Oct 11, 2022 via email

@carrodher
Copy link
Member

Unfortunately, this issue was created a long time ago and although there is an internal task to fix it, it was not prioritized as something to address in the short/mid term. It's not a technical reason but something related to the capacity since we're a small team.

Being said that, contributions via PRs are more than welcome in both repositories (containers and charts). Just in case you would like to contribute.

During this time, there are several releases of this asset and it's possible the issue has gone as part of other changes. If that's not the case and you are still experiencing this issue, please feel free to reopen it and we will re-evaluate it.

@bitnami-bot bitnami-bot added this to Solved in Support Oct 20, 2022
@github-actions github-actions bot moved this from Solved to Pending in Support Oct 20, 2022
@carrodher carrodher moved this from Pending to Solved in Support Oct 20, 2022
@mxcolin
Copy link

mxcolin commented Dec 1, 2022

$(hostname -i)

Took me a minute to understand what was being said in the comments above; it's that spark.driver.host needs to be the IP of the client submitting the job. Programmatically, that is:

./bin/run-example \
  --master spark://spark-master-svc:7077 \
  --deploy-mode client \
  --conf spark.driver.host=$(hostname -i) \
  SparkPi

I'm at just this point in working this all out but when I use

spark.driver.host=$(hostname -i)

I get

22/12/01 01:32:01 ERROR SparkContext: Error initializing SparkContext.
java.nio.channels.UnresolvedAddressException
        at sun.nio.ch.Net.checkAddress(Net.java:104)
        at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:217)

@fmulero fmulero removed this from Solved in Support Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold Issues or Pull Requests with this label will never be considered stale
Projects
None yet
Development

No branches or pull requests

9 participants