Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task submitted never be executed using dask-yarn #1285

Closed
zwang1986 opened this issue Jul 25, 2017 · 8 comments
Closed

task submitted never be executed using dask-yarn #1285

zwang1986 opened this issue Jul 25, 2017 · 8 comments

Comments

@zwang1986
Copy link

Hi! I got this error after configuration.

>>> from dask_yarn import YARNCluster
>>> cluster = YARNCluster()
ec2-54-222-217-210.cn-north-1.compute.amazonaws.com.cn 9000
ec2-54-222-217-210.cn-north-1.compute.amazonaws.com.cn 8088
>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> cluster.start(2, cpus=1, memory=500)
tcp://172.31.4.217:45665

ClientConfig(127.0.0.1,38049)
41371
2017-07-24 23:04:48,000 INFO  [Thread-3] knit.Client$ (Client.scala:start(80)) - Staring Application Master
Attemping upload of /usr/local/lib/python2.7/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar
Uploading resource file:/usr/local/lib/python2.7/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar -> hdfs://ec2-54-222-217-210.cn-north-1.compute.amazonaws.com.cn:9000/user/root/.knitDeps/knit-1.0-SNAPSHOT.jarhdfs://ec2-54-222-217-210.cn-north-1.compute.amazonaws.com.cn:9000/user/root/.knitDeps/knit-1.0-SNAPSHOT.jar
2017-07-24 23:04:50,396 INFO  [Thread-3] client.RMProxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at ec2-54-222-217-210.cn-north-1.compute.amazonaws.com.cn/172.31.4.217:8032
Attemping upload of /usr/local/lib/python2.7/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip
Uploading resource file:/usr/local/lib/python2.7/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip -> hdfs://ec2-54-222-217-210.cn-north-1.compute.amazonaws.com.cn:9000/user/root/.knitDeps/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip2017-07-24 23:05:19,884 INFO  [Thread-3] knit.Client$ (Client.scala:start(171)) - Submitting application application_1500743829967_0030
2017-07-24 23:05:19,912 INFO  [Thread-3] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(251)) - Submitted application application_1500743829967_0030
172.31.15.100 38627
u'application_1500743829967_0030'
>>> 
>>> future = client.submit(lambda x: x + 1, 10)
>>> future.result()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/distributed/client.py", line 155, in result
    six.reraise(*result)
  File "<string>", line 2, in reraise
distributed.scheduler.KilledWorker: ('<lambda>-1755a7b8598406e265550c6e0de3412d', u'tcp://172.31.15.100:38951')

It seems that task submitted cannot run. Checking the "client" info says it has no resources:

>>> client
<Client: scheduler='tcp://172.31.4.217:45665' processes=0 cores=0>
>>> 

Yarn logs yarn logs -applicationId application_1500743829967_0030 says:


2017-07-24 23:07:15,058 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at ec2-54-222-217-210.cn-north-1.compute.amazonaws.com.cn/172.31.4.217:8032
/tmp/logs/root/logs/application_1500743829967_0030does not have any log files.

Thanks for your help in advance!

@martindurant
Copy link
Member

Please join us on https://github.com/dask/knit/issues , where I am trying to iron out issues around knit/dask-yarn.
For your specific issue, can you determine if any of the worker containers started? The simplest type of failure is simply that yarn was not able to allocate the memory/cpus necessary for the workers.

@zwang1986
Copy link
Author

zwang1986 commented Jul 25, 2017

Moving to issue container cannot start dask workers

@martindurant Yes, I have double checked yarn log after posting this issue. No worker is started by containers. Part of this log:
`2017-07-25 01:31:19,759 INFO [main] client.RMProxy (RMProxy.java:createRMProxy(98)) - Connecting to ResourceManager at ec2-54-222-217-210.cn-north-1.compute.amazonaws.com.cn/172.31.4.217:8032

Container: container_1500743829967_0030_01_000001 on ip-172-31-15-100.cn-north-1.compute.internal_46203

=========================================================================================================
LogType:stderr
Log Upload Time:24-Jul-2017 23:31:34
LogLength:704
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/ephemeral-hdfs/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/yarn-local/filecache/48/knit-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (io.continuum.knit.ApplicationMaster$).
log4j:WARN Please initialize the log4j system properly.
LogType:stdout
Log Upload Time:24-Jul-2017 23:31:34
LogLength:0
Log Contents:

Container: container_1500743829967_0030_01_000002 on ip-172-31-15-100.cn-north-1.compute.internal_46203

=========================================================================================================
LogType:stderr
Log Upload Time:24-Jul-2017 23:31:34
LogLength:10239950
Log Contents:
distributed.nanny - INFO - Start Nanny at: 'tcp://172.31.15.100:46715'
distributed.worker - INFO - Start worker at: tcp://172.31.15.100:42065
distributed.worker - INFO - nanny at: 172.31.15.100:46715
distributed.worker - INFO - http at: 172.31.15.100:39361
distributed.worker - INFO - bokeh at: 172.31.15.100:8789
distributed.worker - INFO - Waiting to connect to: tcp://172.31.4.217:45665
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 0.50 GB
distributed.worker - INFO - Local Directory: worker-lqy3fgbm
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/mnt/yarn-local/usercache/root/appcache/application_1500743829967_0030/container_1500743829967_0030_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/nanny.py", line 456, in run
yield worker._start(*worker_start_args)

@martindurant
Copy link
Member

I think your traceback is truncated - could you perhaps put it into a gist?
Already, we can see that apparently the container did start, and dask-worker did run, but didn't manage to contact the scheduler.

@zwang1986
Copy link
Author

zwang1986 commented Jul 25, 2017

Sure. Please find the log here: trackback.
It was trying to restart worker several times, so I truncated the first traceback.

@martindurant
Copy link
Member

@mrocklin , does this indicate version mismatch between workers and scheduler?

@martindurant
Copy link
Member

@zwang1986 , have you had any joy trying this? The configuration system in knit has improved, and you can now use the client's check_versions method to see if that was at the root of your problem.

@mrocklin
Copy link
Member

ValueError: Unexpected response from register: b'OK'

Can you verify that the worker and scheduler are either both Python 2 or both Python 3?

@jcrist
Copy link
Member

jcrist commented Feb 6, 2019

dask-yarn has been rewritten, and is much more robust (see http://yarn.dask.org/en/latest/ for docs). It is likely that your issues have been resolved with the new library. Closing as stale, feel free open a new issue in the appropriate repo if you still have issues.

@jcrist jcrist closed this as completed Feb 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants