`distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero` #6395

fjetter · 2022-05-20T09:01:29Z

This test is evolving to be a frequent offender

Short test report (failures on main)

https://github.com/dask/distributed/actions/runs/2353756567
https://github.com/dask/distributed/actions/runs/2345977494

Failure on #6371
https://github.com/dask/distributed/runs/6517029607?check_suite_focus=true


self = <Popen: returncode: -9 args: ['/usr/share/miniconda3/envs/dask-distributed/b...>
timeout = 30

    def _wait(self, timeout):
        """Internal implementation of wait() on POSIX."""
        if self.returncode is not None:
            return self.returncode
    
        if timeout is not None:
            endtime = _time() + timeout
            # Enter a busy loop if we have a timeout.  This busy loop was
            # cribbed from Lib/threading.py in Thread.wait() at r71065.
            delay = 0.0005 # 500 us -> initial delay of 1 ms
            while True:
                if self._waitpid_lock.acquire(False):
                    try:
                        if self.returncode is not None:
                            break  # Another thread waited.
                        (pid, sts) = self._try_wait(os.WNOHANG)
                        assert pid == self.pid or pid == 0
                        if pid == self.pid:
                            self._handle_exitstatus(sts)
                            break
                    finally:
                        self._waitpid_lock.release()
                remaining = self._remaining_time(endtime)
                if remaining <= 0:
>                   raise TimeoutExpired(self.args, timeout)
E                   subprocess.TimeoutExpired: Command '['/usr/share/miniconda3/envs/dask-distributed/bin/dask-scheduler', '--dashboard-address', ':0']' timed out after 30 seconds

The text was updated successfully, but these errors were encountered:

Tests like dask#6395 will fail (timeout) because a log statement doesn't get printed, but since you never get to see what _was_ printed, CI failuers are hard to debug. Adds a `wait_for_log_line` helper that tees the output to stdout, so you can at least see what happened.

Tests like #6395 will fail (timeout) because a log statement doesn't get printed, but since you never get to see what _was_ printed, CI failuers are hard to debug. Adds a `wait_for_log_line` helper that tees the output to stdout, so you can at least see what happened.

gjoseph92 · 2022-05-27T15:02:14Z

We'll see what's happening in CI with #6461, but I'm wondering if this is just the typical thing where port 8787 is already in use. pytest --count 10 -n4 distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero will easily make this test fail. That's not with a TimeoutError, but maybe CI is so slow that the test times out before the process actually starts and stops?

gjoseph92 · 2022-06-10T15:06:23Z

With #6461 in, we'll wait for a week to see if these subprocess.TimeoutExpired errors keep occurring; if not, we'll close this issue.

crusaderky · 2022-06-11T09:43:18Z

It doesn't seem to be effective: https://github.com/dask/distributed/runs/6825736211?check_suite_focus=true

fjetter · 2022-06-23T13:11:01Z

Still an issue...

https://github.com/fjetter/distributed/runs/7023391512?check_suite_focus=true

shows a log message pattern like the following

2022-06-23 12:39:09,197 - distributed.scheduler - INFO - -----------------------------------------------
2022-06-23 12:39:11,227 - distributed.scheduler - INFO - End scheduler

and the test fails with a "client cannot connect" after 5s tearing the test down. I suspecit this is where the ~5s delay between the two info messages above come up.
Trying to identify which lines are issuing these logs is quite interesting, End scheduler is very clearly running once the CLI event loop finishes, i.e. just before the CLI process terminates.
The other log message "-" * 47 has to be the log message at the top of the CLI run command. The interesting information is that after the scheduler object is initialized, we'd expect another one of these logs to appear but this clearly doesn't happen.
Therefore, this test, the CLI process, is actually stuck in the Scheduler.__init__, before we get the log message To route to workers diagnostics web server please install jupyter-server-proxy that is typically issued in get_handlers

So, either one of the handlers is blocking before it reaches the http.proxy module or something before that in init is responsible

fjetter · 2022-07-04T12:22:04Z

I'm currently investigating the above "scheduler does not startup" error a bit more closely on the branch https://github.com/fjetter/distributed/tree/run_test_dashboard_port_zero_until_it_fails

So far

I added a couple of log on module import to trace the import of packages. This was motivated since I could determine that the scheduler process startup indeed is stuck at this line

distributed/distributed/scheduler.py

Line 2961 in c992f80

import distributed.dashboard.scheduler
I already tried removing a couple of imports but regardless of what I did, I ended up in this state as long as I imported bokeh and initialized the dashboard
I enabled faulthandler and set it up to print a traceback during scheduler init. This indeed confirms that we're stuck in import

After this, I could observe two things, so far

My first attempt with faulthandler suggested that I was stuck while importing pandas (specifically the tslib module; another run showed pyarrow)
I could not reproduce this. After restarting the test job (and enable github debugging), I can see a very interesting message
https://github.com/fjetter/distributed/actions/runs/2609527766

  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "d:\a\distributed\distributed\distributed\cli\dask_scheduler.py", line 234, in main
    asyncio.run(run())
  File "C:\Miniconda3\envs\dask-distributed\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Miniconda3\envs\dask-distributed\lib\asyncio\base_events.py", line 616, in run_until_complete
    return future.result()
  File "d:\a\distributed\distributed\distributed\cli\dask_scheduler.py", line 197, in run
    scheduler = Scheduler(
  File "d:\a\distributed\distributed\distributed\scheduler.py", line 2967, in __init__
    import distributed.dashboard.scheduler
  File "d:\a\distributed\distributed\distributed\dashboard\scheduler.py", line 12, in <module>
    from distributed.dashboard.components.scheduler import (
  File "d:\a\distributed\distributed\distributed\dashboard\components\scheduler.py", line 20, in <module>
    from bokeh.io import curdoc
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\io\__init__.py", line 28, in <module>
    from .saving import save
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\io\saving.py", line 34, in <module>
    from ..models.layouts import LayoutDOM
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\models\__init__.py", line 35, in <module>
    from .annotations import *
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\models\annotations.py", line 83, in <module>
    from .formatters import TickFormatter
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\models\formatters.py", line 600, in <module>
    _df_defaults = _df.properties_with_values()
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\core\has_props.py", line 555, in properties_with_values
    return self.query_properties_with_values(lambda prop: prop.serialized,
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\core\has_props.py", line 607, in query_properties_with_values
    if not query(descriptor):
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\core\has_props.py", line 555, in <lambda>
    return self.query_properties_with_values(lambda prop: prop.serialized,
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\core\property\descriptors.py", line 429, in serialized
    return self.property.serialized
SystemError: PyEval_EvalFrameEx returned a result with an error set

A couple of observations here

The system error is interesting. I'm not sure if this is a genuine error or something that was triggered by using faulthandler.dump_traceback_later
I'm surprised to see that we're operating on two different windows partitions (C: with the conda env and d: with the checked out code)
The above error is also accompanied by a non-descriptive TypeError: expected bytes, str found error message (no proper log formatting, just a single line

Note: Even though we have this additional debug/error information, the test is still failing with a OSError: Timed out trying to connect to tcp://127.0.0.1:51044 after 5 s as most of our tests. Therefore, we may have actually missed this all along?

fjetter · 2022-07-04T12:39:22Z

Next curiosity. We're only executing python3.8 and python 3.9 jobs for OSX

distributed/.github/workflows/tests.yaml

Lines 25 to 31 in c992f80

    
           os: [ubuntu-latest, windows-latest, macos-latest] 
        
           python-version: ["3.8", "3.9", "3.10"] 
        
           # Cherry-pick test modules to split the overall runtime roughly in half 
        
           partition: [ci1, not ci1] 
        
           exclude: 
        
             - os: macos-latest 
        
               python-version: 3.9

I got the same on my branch as well https://github.com/fjetter/distributed/blob/b36cc0cf44f60c4491953ee9801e41b33b9788d4/.github/workflows/tests.yaml#L25-L31

However, If I inspect the actually installed versions, I can see that the python 3.8 job is using python3.9!

    python             conda-forge/win-64::python-3.9.13-h9a09f29_0_cpython
    python_abi         conda-forge/win-64::python_abi-3.9-2_cp39

Edit: For some reason I cannot find the original test report anymore. I was suspecting an improper or buggy mamba solve but couldn't verify due to lack of logs

fjetter · 2022-07-04T12:44:22Z

I could reproduce the "stuck during pyarrow import" failure https://github.com/fjetter/distributed/runs/7178677135?check_suite_focus=true

(faulthandler output bottom to top; i.e. first linemost recent. We're clearly stuck in pyarrow init

File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1176 in create_module
  File "<frozen importlib._bootstrap>", line 571 in module_from_spec
  File "<frozen importlib._bootstrap>", line 674 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pyarrow/__init__.py", line 65 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 883 in exec_module
  File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pandas/compat/pyarrow.py", line 6 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 883 in exec_module
  File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pandas/compat/__init__.py", line 20 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 883 in exec_module
  File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pandas/__init__.py", line 22 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 883 in exec_module
  File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "<frozen importlib._bootstrap>", line 1050 in _gcd_import
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/importlib/__init__.py", line 126 in import_module
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/bokeh/util/dependencies.py", line 53 in import_optional

this one happened on 3.10 osx. Possibly, we see two different errors on different python versions / OS

The line the pyarrow import is stuck in is indeed the initialization of the cython module
https://github.com/apache/arrow/blob/e90472e35b40f58b17d408438bb8de1641bfe6ef/python/pyarrow/__init__.py#L65

(pyarrow 7.0.0)

fjetter · 2022-07-04T14:50:20Z

cc @graingert

fjetter · 2022-07-04T17:03:29Z

Just had a debugging session with @graingert . What we currently think is happening

pyarrow deadlocks in the dask-scheduler process (or is taking an insane amount of time >5s)
The parent process running the test in the meantime raises a timeout error
we're entering the exit clause of the popen contextmanager
the ctx manager is signaling a SIGINT
The faulthandler converts this somehow to a SIGABRT and/or cython code raises a SystemExit("Abort!") which tears down and closes the event loop such that we see the "End Scheduler" log just before the Abort! line.

It's still unclear why pyarrow is locking up. I am currently running two more CI runs

A CI run with additional sys_exit information https://github.com/fjetter/distributed/actions/runs/2611522382
A CI run with pyarrow<6 (that would require us to skip / drop the shuffle service tests for now) https://github.com/fjetter/distributed/actions/runs/2611510200
A CI run without pyarrow https://github.com/fjetter/distributed/actions/runs/2611513504

Edit: pyarrow==6.0.1 run also ran into the same problem https://github.com/fjetter/distributed/runs/7184426045?check_suite_focus=true

fjetter · 2022-07-04T17:20:52Z

Turns out pyarrow is not the culprit. When running the same run w/out pyarrow, the import is equally stuck. This time in pandas hashing. Maybe cython??

fjetter · 2022-07-04T17:27:42Z

ok, so the event loop is closed because the timeout bubbles up as a keyboard interrupt

    import pyarrow.lib as _lib
  File "<frozen importlib._bootstrap>", line 404, in parent
KeyboardInterrupt

now, why are all the cython imports so slow?? I started a test run with an ancient cython version (I don't know how compatible this is / how this depends on the version that the libs were compiled with)

I'm continuing these tests on https://github.com/fjetter/distributed/tree/try_no_pyarrow

graingert · 2022-07-04T17:38:20Z

event loop is closed because the timeout bubbles up as a keyboard interrupt

That's weird, it should be the wait_for_signals handler

graingert · 2022-07-04T17:38:43Z

Ah wait_for_signals hasn't run yet!

fjetter · 2022-07-04T17:40:59Z

Ah wait_for_signals hasn't run yet!

Well that still doesn't explain why the import is locking up

fjetter · 2022-07-04T17:41:28Z

now, why are all the cython imports so slow?? I started a test run with an ancient cython version (I don't know how compatible this is / how this depends on the version that the libs were compiled with)

Ran an old cython version (0.29.24). Still locking up

fjetter added the flaky test Intermittent failures on CI. label May 20, 2022

This was referenced May 20, 2022

Do not filter tasks before gathering data #6371

Merged

Remove report and safe from Worker.close #6363

Merged

restore signal handlers after wait for signals is done #6400

Merged

fjetter added the tests Unit tests and/or continuous integration label May 25, 2022

fjetter assigned gjoseph92 May 25, 2022

gjoseph92 mentioned this issue May 26, 2022

Improve tests that watch for subprocess logs #6461

Merged

1 task

gjoseph92 mentioned this issue Jun 1, 2022

Avoid deadlocks in tests that use popen #6483

Merged

2 tasks

gjoseph92 mentioned this issue Jun 9, 2022

Fix CLI Scheduler Tests #6502

Merged

gjoseph92 mentioned this issue Jun 11, 2022

Log popen stdout/err when subprocess times out #6567

Merged

2 tasks

graingert mentioned this issue Jun 14, 2022

fix tls_(min|max)_version having no effect on openssl 1.1.0g or lower #6562

Merged

2 tasks

gjoseph92 mentioned this issue Jun 15, 2022

Test retire workers deadlock #6240

Merged

2 tasks

fjetter mentioned this issue Jun 16, 2022

Do not log in signal handler #6590

Merged

crusaderky closed this as completed in #6590 Jun 17, 2022

fjetter reopened this Jun 23, 2022

crusaderky assigned graingert and gjoseph92 and unassigned gjoseph92 Jul 6, 2022

gjoseph92 mentioned this issue Jul 20, 2022

Flaky tests: OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s #6731

Closed

gjoseph92 mentioned this issue Aug 3, 2022

Only set 5s connect timeout in gen_cluster tests #6822

Merged

1 task

fjetter closed this as completed in #6822 Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero` #6395

`distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero` #6395

fjetter commented May 20, 2022

gjoseph92 commented May 27, 2022

gjoseph92 commented Jun 10, 2022

crusaderky commented Jun 11, 2022

fjetter commented Jun 23, 2022

fjetter commented Jul 4, 2022 •

edited

Loading

fjetter commented Jul 4, 2022 •

edited

Loading

fjetter commented Jul 4, 2022 •

edited

Loading

fjetter commented Jul 4, 2022

fjetter commented Jul 4, 2022 •

edited

Loading

fjetter commented Jul 4, 2022

fjetter commented Jul 4, 2022 •

edited

Loading

graingert commented Jul 4, 2022

graingert commented Jul 4, 2022

fjetter commented Jul 4, 2022

fjetter commented Jul 4, 2022

distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395

distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395

Comments

fjetter commented May 20, 2022

gjoseph92 commented May 27, 2022

gjoseph92 commented Jun 10, 2022

crusaderky commented Jun 11, 2022

fjetter commented Jun 23, 2022

fjetter commented Jul 4, 2022 • edited Loading

fjetter commented Jul 4, 2022 • edited Loading

fjetter commented Jul 4, 2022 • edited Loading

fjetter commented Jul 4, 2022

fjetter commented Jul 4, 2022 • edited Loading

fjetter commented Jul 4, 2022

fjetter commented Jul 4, 2022 • edited Loading

graingert commented Jul 4, 2022

graingert commented Jul 4, 2022

fjetter commented Jul 4, 2022

fjetter commented Jul 4, 2022

`distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero` #6395

`distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero` #6395

fjetter commented Jul 4, 2022 •

edited

Loading

fjetter commented Jul 4, 2022 •

edited

Loading

fjetter commented Jul 4, 2022 •

edited

Loading

fjetter commented Jul 4, 2022 •

edited

Loading

fjetter commented Jul 4, 2022 •

edited

Loading