Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395

Closed
fjetter opened this issue May 20, 2022 · 15 comments · Fixed by #6590 or #6822
Closed

distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395

fjetter opened this issue May 20, 2022 · 15 comments · Fixed by #6590 or #6822
Assignees
Labels
flaky test Intermittent failures on CI. tests Unit tests and/or continuous integration

Comments

@fjetter
Copy link
Member

fjetter commented May 20, 2022

This test is evolving to be a frequent offender

Short test report (failures on main)
image

https://github.com/dask/distributed/actions/runs/2353756567
https://github.com/dask/distributed/actions/runs/2345977494

Failure on #6371
https://github.com/dask/distributed/runs/6517029607?check_suite_focus=true


self = <Popen: returncode: -9 args: ['/usr/share/miniconda3/envs/dask-distributed/b...>
timeout = 30

    def _wait(self, timeout):
        """Internal implementation of wait() on POSIX."""
        if self.returncode is not None:
            return self.returncode
    
        if timeout is not None:
            endtime = _time() + timeout
            # Enter a busy loop if we have a timeout.  This busy loop was
            # cribbed from Lib/threading.py in Thread.wait() at r71065.
            delay = 0.0005 # 500 us -> initial delay of 1 ms
            while True:
                if self._waitpid_lock.acquire(False):
                    try:
                        if self.returncode is not None:
                            break  # Another thread waited.
                        (pid, sts) = self._try_wait(os.WNOHANG)
                        assert pid == self.pid or pid == 0
                        if pid == self.pid:
                            self._handle_exitstatus(sts)
                            break
                    finally:
                        self._waitpid_lock.release()
                remaining = self._remaining_time(endtime)
                if remaining <= 0:
>                   raise TimeoutExpired(self.args, timeout)
E                   subprocess.TimeoutExpired: Command '['/usr/share/miniconda3/envs/dask-distributed/bin/dask-scheduler', '--dashboard-address', ':0']' timed out after 30 seconds
@fjetter fjetter added the flaky test Intermittent failures on CI. label May 20, 2022
@fjetter fjetter added the tests Unit tests and/or continuous integration label May 25, 2022
gjoseph92 added a commit to gjoseph92/distributed that referenced this issue May 26, 2022
Tests like dask#6395 will fail (timeout) because a log statement doesn't get printed, but since you never get to see what _was_ printed, CI failuers are hard to debug.

Adds a `wait_for_log_line` helper that tees the output to stdout, so you can at least see what happened.
gjoseph92 added a commit to gjoseph92/distributed that referenced this issue May 26, 2022
Tests like dask#6395 will fail (timeout) because a log statement doesn't get printed, but since you never get to see what _was_ printed, CI failuers are hard to debug.

Adds a `wait_for_log_line` helper that tees the output to stdout, so you can at least see what happened.
mrocklin pushed a commit that referenced this issue May 27, 2022
Tests like #6395 will fail (timeout) because a log statement doesn't get printed, but since you never get to see what _was_ printed, CI failuers are hard to debug.

Adds a `wait_for_log_line` helper that tees the output to stdout, so you can at least see what happened.
@gjoseph92
Copy link
Collaborator

We'll see what's happening in CI with #6461, but I'm wondering if this is just the typical thing where port 8787 is already in use. pytest --count 10 -n4 distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero will easily make this test fail. That's not with a TimeoutError, but maybe CI is so slow that the test times out before the process actually starts and stops?

@gjoseph92
Copy link
Collaborator

With #6461 in, we'll wait for a week to see if these subprocess.TimeoutExpired errors keep occurring; if not, we'll close this issue.

@crusaderky
Copy link
Collaborator

@fjetter
Copy link
Member Author

fjetter commented Jun 23, 2022

Still an issue...

https://github.com/fjetter/distributed/runs/7023391512?check_suite_focus=true

shows a log message pattern like the following

2022-06-23 12:39:09,197 - distributed.scheduler - INFO - -----------------------------------------------
2022-06-23 12:39:11,227 - distributed.scheduler - INFO - End scheduler

and the test fails with a "client cannot connect" after 5s tearing the test down. I suspecit this is where the ~5s delay between the two info messages above come up.
Trying to identify which lines are issuing these logs is quite interesting, End scheduler is very clearly running once the CLI event loop finishes, i.e. just before the CLI process terminates.
The other log message "-" * 47 has to be the log message at the top of the CLI run command. The interesting information is that after the scheduler object is initialized, we'd expect another one of these logs to appear but this clearly doesn't happen.
Therefore, this test, the CLI process, is actually stuck in the Scheduler.__init__, before we get the log message To route to workers diagnostics web server please install jupyter-server-proxy that is typically issued in get_handlers

So, either one of the handlers is blocking before it reaches the http.proxy module or something before that in init is responsible

@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

I'm currently investigating the above "scheduler does not startup" error a bit more closely on the branch https://github.com/fjetter/distributed/tree/run_test_dashboard_port_zero_until_it_fails

So far

  1. I added a couple of log on module import to trace the import of packages. This was motivated since I could determine that the scheduler process startup indeed is stuck at this line
    import distributed.dashboard.scheduler
  2. I already tried removing a couple of imports but regardless of what I did, I ended up in this state as long as I imported bokeh and initialized the dashboard
  3. I enabled faulthandler and set it up to print a traceback during scheduler init. This indeed confirms that we're stuck in import

After this, I could observe two things, so far

  1. My first attempt with faulthandler suggested that I was stuck while importing pandas (specifically the tslib module; another run showed pyarrow)
  2. I could not reproduce this. After restarting the test job (and enable github debugging), I can see a very interesting message
    https://github.com/fjetter/distributed/actions/runs/2609527766
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "d:\a\distributed\distributed\distributed\cli\dask_scheduler.py", line 234, in main
    asyncio.run(run())
  File "C:\Miniconda3\envs\dask-distributed\lib\asyncio\runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "C:\Miniconda3\envs\dask-distributed\lib\asyncio\base_events.py", line 616, in run_until_complete
    return future.result()
  File "d:\a\distributed\distributed\distributed\cli\dask_scheduler.py", line 197, in run
    scheduler = Scheduler(
  File "d:\a\distributed\distributed\distributed\scheduler.py", line 2967, in __init__
    import distributed.dashboard.scheduler
  File "d:\a\distributed\distributed\distributed\dashboard\scheduler.py", line 12, in <module>
    from distributed.dashboard.components.scheduler import (
  File "d:\a\distributed\distributed\distributed\dashboard\components\scheduler.py", line 20, in <module>
    from bokeh.io import curdoc
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\io\__init__.py", line 28, in <module>
    from .saving import save
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\io\saving.py", line 34, in <module>
    from ..models.layouts import LayoutDOM
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\models\__init__.py", line 35, in <module>
    from .annotations import *
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\models\annotations.py", line 83, in <module>
    from .formatters import TickFormatter
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\models\formatters.py", line 600, in <module>
    _df_defaults = _df.properties_with_values()
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\core\has_props.py", line 555, in properties_with_values
    return self.query_properties_with_values(lambda prop: prop.serialized,
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\core\has_props.py", line 607, in query_properties_with_values
    if not query(descriptor):
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\core\has_props.py", line 555, in <lambda>
    return self.query_properties_with_values(lambda prop: prop.serialized,
  File "C:\Miniconda3\envs\dask-distributed\lib\site-packages\bokeh\core\property\descriptors.py", line 429, in serialized
    return self.property.serialized
SystemError: PyEval_EvalFrameEx returned a result with an error set

A couple of observations here

  1. The system error is interesting. I'm not sure if this is a genuine error or something that was triggered by using faulthandler.dump_traceback_later
  2. I'm surprised to see that we're operating on two different windows partitions (C: with the conda env and d: with the checked out code)
    The above error is also accompanied by a non-descriptive TypeError: expected bytes, str found error message (no proper log formatting, just a single line

Note: Even though we have this additional debug/error information, the test is still failing with a OSError: Timed out trying to connect to tcp://127.0.0.1:51044 after 5 s as most of our tests. Therefore, we may have actually missed this all along?

@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

Next curiosity. We're only executing python3.8 and python 3.9 jobs for OSX

os: [ubuntu-latest, windows-latest, macos-latest]
python-version: ["3.8", "3.9", "3.10"]
# Cherry-pick test modules to split the overall runtime roughly in half
partition: [ci1, not ci1]
exclude:
- os: macos-latest
python-version: 3.9

I got the same on my branch as well https://github.com/fjetter/distributed/blob/b36cc0cf44f60c4491953ee9801e41b33b9788d4/.github/workflows/tests.yaml#L25-L31

However, If I inspect the actually installed versions, I can see that the python 3.8 job is using python3.9!

    python             conda-forge/win-64::python-3.9.13-h9a09f29_0_cpython
    python_abi         conda-forge/win-64::python_abi-3.9-2_cp39

Edit: For some reason I cannot find the original test report anymore. I was suspecting an improper or buggy mamba solve but couldn't verify due to lack of logs

@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

I could reproduce the "stuck during pyarrow import" failure https://github.com/fjetter/distributed/runs/7178677135?check_suite_focus=true

(faulthandler output bottom to top; i.e. first linemost recent. We're clearly stuck in pyarrow init

File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1176 in create_module
  File "<frozen importlib._bootstrap>", line 571 in module_from_spec
  File "<frozen importlib._bootstrap>", line 674 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pyarrow/__init__.py", line 65 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 883 in exec_module
  File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pandas/compat/pyarrow.py", line 6 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 883 in exec_module
  File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pandas/compat/__init__.py", line 20 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 883 in exec_module
  File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pandas/__init__.py", line 22 in <module>
  File "<frozen importlib._bootstrap>", line 241 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 883 in exec_module
  File "<frozen importlib._bootstrap>", line 688 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1006 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1027 in _find_and_load
  File "<frozen importlib._bootstrap>", line 1050 in _gcd_import
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/importlib/__init__.py", line 126 in import_module
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/bokeh/util/dependencies.py", line 53 in import_optional

this one happened on 3.10 osx. Possibly, we see two different errors on different python versions / OS

The line the pyarrow import is stuck in is indeed the initialization of the cython module
https://github.com/apache/arrow/blob/e90472e35b40f58b17d408438bb8de1641bfe6ef/python/pyarrow/__init__.py#L65

(pyarrow 7.0.0)

@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

cc @graingert

@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

Just had a debugging session with @graingert . What we currently think is happening

  • pyarrow deadlocks in the dask-scheduler process (or is taking an insane amount of time >5s)
  • The parent process running the test in the meantime raises a timeout error
  • we're entering the exit clause of the popen contextmanager
  • the ctx manager is signaling a SIGINT
  • The faulthandler converts this somehow to a SIGABRT and/or cython code raises a SystemExit("Abort!") which tears down and closes the event loop such that we see the "End Scheduler" log just before the Abort! line.

It's still unclear why pyarrow is locking up. I am currently running two more CI runs

  1. A CI run with additional sys_exit information https://github.com/fjetter/distributed/actions/runs/2611522382
  2. A CI run with pyarrow<6 (that would require us to skip / drop the shuffle service tests for now) https://github.com/fjetter/distributed/actions/runs/2611510200
  3. A CI run without pyarrow https://github.com/fjetter/distributed/actions/runs/2611513504

Edit: pyarrow==6.0.1 run also ran into the same problem https://github.com/fjetter/distributed/runs/7184426045?check_suite_focus=true

@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

Turns out pyarrow is not the culprit. When running the same run w/out pyarrow, the import is equally stuck. This time in pandas hashing. Maybe cython??

@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

ok, so the event loop is closed because the timeout bubbles up as a keyboard interrupt

    import pyarrow.lib as _lib
  File "<frozen importlib._bootstrap>", line 404, in parent
KeyboardInterrupt

now, why are all the cython imports so slow?? I started a test run with an ancient cython version (I don't know how compatible this is / how this depends on the version that the libs were compiled with)

I'm continuing these tests on https://github.com/fjetter/distributed/tree/try_no_pyarrow

@graingert
Copy link
Member

event loop is closed because the timeout bubbles up as a keyboard interrupt

That's weird, it should be the wait_for_signals handler

@graingert
Copy link
Member

Ah wait_for_signals hasn't run yet!

@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

Ah wait_for_signals hasn't run yet!

Well that still doesn't explain why the import is locking up

@fjetter
Copy link
Member Author

fjetter commented Jul 4, 2022

now, why are all the cython imports so slow?? I started a test run with an ancient cython version (I don't know how compatible this is / how this depends on the version that the libs were compiled with)

Ran an old cython version (0.29.24). Still locking up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI. tests Unit tests and/or continuous integration
Projects
None yet
4 participants