Unit Test Timeouts (Dask Instability) #2354

chukarsten · 2021-06-09T16:49:39Z

We're currently seeing unit tests go to the GH Actions limit of 6hrs. This is not good for obvious reasons.

3.8 core deps 6 hr timeout (in progress)
build_conda_pkg, 3.8 core deps, 3.7 non-core deps 6 hr timeout (in progress)
3.7 non-core deps 6 hr timeout
3.8 non-core deps 6 hr timeout
3.7 non-core deps 1.5 hrs
build_conda_pkg
3.7 non-core deps
3.8

chukarsten · 2021-06-10T18:39:15Z

Just adding some data from this 3.8 core deps run series of checks. Adding the logs from that run.

github_unittests.txt

I think one thing I've noticed is that they're all pausing around the 91-93% completed mark. I doubt there's any value to figuring out which tests those are, but that might be a route to pursue.

chukarsten · 2021-06-10T18:42:18Z

Here's another for 3.9 non-core deps.

github_unittests_2.txt

dsherry · 2021-06-10T19:56:52Z

Thanks for filing @chukarsten

Thankfully we can rule out conda as a cause, since this happens for our normal unit test builds and not just for build_conda_pkg

Is there any other info we should collect which could help us figure this out? A few ideas below

How reliably can we repro the timeout? Is it happening 50% of the time we run unit test job, more, less?
Which test or tests is not completing properly? If we can get pytest to log the start and end of each test, we can look at the logs and deduce which test has not ended when the hanging occurs. This looked potentially useful.
Do we still see these timeouts if we run tests without any pytest parallelization?
This is just a hunch, but what happens if we disable the dask engine tests? I know we've seen some flakes with those recently Flaky Dask test #2341
What does CPU and memory utilization look like while the tests are running?

dsherry · 2021-06-10T20:07:35Z

(@freddyaboulton I added you on here since this connects to #2298 and #1815)

chukarsten · 2021-06-10T22:43:11Z

Changing the Makefile to do verbose logging with pytest, we get the following log
. This shows the last executed test to be "evalml/tuners/random_search_tuner.py::evalml.tuners.random_search_tuner.RandomSearchTuner"

freddyaboulton · 2021-06-11T17:34:39Z

After adding timeout I've seen the same timeout on test_dask_sends_woodwork_schema at least three time:

chukarsten · 2021-06-14T14:51:11Z

I think @freddyaboulton is certainly onto something here and we're pointing firmly at Dask. Made this PR to separate out the dask unit tests. I think we have the option of not preventing merge upon their failing. This PR failed on test_automl_immediate_quit, which is still in the array of dask tests.

Looking into the root cause of the dask unit test failures is puzzling. The logs generate a lot of this :

distributed.worker - WARNING - Could not find data: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']} on workers: [] (who_has: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']})
distributed.scheduler - WARNING - Communication failed during replication: {'status': 'missing-data', 'keys'

Why does this happen? Well it seems that wherever the data that's being acted upon is losing reference to that data. Additionally the 'workers: []' suggests that perhaps the nanny process is killing the workers. I suspect there's something going on with how the data is scattered but am also suspicious of what's happening going under the covers with these four jobs running together in pseudo parallel/series.

This dask distributed issue suggests disabling adaptive scaling for the cluster. Unfortunately we don't use adaptive clusters, just regular local, static clusters, so that's not the issue. This issue points at scattering of the data as the potential cause of the issue, where workers are being abandoned, but we're not getting the same connection errors.

chukarsten · 2021-06-14T17:31:23Z

After trying #2376 to separate out the dask jobs and settingbroadcast=False for the DaskEngine's client, by default, I have a flaky test failure with test_automl_immediate_quit. Documented here.

freddyaboulton · 2021-06-15T15:30:03Z

I am now seeing the following stacktrace in build_conda_pkg

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]
    
            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results
    
            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted
    
            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.
    
        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)
    
        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

This seems to be a known issue in dask dask/distributed#4612

angela97lin · 2021-06-15T15:55:37Z

Deleted my old post but here's a red one: https://github.com/alteryx/evalml/actions/runs/939673304, seems to be the same stack trace @freddyaboulton posted above.

chukarsten · 2021-06-17T15:15:26Z

I believe this issue no longer blocks per [this PR] to separate out the dask jobs(#2376), this PR to refactor the dask jobs to cut down on flakes, and this PR to make the separate dask jobs not blocking for merge to main and this PR to add a timeout to prevent pathological dask tests from taking 6 hours to ultimately be cancelled by GH Actions.

Going to move this to closed because the dask related timeouts are now no longer an issue and shouldn't be for the foreseeable future. However, the underlying cause is still unknown.

dsherry added testing Issues related to testing. bug Issues tracking problems with existing features. labels Jun 10, 2021

dsherry added the blocker An issue blocking a release. label Jun 10, 2021

dsherry assigned chukarsten and freddyaboulton Jun 10, 2021

chukarsten mentioned this issue Jun 14, 2021

Refactor dask tests #2377

Merged

chukarsten changed the title ~~Unit Test Timeouts~~ Unit Test Timeouts (Dask Instability) Jun 14, 2021

chukarsten closed this as completed Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit Test Timeouts (Dask Instability) #2354

Unit Test Timeouts (Dask Instability) #2354

chukarsten commented Jun 9, 2021 •

edited by angela97lin

chukarsten commented Jun 10, 2021 •

edited

chukarsten commented Jun 10, 2021

dsherry commented Jun 10, 2021

dsherry commented Jun 10, 2021

chukarsten commented Jun 10, 2021 •

edited

freddyaboulton commented Jun 11, 2021

chukarsten commented Jun 14, 2021 •

edited

chukarsten commented Jun 14, 2021

freddyaboulton commented Jun 15, 2021

angela97lin commented Jun 15, 2021 •

edited

chukarsten commented Jun 17, 2021

Unit Test Timeouts (Dask Instability) #2354

Unit Test Timeouts (Dask Instability) #2354

Comments

chukarsten commented Jun 9, 2021 • edited by angela97lin

chukarsten commented Jun 10, 2021 • edited

chukarsten commented Jun 10, 2021

dsherry commented Jun 10, 2021

dsherry commented Jun 10, 2021

chukarsten commented Jun 10, 2021 • edited

freddyaboulton commented Jun 11, 2021

chukarsten commented Jun 14, 2021 • edited

chukarsten commented Jun 14, 2021

freddyaboulton commented Jun 15, 2021

angela97lin commented Jun 15, 2021 • edited

chukarsten commented Jun 17, 2021

chukarsten commented Jun 9, 2021 •

edited by angela97lin

chukarsten commented Jun 10, 2021 •

edited

chukarsten commented Jun 10, 2021 •

edited

chukarsten commented Jun 14, 2021 •

edited

angela97lin commented Jun 15, 2021 •

edited