Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit Test Timeouts (Dask Instability) #2354

Closed
chukarsten opened this issue Jun 9, 2021 · 11 comments
Closed

Unit Test Timeouts (Dask Instability) #2354

chukarsten opened this issue Jun 9, 2021 · 11 comments
Assignees
Labels
blocker An issue blocking a release. bug Issues tracking problems with existing features. testing Issues related to testing.

Comments

@chukarsten
Copy link
Collaborator

chukarsten commented Jun 9, 2021

We're currently seeing unit tests go to the GH Actions limit of 6hrs. This is not good for obvious reasons.

3.8 core deps 6 hr timeout (in progress)
build_conda_pkg, 3.8 core deps, 3.7 non-core deps 6 hr timeout (in progress)
3.7 non-core deps 6 hr timeout
3.8 non-core deps 6 hr timeout
3.7 non-core deps 1.5 hrs
build_conda_pkg
3.7 non-core deps
3.8

@chukarsten
Copy link
Collaborator Author

chukarsten commented Jun 10, 2021

Just adding some data from this 3.8 core deps run series of checks. Adding the logs from that run.

github_unittests.txt

I think one thing I've noticed is that they're all pausing around the 91-93% completed mark. I doubt there's any value to figuring out which tests those are, but that might be a route to pursue.

@chukarsten
Copy link
Collaborator Author

Here's another for 3.9 non-core deps.

github_unittests_2.txt

@dsherry dsherry added testing Issues related to testing. bug Issues tracking problems with existing features. labels Jun 10, 2021
@dsherry
Copy link
Contributor

dsherry commented Jun 10, 2021

Thanks for filing @chukarsten

Thankfully we can rule out conda as a cause, since this happens for our normal unit test builds and not just for build_conda_pkg

Is there any other info we should collect which could help us figure this out? A few ideas below

  • How reliably can we repro the timeout? Is it happening 50% of the time we run unit test job, more, less?
  • Which test or tests is not completing properly? If we can get pytest to log the start and end of each test, we can look at the logs and deduce which test has not ended when the hanging occurs. This looked potentially useful.
  • Do we still see these timeouts if we run tests without any pytest parallelization?
  • This is just a hunch, but what happens if we disable the dask engine tests? I know we've seen some flakes with those recently Flaky Dask test #2341
  • What does CPU and memory utilization look like while the tests are running?

@dsherry dsherry added the blocker An issue blocking a release. label Jun 10, 2021
@dsherry
Copy link
Contributor

dsherry commented Jun 10, 2021

(@freddyaboulton I added you on here since this connects to #2298 and #1815)

@chukarsten
Copy link
Collaborator Author

chukarsten commented Jun 10, 2021

Changing the Makefile to do verbose logging with pytest, we get the following log
. This shows the last executed test to be "evalml/tuners/random_search_tuner.py::evalml.tuners.random_search_tuner.RandomSearchTuner"

@freddyaboulton
Copy link
Contributor

@chukarsten
Copy link
Collaborator Author

chukarsten commented Jun 14, 2021

I think @freddyaboulton is certainly onto something here and we're pointing firmly at Dask. Made this PR to separate out the dask unit tests. I think we have the option of not preventing merge upon their failing. This PR failed on test_automl_immediate_quit, which is still in the array of dask tests.

Looking into the root cause of the dask unit test failures is puzzling. The logs generate a lot of this :

distributed.worker - WARNING - Could not find data: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']} on workers: [] (who_has: {'Series-32a3ef2ca4739b46a6acc2ac58638b32': ['tcp://127.0.0.1:45587']})
distributed.scheduler - WARNING - Communication failed during replication: {'status': 'missing-data', 'keys'

Why does this happen? Well it seems that wherever the data that's being acted upon is losing reference to that data. Additionally the 'workers: []' suggests that perhaps the nanny process is killing the workers. I suspect there's something going on with how the data is scattered but am also suspicious of what's happening going under the covers with these four jobs running together in pseudo parallel/series.

This dask distributed issue suggests disabling adaptive scaling for the cluster. Unfortunately we don't use adaptive clusters, just regular local, static clusters, so that's not the issue. This issue points at scattering of the data as the potential cause of the issue, where workers are being abandoned, but we're not getting the same connection errors.

@chukarsten
Copy link
Collaborator Author

After trying #2376 to separate out the dask jobs and settingbroadcast=False for the DaskEngine's client, by default, I have a flaky test failure with test_automl_immediate_quit. Documented here.

@chukarsten chukarsten changed the title Unit Test Timeouts Unit Test Timeouts (Dask Instability) Jun 14, 2021
@freddyaboulton
Copy link
Contributor

I am now seeing the following stacktrace in build_conda_pkg

[gw3] linux -- Python 3.7.10 $PREFIX/bin/python

X_y_binary_cls = (          0         1         2   ...        17        18        19
0  -0.039268  0.131912 -0.211206  ...  1.976989  ...ns], 0     0
1     0
2     1
3     1
4     1
     ..
95    1
96    1
97    1
98    1
99    0
Length: 100, dtype: int64)
cluster = LocalCluster(15c4b3ad, 'tcp://127.0.0.1:45201', workers=0, threads=0, memory=0 B)

    def test_submit_training_jobs_multiple(X_y_binary_cls, cluster):
        """Test that training multiple pipelines using the parallel engine produces the
        same results as the sequential engine."""
        X, y = X_y_binary_cls
        with Client(cluster) as client:
            pipelines = [
                BinaryClassificationPipeline(
                    component_graph=["Logistic Regression Classifier"],
                    parameters={"Logistic Regression Classifier": {"n_jobs": 1}},
                ),
                BinaryClassificationPipeline(component_graph=["Baseline Classifier"]),
                BinaryClassificationPipeline(component_graph=["SVM Classifier"]),
            ]
    
            def fit_pipelines(pipelines, engine):
                futures = []
                for pipeline in pipelines:
                    futures.append(
                        engine.submit_training_job(
                            X=X, y=y, automl_config=automl_data, pipeline=pipeline
                        )
                    )
                results = [f.get_result() for f in futures]
                return results
    
            # Verify all pipelines are trained and fitted.
            seq_pipelines = fit_pipelines(pipelines, SequentialEngine())
            for pipeline in seq_pipelines:
                assert pipeline._is_fitted
    
            # Verify all pipelines are trained and fitted.
>           par_pipelines = fit_pipelines(pipelines, DaskEngine(client=client))

evalml/tests/automl_tests/dask_tests/test_dask_engine.py:103: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in fit_pipelines
    results = [f.get_result() for f in futures]
evalml/tests/automl_tests/dask_tests/test_dask_engine.py:94: in <listcomp>
    results = [f.get_result() for f in futures]
evalml/automl/engine/dask_engine.py:30: in get_result
    return self.work.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Future: cancelled, key: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c>
timeout = None

    def result(self, timeout=None):
        """Wait until computation completes, gather result to local process.
    
        If *timeout* seconds are elapsed before returning, a
        ``dask.distributed.TimeoutError`` is raised.
        """
        if self.client.asynchronous:
            return self.client.sync(self._result, callback_timeout=timeout)
    
        # shorten error traceback
        result = self.client.sync(self._result, callback_timeout=timeout, raiseit=False)
        if self.status == "error":
            typ, exc, tb = result
            raise exc.with_traceback(tb)
        elif self.status == "cancelled":
>           raise result
E           concurrent.futures._base.CancelledError: train_pipeline-4bd4a99325cd3cc91144f86b64d6503c

This seems to be a known issue in dask dask/distributed#4612

@angela97lin
Copy link
Contributor

angela97lin commented Jun 15, 2021

Deleted my old post but here's a red one: https://github.com/alteryx/evalml/actions/runs/939673304, seems to be the same stack trace @freddyaboulton posted above.

@chukarsten
Copy link
Collaborator Author

I believe this issue no longer blocks per [this PR] to separate out the dask jobs(#2376), this PR to refactor the dask jobs to cut down on flakes, and this PR to make the separate dask jobs not blocking for merge to main and this PR to add a timeout to prevent pathological dask tests from taking 6 hours to ultimately be cancelled by GH Actions.

Going to move this to closed because the dask related timeouts are now no longer an issue and shouldn't be for the foreseeable future. However, the underlying cause is still unknown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker An issue blocking a release. bug Issues tracking problems with existing features. testing Issues related to testing.
Projects
None yet
Development

No branches or pull requests

4 participants