Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karstengine #2506

Merged
merged 6 commits into from Jul 21, 2021
Merged

Karstengine #2506

merged 6 commits into from Jul 21, 2021

Conversation

chukarsten
Copy link
Contributor

@chukarsten chukarsten commented Jul 13, 2021

An additional engine, written using concurrent.futures, in the style of DaskEngine to try and thwart the Dask scheduler woes we've been experiencing.

@codecov
Copy link

codecov bot commented Jul 13, 2021

Codecov Report

Merging #2506 (057e00f) into main (b3e16f0) will increase coverage by 0.3%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2506     +/-   ##
=======================================
+ Coverage   99.7%   99.9%   +0.3%     
=======================================
  Files        283     285      +2     
  Lines      25903   26170    +267     
=======================================
+ Hits       25802   26134    +332     
+ Misses       101      36     -65     
Impacted Files Coverage Δ
evalml/automl/engine/dask_engine.py 100.0% <ø> (+66.0%) ⬆️
evalml/automl/engine/__init__.py 100.0% <100.0%> (ø)
evalml/automl/engine/cf_engine.py 100.0% <100.0%> (ø)
.../tests/automl_tests/dask_tests/test_automl_dask.py 100.0% <100.0%> (ø)
...ml/tests/automl_tests/dask_tests/test_cf_engine.py 100.0% <100.0%> (ø)
evalml/tests/conftest.py 98.6% <0.0%> (+0.4%) ⬆️
evalml/automl/automl_search.py 99.9% <0.0%> (+0.6%) ⬆️
evalml/automl/engine/engine_base.py 100.0% <0.0%> (+1.6%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3e16f0...057e00f. Read the comment docs.

@angela97lin
Copy link
Contributor

Please tell me we're keeping this name

@@ -31,11 +31,11 @@ test-dask:

.PHONY: git-test-dask
git-test-dask:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure how the dask code coverage was working before this as the "--cov" flag indicates the top level of where the source code is. Previously, it would have looked for the dask source code in the same folder as the dask tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... If I'm understanding this correctly, looks like we were testing if our test code was being covered instead LOL: https://codecov.io/gh/alteryx/evalml/tree/aad921c5bbbafdb51529b907df6e46ad6c8feb3f/evalml/tests/automl_tests/dask_tests (from my branch)

yoikes, good catch 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was entirely @freddyaboulton 's catch, but thanks lol

"""Pass through to imitate Dask's Client API."""
return self.pool.submit(*args, **kwargs)


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent's futures implementation is a little different from dask, but not signficantly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea looking at the engines, I think the only difference is _send_data_to_cluster ? I think the door is open to refactoring in the future, we can make _send_data_to_cluster a no-op in the CFEngine

yield pool
pool.shutdown()


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this for completeness. You'll see the comment below indicating how to add process pool testing to the mix. ProcessPool works locally but fails on the github executor. I suspect the additional processes might exceed the meager resources allocated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be my guess as well

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (from my limited knowledge, lol), and pretty cool stuff--thanks for explaining to me about why we need this!

Left some comments, but nothing blocking. 😁

@@ -31,11 +31,11 @@ test-dask:

.PHONY: git-test-dask
git-test-dask:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... If I'm understanding this correctly, looks like we were testing if our test code was being covered instead LOL: https://codecov.io/gh/alteryx/evalml/tree/aad921c5bbbafdb51529b907df6e46ad6c8feb3f/evalml/tests/automl_tests/dask_tests (from my branch)

yoikes, good catch 😁

evalml/automl/engine/cf_engine.py Show resolved Hide resolved
return self.work.result()

def cancel(self):
"""Cancel the current computation."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring nitpick: Return type here?

evalml/automl/engine/cf_engine.py Show resolved Hide resolved
evalml/automl/engine/cf_engine.py Outdated Show resolved Hide resolved
engine class for testing purposes.

e.g. The CFEngine can be run either with a ThreadPoolExecutor or a ProcessPoolExecutor,
so _get_engine_support("CFEngine", thread_pool, process_pool, "thread", cluster) returns a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused on this docstring: is the example supposed to be _get_engine_support("CFEngine", "thread", cluster)? 😮

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I feel like there's some potential for refactoring that I can't quite put my finger on: if all we're checking for is parallel_engine_type, is it possible to do:

def _get_engine_support(parallel_engine_type, resource):
    if parallel_engine_type == "CFEngine":
        client_class = CFClient
        engine_class = CFEngine
        # thread_pool should be passed as resource
    elif parallel_engine_type == "DaskEngine":
        client_class = dd.Client
        engine_class = DaskEngine
        # cluster should be passed as resource
    return resources, client_class, engine_class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is a weird helper function. It's gone through a few iterations and ultimately looks like a lot of hullabaloo for little return. I think the reason I didn't do it the way you mentioned was to leave the decision logic for what resources should be entirely inside the function rather than have an implied knowledge in the hosting function/test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think there's good potential for refactoring in the engine code. I filed this to cover that.



@pytest.mark.parametrize("pool_type", ["threads", "processes"])
def test_submit_training_job_single(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for this file is so similar to that of the dask tests that I can't help but question if it's a good idea to combine the two or not, like you have for the automl tests. What do ya think? Is this worth parameterizing? Or should they be two separate files because they're covering two different engines?
(Just thinking that if we decide to update tests to include other checks, we'd have to make sure we do it for both this and test_dask_engine.py 🤔 , but structurally I like the separation more)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I struggled mightily with this and landed on a half and half solution like what you see. So, obviously I started with the copy pasta approach, but when I considered parameterizing the engine tests, like the automl tests, I decided to maintain a physical distance between the two since the dask tests were flaky. Part of me wanted to see whether these new tests would be flaky too. I'm actually more uncomfortable about parameterizing the test_automl_dask.py tests to encompass the CFEngine because of the dask flakiness.

I dunno. I'm going to file an issue to track this and commit to one path or another. Either 1.) a separate test_cf_engine.py and test_automl_cf.py or 2.) parameterized dask/cf test files. #2533

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep separate for now! That being said I don't follow the reason that we should not parameterize over the engines because the dask tests may flake again. If they do, we would still see that the dask engine parameter was the one that flaked? So we would still be able to tell if the dask engine or cf engine flaked. Additionally, make git-test-dask will run both test_cf_engine and test_dask_engine right now so we're already lumping the tests together as is.

Cleanup and parameterization.

Refactored test_automl_dask to cut down lines. Lint.

Removed the process pool executor from testing as it crashes on GitHub.

Updated the dask tests --cov flag.

Removed uncovered lines in test files.
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten This is so cool! Thanks for putting this together.

Nothing blocking merge. But if we have time, I think we can get rid of CFClient - ThreadPoolExecutor and ProcessPoolExecutor already match the api of dask client. In fact, I think the dask client was modeled after the api of concurrent.futures

https://distributed.dask.org/en/latest/client.html#client

self.client = client
self._data_futures_cache = {}

def submit_evaluation_job(self, automl_config, pipeline, X, y) -> EngineComputation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit-pick: Let's get rid of the type hint? I know the DaskEngine is guilty of this too so feel free to delete there as well hehe

yield pool
pool.shutdown()


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be my guess as well

def __enter__(self):
return self

def __exit__(self, typ, value, traceback):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically the exit method of a context manager does some clean-up: Should we call self.pool.close() here?



class CFClient:
"""Custom CFClient API to match Dask's CFClient and allow context management."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: to match Dask's Client?

)


class CFClient:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this class? The ThreadPoolExecutor and ProcessPoolExecutors already have context management and submit method in their api.

https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example

"""Pass through to imitate Dask's Client API."""
return self.pool.submit(*args, **kwargs)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea looking at the engines, I think the only difference is _send_data_to_cluster ? I think the door is open to refactoring in the future, we can make _send_data_to_cluster a no-op in the CFEngine



@pytest.mark.parametrize("pool_type", ["threads", "processes"])
def test_submit_training_job_single(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep separate for now! That being said I don't follow the reason that we should not parameterize over the engines because the dask tests may flake again. If they do, we would still see that the dask engine parameter was the one that flaked? So we would still be able to tell if the dask engine or cf engine flaked. Additionally, make git-test-dask will run both test_cf_engine and test_dask_engine right now so we're already lumping the tests together as is.

@chukarsten chukarsten merged commit 26db19b into main Jul 21, 2021
@chukarsten chukarsten deleted the karstengine branch July 21, 2021 14:57
@chukarsten chukarsten mentioned this pull request Jul 22, 2021
@chukarsten chukarsten mentioned this pull request Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants