New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX,AUTOSCHEDULER] Fix auto_scheduler to run with multiprocessing's spawn start method #6671
Conversation
Note that "spawn" is slower than "fork" (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods). |
The problem with using fork is that it is unsafe on macOS or when using the CUDA api. The pytorch developers recommend not using fork with CUDA: https://pytorch.org/docs/master/notes/multiprocessing.html#cuda-in-multiprocessing. It looks like we have to choose something that works over something that is fast. I also don't think it really matters that spawning is slower than forking. We are not spawning that many processes. And if the overhead from spawning is big enough to cause a significant slowdown, then we shouldn't have been using processes/threads in the first place. |
c065ff3
to
078786c
Compare
Autotvm uses a lot of python multiprocessing and I expect it will be much slower when using spawn. AutoTVM uses multiprocessing for feature extraction. So it needs to launch about 50,000 tasks every measurement batch. The situation for Ansor is better as Ansor does not rely on multiprocessing as heavily as AutoTVM does. On the other hand, removing all multiprocessing requires some engineering effort. But this is not on my agenda for now. |
0862d0f
to
ad116fc
Compare
@merrymercy spawn is definitely slower. With autotvm, spawn is about 50% slower. I figured out where the CUDA drivers were being called in forked threads and fixed it. Now fork should be used on linux. |
1dfa710
to
6011737
Compare
@merrymercy @tqchen A review on this would be great. |
6011737
to
f8c1db6
Compare
assert len(inputs) == len(build_results), "Measure input size should be equal to build results" | ||
pool = NoDaemonPool(n_parallel) | ||
tuple_res = pool.map(rpc_run_worker, range(len(build_results))) | ||
# This pool is not doing computationally intensive work, so we can use threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you benchmark the speed of ProcessingPool vs. ThreadPool?
For the comment, is this pool "not doing computational intensive work" or "not doing computational intensive work in python"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pool is doing basically no work. Each thread in the pool spawns a process and then waits till it times out.
@@ -321,10 +316,11 @@ def _get_feature(self, indexes): | |||
|
|||
indexes = np.array(indexes) | |||
need_extract = [x for x in indexes if x not in fea_cache] | |||
args = [(self.space.get(x), self.target, self.task) for x in need_extract] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this still needs serializing a lot of things. Did you test the performance before vs. after?
@@ -87,6 +120,24 @@ def __init__(self, task, state): | |||
state = state if isinstance(state, StateObject) else state.state_object | |||
self.__init_handle_by_constructor__(_ffi_api.MeasureInput, task, state) | |||
|
|||
def serialize(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use __getstate__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some weird initialization bug with __getstate__
that I could not figure out. I'll add a comment about this.
Thanks for the refactoring! I would like to see benchmark results for both autotvm and auto-scheduler, so we don't get performance regression. |
Another concern is that I introduced a new kind of workload in this PR (#6710, https://github.com/apache/incubator-tvm/blob/5d93c61896acafe5d0b76b70615f2e2823cbf3b2/python/tvm/auto_scheduler/workload_registry.py#L158) for relay integration. |
Here are benchmark results on linux using forking. This is
Looks pretty similar except for high standard deviation in runtime. Not sure what's causing this. I still think we should merge this as it fixes users not being able to run auto scheduling on Macs. |
520f7fa
to
0a41780
Compare
The auto-scheduler part looks good to me. I believe this PR solves the problem while not bringing any performance regression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! I don't have more comments on this.
@merrymercy The autotvm changes are required for things to work. |
0a41780
to
e9ae8ab
Compare
Here are the autotvm times you requested. 1024x1024 with 100 trials.
Results are very similar. |
They are broken with respect to the multiprocessing changes I have introduced in this PR. Technically speaking they were never correct as the script to be executed should be wrapped in |
Sorry, could you point me to the related lines? I only see you rename some variables in |
I was thinking it was cloud pickle reloading the main module, but I took off the Also, did you remove the tuning logs between runs? |
03cbb65
to
4af275f
Compare
I added Can you delete all autotvm related code and tutorials in this PR? We can investigate later. In the meanwhile, can you try on a machine with more cores? This PR and #6710 are conflicting. I want to merge this first so I can do merge and clean up on my PR. Otherwise, you have to do more work by yourself. |
Here is the tests I've run on a 64-core machine:
Seems like the issue is hit when there are a lot of cores. I've pulled out the autotvm parts and will send another pr when I figure out those. |
How about also undoing the tutorials? I guess their format will be wrong because you add indentation to text blocks. |
the sphinx gallery will run all tutorials together, and it would still be useful to keep the tutorials in the form of ipynb style(no main) |
Maybe we should at least leave a comment that they are broken when using the spawn method? |
Like apache#6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.
Like #6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.
Like apache#6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.
…spawn start method (apache#6671) * Fix multiprocessing with spawn issues * address reviewer feedback * Fix tutorials * formatting * undo autotvm work * Undo tutorial changes * Add spawn tests * fix test
Like apache#6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.
…spawn start method (apache#6671) * Fix multiprocessing with spawn issues * address reviewer feedback * Fix tutorials * formatting * undo autotvm work * Undo tutorial changes * Add spawn tests * fix test
Like apache#6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.
…spawn start method (apache#6671) * Fix multiprocessing with spawn issues * address reviewer feedback * Fix tutorials * formatting * undo autotvm work * Undo tutorial changes * Add spawn tests * fix test
This PR fixes autotvm and the autoscheduler when using multiprocessing's spawn start method. I had to remove all nested function declarations, property serialize auto_scheduler's
SearchTask
, and propagate registries to subprocesses. Unfortunately, the registry propagation does not work correctly when running under pytest. I've disable those tests for now (tests/python/unittest/test_runtime_rpc.py
), but they work if run directly. Maybe someone else can take a look at this.See #6650
@merrymercy