Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX,AUTOSCHEDULER] Fix auto_scheduler to run with multiprocessing's spawn start method #6671

Merged
merged 8 commits into from Oct 29, 2020

Conversation

tkonolige
Copy link
Contributor

This PR fixes autotvm and the autoscheduler when using multiprocessing's spawn start method. I had to remove all nested function declarations, property serialize auto_scheduler's SearchTask, and propagate registries to subprocesses. Unfortunately, the registry propagation does not work correctly when running under pytest. I've disable those tests for now (tests/python/unittest/test_runtime_rpc.py), but they work if run directly. Maybe someone else can take a look at this.

See #6650

@merrymercy

@merrymercy merrymercy self-assigned this Oct 13, 2020
@merrymercy
Copy link
Member

merrymercy commented Oct 14, 2020

Note that "spawn" is slower than "fork" (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods).
We should not change all default settings to "spawn".
We should prefer "fork" as the default setting on Linux and macOS when it is possible.

@tkonolige
Copy link
Contributor Author

The problem with using fork is that it is unsafe on macOS or when using the CUDA api. The pytorch developers recommend not using fork with CUDA: https://pytorch.org/docs/master/notes/multiprocessing.html#cuda-in-multiprocessing. It looks like we have to choose something that works over something that is fast.

I also don't think it really matters that spawning is slower than forking. We are not spawning that many processes. And if the overhead from spawning is big enough to cause a significant slowdown, then we shouldn't have been using processes/threads in the first place.

@merrymercy
Copy link
Member

merrymercy commented Oct 15, 2020

Autotvm uses a lot of python multiprocessing and I expect it will be much slower when using spawn. AutoTVM uses multiprocessing for feature extraction. So it needs to launch about 50,000 tasks every measurement batch.

The situation for Ansor is better as Ansor does not rely on multiprocessing as heavily as AutoTVM does.
However, I'd like to see benchmark numbers about fork vs. spawn.
i.e. Run a search with n_trials=64 with (Ansor, AutoTVM) x (fork, spawn).
The requirement from my side is that I don't want to see any performance regression.
I prefer to only use the slower and safer fallback mechanism when the fast approach does not work.

On the other hand, removing all multiprocessing requires some engineering effort. But this is not on my agenda for now.

@tkonolige tkonolige force-pushed the fix_multiprocessing branch 2 times, most recently from 0862d0f to ad116fc Compare October 19, 2020 19:59
@tkonolige
Copy link
Contributor Author

@merrymercy spawn is definitely slower. With autotvm, spawn is about 50% slower. I figured out where the CUDA drivers were being called in forked threads and fixed it. Now fork should be used on linux.

@tkonolige
Copy link
Contributor Author

@merrymercy @tqchen A review on this would be great.

@tqchen
Copy link
Member

tqchen commented Oct 24, 2020

cc @jcf94 @comaniac @yzhliu please also help to take a look

python/tvm/auto_scheduler/measure.py Outdated Show resolved Hide resolved
assert len(inputs) == len(build_results), "Measure input size should be equal to build results"
pool = NoDaemonPool(n_parallel)
tuple_res = pool.map(rpc_run_worker, range(len(build_results)))
# This pool is not doing computationally intensive work, so we can use threads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you benchmark the speed of ProcessingPool vs. ThreadPool?
For the comment, is this pool "not doing computational intensive work" or "not doing computational intensive work in python"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pool is doing basically no work. Each thread in the pool spawns a process and then waits till it times out.

python/tvm/autotvm/task/task.py Outdated Show resolved Hide resolved
@@ -321,10 +316,11 @@ def _get_feature(self, indexes):

indexes = np.array(indexes)
need_extract = [x for x in indexes if x not in fea_cache]
args = [(self.space.get(x), self.target, self.task) for x in need_extract]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this still needs serializing a lot of things. Did you test the performance before vs. after?

src/auto_scheduler/measure_record.cc Show resolved Hide resolved
python/tvm/auto_scheduler/utils.py Show resolved Hide resolved
@@ -87,6 +120,24 @@ def __init__(self, task, state):
state = state if isinstance(state, StateObject) else state.state_object
self.__init_handle_by_constructor__(_ffi_api.MeasureInput, task, state)

def serialize(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use __getstate__?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some weird initialization bug with __getstate__ that I could not figure out. I'll add a comment about this.

@merrymercy
Copy link
Member

Thanks for the refactoring! I would like to see benchmark results for both autotvm and auto-scheduler, so we don't get performance regression.

python/tvm/auto_scheduler/workload_registry.py Outdated Show resolved Hide resolved
python/tvm/auto_scheduler/workload_registry.py Outdated Show resolved Hide resolved
@merrymercy
Copy link
Member

merrymercy commented Oct 26, 2020

Another concern is that I introduced a new kind of workload in this PR (#6710, https://github.com/apache/incubator-tvm/blob/5d93c61896acafe5d0b76b70615f2e2823cbf3b2/python/tvm/auto_scheduler/workload_registry.py#L158) for relay integration.
When using auto-scheduler in Relay, we don't serialize a ComputeDAG by function + arguments.
We save a copy of the TE expressions generated by relay as the workload.
So we have to support serializing TE expressions after this PR. But I am not sure whether it works because there are pointers in TE expressions.

@tkonolige
Copy link
Contributor Author

Here are benchmark results on linux using forking. This is dense with 64 x 64 matrices. ntrials=16, 5 bencmarking iterations.

Branch Auto-what Target Mean Time STD
main autotvm llvm 20.429478 1.914881
PR autotvm llvm 26.400752 9.013005
main autotvm cuda 7.010600 0.080864
PR autotvm cuda 7.001349 0.017357
main autoscheduler llvm 16.711073 2.572820
PR autoscheduler llvm 14.752735 0.757895
main autoscheduler cuda 32.981713 0.072185
PR autoscheduler cuda 32.283536 0.054209

Looks pretty similar except for high standard deviation in runtime. Not sure what's causing this. I still think we should merge this as it fixes users not being able to run auto scheduling on Macs.

@tkonolige tkonolige force-pushed the fix_multiprocessing branch 2 times, most recently from 520f7fa to 0a41780 Compare October 26, 2020 19:44
merrymercy
merrymercy previously approved these changes Oct 27, 2020
@merrymercy
Copy link
Member

merrymercy commented Oct 27, 2020

The auto-scheduler part looks good to me. I believe this PR solves the problem while not bringing any performance regression.
But I am not sure about the autotvm part. Your test case is not correct. With n-trials=16, the code modified by you will not even be executed. Because the cost model only starts to work after getting enough training samples (n-trials > 64). To do the benchmark correctly, you should use a larger matrix size (n=1024), run the autotvm with n-trials > 64 and XGBTuner, and report the time used in simulated annealing (which actually runs the modified xgboost_cost_model.py).
Could you delete the autotvm part in this PR (files : python/tvm/autotvm/task/task.py, python/tvm/autotvm/tuner/xgboost_cost_model.py) so we can merge this?

Copy link
Contributor

@jcf94 jcf94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I don't have more comments on this.

@tkonolige
Copy link
Contributor Author

@merrymercy The autotvm changes are required for things to work.

@tkonolige
Copy link
Contributor Author

Here are the autotvm times you requested. 1024x1024 with 100 trials.

This PR
autotvm       llvm 297.038928
autotvm       cuda 15.138037

main
autotvm       llvm 297.375458
autotvm       cuda 14.163432

Results are very similar.

@tkonolige
Copy link
Contributor Author

They are broken with respect to the multiprocessing changes I have introduced in this PR. Technically speaking they were never correct as the script to be executed should be wrapped in __name__ == "__main__".

@merrymercy
Copy link
Member

merrymercy commented Oct 27, 2020

Sorry, could you point me to the related lines? I only see you rename some variables in python/tvm/autotvm/measure/local_executor.py. Is this because of cloudpickle?

@tkonolige
Copy link
Contributor Author

tkonolige commented Oct 27, 2020

I was thinking it was cloud pickle reloading the main module, but I took off the __name__ == "__main__" lines and it performs the same. __name__ == "__main__" is still required for non-linux machines (i.e. the tutorials are buggy on non linux machines), but it isn't causing the issue here. The only difference I can see then is that your machine has so many threads. Can you reproduce your results on a different machine?

Also, did you remove the tuning logs between runs?

@merrymercy
Copy link
Member

merrymercy commented Oct 27, 2020

I added __name__ == "__main__" but the results are the same.

Can you delete all autotvm related code and tutorials in this PR? We can investigate later. In the meanwhile, can you try on a machine with more cores?
Can you check whether your modifications to the tutorials break their format?
I don't think the tutorial scripts can generate correct RST files after adding indentation to the text block.
You can build the tutorials locally and check the generated HTML files.

This PR and #6710 are conflicting. I want to merge this first so I can do merge and clean up on my PR. Otherwise, you have to do more work by yourself.

@tkonolige
Copy link
Contributor Author

Here is the tests I've run on a 64-core machine:

This PR
XGB iter:   0	tr-a-recall@64: 0.598716	tr-map: 0.142857
XGB iter:  25	tr-a-recall@64: 0.667324	tr-map: 0.500000
XGB stopped. Best iteration: [13] tr-a-recall@64:0.67690	tr-map:0.50000
XGB train: 1.24	obs: 64	error: 50	n_cache: 64
SA iter: 50	last_update: 49	max-0: 4.42	max-1: 4.71	temp: 0.90	elapsed: 16.14
SA iter: 100	last_update: 97	max-0: 4.55	max-1: 4.71	temp: 0.80	elapsed: 30.59
SA iter: 150	last_update: 145	max-0: 4.56	max-1: 4.74	temp: 0.70	elapsed: 45.74
SA iter: 200	last_update: 197	max-0: 4.58	max-1: 4.83	temp: 0.60	elapsed: 60.61
SA iter: 250	last_update: 249	max-0: 4.67	max-1: 4.84	temp: 0.50	elapsed: 76.09
SA iter: 300	last_update: 299	max-0: 4.69	max-1: 4.84	temp: 0.40	elapsed: 90.07
SA iter: 350	last_update: 349	max-0: 4.71	max-1: 4.84	temp: 0.30	elapsed: 104.27
SA iter: 400	last_update: 377	max-0: 4.71	max-1: 4.84	temp: 0.20	elapsed: 118.12
SA iter: 450	last_update: 448	max-0: 4.72	max-1: 4.84	temp: 0.10	elapsed: 130.86
SA iter: 500	last_update: 498	max-0: 4.76	max-1: 4.84	temp: 0.00	elapsed: 143.39
SA iter: 500	last_update: 498	elapsed: 143.39

main
XGB iter:   0	tr-a-recall@64: 0.641594	tr-map: 0.500000
XGB iter:  25	tr-a-recall@64: 0.716034	tr-map: 0.500000
XGB iter:  50	tr-a-recall@64: 0.728245	tr-map: 1.000000
XGB stopped. Best iteration: [38] tr-a-recall@64:0.73020	tr-map:0.50000
XGB train: 1.10	obs: 64	error: 46	n_cache: 64
SA iter: 50	last_update: 49	max-0: 6.21	max-1: 6.94	temp: 0.90	elapsed: 3.40
SA iter: 100	last_update: 99	max-0: 6.50	max-1: 7.16	temp: 0.80	elapsed: 6.68
SA iter: 150	last_update: 149	max-0: 6.74	max-1: 7.38	temp: 0.70	elapsed: 10.24
SA iter: 200	last_update: 199	max-0: 6.86	max-1: 7.39	temp: 0.60	elapsed: 13.91
SA iter: 250	last_update: 249	max-0: 6.97	max-1: 7.39	temp: 0.50	elapsed: 17.68
SA iter: 300	last_update: 294	max-0: 7.07	max-1: 7.56	temp: 0.40	elapsed: 21.29
SA iter: 350	last_update: 348	max-0: 7.11	max-1: 7.56	temp: 0.30	elapsed: 24.52
SA iter: 400	last_update: 399	max-0: 7.15	max-1: 7.56	temp: 0.20	elapsed: 27.68
SA iter: 450	last_update: 449	max-0: 7.16	max-1: 7.56	temp: 0.10	elapsed: 30.63
SA iter: 500	last_update: 499	max-0: 7.17	max-1: 7.56	temp: 0.00	elapsed: 33.48
SA iter: 500	last_update: 499	elapsed: 33.48

Seems like the issue is hit when there are a lot of cores.

I've pulled out the autotvm parts and will send another pr when I figure out those.

@merrymercy
Copy link
Member

merrymercy commented Oct 28, 2020

How about also undoing the tutorials? I guess their format will be wrong because you add indentation to text blocks.
For the autotvm problem, I am okay to accept the following solution:
provide two versions of the code and pick the version of code according to OS.

@tqchen
Copy link
Member

tqchen commented Oct 28, 2020

the sphinx gallery will run all tutorials together, and it would still be useful to keep the tutorials in the form of ipynb style(no main)

@tkonolige
Copy link
Contributor Author

Maybe we should at least leave a comment that they are broken when using the spawn method?

tkonolige pushed a commit to tkonolige/incubator-tvm that referenced this pull request Oct 28, 2020
Like apache#6671 this PR fixes autotvm when using the spawn start method for
multiprocessing. I've added some tests to make sure that things work
with spawn in the CI.
merrymercy pushed a commit that referenced this pull request Oct 29, 2020
Like #6671 this PR fixes autotvm when using the spawn start method for
multiprocessing. I've added some tests to make sure that things work
with spawn in the CI.
@merrymercy merrymercy merged commit 6be6363 into apache:main Oct 29, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
Like apache#6671 this PR fixes autotvm when using the spawn start method for
multiprocessing. I've added some tests to make sure that things work
with spawn in the CI.
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
…spawn start method (apache#6671)

* Fix multiprocessing with spawn issues

* address reviewer feedback

* Fix tutorials

* formatting

* undo autotvm work

* Undo tutorial changes

* Add spawn tests

* fix test
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
Like apache#6671 this PR fixes autotvm when using the spawn start method for
multiprocessing. I've added some tests to make sure that things work
with spawn in the CI.
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
…spawn start method (apache#6671)

* Fix multiprocessing with spawn issues

* address reviewer feedback

* Fix tutorials

* formatting

* undo autotvm work

* Undo tutorial changes

* Add spawn tests

* fix test
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
Like apache#6671 this PR fixes autotvm when using the spawn start method for
multiprocessing. I've added some tests to make sure that things work
with spawn in the CI.
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
…spawn start method (apache#6671)

* Fix multiprocessing with spawn issues

* address reviewer feedback

* Fix tutorials

* formatting

* undo autotvm work

* Undo tutorial changes

* Add spawn tests

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants