[FIX,AUTOSCHEDULER] Fix auto_scheduler to run with multiprocessing's spawn start method #6671

tkonolige · 2020-10-12T19:45:04Z

This PR fixes autotvm and the autoscheduler when using multiprocessing's spawn start method. I had to remove all nested function declarations, property serialize auto_scheduler's SearchTask, and propagate registries to subprocesses. Unfortunately, the registry propagation does not work correctly when running under pytest. I've disable those tests for now (tests/python/unittest/test_runtime_rpc.py), but they work if run directly. Maybe someone else can take a look at this.

See #6650

@merrymercy

merrymercy · 2020-10-14T01:47:36Z

Note that "spawn" is slower than "fork" (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods).
We should not change all default settings to "spawn".
We should prefer "fork" as the default setting on Linux and macOS when it is possible.

tkonolige · 2020-10-14T20:42:50Z

The problem with using fork is that it is unsafe on macOS or when using the CUDA api. The pytorch developers recommend not using fork with CUDA: https://pytorch.org/docs/master/notes/multiprocessing.html#cuda-in-multiprocessing. It looks like we have to choose something that works over something that is fast.

I also don't think it really matters that spawning is slower than forking. We are not spawning that many processes. And if the overhead from spawning is big enough to cause a significant slowdown, then we shouldn't have been using processes/threads in the first place.

merrymercy · 2020-10-15T01:58:34Z

Autotvm uses a lot of python multiprocessing and I expect it will be much slower when using spawn. AutoTVM uses multiprocessing for feature extraction. So it needs to launch about 50,000 tasks every measurement batch.

The situation for Ansor is better as Ansor does not rely on multiprocessing as heavily as AutoTVM does.
However, I'd like to see benchmark numbers about fork vs. spawn.
i.e. Run a search with n_trials=64 with (Ansor, AutoTVM) x (fork, spawn).
The requirement from my side is that I don't want to see any performance regression.
I prefer to only use the slower and safer fallback mechanism when the fast approach does not work.

On the other hand, removing all multiprocessing requires some engineering effort. But this is not on my agenda for now.

tkonolige · 2020-10-20T21:53:29Z

@merrymercy spawn is definitely slower. With autotvm, spawn is about 50% slower. I figured out where the CUDA drivers were being called in forked threads and fixed it. Now fork should be used on linux.

tkonolige · 2020-10-22T23:42:31Z

@merrymercy @tqchen A review on this would be great.

tqchen · 2020-10-24T14:03:46Z

cc @jcf94 @comaniac @yzhliu please also help to take a look

python/tvm/auto_scheduler/measure.py

merrymercy · 2020-10-26T11:53:03Z

python/tvm/auto_scheduler/measure.py

    assert len(inputs) == len(build_results), "Measure input size should be equal to build results"
-    pool = NoDaemonPool(n_parallel)
-    tuple_res = pool.map(rpc_run_worker, range(len(build_results)))
+    # This pool is not doing computationally intensive work, so we can use threads


Did you benchmark the speed of ProcessingPool vs. ThreadPool?
For the comment, is this pool "not doing computational intensive work" or "not doing computational intensive work in python"?

The pool is doing basically no work. Each thread in the pool spawns a process and then waits till it times out.

python/tvm/autotvm/task/task.py

merrymercy · 2020-10-26T12:04:42Z

python/tvm/autotvm/tuner/xgboost_cost_model.py

@@ -321,10 +316,11 @@ def _get_feature(self, indexes):

        indexes = np.array(indexes)
        need_extract = [x for x in indexes if x not in fea_cache]
+        args = [(self.space.get(x), self.target, self.task) for x in need_extract]


Doing this still needs serializing a lot of things. Did you test the performance before vs. after?

src/auto_scheduler/measure_record.cc

python/tvm/auto_scheduler/utils.py

merrymercy · 2020-10-26T12:14:11Z

python/tvm/auto_scheduler/measure.py

@@ -87,6 +120,24 @@ def __init__(self, task, state):
        state = state if isinstance(state, StateObject) else state.state_object
        self.__init_handle_by_constructor__(_ffi_api.MeasureInput, task, state)

+    def serialize(self):


Can we use __getstate__?

There was some weird initialization bug with __getstate__ that I could not figure out. I'll add a comment about this.

merrymercy · 2020-10-26T12:32:55Z

Thanks for the refactoring! I would like to see benchmark results for both autotvm and auto-scheduler, so we don't get performance regression.

python/tvm/auto_scheduler/workload_registry.py

merrymercy · 2020-10-26T13:08:39Z

Another concern is that I introduced a new kind of workload in this PR (#6710, https://github.com/apache/incubator-tvm/blob/5d93c61896acafe5d0b76b70615f2e2823cbf3b2/python/tvm/auto_scheduler/workload_registry.py#L158) for relay integration.
When using auto-scheduler in Relay, we don't serialize a ComputeDAG by function + arguments.
We save a copy of the TE expressions generated by relay as the workload.
So we have to support serializing TE expressions after this PR. But I am not sure whether it works because there are pointers in TE expressions.

tkonolige · 2020-10-26T19:42:30Z

Here are benchmark results on linux using forking. This is dense with 64 x 64 matrices. ntrials=16, 5 bencmarking iterations.

Branch	Auto-what	Target	Mean Time	STD
main	autotvm	llvm	20.429478	1.914881
PR	autotvm	llvm	26.400752	9.013005
main	autotvm	cuda	7.010600	0.080864
PR	autotvm	cuda	7.001349	0.017357
main	autoscheduler	llvm	16.711073	2.572820
PR	autoscheduler	llvm	14.752735	0.757895
main	autoscheduler	cuda	32.981713	0.072185
PR	autoscheduler	cuda	32.283536	0.054209

Looks pretty similar except for high standard deviation in runtime. Not sure what's causing this. I still think we should merge this as it fixes users not being able to run auto scheduling on Macs.

merrymercy · 2020-10-27T00:45:39Z

The auto-scheduler part looks good to me. I believe this PR solves the problem while not bringing any performance regression.
But I am not sure about the autotvm part. Your test case is not correct. With n-trials=16, the code modified by you will not even be executed. Because the cost model only starts to work after getting enough training samples (n-trials > 64). To do the benchmark correctly, you should use a larger matrix size (n=1024), run the autotvm with n-trials > 64 and XGBTuner, and report the time used in simulated annealing (which actually runs the modified xgboost_cost_model.py).
Could you delete the autotvm part in this PR (files : python/tvm/autotvm/task/task.py, python/tvm/autotvm/tuner/xgboost_cost_model.py) so we can merge this?

dismiss

jcf94

Thanks for your contribution! I don't have more comments on this.

tkonolige · 2020-10-27T16:35:46Z

@merrymercy The autotvm changes are required for things to work.

tkonolige · 2020-10-27T17:11:26Z

Here are the autotvm times you requested. 1024x1024 with 100 trials.

This PR
autotvm       llvm 297.038928
autotvm       cuda 15.138037

main
autotvm       llvm 297.375458
autotvm       cuda 14.163432

Results are very similar.

tkonolige · 2020-10-27T21:23:22Z

They are broken with respect to the multiprocessing changes I have introduced in this PR. Technically speaking they were never correct as the script to be executed should be wrapped in __name__ == "__main__".

merrymercy · 2020-10-27T21:28:36Z

Sorry, could you point me to the related lines? I only see you rename some variables in python/tvm/autotvm/measure/local_executor.py. Is this because of cloudpickle?

tkonolige · 2020-10-27T21:49:02Z

I was thinking it was cloud pickle reloading the main module, but I took off the __name__ == "__main__" lines and it performs the same. __name__ == "__main__" is still required for non-linux machines (i.e. the tutorials are buggy on non linux machines), but it isn't causing the issue here. The only difference I can see then is that your machine has so many threads. Can you reproduce your results on a different machine?

Also, did you remove the tuning logs between runs?

merrymercy · 2020-10-27T22:55:14Z

I added __name__ == "__main__" but the results are the same.

Can you delete all autotvm related code and tutorials in this PR? We can investigate later. In the meanwhile, can you try on a machine with more cores?
Can you check whether your modifications to the tutorials break their format?
I don't think the tutorial scripts can generate correct RST files after adding indentation to the text block.
You can build the tutorials locally and check the generated HTML files.

This PR and #6710 are conflicting. I want to merge this first so I can do merge and clean up on my PR. Otherwise, you have to do more work by yourself.

tkonolige · 2020-10-27T23:32:49Z

Here is the tests I've run on a 64-core machine:

This PR
XGB iter:   0	tr-a-recall@64: 0.598716	tr-map: 0.142857
XGB iter:  25	tr-a-recall@64: 0.667324	tr-map: 0.500000
XGB stopped. Best iteration: [13] tr-a-recall@64:0.67690	tr-map:0.50000
XGB train: 1.24	obs: 64	error: 50	n_cache: 64
SA iter: 50	last_update: 49	max-0: 4.42	max-1: 4.71	temp: 0.90	elapsed: 16.14
SA iter: 100	last_update: 97	max-0: 4.55	max-1: 4.71	temp: 0.80	elapsed: 30.59
SA iter: 150	last_update: 145	max-0: 4.56	max-1: 4.74	temp: 0.70	elapsed: 45.74
SA iter: 200	last_update: 197	max-0: 4.58	max-1: 4.83	temp: 0.60	elapsed: 60.61
SA iter: 250	last_update: 249	max-0: 4.67	max-1: 4.84	temp: 0.50	elapsed: 76.09
SA iter: 300	last_update: 299	max-0: 4.69	max-1: 4.84	temp: 0.40	elapsed: 90.07
SA iter: 350	last_update: 349	max-0: 4.71	max-1: 4.84	temp: 0.30	elapsed: 104.27
SA iter: 400	last_update: 377	max-0: 4.71	max-1: 4.84	temp: 0.20	elapsed: 118.12
SA iter: 450	last_update: 448	max-0: 4.72	max-1: 4.84	temp: 0.10	elapsed: 130.86
SA iter: 500	last_update: 498	max-0: 4.76	max-1: 4.84	temp: 0.00	elapsed: 143.39
SA iter: 500	last_update: 498	elapsed: 143.39

main
XGB iter:   0	tr-a-recall@64: 0.641594	tr-map: 0.500000
XGB iter:  25	tr-a-recall@64: 0.716034	tr-map: 0.500000
XGB iter:  50	tr-a-recall@64: 0.728245	tr-map: 1.000000
XGB stopped. Best iteration: [38] tr-a-recall@64:0.73020	tr-map:0.50000
XGB train: 1.10	obs: 64	error: 46	n_cache: 64
SA iter: 50	last_update: 49	max-0: 6.21	max-1: 6.94	temp: 0.90	elapsed: 3.40
SA iter: 100	last_update: 99	max-0: 6.50	max-1: 7.16	temp: 0.80	elapsed: 6.68
SA iter: 150	last_update: 149	max-0: 6.74	max-1: 7.38	temp: 0.70	elapsed: 10.24
SA iter: 200	last_update: 199	max-0: 6.86	max-1: 7.39	temp: 0.60	elapsed: 13.91
SA iter: 250	last_update: 249	max-0: 6.97	max-1: 7.39	temp: 0.50	elapsed: 17.68
SA iter: 300	last_update: 294	max-0: 7.07	max-1: 7.56	temp: 0.40	elapsed: 21.29
SA iter: 350	last_update: 348	max-0: 7.11	max-1: 7.56	temp: 0.30	elapsed: 24.52
SA iter: 400	last_update: 399	max-0: 7.15	max-1: 7.56	temp: 0.20	elapsed: 27.68
SA iter: 450	last_update: 449	max-0: 7.16	max-1: 7.56	temp: 0.10	elapsed: 30.63
SA iter: 500	last_update: 499	max-0: 7.17	max-1: 7.56	temp: 0.00	elapsed: 33.48
SA iter: 500	last_update: 499	elapsed: 33.48

Seems like the issue is hit when there are a lot of cores.

I've pulled out the autotvm parts and will send another pr when I figure out those.

merrymercy · 2020-10-28T00:25:06Z

How about also undoing the tutorials? I guess their format will be wrong because you add indentation to text blocks.
For the autotvm problem, I am okay to accept the following solution:
provide two versions of the code and pick the version of code according to OS.

tqchen · 2020-10-28T00:27:22Z

the sphinx gallery will run all tutorials together, and it would still be useful to keep the tutorials in the form of ipynb style(no main)

tkonolige · 2020-10-28T17:03:26Z

Maybe we should at least leave a comment that they are broken when using the spawn method?

Like apache#6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.

Like #6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.

Like apache#6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.

…spawn start method (apache#6671) * Fix multiprocessing with spawn issues * address reviewer feedback * Fix tutorials * formatting * undo autotvm work * Undo tutorial changes * Add spawn tests * fix test

Like apache#6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.

…spawn start method (apache#6671) * Fix multiprocessing with spawn issues * address reviewer feedback * Fix tutorials * formatting * undo autotvm work * Undo tutorial changes * Add spawn tests * fix test

Like apache#6671 this PR fixes autotvm when using the spawn start method for multiprocessing. I've added some tests to make sure that things work with spawn in the CI.

…spawn start method (apache#6671) * Fix multiprocessing with spawn issues * address reviewer feedback * Fix tutorials * formatting * undo autotvm work * Undo tutorial changes * Add spawn tests * fix test

merrymercy self-assigned this Oct 13, 2020

tkonolige force-pushed the fix_multiprocessing branch from c065ff3 to 078786c Compare October 14, 2020 20:55

tkonolige force-pushed the fix_multiprocessing branch 2 times, most recently from 0862d0f to ad116fc Compare October 19, 2020 19:59

This was referenced Oct 21, 2020

[Bugfix] Auto scheduler tutorial failure on CI #6723

Merged

[CI] Build doc environment isolation #6727

Closed

tkonolige force-pushed the fix_multiprocessing branch 2 times, most recently from 1dfa710 to 6011737 Compare October 22, 2020 16:51

tqchen added the status: need review label Oct 23, 2020

tkonolige force-pushed the fix_multiprocessing branch from 6011737 to f8c1db6 Compare October 23, 2020 17:22

merrymercy requested changes Oct 26, 2020

View reviewed changes

python/tvm/auto_scheduler/workload_registry.py Outdated Show resolved Hide resolved

python/tvm/auto_scheduler/workload_registry.py Outdated Show resolved Hide resolved

tkonolige force-pushed the fix_multiprocessing branch 2 times, most recently from 520f7fa to 0a41780 Compare October 26, 2020 19:44

merrymercy previously approved these changes Oct 27, 2020

View reviewed changes

jcf94 reviewed Oct 27, 2020

View reviewed changes

tkonolige force-pushed the fix_multiprocessing branch from 0a41780 to e9ae8ab Compare October 27, 2020 16:36

tkonolige added 4 commits October 27, 2020 16:14

Fix multiprocessing with spawn issues

66901ad

address reviewer feedback

50f5562

Fix tutorials

5a6fd49

formatting

4af275f

tkonolige force-pushed the fix_multiprocessing branch from 03cbb65 to 4af275f Compare October 27, 2020 22:15

undo autotvm work

e9ecb16

tkonolige added 2 commits October 28, 2020 10:11

Undo tutorial changes

df0b9b8

Add spawn tests

228e382

tkonolige mentioned this pull request Oct 28, 2020

[FIX][AUTOTVM] Make autotvm work with spawn #6790

Merged

fix test

1686131

merrymercy approved these changes Oct 29, 2020

View reviewed changes

merrymercy merged commit 6be6363 into apache:main Oct 29, 2020

tqchen mentioned this pull request Oct 30, 2020

Ansor support for multiprocessing spawn #6650

Closed

FrozenGene mentioned this pull request Jan 4, 2021

[AutoScheduler] Add custom build function #7185

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX,AUTOSCHEDULER] Fix auto_scheduler to run with multiprocessing's spawn start method #6671

[FIX,AUTOSCHEDULER] Fix auto_scheduler to run with multiprocessing's spawn start method #6671

tkonolige commented Oct 12, 2020

merrymercy commented Oct 14, 2020 •

edited

tkonolige commented Oct 14, 2020

merrymercy commented Oct 15, 2020 •

edited

tkonolige commented Oct 20, 2020

tkonolige commented Oct 22, 2020

tqchen commented Oct 24, 2020

merrymercy Oct 26, 2020 •

edited

tkonolige Oct 26, 2020

merrymercy Oct 26, 2020

merrymercy Oct 26, 2020

tkonolige Oct 26, 2020

merrymercy commented Oct 26, 2020

merrymercy commented Oct 26, 2020 •

edited

tkonolige commented Oct 26, 2020

merrymercy commented Oct 27, 2020 •

edited

jcf94 left a comment

tkonolige commented Oct 27, 2020

tkonolige commented Oct 27, 2020

tkonolige commented Oct 27, 2020

merrymercy commented Oct 27, 2020 •

edited

tkonolige commented Oct 27, 2020 •

edited

merrymercy commented Oct 27, 2020 •

edited

tkonolige commented Oct 27, 2020

merrymercy commented Oct 28, 2020 •

edited

tqchen commented Oct 28, 2020 •

edited

tkonolige commented Oct 28, 2020

[FIX,AUTOSCHEDULER] Fix auto_scheduler to run with multiprocessing's spawn start method #6671

[FIX,AUTOSCHEDULER] Fix auto_scheduler to run with multiprocessing's spawn start method #6671

Conversation

tkonolige commented Oct 12, 2020

merrymercy commented Oct 14, 2020 • edited

tkonolige commented Oct 14, 2020

merrymercy commented Oct 15, 2020 • edited

tkonolige commented Oct 20, 2020

tkonolige commented Oct 22, 2020

tqchen commented Oct 24, 2020

merrymercy Oct 26, 2020 • edited

Choose a reason for hiding this comment

tkonolige Oct 26, 2020

Choose a reason for hiding this comment

merrymercy Oct 26, 2020

Choose a reason for hiding this comment

merrymercy Oct 26, 2020

Choose a reason for hiding this comment

tkonolige Oct 26, 2020

Choose a reason for hiding this comment

merrymercy commented Oct 26, 2020

merrymercy commented Oct 26, 2020 • edited

tkonolige commented Oct 26, 2020

merrymercy commented Oct 27, 2020 • edited

jcf94 left a comment

Choose a reason for hiding this comment

tkonolige commented Oct 27, 2020

tkonolige commented Oct 27, 2020

tkonolige commented Oct 27, 2020

merrymercy commented Oct 27, 2020 • edited

tkonolige commented Oct 27, 2020 • edited

merrymercy commented Oct 27, 2020 • edited

tkonolige commented Oct 27, 2020

merrymercy commented Oct 28, 2020 • edited

tqchen commented Oct 28, 2020 • edited

tkonolige commented Oct 28, 2020

merrymercy commented Oct 14, 2020 •

edited

merrymercy commented Oct 15, 2020 •

edited

merrymercy Oct 26, 2020 •

edited

merrymercy commented Oct 26, 2020 •

edited

merrymercy commented Oct 27, 2020 •

edited

merrymercy commented Oct 27, 2020 •

edited

tkonolige commented Oct 27, 2020 •

edited

merrymercy commented Oct 27, 2020 •

edited

merrymercy commented Oct 28, 2020 •

edited

tqchen commented Oct 28, 2020 •

edited