[Auto-Schedule][Fix] Fix hang while tune model through rpc #9032

echuraev · 2021-09-16T14:32:51Z

No description provided.

echuraev · 2021-09-16T14:37:43Z

@shingjan please take a look on this PR. After #8492 the tuning doesn't work in proper way. It hangs. The problem is in the resources management and global dictionary with arguments. The issue is that global arguments don't release managed resources, and the RPC hangs till they won't be released.
I don't know why you added args as an argument to _timed_eval_func and _rpc_run?
With this fix, we don't have this hang during tuning.

vinx13 · 2021-09-16T15:28:30Z

args was added because some workloads like sparse require predefined inputs that shouldn’t be randomly generated. These inputs are looked up from a global dictionary that won’t be propagated to popen workers. Therefore we get these predefined arguments from main process and passed it through the workers. Is the hang related to this global dictionary? With this PR I’m afraid this lookup won’t work because it is called from the worker side.

echuraev · 2021-09-16T16:02:27Z

args was added because some workloads like sparse require predefined inputs that shouldn’t be randomly generated. These inputs are looked up from a global dictionary that won’t be propagated to popen workers. Therefore we get these predefined arguments from main process and passed it through the workers. Is the hang related to this global dictionary? With this PR I’m afraid this lookup won’t work because it is called from the worker side.

Yes, the hang related to the global dictionary. In this case I have another workaround how to fix the hang. In the _rpc_run and _timed_eval_func we can create a copy of global arguments and work with them. In this case it will work, I already tested it. I mean to do something like this:

            assert len(args) == len(build_res.args)
+          import copy
+          loc_args = copy.deepcopy(args)
            # pylint: disable=consider-using-enumerate
-            for idx in range(len(args)):
-                if args[idx] is None:
+            for idx in range(len(loc_args)):
+                if loc_args[idx] is None:
                    build_res_arg = build_res.args[idx]
                    empty_array = ndarray.empty(
                        get_const_tuple(build_res_arg.shape), build_res_arg.dtype, dev
                    )
                    random_fill(empty_array)
-                    args[idx] = empty_array
+                    loc_args[idx] = empty_array
                else:
-                    args[idx] = ndarray.array(args[idx], dev)
+                    args[idx] = ndarray.array(args[idx], dev)

I don't need this workaround due to it is necessary to do a deep copy of arguments. What do you think about it?

vinx13 · 2021-09-16T16:23:25Z

Sorry I’m a bit confused. Is it because args not being properly managed? It should be a python array with numpy ndarray as elements. It will be pickled and sent to workers and then be unpickled. Do you know exactly what resource is not freed?

echuraev · 2021-09-16T16:35:03Z

Is it because args not being properly managed?

I think so. In case when the args is a global dictionary then during the tuning some object hold the rpc_endpoint and we cannot destroy it. This is why this hang happens.

Do you know exactly what resource is not freed?

I tried to find the exact resource which hold the resources and leads to hang. But didn't find it. Now I thought about it and I suppose it can be the empty ndarrays which were created. But I have to check it. Maybe the good solution will be to regenerate such arguments which were not predefined for each run.

vinx13 · 2021-09-16T17:22:14Z

But I have to check it. Maybe the good solution will be to regenerate such arguments which were not predefined for each run.

I think this is already the case. Currently, in the main process side, we look up the dictionary via prepare_runner_args, which create an python array of arguments that only contains predefined args (these args are converted to numpy array to make them pickable). For those not predefined, None is added to that array as a placeholder. Then, this array is sent to workers. In each worker, it replaces None with randomly generated tvm.ndarray.
Empty arrays created in timed_rpc_run should be destroy after running the program in the device.
Also if you can provide steps to reproduce it I can definitely take a look

echuraev · 2021-09-17T13:14:36Z

Empty arrays created in timed_rpc_run should be destroy after running the program in the device.

Who will guarantee that the arrays will be destroyed? We have a global variable which was passed to the function and in this function new arrays were created in this global dictionary. When the program was run in the device, these arrays won't be destroyed due to they are the part of global object. Or I'm wrong?

I updated this PR with another fix for this problem. I just create a local copy of the arguments in _rpc_run and _timed_eval_func. In this case, then we go out of the function scope the local objects will be destroyed, and it guarantees to us that the arrays were destroyed.

Another workable fix is to set arguments to None after function execution. For example:

            assert len(args) == len(build_res.args)
+            indices = []
            # pylint: disable=consider-using-enumerate
            for idx in range(len(args)):
                if args[idx] is None:
                    build_res_arg = build_res.args[idx]
                    empty_array = ndarray.empty(
                        get_const_tuple(build_res_arg.shape), build_res_arg.dtype, dev
                    )
                    random_fill(empty_array)
+                    indices.append(idx)
                    args[idx] = empty_array
                else:
                    args[idx] = ndarray.array(args[idx], dev)
            dev.sync()

            # First run for check that the kernel is correct
            func.entry_func(*args)
            dev.sync()

            costs = time_f(*args).results

+            for idx in indices:
+                args[idx] = None

In this case, the arrays also will be destroyed.

What about steps to reproduce this issue:

Run rpc_tracker
Connect Android or iOS device to the tracker by using AndroidRPC or ios_rpc applications
Run tuning session on this device after several trials the tuning will hang.

vinx13 · 2021-09-17T14:36:10Z

I see. If I understand correctly, the issue is when popen worker got a copy of args here https://github.com/apache/tvm/blob/main/python/tvm/exec/popen_worker.py#L77 and call timed_rpc_run, it create new ndarrays as elements of args. When timed_rpc_run finishes, it doesn't free these ndarrays. This looks like python gc issue? Note that args is not a global dictionary, it is a local copy in each worker, seems you don't need to record the indices of ndarrays (in the alternative fix you provided). Did you try set args=None or set all elements to None (the predefined arrays are also converted to ndarray from numpy array here in each popen workers locally, which should also be freed) Otherwise the alternative fix looks good to me as it avoids deep copying

echuraev · 2021-09-17T21:12:57Z

Did you try set args=None or set all elements to None (the predefined arrays are also converted to ndarray from numpy array here in each popen workers locally, which should also be freed) Otherwise the alternative fix looks good to me as it avoids deep copying

Yes, I did it and it works fine. I wrote this code with indices just because I wasn't sure that I can set None to the args which were predefined. In this case, if you are ok with alternative fix (with setting None to arguments) then, I'll update this PR tomorrow. Thank you very much.
Only one thing that I don't really like in this fix is that we need to set these elements to None in both cases (after executed function and also in case when exception was generated).

vinx13 · 2021-09-17T21:17:42Z

Alternatively we can try creating a local array without deep copy. I think the issue if the arguments sent to Popen workers are not freed immediately (not sure why), if so creating another array should work

loc_args = []
for arg in args:
  if arg is None:
     loc_args.append(empty_array)
  else:
     loc_args.append(ndarray.array(arg))

echuraev · 2021-09-20T06:12:14Z

@vinx13 sorry for the delay, I was busy at the weekends. I updated the PR and added fix with local arguments. Could you please take a look once again?

shingjan

LGTM. Probably need to re-trigger PI. BTW, is this an issue that PopenPoolExecutor will always have with parameter/argument passing?

vinx13 · 2021-09-20T16:55:14Z

@shingjan not sure why args sent to popenworker is not immediately GC'ed

shingjan · 2021-09-20T20:57:19Z

@echuraev after #9052 merges, you can rebase and the CI issue should be fixed.

echuraev · 2021-09-21T05:14:36Z

@echuraev after #9052 merges, you can rebase and the CI issue should be fixed.

Thank you! Will wait when this PR will be merged.

vinx13

Found some minor change needed

python/tvm/auto_scheduler/measure.py

echuraev · 2021-09-22T19:22:33Z

Found some minor change needed

Sorry, only now I saw that you added the comment in this PR. Thank you for the change.

vinx13 · 2021-09-23T17:49:59Z

Thanks @echuraev @shingjan this PR is merged!

* [Auto-Schedule][Fix] Fix hang while tune model through rpc * Fix problem with hang by using deep copy * Fix with local args * Update python/tvm/auto_scheduler/measure.py Co-authored-by: Wuwei Lin <vincentl13x@gmail.com>

echuraev requested review from comaniac, Hzfengsy, jcf94, junrushao, merrymercy and vinx13 as code owners September 16, 2021 14:32

vinx13 self-assigned this Sep 16, 2021

vinx13 approved these changes Sep 20, 2021

View reviewed changes

shingjan approved these changes Sep 20, 2021

View reviewed changes

echuraev force-pushed the echuraev/fix_rpc_hangs branch from 0b95b88 to 4c12e6e Compare September 20, 2021 17:27

echuraev force-pushed the echuraev/fix_rpc_hangs branch 2 times, most recently from cd14bbb to 80df07a Compare September 22, 2021 11:26

vinx13 requested changes Sep 22, 2021

View reviewed changes

python/tvm/auto_scheduler/measure.py Outdated Show resolved Hide resolved

vinx13 approved these changes Sep 22, 2021

View reviewed changes

[Auto-Schedule][Fix] Fix hang while tune model through rpc

00448aa

echuraev and others added 3 commits September 23, 2021 09:02

Fix problem with hang by using deep copy

23f9bc9

Fix with local args

00a2aa2

Update python/tvm/auto_scheduler/measure.py

c53badd

echuraev force-pushed the echuraev/fix_rpc_hangs branch from 80ee67c to c53badd Compare September 23, 2021 10:55

shingjan approved these changes Sep 23, 2021

View reviewed changes

vinx13 merged commit ee6fe12 into apache:main Sep 23, 2021

echuraev deleted the echuraev/fix_rpc_hangs branch September 24, 2021 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Auto-Schedule][Fix] Fix hang while tune model through rpc #9032

[Auto-Schedule][Fix] Fix hang while tune model through rpc #9032

echuraev commented Sep 16, 2021

echuraev commented Sep 16, 2021

vinx13 commented Sep 16, 2021 •

edited

Loading

echuraev commented Sep 16, 2021 •

edited

Loading

vinx13 commented Sep 16, 2021

echuraev commented Sep 16, 2021

vinx13 commented Sep 16, 2021 •

edited

Loading

echuraev commented Sep 17, 2021

vinx13 commented Sep 17, 2021 •

edited

Loading

echuraev commented Sep 17, 2021

vinx13 commented Sep 17, 2021 •

edited

Loading

echuraev commented Sep 20, 2021

shingjan left a comment

vinx13 commented Sep 20, 2021

shingjan commented Sep 20, 2021

echuraev commented Sep 21, 2021

vinx13 left a comment

echuraev commented Sep 22, 2021

vinx13 commented Sep 23, 2021

[Auto-Schedule][Fix] Fix hang while tune model through rpc #9032

[Auto-Schedule][Fix] Fix hang while tune model through rpc #9032

Conversation

echuraev commented Sep 16, 2021

echuraev commented Sep 16, 2021

vinx13 commented Sep 16, 2021 • edited Loading

echuraev commented Sep 16, 2021 • edited Loading

vinx13 commented Sep 16, 2021

echuraev commented Sep 16, 2021

vinx13 commented Sep 16, 2021 • edited Loading

echuraev commented Sep 17, 2021

vinx13 commented Sep 17, 2021 • edited Loading

echuraev commented Sep 17, 2021

vinx13 commented Sep 17, 2021 • edited Loading

echuraev commented Sep 20, 2021

shingjan left a comment

Choose a reason for hiding this comment

vinx13 commented Sep 20, 2021

shingjan commented Sep 20, 2021

echuraev commented Sep 21, 2021

vinx13 left a comment

Choose a reason for hiding this comment

echuraev commented Sep 22, 2021

vinx13 commented Sep 23, 2021

vinx13 commented Sep 16, 2021 •

edited

Loading

echuraev commented Sep 16, 2021 •

edited

Loading

vinx13 commented Sep 16, 2021 •

edited

Loading

vinx13 commented Sep 17, 2021 •

edited

Loading

vinx13 commented Sep 17, 2021 •

edited

Loading