Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST] Conv2dGrad Takes long time to finish #8579

Closed
tqchen opened this issue Jul 28, 2021 · 5 comments · Fixed by #8581
Closed

[TEST] Conv2dGrad Takes long time to finish #8579

tqchen opened this issue Jul 28, 2021 · 5 comments · Fixed by #8581

Comments

@tqchen
Copy link
Member

tqchen commented Jul 28, 2021

Looking at the test time log, https://ci.tlcpack.ai/job/tvm/job/main/1356/testReport/ctypes.tests.python.relay/test_op_grad_level2/

conv2d grad takes a long time to finish(1h)..

Possible cause: this is likely due to our numerical gradient checker trying to do numerical gradient for all directions which scales linearly to the number of entries.

Possible resolutions:

  • K0: Use memoization to cache the numerical gradient needs.
  • K1: run a different kind of test that randomizes the direction and check consistency, see page 8 of this slide

I would suggest we go with K1 with a few directions instead of the full direction checking.

@tqchen tqchen changed the title [TEST] cImprove Numerical Gradient Checker [TEST] Improve Numerical Gradient Checker Jul 28, 2021
@tqchen
Copy link
Member Author

tqchen commented Jul 28, 2021

Followup with @altanh showing that this might be related to a perf regression, we will follow up on this thread

@tqchen tqchen changed the title [TEST] Improve Numerical Gradient Checker [TEST] Conv2dGrad Takes long time to finish Jul 28, 2021
@altanh
Copy link
Contributor

altanh commented Jul 28, 2021

After analysis we believe this regression is due to a combination of changes in #8486 and also an inefficient loop in the check_grad function:

  1. [RELAY]Switch from CompileEngine to TECompiler in Interpreter #8486 replaced the previous global (or thread-local) CompileEngine with a fresh TECompiler per Interpreter instance. This means that the previous behavior of caching lowered functions (for given input types) globally across all interpreters no longer holds.

  2. in check_grad at

    fwd_plus = intrp.evaluate(fwd_func)(*inputs).numpy().astype("float64")
    notice that the forward function is being recompiled by the interpreter executor for every element of the input tensor. Important to note here that the interpreter executor is actually creating a new Interpreter object for every evaluated expression (and hence losing the cache) causing a complete recompile on each evaluation.

Proposed Solution

The second point can be easily rectified by hoisting the forward function evaluation outside the hot loop, as the function doesn't change. In fact it only needs to be evaluated once per device and target combination. This will immediately fix the extreme regression. A PR will be posted soon.

Regarding the first point, we would like to move away from relying on hidden global caching for performance as much as possible as this has caused confusion in the past. Thus we will not modify the new interpreter behavior.

@altanh
Copy link
Contributor

altanh commented Jul 29, 2021

Update: the problem goes a bit deeper with the Interpreter executor. Specifically, the function returned by intrp.evaluate(fwd_func) is actually a Python closure which itself creates the new interpreter each time it is invoked. This means we can't even get around the caching problem by just calling .evaluate once.

Possible solutions:

  1. just give up on using the interpreter executor for now
  2. rewrite the evaluate function so that it only creates 1 interpreter (and has the closure call it) rather than a new interpreter each time the closure is invoked

I'll go ahead with option 2 in the hot fix PR for now

@tqchen
Copy link
Member Author

tqchen commented Jul 29, 2021

Let us go with 1 and just skip the interpreter executor in the grad evaluation.

@tqchen
Copy link
Member Author

tqchen commented Jul 30, 2021

Closing this for now, opened #8601 to track the followup item

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants