Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay][Legalize][ARM_CPU] Handling NHWC layout for arm_cpu. #3754

Merged
merged 1 commit into from
Aug 14, 2019

Conversation

anijain2305
Copy link
Contributor

Relevant Issue - #2519

Currently, the TFLite parsed models fail for ARM devices. The reason is TFLite models are NHWC and we don't support NHWC layout in ARM. However, TF models (which are also NHWC), are able to run on ARM devices. This is because of TF parser adds layout transforms while parsing the graph (which is not a good idea as parsers should retain the framework layout).

As a first step, this PR adds a legalize function for conv2d for arm_cpu to add transposes before and after to use the NCHW layout. This gets the TFLite models working for ARM cpus.

A follow-up work requires TF parser cleanup. Parsers should keep the frontend layout. So, we can add legalize for all the targets to fall back to NCHW for conv and then cleanup TF parser. (Let me know if I should open up a new Issue for tracking the tasks).

@yzhliu @FrozenGene @apivovarov @kevinthesun @yongwww

@FrozenGene
Copy link
Member

FrozenGene commented Aug 12, 2019

Inserting transpose is not good solution IMO. As demonstrated, which has big performance impact when conv2d is executed fast. Besides us, the community also report this thing too. https://discuss.tvm.ai/t/quantization-autotvm-performance-degradation-due-to-transpose-ops-when-nchw-layout-is-used/3561. I think we should support NHWC layout of spatial pack schedule in ARM CPU instead of inserting transpose op. Morever, @jackwish has investigated NHWC / NCHW layout on ARM CPU advantages / disadvantages internally. If we don't take NHWC layout, we can not get the best performance and beat QNNPACK.

In summary, we should support NHWC layout like here: https://github.com/dmlc/tvm/blob/master/topi/python/topi/arm_cpu/bitserial_conv2d.py#L264 and it is not difficult and only need our minor work. Unified one NCHW sounds good but is not best in practical if we want to get the best performance (which is one key factor to let users use TVM or not).

@anijain2305
Copy link
Contributor Author

anijain2305 commented Aug 12, 2019

@FrozenGene I agree writing schedule for NHWC is a good long-term solution. However, for the time being, can this be merged in? Current state is bad because we fail compilation. As we get time, we can start working on the NHWC schedule. If you have already looked into it and that can be open-sourced, that will really helpful as well.

@FrozenGene
Copy link
Member

I am busy in our internal DSP product. I can not make sure whether @jackwish could have interest and have time to help it, I think we could contribute it back.

@anijain2305
Copy link
Contributor Author

anijain2305 commented Aug 12, 2019

Ok, in that case, I would suggest current PR as an intermediate step, until a better NHWC schedule is merged.

Performance is definitely one of the key factors for using TVM. But, equally important is portability, both across frameworks and HW devices. It is fine to be in a temporary stage where things are functional and not performant as compared to a stage where we are neither functional nor performant. Currently, we fall into latter bucket for TFLite models, which makes it harder to convince edge device vendors (who like TFLite) to try TVM.

----------
attrs : nnvm.top.AttrDict or tvm.attrs.Attrs
Attributes of current convolution
inputs : nnvm.symbol or tvm.relay.Expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arn't we going to discontinue nnvm support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, I have made it work only for Relay. Changed the comments to reflect that.

@zhenhuaw-me
Copy link
Contributor

I am busy in our internal DSP product. I can not make sure whether @jackwish could have interest and have time to help it, I think we could contribute it back.

We may need some internal discussions on this...

Copy link
Contributor

@zhenhuaw-me zhenhuaw-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for ping. Are we using this to make it possible to test models (I mean something like mobilenet)?

topi/python/topi/arm_cpu/conv2d.py Show resolved Hide resolved
@anijain2305
Copy link
Contributor Author

Thanks for ping. Are we using this to make it possible to test models (I mean something like mobilenet)?

Yes

@kevinthesun
Copy link
Contributor

kevinthesun commented Aug 13, 2019

@FrozenGene I think this PR doesn't introduce transpose by replacing conv2d NHWC directly with conv2d NCHW. This partially resolves our issue. In the long term, we might need #3670 to better manage graph level layout conversion. For topi schedule, I suggest we only add schedules for new layout when obvious performance benefit is shown. Otherwise we can just leverage existing layout schedules and convert nodes into that layout.

@FrozenGene
Copy link
Member

FrozenGene commented Aug 14, 2019

@FrozenGene I think this PR doesn't introduce transpose by replacing conv2d NHWC directly with conv2d NCHW. This partially resolves our issue. In the long term, we might need #3670 to better manage graph level layout conversion. For topi schedule, I suggest we only add schedules for new layout when obvious performance benefit is shown. Otherwise we can just leverage existing layout schedules and convert nodes into that layout.

Yeah, we have investigated NHWC and NCHW layout in quantization model deeply and find NHWC could make us get better performance, otherwise we will introduce pack operation internally to get the same effect (better locality) of NHWC if using NCHW. However, when we execute the quantized model, we care the performance very much. The pack cost time we can not ignore. @jackwish maybe could share his investigation result more.

@kevinthesun
Copy link
Contributor

kevinthesun commented Aug 14, 2019

@FrozenGene I see. I think there is no harm that we improve layout support and topi schedule at the same time. This PR can serve as a short term solution for arm_cpu fp32.

@kevinthesun kevinthesun added the status: need update need update based on feedbacks label Aug 14, 2019
@kevinthesun kevinthesun merged commit 5498e54 into apache:master Aug 14, 2019
@kevinthesun
Copy link
Contributor

Thanks!

Note
----
Unlike other TOPI functions, this function operates on both graph level and operator level,
so we have to pass 'F' to make it support our two versions of graph IR, NNVM and Relay.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anijain2305 why are NNVM and F argument still mentioned here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing out. I copied the description from other functions which have this comment. Will send a separate PR to clean up the comments.

@zhenhuaw-me
Copy link
Contributor

Drafted the ARM NHWC schedule PR #3859, if you guys have interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: need update need update based on feedbacks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants