Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay] Fix index order in conv2d computation for Arm CPU. #8361

Merged
merged 1 commit into from
Jul 8, 2021

Conversation

AnastasiaStulova
Copy link
Contributor

@AnastasiaStulova AnastasiaStulova commented Jun 28, 2021

When dilation is larger than value 1 in conv2d with NHWC
layout, the ordering of indexes when accessing data array
in computation of convolution appears to be incorrect.

'data_vec' is defined as

lambda n, oho, owo, kh, kw, ic, ohi, owi:

but accessed as

data_vec[n, oho, owo, kh, kw, ohi, owi, ic]

This patch fixes the order of indexes and modifies the test
so that it is suitable for running on an AArch64 CPU.

@AnastasiaStulova
Copy link
Contributor Author

@giuseros

Copy link
Contributor

@giuseros giuseros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @AnastasiaStulova ! So yes, I agree on the testing part. If you could add a separate PR that enables arm_cpu tests, that would be very helpful

@giuseros
Copy link
Contributor

@mbaret @leandron could you double check this and possibly merge?

@mbaret
Copy link
Contributor

mbaret commented Jun 29, 2021

This patch looks good to me, ideally we should have test cases for this though. I'm not up to speed on the status of Arm CI so am unsure how immediately feasible this is, perhaps @u99127 or @leandron know more.

@u99127
Copy link
Contributor

u99127 commented Jun 30, 2021

This patch looks good to me, ideally we should have test cases for this though. I'm not up to speed on the status of Arm CI so am unsure how immediately feasible this is, perhaps @u99127 or @leandron know more.

do the existing topi tests catch the failure on AArch64 ?

@AnastasiaStulova
Copy link
Contributor Author

This patch looks good to me, ideally we should have test cases for this though. I'm not up to speed on the status of Arm CI so am unsure how immediately feasible this is, perhaps @u99127 or @leandron know more.

do the existing topi tests catch the failure on AArch64 ?

@u99127 The test indeed caught the bug and it already what we need for arm_cpu testing
https://github.com/apache/tvm/blob/main/tests/python/topi/python/test_topi_conv2d_nhwc.py#L33
but it is not enabled by default in the list of tested devices:
https://github.com/apache/tvm/blob/main/tests/python/topi/python/test_topi_conv2d_nhwc.py#L78
It is very easy to add it to the list "llvm -device=arm_cpu" then it runs straight away.

@AnastasiaStulova
Copy link
Contributor Author

FYI there is also https://github.com/apache/tvm/blob/main/tests/python/topi/python/test_topi_conv2d_nchw.py for testing NCHW layout that already runs on all available targets. So if we run it where Arm CPU is enabled it will test it automatically.

The only problem could be that it is a bit long-running as it has more than 350 test cases but we could think of creating a fast running mode too.

@u99127
Copy link
Contributor

u99127 commented Jul 1, 2021

FYI there is also https://github.com/apache/tvm/blob/main/tests/python/topi/python/test_topi_conv2d_nchw.py for testing NCHW layout that already runs on all available targets. So if we run it where Arm CPU is enabled it will test it automatically.

The only problem could be that it is a bit long-running as it has more than 350 test cases but we could think of creating a fast running mode too.

Got it - yes I could just reproduce the failure with that test but as you say I did have to add 'llvm -device=arm_cpu' to the devices list as you said to see the AssertionError failure, so may be the solution here is that we add 'llvm -device=arm_cpu' to the test here or actually fix the test to only use 'llvm -device=arm_cpu' when running on ci_arm rather than having it run twice on the GPU CI which is where TOPI testing is turned on.

I suspect the following things need to happen here to progress this with the tests.

  1. We run the topi tests in Jenkins using the Task script and just turn that on for the CI on AArch64. That is a one liner change and hopefully it works. However while pipe cleaning that on my local machine , I've hit a few testisms in the topi tests that need to be cleaned up. Especially with some tests being suitable only for CUDA but not being marked as so. I can help with this. That's pretty easy and straightforward but will need buy in from the project that running these tests in AArch64 CI is a good thing ! I think this bug just shows the value of running these tests in CI. I would prefer to start with the entirety of the TOPI tests, it appears to take me about 30-40 minutes on a machine that has similar characteristics to the one in CI.

  2. I agree with you that we should look to add to this PR an additional change to the test script you mention by adding 'llvm -device=arm_cpu' to the list of targets only if one is running the test on AArch64 hardware, otherwise when we come around to doing 1 above we'd be running this test once for the llvm target and once for an llvm -device=arm_cpu and just duplicating testing for a path that is unneeded . For bonus points we should audit the topi test directory and clean it up as we go along because I note that many tests again run for both llvm and llvm -device=arm_cpu which probably just uses more runtime during CI when it is not needed.

regards
Ramana

When dilation is larger than value 1 in conv2d with NHWC
layout, the ordering of indexes when accessing data array
in computation of convolution appears to be incorrect.

'data_vec' is defined as

lambda n, oho, owo, kh, kw, ic, ohi, owi:

But accessed as

data_vec[n, oho, owo, kh, kw, ohi, owi, ic]

This patch fixes the order of indexes and modifies the test
so that it is suitable for running on an AArch64 CPU.
u99127 pushed a commit to u99127/tvm that referenced this pull request Jul 6, 2021
This pull request cleans up the TOPI testuite for use on the AArch64
CI target by doing the following:

- Introducing a script to run the tests on AArch64 with a suitable
  invocation of the llvm target string by setting the TVM_TEST_TARGETS
  environment variable.

- Cleaning up the use of hard coded targets and moving the testsuite
  to testing more sensibly with the use of tvm.testing.enabled_targets.

- Cleanup the use of tvm.target.create.

- The above allows for the use of tests reasonably with the topi
  tests and cleans up what is needed from the testsuite.

Putting this up for a test run on ci_gpu and ci_cpu to see the effects
of moving TOPI test runs to AArch64 CPU before firing up the Jenkins
changes.

The motivation was from apache#8361 to pipeclean and add this support.
u99127 pushed a commit to u99127/tvm that referenced this pull request Jul 6, 2021
This pull request cleans up the TOPI testuite for use on the AArch64
CI target by doing the following:

- Introducing a script to run the tests on AArch64 with a suitable
  invocation of the llvm target string by setting the TVM_TEST_TARGETS
  environment variable.

- Cleaning up the use of hard coded targets and moving the testsuite
  to testing more sensibly with the use of tvm.testing.enabled_targets.

- Cleanup the use of tvm.target.create vs tvm.target.Target

- The above allows for the use of tests reasonably with the topi
  tests and cleans up what is needed from the testsuite.

Putting this up for a test run on ci_gpu and ci_cpu to see the effects
of moving TOPI test runs to AArch64 CPU before firing up the Jenkins
changes.

The motivation was from apache#8361 to pipeclean and add this support.
@AnastasiaStulova
Copy link
Contributor Author

This PR now allows enabling testing of this functionality in CI for Aarch64 after the following #8409 gets merged.

Copy link
Contributor

@jcf94 jcf94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM.

Does this PR has any dependency with #8409 ? I think it's ready for merge? @AnastasiaStulova

@u99127
Copy link
Contributor

u99127 commented Jul 8, 2021

I think this can be merged before #8409 as it would make my life easier to know that upstream tests are as clan as possible.

@jcf94 jcf94 merged commit 0b39af7 into apache:main Jul 8, 2021
@jcf94
Copy link
Contributor

jcf94 commented Jul 8, 2021

Thanks! @u99127 @AnastasiaStulova @giuseros

u99127 pushed a commit to u99127/tvm that referenced this pull request Aug 4, 2021
This pull request cleans up the TOPI testuite for use on the AArch64
CI target by doing the following:

- Introducing a script to run the tests on AArch64 with a suitable
  invocation of the llvm target string by setting the TVM_TEST_TARGETS
  environment variable.

- Cleaning up the use of hard coded targets and moving the testsuite
  to testing more sensibly with the use of tvm.testing.enabled_targets.

- Cleanup the use of tvm.target.create.

- The above allows for the use of tests reasonably with the topi
  tests and cleans up what is needed from the testsuite.

Putting this up for a test run on ci_gpu and ci_cpu to see the effects
of moving TOPI test runs to AArch64 CPU before firing up the Jenkins
changes.

The motivation was from apache#8361 to pipeclean and add this support.
ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021
When dilation is larger than value 1 in conv2d with NHWC
layout, the ordering of indexes when accessing data array
in computation of convolution appears to be incorrect.

'data_vec' is defined as

lambda n, oho, owo, kh, kw, ic, ohi, owi:

But accessed as

data_vec[n, oho, owo, kh, kw, ohi, owi, ic]

This patch fixes the order of indexes and modifies the test
so that it is suitable for running on an AArch64 CPU.
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Mar 4, 2022
When dilation is larger than value 1 in conv2d with NHWC
layout, the ordering of indexes when accessing data array
in computation of convolution appears to be incorrect.

'data_vec' is defined as

lambda n, oho, owo, kh, kw, ic, ohi, owi:

But accessed as

data_vec[n, oho, owo, kh, kw, ohi, owi, ic]

This patch fixes the order of indexes and modifies the test
so that it is suitable for running on an AArch64 CPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants