Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] On-Device Testing in TVM CI #98

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mehrdadh
Copy link
Member

This RFC proposes a structure to bring on-device testing on non-cloud supported hardware targets in TVM.

Co-authored-by: David Riazati driazati@octoml.ai

cc @areusch @leandron

@areusch
Copy link
Contributor

areusch commented Jan 24, 2023

2. **Testing.** At a minimum, Device Test Maintainers should re-run any simulated integration tests ordinarily ran in TVM’s CI on real hardware targets. In addition, they are welcome to bring more tests with more input samples or tuning with more trials to show better accuracy and performance benchmark. For nightly, running the test could be triggered based on a timer and implemented however the HW vendor desires. This way Device Test Maintainers have flexibility on the implementation and are not required to make a connection to TVM Jenkins node.

3. **Test results.** We expect Device Test Maintainers to publicly report functional test results for any on-device tests which also run on simulators in the TVM CI. To facilitate this, TVM will provide reporting infrastructure (i.e. a test dashboard) to present those results in public domain. Our proposal is that Device Test Maintainers upload the tests results in the form of pytest artifacts to an S3 bucket which is provided by TVM community. Device Test Maintainers are also welcome to show the results in the form of a website, but the tests artifacts should be uploaded to the S3 bucket so they can be retrieved in future.
- Other alternative is to use a Github repository to host the test results. Github repo is not the ideal solution for saving and downloading files and it could be slow for hosting large number of files for a website.
Copy link
Contributor

@gromero gromero Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to not keep the test results in GH :-)

Copy link
Contributor

@gromero gromero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mehrdadh @driazati Thanks for this RFC! It's neat the idea of adding mechanisms to allow "third party CIs" -- using tiers -- to handle/help the on-device testing in the TVM CI. LGTM!

@mehrdadh I just asked for a couple of clarifications, so formally I'll approve the PR once you answer / address them :-)


# Test Tiers
So far we explained a minimal setup to bring a on-device testing CI to TVM on a nightly basis. However, in principle one could enable more frequent testing. TVM defines these tiers:
1. **Tier 1: Run CI for all PRs.** This tier is equivalent to testing support for existing hardware targets that exist in cloud. This case requires large resources to avoid increasing the CI time. TVM community expects close CI infrastructure monitoring if they a Device Test Maintainer registers at this tier. If failures are observed in a CI at this tier which are due to to failures in the CI infrastructure, TVM community expect it to be resolved in one day time frame. If this requirement is not fulfilled the mentioned CI would be degraded to lower tiers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant "TVM community expects close CI infrastructure monitoring if there is a Device Test Maintainer registered at this tier"?

nit: TVM community expects (s/expect/expects)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TVM community is singular and should be expects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mehrdadh yeah, that's what I meant, so it should be "TVM community expects" instead of 'expect' (which is in the text).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mehrdadh ah, I see the confusion here. So, I meant two fixes:

  1. TVM community expects close CI infrastructure monitoring if they a Device Test Maintainer registers at this tier => TVM community expects close CI infrastructure monitoring if there is a Device Test Maintainer registered at this tier"

  2. TVM community expect it to be resolved in one day time frame. => TVM community expects it to be resolved in one-day time frame.

rfcs/0098_on_device_testing.md Outdated Show resolved Hide resolved
rfcs/0098_on_device_testing.md Outdated Show resolved Hide resolved
rfcs/0098_on_device_testing.md Show resolved Hide resolved
rfcs/0098_on_device_testing.md Outdated Show resolved Hide resolved
rfcs/0098_on_device_testing.md Outdated Show resolved Hide resolved

[Recent changes](https://github.com/apache/tvm/pull/13300) in TVM CI have divided TVM previous single large CI into multiple smaller Jenkins files which are more readable and easy to manage. A device test maintainer could reuse those Jenkins files with their own Jenkins instance to add an On-Device CI for specific hardware. We expect TVM to keep this CI reusability.

In addition to Jenkins file, any device test maintainer requires an infrastructure to manage their device fleet on each server. Since each device test maintainer is specialized in their domain we expect the infrastructure to be internally managed. From the TVM community perspective, there are few expectations from the device test maintainer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/device test maintainer/Device Test Maintainer/ ?

In addition to Jenkins file, any device test maintainer requires an infrastructure to manage their device fleet on each server. Since each device test maintainer is specialized in their domain we expect the infrastructure to be internally managed. From the TVM community perspective, there are few expectations from the device test maintainer:

- Provide N number of workers to Jenkins head to have enough support to run PRs in parallel. N is defined based on the tier that the device test maintainer is registering their CI and it is subjective to the target and CI traffic, CI time, etc.
- For each worker provide M number of devices. M is defined based on number of tests that are running on device and it should provide enough parallelism to avoid this CI to be the bottle neck.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: /bottle neck/ bottleneck/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

@mehrdadh mehrdadh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gromero thanks! PTKL.

rfcs/0098_on_device_testing.md Outdated Show resolved Hide resolved

# Test Tiers
So far we explained a minimal setup to bring a on-device testing CI to TVM on a nightly basis. However, in principle one could enable more frequent testing. TVM defines these tiers:
1. **Tier 1: Run CI for all PRs.** This tier is equivalent to testing support for existing hardware targets that exist in cloud. This case requires large resources to avoid increasing the CI time. TVM community expects close CI infrastructure monitoring if they a Device Test Maintainer registers at this tier. If failures are observed in a CI at this tier which are due to to failures in the CI infrastructure, TVM community expect it to be resolved in one day time frame. If this requirement is not fulfilled the mentioned CI would be degraded to lower tiers.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TVM community is singular and should be expects?

rfcs/0098_on_device_testing.md Outdated Show resolved Hide resolved
rfcs/0098_on_device_testing.md Outdated Show resolved Hide resolved
rfcs/0098_on_device_testing.md Show resolved Hide resolved
rfcs/0098_on_device_testing.md Outdated Show resolved Hide resolved
rfcs/0098_on_device_testing.md Outdated Show resolved Hide resolved
In addition to Jenkins file, any device test maintainer requires an infrastructure to manage their device fleet on each server. Since each device test maintainer is specialized in their domain we expect the infrastructure to be internally managed. From the TVM community perspective, there are few expectations from the device test maintainer:

- Provide N number of workers to Jenkins head to have enough support to run PRs in parallel. N is defined based on the tier that the device test maintainer is registering their CI and it is subjective to the target and CI traffic, CI time, etc.
- For each worker provide M number of devices. M is defined based on number of tests that are running on device and it should provide enough parallelism to avoid this CI to be the bottle neck.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mehrdadh mehrdadh requested a review from gromero March 1, 2023 20:29
@mehrdadh
Copy link
Member Author

mehrdadh commented Mar 1, 2023

@gromero PTAL, thanks!

@gromero
Copy link
Contributor

gromero commented Mar 2, 2023

@gromero PTAL, thanks!

argh! sorry, missed your previous ping long ago! So, just the nits in this comment (#98 (comment)) are missing. Otherwise LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants