Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Auto scheduler tutorial failure on CI #6723

Merged
merged 8 commits into from
Oct 21, 2020

Conversation

comaniac
Copy link
Contributor

@comaniac comaniac commented Oct 21, 2020

  • Disable the failed checker when reading the log file.
  • Add a logging to track why the log is illegal on CI.

We can merge this PR first if it passes the CI. Then we can have a follow-up PR to fix the root cause and remove the logging.

cc @merrymercy @jcf94 @junrushao1994

@comaniac comaniac changed the title [Bugfix] Auto scheduler tutorial failure on CI [DO NOT MERGE][Bugfix] Auto scheduler tutorial failure on CI Oct 21, 2020
@comaniac comaniac changed the title [DO NOT MERGE][Bugfix] Auto scheduler tutorial failure on CI [Bugfix] Auto scheduler tutorial failure on CI Oct 21, 2020
@jroesch
Copy link
Member

jroesch commented Oct 21, 2020

@comaniac thanks for doing this!

@junrushao
Copy link
Member

Yeah we should probably unblock the CI first, then can work together to find the root cause in exactly the same docker environment used in the CI.

@comaniac
Copy link
Contributor Author

Although CI still failed, I've finally got the incorrect JSON. Here are the instructions to reproduce:

  1. Put the following line to conv2d.json.
{"i": [["[\"conv2d_layer\", 1, 7, 7, 512, 512, 3, 3, [1, 1], [1, 1]]", "cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32", [-1, 16, 64, 49152, 65536, 1024, 8, 32]], [[], [["CI", 5], ["SP", 3, 0, 1, [1, 1, 1, 1], 1], ["SP", 3, 5, 512, [4, 32, 1, 1], 1], ["SP", 3, 10, 7, [1, 1, 1, 7], 1], ["SP", 3, 15, 7, [1, 1, 7, 1], 1], ["SP", 3, 20, 512, [8, 1], 1], ["SP", 3, 23, 3, [1, 1], 1], ["SP", 3, 26, 3, [1, 3], 1], ["RE", 3, [0, 5, 10, 15, 1, 6, 11, 16, 2, 7, 12, 17, 20, 23, 26, 21, 24, 27, 3, 8, 13, 18, 22, 25, 28, 4, 9, 14, 19]], ["FSP", 6, 0, 1, 3], ["FSP", 6, 4, 2, 3], ["FSP", 6, 8, 3, 3], ["FSP", 6, 12, 4, 3], ["RE", 6, [0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15]], ["CA", 3, 6, 11], ["CHR", 2, "shared", [3]], ["CA", 3, 4, 14], ["CHR", 1, "shared", [4]], ["CA", 2, 5, 14], ["CI", 1], ["FU", 8, [0, 1, 2, 3]], ["AN", 8, 0, 5], ["FU", 8, [1, 2, 3, 4]], ["AN", 8, 1, 4], ["FU", 8, [2, 3, 4, 5]], ["AN", 8, 2, 6], ["FU", 4, [0, 1, 2, 3]], ["SP", 4, 0, 16, [1], 1], ["AN", 4, 1, 2], ["FFSP", 4, 0, [4, 3, 2, 1], 1, 1], ["AN", 4, 1, 6], ["FU", 2, [0, 1, 2, 3]], ["SP", 2, 0, 784, [7], 1], ["AN", 2, 1, 2], ["FFSP", 2, 0, [4, 3, 2, 1], 1, 1], ["AN", 2, 1, 6], ["PR", 5, 0, "auto_unroll_max_step$0"]]]], "r": [[0.001027474], 0, 1.97181, 1603158399], "v": "v0.2"}
  1. Run the following code.
import tvm
from tvm import auto_scheduler
auto_scheduler.load_best("conv2d.json")

I'll try to fix it tomorrow. @jcf94 will help investigate as well.

@jcf94
Copy link
Contributor

jcf94 commented Oct 21, 2020

@comaniac Thanks!

@junrushao
Copy link
Member

Nice we finally get the json file to reproduce!

@jcf94
Copy link
Contributor

jcf94 commented Oct 21, 2020

Nice we finally get the json file to reproduce!

...... 😢 Unfortunately, it seems we still can't reproduce such bug in local runtime. I'm still not able to figure out how this log was generated.

@comaniac
Copy link
Contributor Author

I disabled the tutorials to make the CI green first. Will file another PR to fix the issue.

@comaniac
Copy link
Contributor Author

We found that the root cause is the log file generated by the tutorial is not removed, meaning that each CI will append several lines of log to the same file. Based on that, #6671 changes the log format and appended the record in different format to the file that is read by other CI runs. After this PR, I'll file another PR to make sure every CI run is independent.

@tqchen tqchen merged commit 9ae386c into apache:main Oct 21, 2020
@comaniac comaniac deleted the ansor_fix_ci branch October 21, 2020 21:04
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Oct 29, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants