Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][AutoTVM] Tuning with subgraph granularity #64

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DzAvril
Copy link

@DzAvril DzAvril commented Mar 22, 2022

This is an RFC PR for apache/tvm#10650.

cc @FrozenGene @masahi @comaniac

Our implementation is inspired by auto-schedule.

# Unresolved questions
[unresolved-questions]: #unresolved-questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you handling serialization of the tuning results? Autotvm logs just contain the implementation name, but that is not specific enough for subgraphs. Will you use the structural hash like auto scheduler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here. This also determines how general the tuned schedule could be. For example, considering the following hashing approaches:

H1: Similar to auto-scheduler, such as ["impl-name", "args", "compute DAG MD5"]. This guarantees that the tuned schedule will be applied to ONLY exactly the same subgraph.

H2 Canonicalize compute DAG. For example, canonicalize a compute DAG cast - dense - cast to be elementwise - dense - elementwise. Since different elementwise ops in this DAG should result in almost the same performance, canonicalization makes the tuned result more general applicable. The drawback is we need to be careful to make sure the canonicalized compute DAG won't cover too many undesired cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H1: Similar to auto-scheduler, such as ["impl-name", "args", "compute DAG MD5"]. This guarantees that the tuned schedule will be applied to ONLY exactly the same subgraph.

We use the first approach. We serialize the iotensors of the subgraph to a hashable tuple and compute the tuple MD5 for getting a hash key. Then use subgraph name + hash key as the subgraph task name.

Case 1: Anchor operator has more than one implementation.
We register the subgraph tuning task by `outputs` in step 1.4. No matter how many implementations the anchor operator has, step 1.1 will only pick the implementation with the highest level and return outputs computed by it. So the subgraph tuning tasks may not contain the potential best implementation.

Case 2: Anchor operator's function `fcompute` needs value from config such as code block below. In step 2.2, computing output will call function `_pack_data` and the `cfg` suppose to be the best config of the subgraph. But in step 2.2 we don't know which subgraph the anchor operator belongs to yet, so we cannot get the right config from the best history and fallback to the default one. This may bring great performance regression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see a potential way around this. All the cases I've wanted subgraph tuning in autotvm are around layout rewriting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's the same issue as my previous comment. If we could generate N tasks based on the implementations, we should be able to workaround this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see a potential way around this. All the cases I've wanted subgraph tuning in autotvm are around layout rewriting.

Do not yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's the same issue as my previous comment. If we could generate N tasks based on the implementations, we should be able to workaround this.

I think they are not the same. The key problem, in this case, is the tunning results are recorded with the subgraph task name as workload key but in step 2.2 the subgraph task name is not created yet because the ops after the anchor op are not visited yet. So we don't know which tuning log we should employ.

Comment on lines +19 to +22
- We wrote a schedule marked as ```ScheduleA``` for the subgraph above by hardcode and the latency of subgraph inferencing is 104 microseconds. Then we tuned the subgraph with single op as granularity. In the tuning log we found ```ScheduleA``` and the latency recorded in the measurement result is 329 microseconds.
- The best schedule from the tuned log in the step above is marked as ```ScheduleB``` and the latency recorded in the measurement result is 237 microseconds. The latency of subgraph inferencing with ```ScheduleB``` is 224 microseconds.

From the example above we can tell AutoTVM would not find ```ScheduleA```, the obvious better schedule. This means the tuning result is distorted, the distortion would be more obvious if the shape of the output became bigger.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great motivating example. I believe this is also an important reason that auto-scheduler could have end-to-end speedup over AutoTVM.

Comment on lines +101 to +102
Case 1: Anchor operator has more than one implementation.
We register the subgraph tuning task by `outputs` in step 1.4. No matter how many implementations the anchor operator has, step 1.1 will only pick the implementation with the highest level and return outputs computed by it. So the subgraph tuning tasks may not contain the potential best implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my major concern too, as one important feature in AutoTVM is transparent the implementation selection to the tuning results. This is also a challenge issue that auto-scheduler hasn't figured out yet. As you pointed out, selecting an implementation based on the plevel before tuning could result in sub-optimal, is there anyway to still generate N subgraph tasks (N=#impl) for the anchor op?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate N subgraph tasks means we need N subgraph outputs and that means we need to run LowerToTECompute::Lower N times. It will be a big change to the current framework but I think it will work. Even we can resolve this case, case 2 is still a problem preventing anchor op to find the best config from N tuning results.

Case 1: Anchor operator has more than one implementation.
We register the subgraph tuning task by `outputs` in step 1.4. No matter how many implementations the anchor operator has, step 1.1 will only pick the implementation with the highest level and return outputs computed by it. So the subgraph tuning tasks may not contain the potential best implementation.

Case 2: Anchor operator's function `fcompute` needs value from config such as code block below. In step 2.2, computing output will call function `_pack_data` and the `cfg` suppose to be the best config of the subgraph. But in step 2.2 we don't know which subgraph the anchor operator belongs to yet, so we cannot get the right config from the best history and fallback to the default one. This may bring great performance regression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's the same issue as my previous comment. If we could generate N tasks based on the implementations, we should be able to workaround this.

Our implementation is inspired by auto-schedule.

# Unresolved questions
[unresolved-questions]: #unresolved-questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here. This also determines how general the tuned schedule could be. For example, considering the following hashing approaches:

H1: Similar to auto-scheduler, such as ["impl-name", "args", "compute DAG MD5"]. This guarantees that the tuned schedule will be applied to ONLY exactly the same subgraph.

H2 Canonicalize compute DAG. For example, canonicalize a compute DAG cast - dense - cast to be elementwise - dense - elementwise. Since different elementwise ops in this DAG should result in almost the same performance, canonicalization makes the tuned result more general applicable. The drawback is we need to be careful to make sure the canonicalized compute DAG won't cover too many undesired cases.

@masahi
Copy link
Member

masahi commented Apr 13, 2022

Any update?

@DzAvril
Copy link
Author

DzAvril commented Apr 13, 2022

Any update?

No update recently. The three cases mentioned in the RFC are still obstacles to making this RP perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants