[RFC][AutoTVM] Tuning with subgraph granularity #64

DzAvril · 2022-03-22T08:15:56Z

tkonolige · 2022-03-22T16:17:10Z

rfcs/0064-autotvm-tune-with-subgraph-granularity.md

+Our implementation is inspired by auto-schedule.
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions


How are you handling serialization of the tuning results? Autotvm logs just contain the implementation name, but that is not specific enough for subgraphs. Will you use the structural hash like auto scheduler?

Same question here. This also determines how general the tuned schedule could be. For example, considering the following hashing approaches:

H1: Similar to auto-scheduler, such as ["impl-name", "args", "compute DAG MD5"]. This guarantees that the tuned schedule will be applied to ONLY exactly the same subgraph.

H2 Canonicalize compute DAG. For example, canonicalize a compute DAG cast - dense - cast to be elementwise - dense - elementwise. Since different elementwise ops in this DAG should result in almost the same performance, canonicalization makes the tuned result more general applicable. The drawback is we need to be careful to make sure the canonicalized compute DAG won't cover too many undesired cases.

H1: Similar to auto-scheduler, such as ["impl-name", "args", "compute DAG MD5"]. This guarantees that the tuned schedule will be applied to ONLY exactly the same subgraph.

We use the first approach. We serialize the iotensors of the subgraph to a hashable tuple and compute the tuple MD5 for getting a hash key. Then use subgraph name + hash key as the subgraph task name.

tkonolige · 2022-03-22T16:19:08Z

rfcs/0064-autotvm-tune-with-subgraph-granularity.md

+Case 1: Anchor operator has more than one implementation.
+We register the subgraph tuning task by `outputs` in step 1.4. No matter how many implementations the anchor operator has, step 1.1 will only pick the implementation with the highest level and return outputs computed by it. So the subgraph tuning tasks may not contain the potential best implementation.
+
+Case 2: Anchor operator's function `fcompute` needs value from config such as code block below. In step 2.2, computing output will call function `_pack_data` and the `cfg` suppose to be the best config of the subgraph. But in step 2.2 we don't know which subgraph the anchor operator belongs to yet, so we cannot get the right config from the best history and fallback to the default one. This may bring great performance regression.


Do you see a potential way around this. All the cases I've wanted subgraph tuning in autotvm are around layout rewriting.

I feel it's the same issue as my previous comment. If we could generate N tasks based on the implementations, we should be able to workaround this.

Do you see a potential way around this. All the cases I've wanted subgraph tuning in autotvm are around layout rewriting.

Do not yet.

I feel it's the same issue as my previous comment. If we could generate N tasks based on the implementations, we should be able to workaround this.

I think they are not the same. The key problem, in this case, is the tunning results are recorded with the subgraph task name as workload key but in step 2.2 the subgraph task name is not created yet because the ops after the anchor op are not visited yet. So we don't know which tuning log we should employ.

comaniac · 2022-03-22T17:26:42Z

rfcs/0064-autotvm-tune-with-subgraph-granularity.md

+- We wrote a schedule marked as ```ScheduleA``` for the subgraph above by hardcode and the latency of subgraph inferencing is 104 microseconds. Then we tuned the subgraph with single op as granularity. In the tuning log we found  ```ScheduleA``` and the latency recorded in the measurement result is 329 microseconds.
+- The best schedule from the tuned log in the step above is marked as ```ScheduleB``` and the latency recorded in the measurement result is 237 microseconds. The latency of subgraph inferencing with ```ScheduleB``` is 224 microseconds.
+
+From the example above we can tell AutoTVM would not find ```ScheduleA```,  the obvious better schedule. This means the tuning result is distorted, the distortion would be more obvious if the shape of the output became bigger.


This is a great motivating example. I believe this is also an important reason that auto-scheduler could have end-to-end speedup over AutoTVM.

comaniac · 2022-03-22T17:35:09Z

rfcs/0064-autotvm-tune-with-subgraph-granularity.md

+Case 1: Anchor operator has more than one implementation.
+We register the subgraph tuning task by `outputs` in step 1.4. No matter how many implementations the anchor operator has, step 1.1 will only pick the implementation with the highest level and return outputs computed by it. So the subgraph tuning tasks may not contain the potential best implementation.


This is my major concern too, as one important feature in AutoTVM is transparent the implementation selection to the tuning results. This is also a challenge issue that auto-scheduler hasn't figured out yet. As you pointed out, selecting an implementation based on the plevel before tuning could result in sub-optimal, is there anyway to still generate N subgraph tasks (N=#impl) for the anchor op?

Generate N subgraph tasks means we need N subgraph outputs and that means we need to run LowerToTECompute::Lower N times. It will be a big change to the current framework but I think it will work. Even we can resolve this case, case 2 is still a problem preventing anchor op to find the best config from N tuning results.

comaniac · 2022-03-22T17:37:42Z

rfcs/0064-autotvm-tune-with-subgraph-granularity.md

+Case 1: Anchor operator has more than one implementation.
+We register the subgraph tuning task by `outputs` in step 1.4. No matter how many implementations the anchor operator has, step 1.1 will only pick the implementation with the highest level and return outputs computed by it. So the subgraph tuning tasks may not contain the potential best implementation.
+
+Case 2: Anchor operator's function `fcompute` needs value from config such as code block below. In step 2.2, computing output will call function `_pack_data` and the `cfg` suppose to be the best config of the subgraph. But in step 2.2 we don't know which subgraph the anchor operator belongs to yet, so we cannot get the right config from the best history and fallback to the default one. This may bring great performance regression.


I feel it's the same issue as my previous comment. If we could generate N tasks based on the implementations, we should be able to workaround this.

comaniac · 2022-03-22T17:45:25Z

rfcs/0064-autotvm-tune-with-subgraph-granularity.md

+Our implementation is inspired by auto-schedule.
+
+# Unresolved questions
+[unresolved-questions]: #unresolved-questions


Same question here. This also determines how general the tuned schedule could be. For example, considering the following hashing approaches:

H1: Similar to auto-scheduler, such as ["impl-name", "args", "compute DAG MD5"]. This guarantees that the tuned schedule will be applied to ONLY exactly the same subgraph.

H2 Canonicalize compute DAG. For example, canonicalize a compute DAG cast - dense - cast to be elementwise - dense - elementwise. Since different elementwise ops in this DAG should result in almost the same performance, canonicalization makes the tuned result more general applicable. The drawback is we need to be careful to make sure the canonicalized compute DAG won't cover too many undesired cases.

masahi · 2022-04-13T09:48:08Z

Any update?

DzAvril · 2022-04-13T14:55:19Z

Any update?

No update recently. The three cases mentioned in the RFC are still obstacles to making this RP perfect.

[RFC][AutoTVM] Tuning with subgraph granularity

0a586ab

DzAvril mentioned this pull request Mar 22, 2022

[AUTOTVM] Tuning with subgraph as granularity apache/tvm#10650

Open

tkonolige reviewed Mar 22, 2022

View reviewed changes

comaniac reviewed Mar 22, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][AutoTVM] Tuning with subgraph granularity #64

[RFC][AutoTVM] Tuning with subgraph granularity #64

DzAvril commented Mar 22, 2022

tkonolige Mar 22, 2022

comaniac Mar 22, 2022

DzAvril Mar 23, 2022

tkonolige Mar 22, 2022

comaniac Mar 22, 2022

DzAvril Mar 23, 2022

DzAvril Mar 23, 2022

comaniac Mar 22, 2022

comaniac Mar 22, 2022

DzAvril Mar 23, 2022

comaniac Mar 22, 2022

comaniac Mar 22, 2022

masahi commented Apr 13, 2022

DzAvril commented Apr 13, 2022

		Case 1: Anchor operator has more than one implementation.
		We register the subgraph tuning task by `outputs` in step 1.4. No matter how many implementations the anchor operator has, step 1.1 will only pick the implementation with the highest level and return outputs computed by it. So the subgraph tuning tasks may not contain the potential best implementation.

[RFC][AutoTVM] Tuning with subgraph granularity #64

Are you sure you want to change the base?

[RFC][AutoTVM] Tuning with subgraph granularity #64

Conversation

DzAvril commented Mar 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Apr 13, 2022

DzAvril commented Apr 13, 2022