[Relay] [Pass] Add mixed precision (e.g. FP16) model conversion pass #8069

AndrewZhaoLuo · 2021-05-18T17:41:37Z

This implements a pass to convert an fp32 relay graph into an fp16 version. The RFC is described here.

Changes:

Add pass and unittests
Make DataType hashable via std::hash (it is kind of slow though). (include/tvm/runtime/data_type.h)
Fix a bug with conv2d not working with accumulation_dtypes different than inputs (python/tvm/topi/nn/conv2d.py)

Testing

Some unittests.

Models tested (onnx):
Image Classification

Densenet (https://github.com/onnx/models)
Resnet18 (https://github.com/onnx/models)
EfficientNet-4 (https://github.com/onnx/models)
InceptionV3 (Converted from torchvision.models)

Object Detection

Tinyyolov2 (https://github.com/onnx/models)
Yolov2 (https://github.com/onnx/models)
Yolov4 (https://github.com/onnx/models)
SSD-resnet34 (https://github.com/onnx/models)

Embedding Models

Arcface (https://github.com/onnx/models)
RFB-face (https://github.com/onnx/models)

Super resolution:

Some random super resolution model (https://github.com/onnx/models)

NLP:

GPT-2 (https://github.com/onnx/models)
DistillBert (Converted from HuggingFace's transformers lib)

Models tested (relay native):

ResNet18
ResNet18-3D
Densenet
LSTM (unrolled)
Squeezenet
Mobilenet

By tested I mean I confirm it did some transformation on the graph and a forward pass could be run on CPU and matches the fp32 output somewhat. I have nothing on performance metrics or other devices yet.

Future PRs (in order of priority)

Show this actually leads to speedups!
Make the coloring function and output_dtype/accumulation_dtype functions extensible via Python
An extensive audit of existing relay ops into the coloring lists.
Write a pass to fold unnecessary casts e.g. cast(fp16) --> cast(fp32) --> cast(fp16) can probably just be one cast(fp16)
Rename the colors into something less generic and easily confused for something else
Rewrite the signature of functions automatically (right now everything is kept in fp32 and internal to the function things are cast)

Known issues

Right now the pass will mutate nodes in the original relay graph

cc @mbrookhart , @csullivan please take a look and add relevant reviewers

Speedups (add as I go along)

BERT w/ input shape [1, 128] on M1 Mac (based on https://github.com/octoml/Apple-M1-BERT) and 10000 tuning trials:

FP32 version - Mean inference time (std dev): 107.82 ms (3.39 ms)
FP16 version - Mean inference time (std dev): 80.04 ms (6.19 ms)

~25% speedup!

Yolov2 (https://github.com/onnx/models) w/ 10000 tuning trials on M1 Mac
FP32 version - Mean inference time (std dev): 112.21 (3.75 ms)
FP16 version - Mean inference time (std dev): 71.05 ms (4.04 ms)

~36% speedup!

anijain2305 · 2021-06-04T22:50:10Z

Thanks for the useful feature. Is this ready for review?

AndrewZhaoLuo · 2021-06-04T23:44:14Z

Hey Animesh, it'll be ready for review soon. Probably by Monday morning (PST time).

There's still some misc. improvements that should be made but I've decided to push those down for later PR's.

AndrewZhaoLuo · 2021-06-07T18:21:49Z

This is ready for review

anijain2305 · 2021-06-07T18:27:07Z

python/tvm/relay/transform/transform.py

    """
    return _ffi_api.AnnotateSpans()
+
+
+def RewriteFP16():


Do you want to call it AMPRewriter?

Good idea. Done.

anijain2305

Overall the structure looks good. Will do a more detailed review later on. The only suggestion for now is to think about naming. Should we call it AMP? Later on we can reuse this for BF16

anijain2305 · 2021-06-07T18:29:39Z

src/relay/transforms/fp32_to_fp16.h

+// GREEN colored ops should always be done in FP16 due to the speed and memory savings
+// GRAY colored ops can be done in FP16 but don't have speedups to justify a dedicated cast.
+// RED colored ops should not be done in FP16 due to numerical reasons.
+enum FP16ConversionCategory { RED, GRAY, GREEN };


What happens if there is an op that is not associated with any of the colors? Is the default RED?

By default it would be RED and a warning would be emitted.

Some suggestions:

There are more like an attribute instead of category.

Use straightforward terms, such as ALWAYS/FOLLOW/NEVER, instead of RED/GRAY/GREEN.

Emit warning for non-specified ops may result in tedious messages. We could make it configurable to let users decide whether to print out these ops.

I've implemented the suggestions listed.

anijain2305 · 2021-06-07T18:37:08Z

src/relay/transforms/fp32_to_fp16.h

+
+      if (color == op_to_initial_color.end()) {
+        if (ignore_missing) {
+          LOG(WARNING) << "Op name " << op_name << " not in included in fp16 conversion lists!.";


Remove the period at the end.

anijain2305 · 2021-06-07T18:37:16Z

src/relay/transforms/fp32_to_fp16.h

+          LOG(WARNING) << "Op name " << op_name << " not in included in fp16 conversion lists!.";
+          return RED;
+        } else {
+          LOG(FATAL) << "Op name " << op_name << " not in included in fp16 lists!.";


Remove the period at the end.

anijain2305 · 2021-06-08T17:43:16Z

Can we get a few more initial reviews - @mbrookhart , @csullivan?

@AndrewZhaoLuo I would also suggest to test a dynamic model like SSD or Mask-RCNN. Your current list of Object detection models involve Yolo which is static model.

mbrookhart

A few minor changes for a first pass. I like the big idea of the pass, I want to dig into a couple of the details still, but overall, looking very good.

mbrookhart · 2021-06-07T22:25:33Z

src/relay/transforms/fp32_to_fp16.cc

+    auto h1 = std::hash<T1>()(pair.first);
+    auto h2 = std::hash<T2>()(pair.second);
+
+    return h1 ^ (h2 << 1);


If I remember correctly, xor hash combine is pretty prone to hash conflicts? Maybe use the boost approach? return h1 ^ (h1 + 0x9e3779b9 + (h2 << 6) + (h2 >> 2))

mbrookhart · 2021-06-08T19:50:56Z

src/relay/transforms/fp32_to_fp16.cc

+
+ public:
+  explicit AmpGraphCreator(ColorFunc colorer, OutputDtypeFunc output_dtype_func)
+      : ExprMutator(), colorer(colorer), output_dtype_func(output_dtype_func) {}


Since you're using the recursive mutator here, you might run into stack overflows on larger models. I haven't looked at this pass in much detail yet, is it possible to do with a post-order traversal (or MixedMode Mutator?)

Yes right now I depend on a post order traversal actually (since we want all arguments to call nodes to be mutated before we make a decision on whether to convert a call node to fp16). I'll look into MixedMode Mutator to solve this issue.

mbrookhart · 2021-06-08T19:52:27Z

tests/python/frontend/mxnet/test_forward.py

+    np.random.seed(90)
+


Tianqi prefers we don't set random seeds to try to find intermittent bugs across CI runs

Oh ok that's an interesting idea. I had a failure where the passing rtol was 1.05e-5 so I'm just going to increase the tolerance.

AndrewZhaoLuo · 2021-06-09T01:42:26Z

Can we get a few more initial reviews - @mbrookhart , @csullivan?

@AndrewZhaoLuo I would also suggest to test a dynamic model like SSD or Mask-RCNN. Your current list of Object detection models involve Yolo which is static model.

I added tried it on an SSD model and it seems to work fine. Mask-RCNN I haven't found a spare file which can convert well and be run normally in FP32.

anijain2305 · 2021-06-09T02:09:33Z

TF SSD is good enough. Thanks @AndrewZhaoLuo

csullivan · 2021-06-09T17:55:45Z

Thanks for this great PR! Would it be too much to ask for AMPRewrite and corresponding infra to support mixed precision with generic reduced precision floating point types? I notice the main assumption is to be downcasting to float16, though TVM has support for other reduced precision fp types for which mixed precision is useful e.g. float32 + bfloat16, as well as possible user defined floating point types.

comaniac

Thanks for the RFC and PR. The overall idea LGTM, and I believe this would be an important feature in the future. Just have some concerns in the current implementation.

comaniac · 2021-06-09T17:53:18Z

python/tvm/relay/transform/transform.py

@@ -1199,3 +1198,20 @@ def FakeQuantizationToInteger():
        The registered SimplifyExpr pass.
    """
    return _ffi_api.FakeQuantizationToInteger()
+
+
+def AMPRewrite():


Since this is the user API, we might need to think a bit more to make it more straightforward. For example, AMP or AutoCast are better naming IMHO.

I disagree. All the passes have names which are verbs which describe what they do while AMP is a noun. Maybe AutoCast would be better but it doesn't capture the mixed precision nature.

Maybe ToMixedPrecision would be a better name?

I don't think AutoCast doesn't capture the nature. For example: https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast

Hmm, I would prefer ToMixedPrecision still if that is fine with you.

The example you list only works for me because it exists under the amp namespace. AutoCast by itself without being part of torch.cuda.amp does not show mixed precision.

I'm fine if no others complain about this naming.

src/relay/transforms/fp32_to_fp16.cc

tests/python/relay/test_fp32_to_fp16_transform.py

comaniac · 2021-06-09T18:01:15Z

src/relay/transforms/fp32_to_fp16.h

+// GREEN colored ops should always be done in FP16 due to the speed and memory savings
+// GRAY colored ops can be done in FP16 but don't have speedups to justify a dedicated cast.
+// RED colored ops should not be done in FP16 due to numerical reasons.
+enum FP16ConversionCategory { RED, GRAY, GREEN };


Some suggestions:

There are more like an attribute instead of category.

Use straightforward terms, such as ALWAYS/FOLLOW/NEVER, instead of RED/GRAY/GREEN.

Emit warning for non-specified ops may result in tedious messages. We could make it configurable to let users decide whether to print out these ops.

comaniac · 2021-06-09T18:03:24Z

src/relay/transforms/fp32_to_fp16.h

+
+using OpStringSet = std::unordered_set<std::string>;
+
+// Default lists inspired from TF's classifications:


I don't prefer to specify op lists in a pass. It means we need to maintain this pass every time we add a new op. It would be better to follow the logic of other similar passes: Register an attribute to each op. If an op doesn't have this attribute registered, using the default behavior. It is also impossible for this implementation to accept user-defined rules from Python.

Good advices.

I'll use better terms instead of RED/GRAY/GREEN.
I'll also make the warning messages configurable to the user.
For the registering attributes to each op, I think it's probably a good idea but do you have an example of this strategy I could look at?
The user defined rules from python is a goal I will try for. It might take a little longer though.

You can refer to the design document of layout conversion pass: https://tvm.apache.org/docs/dev/convert_layout.html. It's actually not hard to take rules from Python for this design.

This is now done.

src/relay/transforms/fp32_to_fp16.cc

comaniac · 2021-06-09T18:16:55Z

src/relay/transforms/fp32_to_fp16.cc

+    // Determine the final color.
+    FP16ConversionCategory final_color;
+    if (initial_color == GRAY) {
+      final_color = all_args_fp16_compatible ? GREEN : RED;


Can you provide an example of FP16 incompatible?

An example with concat.

We have two branches whose outputs are fed into concat.

The first branch has a RED operation and returns an FP32 tensor.
The second branch returns an FP16 tensor.

Now that I say this, it might be better to be a bit smarter about GRAY ops when we have heterogeneous floating point types coming in.

E.g. let's say we had a concat with 10 fp16 args and 1 fp32 arg. It would be wasteful to default convert everything to fp32 and set the color as RED in this case.

I will change this so the number of fp16/fp32 args are taken into account. If there is a majority of fp16 or a tie we color GREEN else we color RED. Thoughts?

The workaround sounds fine to me. Again I'd suggest putting these op-specific heuristics to op attribute instead of this pass.

On closer thought I will lead things as is since only some ops will benefit from the trick I described. In the future exposing this to op-attributes might be worthwhile but I cannot think of a major savings that comes from this.

src/relay/transforms/fp32_to_fp16.cc

comaniac · 2021-06-09T18:21:19Z

src/relay/transforms/fp32_to_fp16.cc

+      Attrs new_attrs = GetNewAttrs(call_node, output_dtypes.accumulation_dtype);
+      Expr output = Call(new_op, call_args, new_attrs, call_arg_types, call_node->span);
+      if (output_dtypes.accumulation_dtype != output_dtypes.output_dtype) {
+        output = CastArg(output, GetType(output), output_dtypes.output_dtype);


Wouldn't this introduce unnecessary cast ops? For example, the accumulation dtype is FP32 and the followed op is RED. Will this make it A(GREEN) - cast_to_fp16 - cast_to_fp32 - B(RED)?

Hmm I don't believe so since CachedCast will also cache the reverse result.

E.g. CachedCast(A(Green), FP16) would produce A(GREEN) - cast_to_fp16

But internally it would cache:
Node, wanted_dtype
A(GREEN), FP16 --> cast_to_fp16
cast_to_fp16, FP32 --> A(GREEN)

So attempting to cast cast_to_fp16 to fp32 would return A(GREEN)

It would be worth having a test case to cover this however and make sure.

I see...this mechanism is interesting and I haven't paid too much attention on it. At the first glance, I would worry if the cache will blow up when the model is too large, but I'll probably take a deeper look at this mechanism later.

comaniac · 2021-06-09T18:23:34Z

src/relay/transforms/fp32_to_fp16.h

+    // TODO(AndrewZhaoLuo): remove when batch_matmul handles accumulation dtypes well.
+    // Batched matmul has inconsistent support for mixed precision operations.
+    // Many schedules ignore the out_dtype attribute which leads to errors when
+    // input types do not match the out_dtype. Therefore, accumulate to fp16 if green.
+    if (auto op_node = call->op.as<OpNode>()) {
+      if (op_node->name == "nn.batch_matmul") {
+        return {DataType::Float(16), DataType::Float(16)};
+      }
+    }


This again illustrates the importance of registering casting function to each op instead of here.

This is now functionally done in the python interface

AndrewZhaoLuo · 2021-06-09T19:47:43Z

Hey folks, covered the simple changes requested, here is the list of more involved changes along with the associated reviewer. Several of these changes were planned to be future PRs but it might be best to just commit this correctly the first time (since it doesn't really touch other files):

Support other floating point types out of the box (e.g. bfloat16)
- Need to test bfloat16 on gpu
Naming of things (pass, GREEN/RED/GRAY, etc.)
Python interface for Coloring/Accumulation logic
How to register ops for coloring
MixedModeMutator to avoid stackoverflow

Let me know if I missed anything

masahi · 2021-06-11T06:56:09Z

src/relay/transforms/to_mixed_precision.cc

+     conv2d might take in fp16 and give a fp32 result.
+     Attrs is const because we get it as a const.
+     */
+    T* mutable_attrs = const_cast<T*>(attrs);


Can we create and return a new attribute?

masahi · 2021-06-11T07:17:24Z

@AndrewZhaoLuo I can suggest trying out DETR model. This is an interesting transformer based object detection model that consists exclusively of conv2d / matmul (no NMS). I believe this is a great fit for fp16 quantization. I have a tuning and benchmark script at https://github.com/masahi/torchscript-to-tvm/blob/master/detr/detr_test.py (should work with PT 1.7). I'm interested in both fp32 and fp16 performance on M1.

I also have FasterRCNN, but it requires a good BLAS lib for reasonable performance (due to dynamic batch dense). So don't recommend it on M1. MaskRCNN is even worse, it has dynamic batch dense/conv2d/conv2_transpose.

Another model that could be interesting is TF2 ssd mobilenet with combined NMS. Many people are interested in this variant of SSD and I have a model that cleanly exports to ONNX. Ask me if you want to try this out, I can give you the model and the script.

masahi · 2021-06-11T08:37:25Z

@AndrewZhaoLuo oh by M1 do you mean its cpu or gpu (metal)?

AndrewZhaoLuo · 2021-06-11T16:55:19Z

@AndrewZhaoLuo oh by M1 do you mean its cpu or gpu (metal)?

I think it's CPU. Here's the benchmarking + tuning script I used: https://github.com/AndrewZhaoLuo/TVM-Sandbox/blob/a3c4b6b2235afb1826b237af1136bbb9539c9ff9/fp16_pass/benchmark_m1_mac_fp16.py

The other models you have are interesting, I think the SSD model I used has combined NMS. At least, it returns variable length tensors representing different numbers of objects detected.

masahi · 2021-06-11T20:47:53Z

I see, interesting to see that fp16 makes things faster on CPU.

So you've tested only on LLVM? Does this work on metal target? Not sure if our metal backend supports fp16 or if M1 GPU is good at fp16 in general @echuraev

Later I can test it on CUDA (tensorcore) and OpenCL (intel), and hopefully @Lunderberg for vulkan.

comaniac · 2021-06-11T22:52:59Z

src/relay/transforms/to_mixed_precision.cc

+    Expr result = expr_dtype == wanted_dtype ? expr : Cast(expr, wanted_dtype);
+    cast_nodes_cache[{expr_node, wanted_dtype}] = result;


I reviewed the cache mechanism and I think I got the idea. Here is the example I went through:

Consider the op A (out: fp32, want: fp16), the cache will look like the following after processing A's output:

(A, fp16): cast (cast, fp32): A

Now consider the followed op B:
Case 1. If B wants fp32, then like you mentioned before, we query (cast, fp32) and get A, so it becomes A -> B.
Case 2. If B wants fp16, then we query (cast, fp16), which is missed and a new entry (cast, fp16): cast is created and returned, so it becomes A -> cast -> B.

This mechanism seems working well, and the cache size should be reasonable as it only keeps pointers. Two possible improvements:

Apparently, the cache entry (cast, fp16): cast in the example is not necessary. I think we can simply return expr when expr_dtype == wanted_dtype?

The created cast ops may be useless, such as the one in case 1. Is it possible to create this op lazily? For example, when casting the output, we only create a cache entry but don't really create the node. Once the entry is queried by the followed ops for the first time, we create the cast node and update the cache.

Another direction I would actually recommend is removing the cache and letting this pass generate cast ops as many as it wants, and we run SimplifyExpr pass afterward to cancel back-to-back cast ops (ref: #8081). IIUC, this should generate the same IR as the current pass, so it doesn't hurt the final performance (please correct me if I missed something).

Right now that is what functionally happens with this line
Expr result = expr_dtype == wanted_dtype ? expr : Cast(expr, wanted_dtype);

It still creates a cache entry though so I reorganized it to be clearer and not insert into the cache when expr_dtype == wanted_dtype

Hmm I believe creating the op lazily will not have any benefit. This is because there aren't any useless casts e.g. refer to 1.

The idea of having another pass handle back to back casts is appealing as the tool can be used in many other situations. The main concern I have is about correctness, e.g. does it handle weird edge cases well? I'll take a closer look at the existing PR and think a little more about this.

I do agree that this is a better direction to go however and will refactor the pass when a sufficiently correct cast-folding pass exists and is checked into main.

echuraev · 2021-06-12T06:45:05Z

So you've tested only on LLVM? Does this work on metal target? Not sure if our metal backend supports fp16 or if M1 GPU is good at fp16 in general @echuraev

The Metal backend support fp16. And as far as I know @elvin-n have run fp16 models with our Metal backend and collected some performance metrics. I think he'll add some information about it.

What about M1, we didn't try to run fp16 models on Metal on M1 yet. Theoretically, it should work, but we should check it.

AndrewZhaoLuo · 2021-06-15T18:14:22Z

@anijain2305
@masahi
@comaniac
@mbrookhart
@csullivan

PTAL. I believe I've addressed all the major points.

elvin-n · 2021-06-17T07:20:52Z

So you've tested only on LLVM? Does this work on metal target? Not sure if our metal backend supports fp16 or if M1 GPU is good at fp16 in general @echuraev

The Metal backend support fp16. And as far as I know @elvin-n have run fp16 models with our Metal backend and collected some performance metrics. I think he'll add some information about it.
What about M1, we didn't try to run fp16 models on Metal on M1 yet. Theoretically, it should work, but we should check it.

Have you or @elvin-n tried tuning metal on m1 mac yet? Doesn't seem to work out of the box.

We have not tried to tune Metal on M1 yet. Have you tried with AutTVM or AutoScheduler?

echuraev · 2021-06-17T07:20:56Z

So you've tested only on LLVM? Does this work on metal target? Not sure if our metal backend supports fp16 or if M1 GPU is good at fp16 in general @echuraev

The Metal backend support fp16. And as far as I know @elvin-n have run fp16 models with our Metal backend and collected some performance metrics. I think he'll add some information about it.
What about M1, we didn't try to run fp16 models on Metal on M1 yet. Theoretically, it should work, but we should check it.

Have you or @elvin-n tried tuning metal on m1 mac yet? Doesn't seem to work out of the box.

No, we didn't try it. I'll take a look on it.

comaniac

I don't have other major comments. I think as the first PR of the AMP support, this PR is basically ready to be merged once the comments from @mbrookhart are addressed. The other improvements can be done in the follow-up PRs.

python/tvm/relay/transform/mixed_precision.py

AndrewZhaoLuo · 2021-06-17T22:29:58Z

@mbrookhart PTAL. I'm going to push ADT support down to a future PR.

mbrookhart

I'm happy with this, we have a couple of TODOs but I think the core of it is in a great place.

@masahi @anijain2305 Any more comments? Otherwise I'll plan to merge later this afternoon.

Lunderberg · 2021-06-18T18:58:53Z

Later I can test it on CUDA (tensorcore) and OpenCL (intel), and hopefully @Lunderberg for vulkan.

Currently, I can run all the tests in test_to_mixed_precision.py with the LLVM target/device, but both cuda and vulkan backends throw an exception at TVMFuncCall in c_runtime_api.cc if I edit the run_module function to use a different target.

On the cuda side, it's failing a check that requires 16-bit floats to be used in pairs.

Check failed: lanes % 2 == 0 (1 vs. 0) : only support even lane for half type

On the vulkan side, it's something similar with the validation checks failing an alignment rule.

Check failed: res == SPV_SUCCESS (-10 vs. 0) :  index=27 error:Structure id 12 decorated as Block for variable in StorageBuffer storage class must follow standard storage buffer layout rules: member 0 contains an array with stride 6 not satisfying alignment to 8
%_struct_12 = OpTypeStruct %_runtimearr_v3half

I don't think either of these are reasons not to merge, and I've added the vulkan errors to my todo list for the ongoing float16 work.

masahi · 2021-06-21T03:18:13Z

Thanks @AndrewZhaoLuo for the great work, and everyone for reviews!!

I'll follow up with CUDA and OpenCL support.

AndrewZhaoLuo · 2021-06-21T16:56:20Z

Added some tracking issues for CUDA and Vulkan:
#8295
#8294

main tracking issue: [RFC][Tracking Issue][AMP] Tracking Issue for Mixed Precision Pass #8296

CoinCheung · 2021-07-13T00:05:55Z

HI,

I tried this:

def compile_model(mod, params, target, logfile, save_path):
    tvm.relay.backend.compile_engine.get().clear()
    mod = tvm.relay.transform.ToMixedPrecision(
            mixed_precision_type='float16')(mod)
    with tvm.autotvm.apply_history_best(logfile):
        with tvm.transform.PassContext(opt_level=3):
            lib = tvm.relay.build(mod, target=target, params=params)
    lib.export_library(save_path) # 保存编译好的模型, 必须so结尾，不然c++不识别

But I got the error:

Traceback (most recent call last):
  File "main.py", line 207, in <module>
    args.save_path)
  File "main.py", line 122, in compile_model
    mixed_precision_type='float16')(mod)
  File "/root/build/tvm/python/tvm/ir/transform.py", line 161, in __call__
    return _ffi_transform_api.RunPass(self, mod)
  File "/root/build/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  23: TVMFuncCall
  22: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::transform::Pass, tvm::IRModule)>::AssignTypedLambda<tvm::transform::{lambda(tvm::transform::Pass, tvm::IRModule)#7}>(tvm::transform::{lambda(tvm::transform::Pass, tvm::IRModule)#7}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  21: tvm::transform::Pass::operator()(tvm::IRModule) const
  20: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  19: tvm::relay::transform::FunctionPassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  18: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::relay::Function (tvm::relay::Function, tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::relay::transform::ToMixedPrecision(tvm::runtime::DataType, int)::{lambda(tvm::relay::Function, tvm::IRModule, tvm::transform::PassContext)#1}>(tvm::relay::transform::ToMixedPrecision(tvm::runtime::DataType, int)::{lambda(tvm::relay::Function, tvm::IRModule, tvm::transform::PassContext)#1})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  17: tvm::relay::ToMixedPrecision(tvm::RelayExpr const&, tvm::runtime::DataType const&, int)
  16: tvm::relay::MixedModeMutator::VisitExpr(tvm::RelayExpr const&)
  15: tvm::relay::MixedModeMutator::VisitLeaf(tvm::RelayExpr const&)
  14: _ZN3tvm5relay16MixedModeMutator17DispatchVisitExprERKNS_9Re
  13: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  12: tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  11: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlR
  10: tvm::relay::MixedPrecisionPass::VisitExpr_(tvm::relay::FunctionNode const*)
  9: tvm::relay::ExprMutator::VisitExpr_(tvm::relay::FunctionNode const*)
  8: tvm::relay::MixedModeMutator::VisitExpr(tvm::RelayExpr const&)
  7: tvm::relay::MixedModeMutator::VisitLeaf(tvm::RelayExpr const&)
  6: _ZN3tvm5relay16MixedModeMutator17DispatchVisitExprERKNS_9Re
  5: tvm::relay::ExprMutator::VisitExpr(tvm::RelayExpr const&)
  4: tvm::relay::ExprFunctor<tvm::RelayExpr (tvm::RelayExpr const&)>::VisitExpr(tvm::RelayExpr const&)
  3: _ZZN3tvm5relay11ExprFunctorIFNS_9RelayExprERKS2_EE10InitVTableEvENUlR
  2: tvm::relay::MixedModeMutator::VisitExpr_(tvm::relay::CallNode const*)
  1: tvm::relay::MixedPrecisionPass::Rewrite_(tvm::relay::CallNode const*, tvm::RelayExpr const&)
  0: tvm::Op::GetAttrMapContainer(tvm::runtime::String const&)
  File "/root/build/tvm/src/ir/../node/attr_registry.h", line 146
TVMError: Attribute 'FTVMMixedPrecisionConversionType' is not registered

Did I miss any key point of using this feature ?

comaniac · 2021-07-13T00:08:06Z

You might need to rebuild TVM.
Please do not ask usage questions under feature PRs. Please post the question to the discuss forum instead.

CoinCheung · 2021-07-13T00:21:20Z

Hi,

I am not sure whether it is a usage question or the code can be refined, I am using a quite new commit pulled from github:

And I built from the source following the steps in the doc website. Only changes in the config.cmake are as follows:

set(USE_LLVM ON)
set(USE_CUDA ON)
set(USE_CUDNN ON)

Should I still go to discussion website for help?

comaniac · 2021-07-13T00:29:39Z

Yes. Please go to the discuss forum. You can refer to this PR and tag relevant people in the post.

…pache#8069) * Initial skeleton for fp16 pass. initial green gray and red lists move fp16 conversion to own fodler second pass example split up files a bit more cool nodes bro initial transofmr pass * Working python version of fp16 pass. fix topi conv2d not casting kernel to output type working resnet, but conv2d topi intrinsics need work tests for resnet add more tests, extend coverage for converter update tests, ensure red ops convert back to fp32 clean up code a bit simplify fp16 output dtype examination fix pass update tests initial coloring * Rewrite python passes in C++ inspect arg fields add propagate colors pass" private -> public inheritance" rewrite draft full transformation in c++ remove prints fp16 pass the proper wrapping insert extra cast to pass type checking fix previously broken test by removing cast in wrong scenario remove old python_files * Extend support to things besides CallNodes. E.g. tuples and lets fp32 invalidate typing instead of cast adding basic tests skeleton code out Stash work -- casting based on checked types working let statements add more ops, handle functions more generally add multiply, fix broken case support TupleNodes properly, move hash function for datatypes into data_type.h" update simple let test with structural expectation cleanup p1 remove old file * Rewrite how and when casting is done by checking types directly. add support for GPT2, BERT add some more comments new single pass version formatting make a lot of things const references clean up tests more cleanup more comments final comment add newline * linting and formatting * add AST header * remove todo * lint errors2 * remove i386 incompatible features * Trigger CI again * set seed * lint * address animesh's initial comments * mutate attributes only if they were originally floats * initial comments from matthew * add comment on hashing strat * add missing ; * edge case when mutating attrs * Cody's easy to address comments * add test to show green-red casting works * remove np.random seed from each test * remove as many references to fp16 types in favor of generic mixed types * rename RED, GREEN, GRAY to MIXED_PRECISION_ALLOW, etc. * skeleton for supporting arbitrary mixed types * cool tests * Using MixedModeMutator * rename things ToMixedPrecision * rename passes to amp.cc * rename tests to match transform * clean up typos * rename even better to_mixed_precision * don't insert into cache when dtypes equal * new python interface for registering ops * cleaner registering ops * add fp64 structural test * clean up and comments * make copy of attributes * asf header * pylint * remove TODO which is solved * Apply nits from code review (comaniac) Co-authored-by: Cody Yu <comaniac0422@gmail.com> * change cast_node_cache --> cast_node_cache_ * add check for returned vals * better error msg * docstring for pass in python * fix default behavior to be proper * better error reporting via single flag * priority to 0 * address more nits * fix story telling slightly * restart * correct docstring * change class fields to have _ at end * add class docstring * add comment on accumulation dtype hack * ADT warnings * add todo * fix linter Co-authored-by: Cody Yu <comaniac0422@gmail.com>

MeJerry215 · 2021-12-30T07:24:30Z

it seems like all conv matmul dense op cast to fp16 when using mix precision pass.

but i still have a question about why not cast weight to fp16 and remove all cast op.

comaniac · 2021-12-30T08:16:12Z

Please do not ask questions in the PR directly.

Weights have cast ops because they are parameters instead of constants. You have to bind parameters first, run ToMixPercision, and run FoldConstant to remove casts.

…pache#8069) * Initial skeleton for fp16 pass. initial green gray and red lists move fp16 conversion to own fodler second pass example split up files a bit more cool nodes bro initial transofmr pass * Working python version of fp16 pass. fix topi conv2d not casting kernel to output type working resnet, but conv2d topi intrinsics need work tests for resnet add more tests, extend coverage for converter update tests, ensure red ops convert back to fp32 clean up code a bit simplify fp16 output dtype examination fix pass update tests initial coloring * Rewrite python passes in C++ inspect arg fields add propagate colors pass" private -> public inheritance" rewrite draft full transformation in c++ remove prints fp16 pass the proper wrapping insert extra cast to pass type checking fix previously broken test by removing cast in wrong scenario remove old python_files * Extend support to things besides CallNodes. E.g. tuples and lets fp32 invalidate typing instead of cast adding basic tests skeleton code out Stash work -- casting based on checked types working let statements add more ops, handle functions more generally add multiply, fix broken case support TupleNodes properly, move hash function for datatypes into data_type.h" update simple let test with structural expectation cleanup p1 remove old file * Rewrite how and when casting is done by checking types directly. add support for GPT2, BERT add some more comments new single pass version formatting make a lot of things const references clean up tests more cleanup more comments final comment add newline * linting and formatting * add AST header * remove todo * lint errors2 * remove i386 incompatible features * Trigger CI again * set seed * lint * address animesh's initial comments * mutate attributes only if they were originally floats * initial comments from matthew * add comment on hashing strat * add missing ; * edge case when mutating attrs * Cody's easy to address comments * add test to show green-red casting works * remove np.random seed from each test * remove as many references to fp16 types in favor of generic mixed types * rename RED, GREEN, GRAY to MIXED_PRECISION_ALLOW, etc. * skeleton for supporting arbitrary mixed types * cool tests * Using MixedModeMutator * rename things ToMixedPrecision * rename passes to amp.cc * rename tests to match transform * clean up typos * rename even better to_mixed_precision * don't insert into cache when dtypes equal * new python interface for registering ops * cleaner registering ops * add fp64 structural test * clean up and comments * make copy of attributes * asf header * pylint * remove TODO which is solved * Apply nits from code review (comaniac) Co-authored-by: Cody Yu <comaniac0422@gmail.com> * change cast_node_cache --> cast_node_cache_ * add check for returned vals * better error msg * docstring for pass in python * fix default behavior to be proper * better error reporting via single flag * priority to 0 * address more nits * fix story telling slightly * restart * correct docstring * change class fields to have _ at end * add class docstring * add comment on accumulation dtype hack * ADT warnings * add todo * fix linter Co-authored-by: Cody Yu <comaniac0422@gmail.com>

AndrewZhaoLuo changed the title ~~Add FP16 model conversion pass~~ [Relay] [Pass] Add FP16 model conversion pass May 18, 2021

AndrewZhaoLuo force-pushed the andrewluo-add-fp16-conversion-pass branch from 6aed619 to 86744b0 Compare May 25, 2021 19:47

AndrewZhaoLuo force-pushed the andrewluo-add-fp16-conversion-pass branch 2 times, most recently from 2c6c2c1 to 8bb3ad6 Compare June 3, 2021 23:23

AndrewZhaoLuo force-pushed the andrewluo-add-fp16-conversion-pass branch 3 times, most recently from 8d02038 to 483dc29 Compare June 5, 2021 00:41

AndrewZhaoLuo marked this pull request as ready for review June 5, 2021 00:57

anijain2305 reviewed Jun 7, 2021

View reviewed changes

mbrookhart requested changes Jun 8, 2021

View reviewed changes

AndrewZhaoLuo force-pushed the andrewluo-add-fp16-conversion-pass branch from dd03c23 to 391b15a Compare June 9, 2021 01:46

AndrewZhaoLuo mentioned this pull request Jun 9, 2021

[RFC] [Relay] Automatic Mixed Precision Pass apache/tvm-rfcs#6

Merged

comaniac requested changes Jun 9, 2021

View reviewed changes

masahi reviewed Jun 11, 2021

View reviewed changes

comaniac reviewed Jun 11, 2021

View reviewed changes

add class docstring

a1dbb68

comaniac approved these changes Jun 17, 2021

View reviewed changes

python/tvm/relay/transform/mixed_precision.py Outdated Show resolved Hide resolved

AndrewZhaoLuo added 3 commits June 17, 2021 12:01

add comment on accumulation dtype hack

97fbd89

ADT warnings

64408ee

add todo

98e9cea

fix linter

2634182

mbrookhart approved these changes Jun 18, 2021

View reviewed changes

AndrewZhaoLuo mentioned this pull request Jun 18, 2021

Add metal to list of backends for TVMC #8282

Merged

masahi approved these changes Jun 21, 2021

View reviewed changes

masahi merged commit 8e63486 into apache:main Jun 21, 2021

This was referenced Jun 21, 2021

[AMP] CUDA support for mixed precision pass #8294

Closed

[AMP] Vulkan Support for Mixed Precision Pass #8295

Closed

AndrewZhaoLuo mentioned this pull request Jun 21, 2021

[RFC][Tracking Issue][AMP] Tracking Issue for Mixed Precision Pass #8296

Open

18 tasks

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

This was referenced Jan 27, 2022

[TIR, Relay] improve bfloat16 support yangulei/tvm#2

Closed

[TIR, Relay] improve bfloat16 support #10112

Merged


		using OpStringSet = std::unordered_set<std::string>;

		// Default lists inspired from TF's classifications:

		Expr result = expr_dtype == wanted_dtype ? expr : Cast(expr, wanted_dtype);
		cast_nodes_cache[{expr_node, wanted_dtype}] = result;

[Relay] [Pass] Add mixed precision (e.g. FP16) model conversion pass #8069

[Relay] [Pass] Add mixed precision (e.g. FP16) model conversion pass #8069

Conversation

AndrewZhaoLuo commented May 18, 2021 • edited

Changes:

Testing

Future PRs (in order of priority)

Known issues

Speedups (add as I go along)

anijain2305 commented Jun 4, 2021

AndrewZhaoLuo commented Jun 4, 2021

AndrewZhaoLuo commented Jun 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anijain2305 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anijain2305 commented Jun 8, 2021

mbrookhart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewZhaoLuo Jun 8, 2021 • edited

Choose a reason for hiding this comment

AndrewZhaoLuo commented Jun 9, 2021

anijain2305 commented Jun 9, 2021

csullivan commented Jun 9, 2021

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewZhaoLuo Jun 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewZhaoLuo Jun 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewZhaoLuo commented Jun 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Jun 11, 2021 • edited

masahi commented Jun 11, 2021

AndrewZhaoLuo commented Jun 11, 2021

masahi commented Jun 11, 2021 • edited

comaniac Jun 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echuraev commented Jun 12, 2021

AndrewZhaoLuo commented Jun 15, 2021

elvin-n commented Jun 17, 2021

echuraev commented Jun 17, 2021

comaniac left a comment

Choose a reason for hiding this comment

AndrewZhaoLuo commented Jun 17, 2021 • edited

mbrookhart left a comment

Choose a reason for hiding this comment

AndrewZhaoLuo commented May 18, 2021 •

edited

AndrewZhaoLuo Jun 8, 2021 •

edited

AndrewZhaoLuo Jun 9, 2021 •

edited

AndrewZhaoLuo Jun 9, 2021 •

edited

AndrewZhaoLuo commented Jun 9, 2021 •

edited

masahi commented Jun 11, 2021 •

edited

masahi commented Jun 11, 2021 •

edited

comaniac Jun 11, 2021 •

edited

AndrewZhaoLuo commented Jun 17, 2021 •

edited

AndrewZhaoLuo commented Jun 21, 2021 •

edited