apache · manupak · Feb 15, 2022 · Dec 10, 2021 · Feb 4, 2022 · Feb 8, 2022
diff --git a/rfcs/0046-module-based-model-runtime-for-aot.md b/rfcs/0046-module-based-model-runtime-for-aot.md
@@ -0,0 +1,348 @@
+# Module-based Model Runtime Interface for AOT
+
+- Feature Name: module_based_model_runtime_for_aot
+- Start Date: 2021-09-17
+- RFC PR: [apache/tvm-rfcs#0046](https://github.com/apache/tvm-rfcs/pull/0046)
+- GitHub Issue: [apache/tvm#0000](https://github.com/apache/tvm/issues/0000)
+
+# **Summary**
+
+This RFC describes a [Module-based Model Runtime
+interface](https://discuss.tvm.apache.org/t/discuss-module-based-model-runtime-interface/5025) for
+the [Ahead-of-Time Executor](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206), thereby
+enabling its use from the TVM C++ Runtime.
+
+# **Motivation**
+
+The microTVM project has made significant progress towards an Ahead-of-Time Executor for compiled
+Relay models. At the time of writing, it's now possible to codegen a TIR function which executes
+Relay models that have known shapes, don't have graph-level control flow, and execute only on the
+CPU device. Right now, the C runtime is the only such runtime environment which can interact with
+this generated code. However, significant interest exists in enabling the C++ runtime to use the
+Ahead-of-Time executor.
+
+# **Guide-level explanation**
+
+Users select the AOT executor at compile time through the traditional GraphExecutor compilation flow
+(e.g. `[tvm.relay.build](http://tvm.relay.build)`) by including `--executor=aot` in the Target
+[1]. The return value of `tvm.relay.build` in this case is an `AotExecutorFactory` Module
+object. Users instantiate the AOT executor via `AotExecutorFactory` as they do with `GraphExecutor`:
+
+```bash
+ir_mod = tvm.parser.fromtext("""\
+      #[version = "0.0.5"]
+      def @main(%a : Tensor[(1, 2), uint8], %b : Tensor[(1, 2), uint8]) {
+          %0 = %a + %b;
+          %0
+      }"""
+    )
+
+with PassConfig(opt_level=3):
+  factory : AotExecutorFactory = tvm.relay.build(
+       ir_mod, "llvm -executor=aot", module_name="my_mod")
+
+aot_executor : AotExecutor = factory["my_mod"](tvm.cpu(0))
+```
+
+`AotExecutor` supports the traditional Module-Based Model Runtime Interface and can be used as a
+user normally would `GraphExecutor`:
+
+```bash
+aot_executor.set_input("a", tvm.nd.array(np.ndarray([1, 2], dtype="uint8")))
+aot_executor.set_input("b", tvm.nd.array(np.ndarray([3, 5], dtype="uint8")))
+aot_exec.run()
+output = aot_exec.get_output(0)
+assert output.asnumpy() == np.ndarray([5, 7], dtype="uint8")
+```
+
+[1] NOTE: The target string is not the final place this customization should be made. However, it's
+been the place where we've been putting runtime-related stuff. A separate RFC will split the Target
+string into Target options (which affect tuning) and runtime options.
+
+# **Reference-level explanation**
+
+Already committed to TVM is the AotExecutorCodegen. This module produces a TIR top-level function
+which invokes the Relay operators (implemented in TIR) in a correct order. An example is given
+below:
+
+```bash
+PrimFunc([input1, input2, output]) attrs={"global_symbol": "tvmgen_my_mod_run_model", "runner_function": (bool)1} {
+  // attr [(nullptr)] device_id = 0
+  // attr [(nullptr)] device_type = 1
+  tir.tvm_call_packed("tvmgen_my_mod_fused_add", input1, input2, output)
+}
+```
+
+The AotExecutor then needs to accomplish the following to meet Module-based Model Runtime Interface:
+
+1. Allocate input and output tensors as defined in the `run_model` function using the correct Device
+   API.
+2. Provide a mapping from relay parameter name to positional argument.
+3. Invoke the generated TIR function and provide profiling.
+
+### Compiler ↔ Runtime Metadata
+
+In order to implement (1) and (2) above, additional metadata about the `run_model` function needs to
+be communicated from Compiler to Runtime:
+
+- The mapping between Relay parameter name and TIR argument position
+- The number of inputs and outputs
+- The type of each parameter
+- Information sufficient to choose a Device API to allocate memory for that data.
+
+At present, Metadata is passed from Compiler to Runtime in several different ways:
+
+1. Constant DLTensor can be bundled with code and supplied to `runtime::Module` via
+   `runtime::MetadataModule`
+2. Many non-DSO-exportable backends (`cuda`, `hexagon`, `metal`, `opencl`, `sdaccel`, `rocm`,
+   `vulkan`) have adopted the convention of including a
+   [1runtime::FunctionInfo`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L106)
+   (NOTE: distinct from `tvm::relay::transform::FunctionInfo`) in their serialization:
+
+    ```bash
+    /*! \brief function information needed by device */
+    struct FunctionInfo {
+      std::string name;
+      std::vector<DLDataType> arg_types;
+      std::vector<std::string> launch_param_tags;
+    }
+    ```
+
+3. AotExecutorCodegen and GraphExecutorCodegen have adopted the practice of producing the
+   graph-level
+   [`runtime::MetadataNode`](https://github.com/apache/tvm/blob/main/src/runtime/meta_data.h#L55):
+
+    ```bash
+    /*!
+     * \brief Structure that can be optionally used by the executor codegen
+     */
+    class MetadataNode : public Object {
+     public:
+      /*! \brief input information for the main function */
+      Array<String> inputs;
+      /*! \brief number of outputs of the main function */
+      int num_outputs = 1;
+      /*! \brief the executor to be used to run the model */
+      String executor = kTvmExecutorGraph;
+
+      String mod_name = "";
+    }
+    ```
+
+4. The recent AOTExecutor implementation has created `tvm::relay::transform::FunctionInfo` which
+   communicates statistics about memory usage and I/O operation for each TIR operator and aggregate
+   statistics for the top-level AOT function:
+
+    ```bash
+    struct FunctionInfoNode : public Object {
+      Map<Target, Integer> workspace_sizes;
+      Map<Target, Integer> io_sizes;
+      Map<Target, Integer> constant_sizes;
+      Map<Target, tir::PrimFunc> tir_primfuncs;
+      Map<Target, Function> relay_primfuncs;
+    }
+    ```
+
+
+Some duplication of information is already present. Likely this is due in part to the existing
+middle-end compiler design, in which a separate `IRModule` is produced for each backend. Another
+factor may be: since `runtime::Module` are responsible for their own serialization, and passing
+`Node` across `PackedFunc` requires a cast, the lack of a centralized facility for
+`runtime::Modules` to obtain module-level Metadata has led backend authors to roll their own. This
+pattern means that it's very difficult to assess the full scope of metadata handed to the runtime,
+particularly across all backends.
+
+Work is currently ongoing to unify the pre-codegen `IRModule` into a single instance. After this
+work is completed, it will be much easier to produce a centralized module-level Metadata. This RFC
+argues for the expansion of `runtime::MetadataNode` in the following ways:
+
+1. Rename `runtime::MetadataModule` to `runtime::ConstLoaderModule` to disambiguate the two and make
+   its purpose in life clearer.
+2. Expand `input_args` in the existing `runtime::Metadata` to parity with `runtime::FunctionInfo`,
+   plus include `_sizes` from `tvm::relay::transform::FunctionInfoNode` and the required `shape` and
+   `dtype` information from the beginning of this section.
+3. Introduce `ModelMetadataModule` to contain this information for use with the C++ runtime.
+
+    ```bash
+    class ModelMetadataModule {
+      virtual GetFunction(const std::string& name, ObjectPtr<Object>& sptr_to_self) {
+        if (name == "get_model_metadata") {
+           return PackedFunc([](TVMArgs args, TVMRetValue* rv) {
+              *rv = ModelMetadata(metadata_);
+           });
+        } else {
+          return PackedFunc();
+        }
+      }
+
+      const struct ModelMetadata* metadata_;
+    };
+    ```
+
+4. Introduce an optional implementation for the C runtime.
+5. Export runtime::Metadata to Model Library Format.
+
+The new proposed definition of `runtime::Metadata` is as follows.  NOTE that this is a C definition
+because it will be made available both the C and C++ runtimes. A C++ wrapper will be written.
+
+```bash
+struct ParameterInfo {
+  const char* relay_name_hint;
+  const char* tir_name_hint;
+  int64_t* shape;
+  int64_t ndim;
+  DLDataType dtype;
+  TargetDevice target_device;  // NOTE: future addition; not covered in this RFC.
+};
+
+struct FunctionInfo {
+  const char* function_name;
+  struct ParameterInfo* params;
+  int num_inputs;
+  int num_outputs;
+  int64_t workspace_size_bytes;
+  int64_t io_size_bytes;
+  int64_t constant_size_bytes;
+};
+
+typedef struct Metadata {
+  int version;
+  struct FunctionInfo* functions;
+  const char* module_name;
+};
+```
+
+### Internal workings of AotExecutor (`--runtime=c++ --interface-api=packed`)
+
+Given the above, we can now sketch out the way AotExecutor should behave (for C++ runtime).
+
+Module initialization will:
+
+1. Load the `ModelMetadata` using `get_model_metadata` PackedFunc.
+2. Allocate space for the parameters to `tvmgen_<model_name>_run_model`.
+3. Lookup and load any linked parameters using the `--link-params` mechanism.
+
+- `set_input`, `get_input`, `get_output` all work as they do in `GraphExecutor`.
+- `run` assembles `TVMArgs` containing inputs + outputs and invokes `tvmgen_<model_name>_run_model`.
+- `time_evaluator` is implemented in the same way as it is in `GraphExecutor`. Timing `run_model` is
+  done using the CPU timer.
+
+### Internal workings of AotExecutor (`--runtime=c --interface-api=packed`)
+
+The C runtime version works in a very similar way with C accessor functions for the `ModelMetadata`.
+
+### No AotExecutor implementation planned (`--runtime=c --interface-api=c`)
+
+When `-interface-api=c` is present in the Target string, the `run_model` function no longer accepts
+the PackedFunc interface and instead accepts `arg_values` directly as positional args:
+
+```bash
+TVM_DLL int32_t tvmgen_default_run_model(void* arg0, void* arg1, void* arg2) {
+  void* input = arg0;
+  void* input1 = arg1;
+  void* output = arg2;
+  (void)tvmgen_default_fused_multiply(input, input1, output);
+  return 0;
+}
+```
+
+Additional work is underway to wrap this in a firmware-friendly interface. A core design goal of
+this interface is to offload all memory management tasks to the calling code to facilitate
+integration with bare-metal embedded devices.
+
+Therefore, it would go against the goals of the C interface to introduce a generic runtime wrapper
+compatible with PackedFunc calling convention. It may be possible to do so in the future, but it
+would be great to motivate such an implementation with rationale more related to the embedded
+runtime setting.
+
+### Operator Calling Convention
+
+TVM uses 3 internal calling conventions:
+
+1. `call_packed` - the traditional calling convention used in the C++ runtime
+2. `call_cpacked` - similar to `call_packed`, but TVM presumes a symbol is linked into the binary
+   containing that function name (e.g. `TVMBackendGetFuncFromEnv` is not used to lookup the
+   PackedFunc)
+3. `unpacked` - used with microTVM to avoid overhead of PackedFunc calls in statically-linked
+   binaries. See [AOT optimisations for Embedded Targets
+   RFC](https://discuss.tvm.apache.org/t/rfc-utvm-aot-optimisations-for-embedded-targets/9849).
+
+The AOT `run_func` can use a different calling convention externally (e.g. `--interface-api`) than
+that used internally with Implemented Operators (`--unpacked-args`). However, there are some
+circumstances under which not all choices can be used:
+
+- When targeting the C++ runtime: `call_packed` must be used when non-DSO-exportable modules exist;
+  otherwise `call_cpacked` may be used. `unpacked` may not be used with AOT Executor as the
+  interface has not settled.
+- When targeting the C runtime: any calling convention may be selected for either the interface API
+  or the operator calling convention. However, when using `--interface-api=c` (e.g. `unpacked`
+  `run_func` calling convention), you must also use the `unpacked` calling convention with
+  Implemented Operators.
+
+# **Drawbacks**
+
+Why should we  *not*  do this?
+
+- This requires quite a bit of rework of the Metadata-passing mechanism, with potential for breakage.
+- It also introduces yet another Executor to the runtime to maintain.
+- It may introduce additional constraints on the `<C-runtime, C-interface>` implementation, which
+  may make it more difficult to make progress on microTVM.
+
+# **Rationale and alternatives**
+
+- Why is this design the best in the space of possible designs?
+- What other designs have been considered and what is the rationale for not choosing them?
+- What is the impact of not doing this?
+
+This RFC doesn't address the question of "why add an AOT executor?" The RFC which added it in the
+first place is a better location to look for rationale to motivate that. In general, not following
+through with this RFC would relegate the AOT executor to a C-runtime-only component. There is
+significant interest in AOT from C++ runtime users, and maintaining compatibility with both
+increases the chances that AOT executor will support all TVM runtime features.
+
+The controversial pieces of this RFC addressed are as follows:
+
+### Should we maintain a unified approach to code-generating the AOT executor?
+
+An alternative approach could introduce an additional e.g. `aot_cpp_executor_codegen.cc` and create
+a third pathway (in the Graph/AOT build flow). Doing this allows us to implement runtime-specific
+compiler primitives, which may simplify both pipelines. However, soon those pipelines will grow more
+complicated as features are added to leverage AOT, such as Unified Static Memory Planning. The
+burden of double-maintenance of those features outweighs the advantage of a simplified
+implementation. It also makes it easier for newcomers to understand the compiler.
+
+### Should we attempt to unify the Metadata?
+
+Metadata could be left in the scattered form it is now. It may be that the implementation of this
+RFC prioritizes expansion of `ModelMetadata` over propagating it to the various non-DSO-exportable
+`runtime::Module`. Ultimately though, maintaining separate function-level metadata adds confusion
+and code bloat. It also makes it harder to reason about the compiler as a whole. For these reasons,
+this RFC advocates for centralizing the Metadata.
+
+# **Prior art**
+
+There is no known prior art of a C++-runtime-compatible AOT implementation.
+
+# **Unresolved questions**
+
+- Who will we break if we unify Model metadata?
+- Will this play nicely with the VM compilation flow when it is unified?
+- How will TargetDevice come in to play here?
+
+# **Future possibilities**
+
+Not covered in this RFC, but particularly useful with the C++ runtime, is heterogenous execution. In
+the present PoC, AotExecutor will CHECK-fail if a non-cpu device is given. A future implementation
+will annotate the parameters with one of:
+
+- A `device_type` — in which case mapping from `device_type` to `tvm::Device` will be done in the
+  same way as the `GraphExecutor`
+- A `target_device` — in which case a new mapping will be defined
+
+Aside from that, the larger unresolved bit which makes it difficult to add heterogenous execution is:
+
+- How should AOT codegen invoke the Device API?
+
+Before this question can be answered, some progress needs to be made on the [C device
+API](https://discuss.tvm.apache.org/t/pre-rfc-c-device-api/10874) and we need to define TIR
+bindings.