Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Target Pre-processing RFC #71

Merged
merged 4 commits into from
Aug 2, 2022
Merged

Conversation

Mousius
Copy link
Member

@Mousius Mousius commented May 11, 2022

No description provided.

@tqchen
Copy link
Member

tqchen commented May 11, 2022

Thanks @Mousius . Given these fields are pretty relevant to compiler configurations in traditional domain, it would be nice to also discuss prior approaches(e.g. where those fields normally sits in say LLVM) for posterity. This would also help us to make meaningful choices that aligns with existing terminologies. My quick read is that they seems to be aligned, but would be nice to double check.

## Architecture Pre-processing
```c++
Target("c")
.set_arch_preprocessor(MyArchPreprocessor)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If arch preprocessor is attached to Target, then this precludes serializing targets into a string. Maybe we should just allow subclassing of targets?

Also, here the target is already created, so it should be something like apply_arch_preprocessor...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embarrassingly, you have found a typo, this should be TVM_REGISTER_TARGET_KIND, which should make it a lot clearer how this is meant to work 😸

When a Target is parsed, the preprocessor will re-process the arguments and therefore re-production from a string should result new a new Target with the pre-processed information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, does this means that the only the first round trip of str -> Target -> str is allowed to introduce changes? That is, round trips of Target -> str -> Target will reproduce the original Target, and round trips of str -> Target -> str are idempotent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str -> Target -> str continues to function as it does today, where the AttrPreprocessor may introduce changes in attributes, the KeysPreprocessor would also behave this way. Theoretically, if the logic doesn't flip flop between values that should mean that the first change from str -> Target creates all the values and you're correct that future transformations should be idempotent (I can't see a reason the logic would flip flop, but it is possible).

The ArchPreprocessor does not hold this property as the arch internal property is repopulated from attrs on instantiation.

```c++
class TargetKind {
...
FTVMArchPreprocessor arch_preprocessor;
Copy link

@kparzysz-quic kparzysz-quic May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a part of TargetKind, then you cannot customize it per specific architecture. For example, both x86-64 and aarch64 have the same kind: "llvm", but they will need different preprocessors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we'll end up with something similar to the example in clang (https://github.com/llvm/llvm-project/blob/2f04e703bff3d9858f53225fa7c780b240c3e247/clang/lib/Driver/ToolChains/Clang.cpp#L331-L343) where the actual parser can take different routes dependent on more specific hints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

@kparzysz-quic kparzysz-quic May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang driver takes flags given by the user and translates them into specific "subtarget" configuration. There is a flag -target <triple>, which roughly corresponds to a TargetMachine, and then there are other details (implicit or explicit) that eventually create TargetSubtargetInfo. That part of clang allows each target to specify the exact features, having already established what that target is. This is similar, for example, to how target kind "rocm" has attributes specific to the ROCm family, the key here being that we know we're dealing with ROCm, and not with any generic GPU.

The target kind "llvm", on the other hand, does correspond to any generic CPU, be it x86 or Arm. Flags specific to x86 may not be applicable to Arm, and vice versa. If we attach every arch preprocessor to kind "llvm", then whatever code executes them will have to know which preprocessor to apply, and currently there is no information that would aid with this selection.

This goes beyond the scope of this RFC, but my vote would be to eliminate "llvm" as a target kind, and replace it with specific kinds of CPUs. We could have target "host" that would correspond to the compilation host (and can be automatically translated to an appropriate specific CPU kind).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang driver takes flags given by the user and translates them into specific "subtarget" configuration. There is a flag -target , which roughly corresponds to a TargetMachine, and then there are other details (implicit or explicit) that eventually create TargetSubtargetInfo.

Apologies, I oversimplified, you're right that Clang/LLVM has a few more data structures to describe this (including SubTargets) - in this RFC we're essentially using Map<String, ObjectRef> as a replacement for most of the SubTarget in the .arch property. I can add further details to the Clang/LLVM Prior Art to reflect this 😸

The target kind "llvm", on the other hand, does correspond to any generic CPU, be it x86 or Arm. Flags specific to x86 may not be applicable to Arm, and vice versa. If we attach every arch preprocessor to kind "llvm", then whatever code executes them will have to know which preprocessor to apply, and currently there is no information that would aid with this selection.

Sorry, I'm not understanding so I'll provide an example 😸 , in the case of LLVM we'd attach a pre-processor which can inspect the attributes and do something similar to:

Map<String, ObjectRef> llvm_preprocessor(...attrs...) {
  bool is_aarch64 = ...logic ...;
  if (is_aarch64) {
     return aarch64_preprocessor(...attrs...);
  }
}

Therefore, whenever you instantiate a Target("llvm -march=aarch64-woof-woof-woof") the preprocessor would give you a target.arch Map which would be pre-processed using the aarch64 processor, similar to the snippet I pasted from clang with the switch logic. This does mean we have to create a preprocessor for both the llvm and c Targets to map mcpu, march, mtune or mtriple to the relevant logic path in the same way that LLVM suggests processing march and mcpu for SubTargetInfo - this preprocessor would have the freedom to generate any eventual Map<String, ObjectRef> for the specific details of a given architecture which is being targeted by the llvm Target of TVM.

This goes beyond the scope of this RFC, but my vote would be to eliminate "llvm" as a target kind, and replace it with specific kinds of CPUs. We could have target "host" that would correspond to the compilation host (and can be automatically translated to an appropriate specific CPU kind).

The Target is definitely a difficult problem to unpick, so glad you agree it's out of scope 😸 , personally I would avoid magic Targets such as "composite" and "host" in favour of better defining them - llvm is maybe just a code generator for however we describe a CPU Target? Potentially worth starting an RFC to get the discussion going and then discuss in the community meetup?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at your example: set_arch_preprocessor(MyArchPreprocessor), what arch would this apply to? If I need to add code to the llvm_preprocessor above to recognize MyArch, and then call MyArchPreprocessor, then there is no point in having the set_arch_preprocessor in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thought was that MyArchPreprocessor would be the CPU preprocessor which invokes other preprocessors when it decides they're appropriate. So you'd have:

Map<String, ObjectRef> AArch64PreProcessor(...attrs...) {
  return ...things...;
}

Map<String, ObjectRef> CPUPreprocessor(...attrs...) {
  bool is_aarch64 = ...logic ...;
  if (is_aarch64) {
     return AArch64PreProcessor(...attrs...);
  }
}

TVM_REGISTER_TARGET_KIND("llvm", kDLCPU)
    .set_arch_preprocessor(CPUPreprocessor);

Which then means when you invoke:

my_target = Target("llvm -mtriple=aarch64-woof-woof");

You can inspect it as:

my_target.arch.is_aarch64 // True

This then allows us to add multiple preprocessors to the llvm Target via the single CPU preprocessor entrypoint.

Copy link

@kparzysz-quic kparzysz-quic May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be equivalent to adding arch as an attribute, and setting it using the attribute preprocessor, wouldn't it? The difference is that with your proposal we don't access arch as an attribute, but as a member.

Edit: Actually attrs is a (public) member in Target, it's just handled by helper functions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my_target.arch.is_aarch64 // True

I think it would be my_target.arch["is_aarch64"].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll come out as my_target.arch.is_aarch64 due to using __getattr__ here to prettify the attributes:
https://github.com/apache/tvm-rfcs/pull/71/files#diff-c04c196e8069297e1fcfdd5a514a268b971943b265649eaa41edd8d47052d56aR142-R150

@kparzysz-quic
Copy link

I think it's a good idea overall. I'm in favor of adding more internal flexibility to targets.

@Mousius
Copy link
Member Author

Mousius commented May 16, 2022

Thanks @Mousius . Given these fields are pretty relevant to compiler configurations in traditional domain, it would be nice to also discuss prior approaches(e.g. where those fields normally sits in say LLVM) for posterity. This would also help us to make meaningful choices that aligns with existing terminologies. My quick read is that they seems to be aligned, but would be nice to double check.

Good idea @tqchen, one thing I've picked out from a second read of the LLVM approach is the term Features rather than Arch which we could potentially align on. I've updated the Prior Art section to link out to what I've seen as the flow in LLVM, though I am by no means an LLVM expert.

@areusch
Copy link
Contributor

areusch commented May 16, 2022

i'm supportive of this idea. i didn't want to derail this RFC by discussing @kparzysz-quic 's idea about the architecture of the llvm target here, but i think that would be a good follow-up discussion.

Copy link

@kparzysz-quic kparzysz-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generally ok. I suggest elaborating a bit more in the text of the RFC that the preprocessors apply to the target kind, and that for all architectures, the code specific to each architecture would need to be handled as a part of the common preprocessor. The use of names like MyArchPreprocessor and MyKeysPreprocessor suggests that there can be multiple specialized preprocessors. I'd replace them with CArchPreprocessor and CKeysPreprocessor, since they both show up in the c target kind.

Comment on lines 85 to 88
These implementations can be stored under `src/target/preprocessors/<arch_identifier>.{cc.h}` to allow them to be composed together such as:

* src/target/preprocessors/aarch64.cc
* src/target/preprocessors/cpu.cc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd get rid of this section---it sounds like a recommendation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, do we not want to try and ensure these are consistently organised in the codebase?

@Mousius
Copy link
Member Author

Mousius commented May 18, 2022

I think this is generally ok. I suggest elaborating a bit more in the text of the RFC that the preprocessors apply to the target kind, and that for all architectures, the code specific to each architecture would need to be handled as a part of the common preprocessor. The use of names like MyArchPreprocessor and MyKeysPreprocessor suggests that there can be multiple specialized preprocessors. I'd replace them with CArchPreprocessor and CKeysPreprocessor, since they both show up in the c target kind.

Ideally I'd like to convey the ability to attach it to any TargetKind rather than implying it's designed for a specific TargetKind, maybe CPUArchPreprocessor which can be re-used between llvm and c which calls out to AArch64ArchPreprocessor or x86ArchPreprocessor?. You can use a single level pre-processor if that's appropriate (i.e. for rocm), which would just be ROCmArchPreprocessor. I can outline both cases if that'd help?

Copy link
Contributor

@Lunderberg Lunderberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it overall, though I do have one potential concern: By making it easier to query the architecture compared to cross-architecture features, will developers more often use architecture-specific checks that unnecessarily limit TVM features to specific architectures?

Expanding, I can see three primary cases for querying the target:

  1. To select a schedule that has been hand-tuned for a specific architecture.
  2. To check if a schedule or optimization is capable of running on a specific target. (e.g. the schedule stays within the maximum number of registers)
  3. To select an implementation during lowering (e.g. only use a warp-level reduction if the target has a sufficient warp size)

Of these cases, the 1st case should compare against the architecture itself. The 2nd and 3rd cases should compare against specific features that are provided by that architecture, but may also be provided by other architectures.

While I don't think this is a major issue, I can see it becoming an issue over time if architecture checks are done for convenience instead of feature checks.

@@ -0,0 +1,217 @@
- Feature Name: target-architecture-preprocessor
- Start Date: 2022-04-04
- RFC PR: [apache/tvm-rfcs#0070](https://github.com/apache/tvm-rfcs/pull/0000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: The RFC link doesn't point to the discussion PR. #70 points to the DeclBuffer RFC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😿

using TargetKeys = Array<String>;

using FTVMAttrPreprocessor = runtime::TypedPackedFunc<TargetAttrs(TargetAttrs)>;
using FTVMArchPreprocessor = runtime::TypedPackedFunc<TargetArch(TargetAttrs)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function signature wouldn't allow the TargetAttrs to be modified as part of the FTVMArchPreprocessor. If we're recognizing the architecture based on a specified "arch" or "mcpu" attribute, I think we should also modify the attributes to no longer contain the "arch" or "mcpu" attributes. Otherwise, we have the same information in two different locations, which allows discrepancies between the two representations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the breadth of TVM, I don't believe we can adequately clamp this to a specific set of input attributes?

Also, I was trying to not mutate the input attrs in favour of providing the new augmented object, so when you specify my_target = Target("c -mcpu=woofles") you can reliably expect my_target.mcpu to still provide the expected answer. Once processed, .arch would give you a more detailed set of information about the combination of input attrs rather than attempting to replace them?

## Architecture Pre-processing
```c++
Target("c")
.set_arch_preprocessor(MyArchPreprocessor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, does this means that the only the first round trip of str -> Target -> str is allowed to introduce changes? That is, round trips of Target -> str -> Target will reproduce the original Target, and round trips of str -> Target -> str are idempotent?

```c++
Target my_target("c -mcpu=cortex-m4");
my_target->GetArch<Bool>("is_aarch64", false); // false
my_target->GetArch<Bool>("has_dsp", false); // true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What should determine whether a parameter is defined in attrs or arch?

  • How can a Target override the values used for an architecture?

    • With the current design, it looks like this would be impossible. Because only the FTVMKeysPreprocessor takes the arch as input, any two it looks like the answer is no.

    • I think this would be a useful feature in order to selectively disable features for debugging and performance comparison. Alternatively, we'd get this same benefit if GetAttr is allowed to delegate to GetArch for undefined values, and and GetArch is only used internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main distinction is the architecture features are inferred from the Target attrs, if we want users to pass it in then it would need to be in attrs as you're correct there's no way of modifying it from a users point of view.

  • I think this would be a useful feature in order to selectively disable features for debugging and performance comparison. Alternatively, we'd get this same benefit if GetAttr is allowed to delegate to GetArch for undefined values, and and GetArch is only used internally.

I referenced this a bit here:
https://github.com/apache/tvm-rfcs/pull/71/files#diff-c04c196e8069297e1fcfdd5a514a268b971943b265649eaa41edd8d47052d56aR174-R182

I'd rather keep Target attrs separated out from arch so as not to confuse the Target API by having too many things accessible as direct properties of it?


using FTVMAttrPreprocessor = runtime::TypedPackedFunc<TargetAttrs(TargetAttrs)>;
using FTVMArchPreprocessor = runtime::TypedPackedFunc<TargetArch(TargetAttrs)>;
using FTVMKeysPreprocessor = runtime::TypedPackedFunc<TargetKeys(TargetAttrs, TargetKeys)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the keys preprocessor is primarily used to define properties based on the architecture determined. It looks like this has significant overlap with the target tags. When should this be used instead of tags?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't see the overlap, taking the example of:

TVM_REGISTER_TARGET_TAG("raspberry-pi/4b-aarch64")
    .set_config({{"kind", String("llvm")},
                 {"mtriple", String("aarch64-linux-gnu")},
                 {"mcpu", String("cortex-a72")},
                 {"mattr", Array<String>{"+neon"}},
                 {"num-cores", Integer(4)},
                 {"host", Map<String, ObjectRef>{{"kind", String("llvm")},
                                                 {"mtriple", String("aarch64-linux-gnu")},
                                                 {"mcpu", String("cortex-a72")},
                                                 {"mattr", Array<String>{"+neon"}},
                                                 {"num-cores", Integer(4)}}}});

These are pre-configured Targets with various mtriple, mcpu and mattr attributes already set - once pre-processed these can produce a set of architecture features for subsequent steps, such as replacing this check in the operator strategy:

https://github.com/apache/tvm/blob/f88a10fb00419c51a116a63f931a98d8286b23de/python/tvm/relay/op/strategy/arm_cpu.py#L232-L245

Other tagged Targets will likely have the same mattr and mcpu, thus rather than trying to hand craft the permutations each time, the pre-processor will 😸

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also entirely possible that I'm misunderstanding. I had been both features as a way to define a bag of properties associated with some short key. For the tags the short key is the name of the tag, and for the new preprocessor the short key is the architecture name.

@Mousius
Copy link
Member Author

Mousius commented May 18, 2022

I like it overall, though I do have one potential concern: By making it easier to query the architecture compared to cross-architecture features, will developers more often use architecture-specific checks that unnecessarily limit TVM features to specific architectures?

Unfortunately, this RFC neither improves nor worsens the situation you describe, I think the only way of really solving how the information is used is to limit the information visible to these functions. Target.current() is global and you have access to it all right now (see: https://github.com/apache/tvm/blob/main/python/tvm/topi/arm_cpu/conv2d_int8.py#L174-L177). Ideally people would move to using .arch.feature or we only make the .arch property visible to the internals (rather than the full Target), as you suggested, but that compiler infrastructure change would be out of scope of this RFC.

Also, .arch may be misleading here, as it's a description of the architecture which means it'll include features such as .arch.has_dsp, enabling 2. and 3. based on those available features within the architecture. To further motivate this, the preprocessor can produce a register_count property on the .arch property if a specific CPU/attr/triple/etc is seen and that could be accessed via my_target.arch.register_count.

Based on my previous re-review of LLVM, thanks to @tqchen, it might help to use my_target.features.dsp rather than my_target.arch.has_dsp and clarifying these are features available to the Target? What do you think?

@Lunderberg
Copy link
Contributor

Based on my previous re-review of LLVM, thanks to @tqchen, it might help to use my_target.features.dsp rather than my_target.arch.has_dsp and clarifying these are features available to the Target? What do you think?

I like that, and the renaming makes it clear which are boolean parameters and which are variable attributes. That would also be useful for cleaning up the Vulkan target, which right now has a large number of boolean attributes that would be better expressed as feature support.

The renaming would also help with avoiding architecture-specific checks at the code review stage. Since the information defined by the architecture would be pulled over into my_target.features, checks that directly access my_target.arch should only occur for case (1).

@junrushao
Copy link
Member

Thanks @Mousius for drafing this RFC!

First of all, I completely agree on the importance to handle arch-specific checks. Use our experience as an example, on CUDA we might want to check if the PTX intrinsic cp.async.commit_group is available on certain architecture before tensorizing using that PTX intrinsic, and the existing approach is to add extra flags in TargetKind.

To motivate discussion, I would love to provide more context on the system design, and then talk about some specific points I noticed in the current RFC, and finally propose some ideas inspired by this RFC.

Background

Design principles. Just wanted to share some of my design principles when developing TVM, the compiler I believe that aims to work across hardware models:

  • A1. Generic. By design, TVM serves the common interest of all vendors. Therefore, it would be great if we could discuss broadly how a design choice would benefit all hardware platforms, including ARM, NV, Intel, Qual, etc.
  • A2. Minimal. Ideally, practitioners are supposed to learn only bare minimum to participate in TVM development.
    • Take TIR as an example, for a developer who knows the IR and passes to use and develop TIR, the only extra concept is Target .
    • Indeed Relay is in a less ideal state, but Relax is designed to address this issue so I wouldn't worry too much
  • A3. Customizable. We want to develop the infrastructure to provide customization opportunities for all vendors to independently work on their extension in a collaborative way, making TVM a platform for everyone. A few examples:
    • TIR scheduling is heading towards this direction that all the schedule primitives are decoupled with each other, so that vendors could develop their own primitives without affecting each other.
    • TVMScript is going to be re-designed this way treating TIR and Relax as independent dialects while the core infra itself supports any 3rdparty IR.
    • Admittedly Relay is much limited with the assumption of Relay => TE => TIR lowering path, with some hooks to hack in other compilation paths, but Relax is going to change this so again I wouldn't worry...

Current arch-specifc checks. Most of the 6 groups of arch-specific helper functions, mentioned in the "Motivation" section, are developed by concurrent efforts from multiple parties, and therefore I would say fragmentation is almost a certain thing to happen. On the other hand, fragmentation of those helper functions, which are indeed ad-hoc, is currently confined locally as independent checks without yet polluting the design of global infrastructure.

Special target attributes. Below are a few existing target attributes that serve special semantics, the design of which I don't fully agree with, but now are preserved as legacy:

  • keys. This is used to guide TOPI dispatch. For example, "llvm --keys=arm_cpu,cpu" first finds if there is any GenericFunc registered with arm_cpu, and if not, it falls back to cpu.
  • libs. This is used in TOPI to dispatch to vendor library. For example, "cuda -libs=cudnn" prefers dispatching to cuDNN.
  • device. and model. These two sometimes control the behavior of auto-tuning.

Existing arch-like attributes. Note that the design of Target attributes in TargetKind is intended to describe the capability of hardware, according to the Target RFC:

Pass in additional target-specific attributes like ISA/library extension to a target.
Therefore, currently all the arch-like flags are centered around Target. As an example:

  • The Vulkan target comprehensively includes hardware feature support (e.g. whether or not fp16 is supported), physical limits of the device (e.g. max number of threads allowed).
  • The CUDA target is defined in similar approach but less comprehensive yet. It doesn't require architectural change to grow its attributes.

Note that the number of attributes may grow if there is new hardware feature, but to the best of my knowledge, it could be less common that those hardware features may bloat, mainly because there is concrete cost to grow features on hardwares.

Target tag system. Given the fact that existing hardware models are enumerable, the Target RFC proposes to use "tags" to allow easy creation of targets. For example, Target("nvidia/jetson-agx-xavier") gives full specification of this device, including the cuda target and the ARM host. At the time of writing, it only takes 200 tags to describe all the CUDA hardware.

What on earth are Targets. Actually, target in TVM not only refers to the hardware, but also the codegen targets. For example, LLVM targets means TVM codegens to LLVM, and let LLVM do the rest of compilation. CUDA targets codegens to CUDA source code (i.e. *.cu files), and invokes NVCC for compilation.

Discussion

Naming choice. When implementing the Target RFC, I came up with the preprocessor to be backward compatible with existing functionalities that auto-detects the local enviroment (e.g. cuda version), which I have to admit it's probably not the best name. In the context of our RFC, we might want to use names like target_parser instead to be more specific.

Where to dispatch target parsers. Currently, the preprocessor is dispatched solely based on TargetKind, which is admittedly a bit limited and overlooked the possiblity that aarch64 and x86 may need completely different parsers. Therefore, here we would love to raise a question: based on which attribute the parser should be dispatched? Previous wisdom in clang seems to suggest dispatching according to mtriple (IIUC), so shall we introduce anything similar, for example, dispatching based on --device aarch64-foo-bar?

Do we need multiple target parsers? My answer is no. If the parsers are maintained by different vendors separately on their own interest, then they could decide how to implement parsers for "keys", "arch", together, without conflicting with other contributors. Therefore, I would say it's already consistent with our principle A3 without having to develop multiple parsers.

Function Signature for a target parser. Note that with the introduction of previous Target RFC, a target object is canonically represented by a JSON-like object. Therefore, the signature could be:

using TargetJSON = Map<String, ObjectRef>;
using FTVMTargetParser = TypedPackedFunc<TargetJSON(TargetJSON)>;

so that it's generic enough and could potentially be useful if vendors in the future want to hack around the Target object. To clarify the discussion here, I would propose that our parsing code path not go back to string again, i.e.:

  • str -> TargetJSON (canonical form) -> TargetJSON (target parser applied) -> Target
  • TargetJSON (canonical form) -> TargetJSON (target parser applied) -> Target

Distinguishing arch and attrs. Note that the number of possible attributes grow less frequently as we might expect. Also, there is no fundamental difference between arch and attrs given arch is special attrs, and target is designed for this according to point C5 in the Target RFC. Therefore, I would personally prefer a more flattened access pattern, and therefore we do not need to distinguish arch and attrs.

Folder structure. According to principle A3, I would propose that different vendors may maintain their own TargetKind and parsers separately, for example, aarch64 could be maintained using:

  • src/target/target/aarch64/parser.cc
  • src/target/target/aarch64/tags.cc
  • src/target/target/aarch64/kind.cc
  • src/target/target/aarch64/helpers.cc
    Where the last item provides pre-defined packed functions mentioned in the "Motivation" section.

@Mousius
Copy link
Member Author

Mousius commented May 20, 2022

Hi @junrushao1994, thanks for the elaborate reply 😸 I don't want to debate our personal principles but I appreciate you sharing them and will reference them where I can.

Current arch-specifc checks. Most of the 6 groups of arch-specific helper functions, mentioned in the "Motivation" section, are developed by concurrent efforts from multiple parties, and therefore I would say fragmentation is almost a certain thing to happen. On the other hand, fragmentation of those helper functions, which are indeed ad-hoc, is currently confined locally as independent checks without yet polluting the design of global infrastructure.

Yip, this fragmentation occurs due to a lack of a standardised mechanism for these groups to use, which the RFC aims to provide - I'm not sure why we consider that pollution given it should have a positive impact on all groups and aids in providing customisation through a well defined route established by all vendors, providing simplicity and customisation (A1, A2 and A3).

Existing arch-like attributes. Note that the design of Target attributes in TargetKind is intended to describe the capability of hardware, according to the Target RFC:

I believe this is still held given the crossover between hardware and code generators, for c/llvm you will still be able to pass mattr/mcpu/etc which are the standard way to specify CPU, ISA and extensions in existing compilers (A1). This is also following the TVM Guideline of "Be consistent with existing well-known package’s APIs if the features overlap.
For example, tensor operation APIs should always be consistent with the numpy API.".

Target tag system. Given the fact that existing hardware models are enumerable, the Target RFC proposes to use "tags" to allow easy creation of targets. For example, Target("nvidia/jetson-agx-xavier") gives full specification of this device, including the cuda target and the ARM host. At the time of writing, it only takes 200 tags to describe all the CUDA hardware.

What on earth are Targets. Actually, target in TVM not only refers to the hardware, but also the codegen targets. For example, LLVM targets means TVM codegens to LLVM, and let LLVM do the rest of compilation. CUDA targets codegens to CUDA source code (i.e. *.cu files), and invokes NVCC for compilation.

There's definitely an existing amount of confusion in the Target system, but I think it's even more confused than this by looking at the tagged entries, such as:

TVM_REGISTER_TARGET_TAG("raspberry-pi/4b-aarch64")
    .set_config({{"kind", String("llvm")},
                 {"mtriple", String("aarch64-linux-gnu")},
                 {"mcpu", String("cortex-a72")},
                 {"mattr", Array<String>{"+neon"}},
                 {"num-cores", Integer(4)},
                 {"host", Map<String, ObjectRef>{{"kind", String("llvm")},
                                                 {"mtriple", String("aarch64-linux-gnu")},
                                                 {"mcpu", String("cortex-a72")},
                                                 {"mattr", Array<String>{"+neon"}},
                                                 {"num-cores", Integer(4)}}}});

Which defines a system configuration that then uses a code generator llvm with a specific CPU profile, therefore the Target system can represent at minimum 3 distinct layers: systems/hardware/code generators. Given the principle of having to understand a minimal amount (A2), Target needs to be streamlined into understandable parts. This is a digression from the actual RFC, as the features are pre-processed from the tagged Targets information.

The problem that I've faced trying to figure out how to implement this is that llvm takes mcpu/mattr/etc which are used to infer features later where-as the cuda Target has a more complete reflection of the attrs available for CUDA. These inconsistencies make it difficult for a user to approach TVM and that isn't requiring a developers to learn the bare minimum (A2). It also diverges from other compilers where you'd expect mcpu/mtriple/etc to infer much of this information for you (A1).

Do we need multiple target parsers? My answer is no. If the parsers are maintained by different vendors separately on their own interest, then they could decide how to implement parsers for "keys", "arch", together, without conflicting with other contributors. Therefore, I would say it's already consistent with our principle A3 without having to develop multiple parsers.

Naming choice. When implementing the Target RFC, I came up with the preprocessor to be backward compatible with existing functionalities that auto-detects the local enviroment (e.g. cuda version), which I have to admit it's probably not the best name. In the context of our RFC, we might want to use names like target_parser instead to be more specific.

I don't mind using one parser, I was just making it more granular and avoiding stepping on peoples toes with the existing pre-processor. Either design seems to fulfil your principle of customisability (A3), more granularly means you only have to implement a small piece of customisation where-as with the single parser it requires some thought as to how to cater for all attributes within a given TargetKind parser.

Where to dispatch target parsers. Currently, the preprocessor is dispatched solely based on TargetKind, which is admittedly a bit limited and overlooked the possiblity that aarch64 and x86 may need completely different parsers. Therefore, here we would love to raise a question: based on which attribute the parser should be dispatched? Previous wisdom in clang seems to suggest dispatching according to mtriple (IIUC), so shall we introduce anything similar, for example, dispatching based on --device aarch64-foo-bar?

This requires developers to learn more than the bare minimum (A2), as users are required to learn the specifics of how to specify a Target in TVM. If an llvm Target supports -mcpu/-mattr/-mtriple then all of these fields can be reasonably expected to allow the Target to infer features given the relation to GCC/LLVM (A1).

Distinguishing arch and attrs. Note that the number of possible attributes grow less frequently as we might expect. Also, there is no fundamental difference between arch and attrs given arch is special attrs, and target is designed for this according to point C5 in the Target RFC. Therefore, I would personally prefer a more flattened access pattern, and therefore we do not need to distinguish arch and attrs.

Based on LLVM, the Features inferred from a given CPU/GPU profile can be extensive already, which would lead to a drastic increase in attrs if we were to track them this way. For example:

If we were to attach all of these directly to Target (i.e. llvm), that would drastically increase the number of available fields and in all cases only a subset would be used - specific to a given CPU/GPU profile. Therefore I would strongly advise following the pattern (as in LLVM) of using a separate structure for this, clear boundaries provide a much more straight forward developer experience (A2) and allow us to do things as @Lunderberg suggested like plucking just features or arch to give to Passes rather than the entire Target.

Folder structure. According to principle A3, I would propose that different vendors may maintain their own TargetKind and parsers separately, for example, aarch64 could be maintained using:

  • src/target/target/aarch64/parser.cc
  • src/target/target/aarch64/tags.cc
  • src/target/target/aarch64/kind.cc
  • src/target/target/aarch64/helpers.cc
    Where the last item provides pre-defined packed functions mentioned in the "Motivation" section.

I believe the proposed layout also conforms to A3 as it already details two different pre-processors, and you can easily imagine it extending:

  • src/target/preprocessors/aarch64.cc
  • src/target/preprocessors/cpu.cc
  • src/target/preprocessors/cuda.cc
  • src/target/preprocessors/woofles.cc
  • src/target/preprocessors/x86_64.cc

The composition of CPU calling AArch64 or x86_64 conditionally also provides multiple vendors to provide pre-processing functionality to llvm by virtue of being able to call their own functions within the central CPU pre-processor.

Outcomes

Will try and summarise the above into changes in the RFC which I can implement 😸

Single Parser

I can re-word the RFC to use a single parser (target_parser) rather than the three separate ones, using the syntax you suggested:

using TargetJSON = Map<String, ObjectRef>;
using FTVMTargetParser = TypedPackedFunc<TargetJSON(TargetJSON)>;

Target Field

I'm strongly in favour of having target.features to represent this, do you actively object to this?

Location

I believe the src/target/parsers folder would provide for maximum customisation given parsers can be re-used between TargetKinds and therefore serves as a good starting point? Also happy for an RFC in 6 months to move everything to a new folder if we re-factor the Targets 😸

@tqchen
Copy link
Member

tqchen commented May 20, 2022

Thanks folks for discussions. I think they summarizes to the following points

  • Q0: Subfield grouping (e.g. features) or simply leave as top-level attrs
  • Q1: Folder structure: target/preprocessors/cuda.cc vs target/cuda/cuda_preprocessor.cc
    • Note that code-reuse is less likely going to be an issue as there can be common includes among sub-folder structures.
  • Q2: How to reconcile attributes specified already for (part of) subfield group, llvm can use mattr to specify the feature necessary. What happens to CUDA(e.g. we proposing to group hw related limitations to features), and how can we specify some of the fields(under the proposal will go into a subfield feature) during tag registration.

@junrushao
Copy link
Member

@Mousius Thank you so much for your response! This makes lots of sense to me! Also, thanks for including my personal principles in the discussion! It's my personal principles which are completely okay to disagree with :-)

I'm not sure why we consider that pollution given it should have a positive impact on all groups and aids in providing customisation through a well defined route established by all vendors, providing simplicity and customisation (A1, A2 and A3).

Sorry my words may lead to some potential miscommunication. As I mentioned in the very beginning, I completely agree with the idea of handling arch more systematically and it's important to us, so in no means I consider it pollution. I mean the fragmentation itself is not good but hasn't become extremely bad right now, because they are not used globally at infrastructural level, so there is no namespace polluting.

I believe this is still held given the crossover between hardware and code generators, for c/llvm you will still be able to pass mattr/mcpu/etc which are the standard way to specify CPU, ISA and extensions in existing compilers.

Agreed. It is designed to be consistent with existing wisdom, i.e. LLVM/GCC mattr/mcpu/etc., so I believe it's not controversial.

There's definitely an existing amount of confusion in the Target system, but I think it's even more confused than this by looking at the tagged entries, such as: .... Which defines a system configuration that then uses a code generator llvm with a specific CPU profile, therefore the Target system can represent at minimum 3 distinct layers: systems/hardware/code generators.

Yes, I think your understanding is correct. Per Target RFC, this object represents host compiler info (C0), heterogeneous compilation settings (C1), compiler behavior control (C2), additional ISA/library (C4). Therefore, the tag system (C4) is introduced to reduce the cognitive overhead, because the end users can use a simple plain string tag Target("raspberry-pi/4b-aarch64") to represent all C0/1/2/4, instead of understanding every detail of the hardware themselves (A2). In other words, to the end users, the vision of having tag system is that they only need a simple string to do configure TVM codegen on their their hardware.

where-as the cuda Target has a more complete reflection of the attrs available for CUDA. These inconsistencies make it difficult for a user to approach TVM and that isn't requiring a developers to learn the bare minimum (A2). It also diverges from other compilers where you'd expect mcpu/mtriple/etc to infer much of this information for you (A1).

Right, in fact I share the same confusion with you with it comes to CUDA/Vulkan/etc., where (to best of my knowledge) there is no precedent of mcpu/mattr/mtriple/etc (except for LLVM NVPTX backend). Certainly, we should develop unified approach to address this confusion, which could possibly be the arch proposed in the RFC.

I don't mind using one parser, I was just making it more granular and avoiding stepping on peoples toes with the existing pre-processor. Either design seems to fulfil your principle of customisability (A3), more granularly means you only have to implement a small piece of customisation where-as with the single parser it requires some thought as to how to cater for all attributes within a given TargetKind parser.

Thanks for sharing your thoughts! Definitely. The issue of having multiple parsers is that the order of their application is less explicit, so I would be a little bit leaning towards a single one, while the implementation could share common helper functions.

Where to dispatch target parsers. Currently, the preprocessor is dispatched solely based on TargetKind, which is admittedly a bit limited and overlooked the possiblity that aarch64 and x86 may need completely different parsers. Therefore, here we would love to raise a question: based on which attribute the parser should be dispatched? Previous wisdom in clang seems to suggest dispatching according to mtriple (IIUC), so shall we introduce anything similar, for example, dispatching based on --device aarch64-foo-bar?

This requires developers to learn more than the bare minimum (A2), as users are required to learn the specifics of how to specify a Target in TVM. If an llvm Target supports -mcpu/-mattr/-mtriple then all of these fields can be reasonably expected to allow the Target to infer features given the relation to GCC/LLVM (A1).

Right. I agree. TargetKind-based dispatch is certainly the clearest approach as you mentioned (A2). The reason I'm thinking about (not 100% sure though) other directions is: what if different vendors (say Qualcomm and ARM) want to customize their own target parsing logic separately, while their TargetKind is the same llvm? Admittedly I don't have a perfect answer for this and would love to hear about your thoughts.

Based on LLVM, the Features inferred from a given CPU/GPU profile can be extensive already, which would lead to a drastic increase in attrs if we were to track them this way. ... Therefore I would strongly advise following the pattern (as in LLVM) of using a separate structure for this, clear boundaries provide a much more straight forward developer experience (A2) and allow us to do things as @Lunderberg suggested like plucking just features or arch to give to Passes rather than the entire Target.

Thank you for detailing the challenges! I'm convinced that target.features may be the best way to go. A couple of more questions:

  • What's the difference between target.features and target.arch? If they are the same thing, I would say features is certainly a better name
  • IIUC we are not going to introduce 2-level of indirection like target.arch.features. Is that correct?
  • Is it possible to completely replace target.attrs with more structured configuration? For example, target.features for C4 in the Target RFC, while target.hardware_limits for maximum shared memory, etc (just a hypothetical name, please feel free to suggest better ones)

I believe the src/target/parsers folder would provide for maximum customisation given parsers can be re-used between TargetKinds and therefore serves as a good starting point? Also happy for an RFC in 6 months to move everything to a new folder if we re-factor the Targets 😸

Sounds good to me!

Overall, I think we are in broad agreement, and thanks for sharing all the context!

@areusch
Copy link
Contributor

areusch commented May 23, 2022

great! so I think we just need some updates to the text of the RFC based on discusison, then we are good to merge.

@Mousius
Copy link
Member Author

Mousius commented Jun 9, 2022

In the spirit of keeping this simple to review, the text of this RFC has been reformulated to cover the target_parser replacement for preprocessor which @junrushao1994 proposed - I've raised a follow on for the features additional to the Target so that can also be reviewed in isolation with updates from the discussion here: #78

Hopefully this makes it simpler for everyone 😸

cc @kparzysz-quic @junrushao1994 @tqchen @areusch @Lunderberg

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Mousius just a couple clarifying questions about the text but I'm good with the overall approach

rfcs/0071-target-json-parser.md Show resolved Hide resolved
rfcs/0071-target-json-parser.md Outdated Show resolved Hide resolved
rfcs/0071-target-json-parser.md Show resolved Hide resolved
rfcs/0071-target-json-parser.md Show resolved Hide resolved
@Mousius
Copy link
Member Author

Mousius commented Jun 27, 2022

@areusch I replied to your comments, could you take a look and see if it makes more sense to you? Then if we're all happy I'll do a final update on the text 😸

@areusch
Copy link
Contributor

areusch commented Jul 14, 2022

@kparzysz-quic @junrushao1994 @tqchen @Lunderberg can you look at this one again?

Copy link

@kparzysz-quic kparzysz-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}

TargetJSON CPUTargetParser(TargetJSON target) {
if (IsAArch64Target(target)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the details of the "kind" are listed in the JSON, there is no way of knowing what architecture the target refers to.

@kparzysz-quic
Copy link

Actually, the target JSON only shows the name of the kind, so the target parser (if it operates on the JSON) won't see the details of the kind, unless they are included in the JSON. I think that expanding the contents of the target JSON to contain the kind specifics as well is critical for this feature to make sense. Please clarify of that's a part of the plan.

@Mousius
Copy link
Member Author

Mousius commented Jul 15, 2022

Actually, the target JSON only shows the name of the kind, so the target parser (if it operates on the JSON) won't see the details of the kind, unless they are included in the JSON. I think that expanding the contents of the target JSON to contain the kind specifics as well is critical for this feature to make sense. Please clarify of that's a part of the plan.

Hi @kparzysz-quic, can you expand on what's missing here? According to https://discuss.tvm.apache.org/t/rfc-tvm-target-specification/6844, the Target JSON includes the kind, attrs and keys.

@kparzysz-quic
Copy link

Map<String, ObjectRef> TargetNode::Export() const {
  Map<String, ObjectRef> result = {
      {"kind", this->kind->name},                        <------------
      {"tag", this->tag},
      {"keys", this->keys},
  };
  if (this->host.defined()) {
    result.Set("host", this->GetHost().value_or(Target())->Export());
  }
  for (const auto& kv : attrs) {
    result.Set(kv.first, kv.second);
  }
  return result;
}

@Mousius
Copy link
Member Author

Mousius commented Jul 15, 2022

@kparzysz-quic, sorry, I'm still not understanding here - the name of the kind is enough to look it up in the kind registry and gather any kind-specific information. What further details are missing?

@kparzysz-quic
Copy link

@kparzysz-quic, sorry, I'm still not understanding here - the name of the kind is enough to look it up in the kind registry and gather any kind-specific information. What further details are missing?

I thought the attrs were stored in the kind, but they are in the target itself. Nevermind.

Mousius added a commit to Mousius/tvm that referenced this pull request Jul 18, 2022
This adds the `target_parser` as described in apache/tvm-rfcs#71, which parses an incoming `TargetJSON` and produces a new configuration for generating the final `Target` object from.

Marks `set_attrs_preprocessor` as deprecated and errors if both `set_attrs_preprocessor` and `set_target_parser` exist together.
kparzysz-quic pushed a commit to apache/tvm that referenced this pull request Jul 18, 2022
This adds the `target_parser` as described in apache/tvm-rfcs#71, which parses an incoming `TargetJSON` and produces a new configuration for generating the final `Target` object from.

Marks `set_attrs_preprocessor` as deprecated and errors if both `set_attrs_preprocessor` and `set_target_parser` exist together.
@areusch
Copy link
Contributor

areusch commented Jul 25, 2022

@Mousius want to make the other suggested updates? then I'm good with this RFC

@areusch areusch merged commit 78423c5 into apache:main Aug 2, 2022
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
This adds the `target_parser` as described in apache/tvm-rfcs#71, which parses an incoming `TargetJSON` and produces a new configuration for generating the final `Target` object from.

Marks `set_attrs_preprocessor` as deprecated and errors if both `set_attrs_preprocessor` and `set_target_parser` exist together.
mikeseven pushed a commit to mikeseven/tvm that referenced this pull request Sep 27, 2023
This adds the `target_parser` as described in apache/tvm-rfcs#71, which parses an incoming `TargetJSON` and produces a new configuration for generating the final `Target` object from.

Marks `set_attrs_preprocessor` as deprecated and errors if both `set_attrs_preprocessor` and `set_target_parser` exist together.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants