Skip to content

[Bug] [RISC-V RVV] sqrt operator shows poor vectorization performance #18564

@yanyanyanggg

Description

@yanyanyanggg

Issue: [RISC-V RVV] sqrt operator shows poor vectorization performance

Description

The sqrt (square root) operator performs poorly with the RISC‑V Vector (RVV) extension, achieving only 0.385× the performance of the scalar implementation. This is unexpected for a mathematical function that should see significant benefits from vectorization.

Steps to Reproduce

  1. Generate the sqrt operator with the following configuration:
params = {
    "dtype": "float32",
    "batch": 14,
    "channels": 23,
    "input_height": 67,
    "input_width": 99
}
  1. Export the operator to two targets:

    • RV target (scalar, without vector extension):
      llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c
      
    • RVV target (with vector extension):
      llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c,+v
      
  2. Run performance measurement on both targets.

Operator definition code:

def export_sqrt(params, set_dir=None, platform="rv"):
    data = relay.var("data",
                     shape=(params["batch"], params["channels"],
                            params["input_height"], params["input_width"]),
                     dtype=params["dtype"])
    sqrt_op = relay.sqrt(data)
    export_op(sqrt_op, params["op_name"], [data], params, set_dir=set_dir)

Performance Data

  • RV execution time: 11.502000 ms
  • RVV execution time: 29.906500 ms
  • Acceleration ratio (RV/RVV): 0.385 (RVV is ~2.6× slower)

Environment Information

  • TVM version: 0.19.0
  • LLVM version: [Please provide: llvm-config --version]
  • Hardware: Spacemit K1‑X bit‑brick board
  • CPU: Spacemit X60 (8 cores, 1.6 GHz)
  • ISA: rv64imafdcv (with vector extensions)
  • Memory: 7.6 GB
  • OS: Bianbu 2.2, Linux kernel 6.6.63
  • Operation: Elementwise square root on ~1.7M elements

Expected Behavior

RVV vectorization should provide a performance improvement over the scalar RV baseline for mathematical functions like square root.

Additional Context

  • The sqrt operation is applied elementwise to a tensor of ~1.7M elements.
  • The performance regression (2.6× slower) suggests that the vectorized implementation of sqrt may be using suboptimal instructions or inefficient vector length management.
  • This is part of a broader pattern where multiple mathematical operators (log, sqrt, etc.) show severe performance degradation with RVV, indicating a potential issue with vector intrinsic mapping or loop vectorization for transcendental functions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triagePRs or issues that need to be investigated by maintainers to find the right assignees to address ittype: bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions