Skip to content

Conversation

@guosran
Copy link
Collaborator

@guosran guosran commented Jan 23, 2026

This PR introduces a new pipeline to lower TOSA operations to the Taskflow dialect through Linalg and Affine conversions. Key changes include:

  1. New Pipeline: Added TosaToTaskflowPipeline.cpp that orchestrates:

    • TOSA Optimizations: Integrated standard TOSA cleanups (infer-shapes, make-broadcastable) and structure. Note: Constant folding has limited impact currently (see Issue [P1] Enhance TOSA constant folding effectivenes #246 ).
    • TOSA -> Linalg/Arith/Tensor conversion.
    • Linalg Optimizations: Enabled elementwise-fusion, which proved critical for merging operation chains into single kernels.
    • One-Shot Bufferization: Configured with IdentityLayoutMap for deterministic results.
    • Linalg -> Affine conversion.
    • Affine -> Taskflow conversion.
  2. Tests:

    • Added tosa-to-taskflow.mlir to verify the full end-to-end pipeline.
    • Added tosa-to-affine.mlir for inspecting the intermediate structural lowering.
    • Added tosa-fusion.mlir to verify that operator chains are correctly fused into single loops.

@guosran guosran requested review from ShangkunLi and Copilot and removed request for Copilot January 23, 2026 23:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new TOSA→Taskflow lowering pipeline and wires it into the mlir-neura-opt tool, along with the necessary dialect/extension registrations and tests.

Changes:

  • Introduces MLIRTosaToTaskflowPipeline and registerTosaToTaskflowPipeline() to lower TOSA through Linalg, bufferization, and Affine to Taskflow.
  • Updates mlir-neura-opt to register TOSA/bufferization dialects, bufferization interfaces, MLIR extensions, and the new pipeline, and links in the required MLIR libraries.
  • Adds conversion tests for full TOSA→Taskflow lowering and direct Affine→Taskflow lowering, plus updates a CGRA-Bench submodule reference.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/mlir-neura-opt/mlir-neura-opt.cpp Registers new dialects, bufferization interfaces, MLIR extensions, passes, and the TOSA→Taskflow pipeline in the optimization tool.
tools/mlir-neura-opt/CMakeLists.txt Links additional MLIR dialect, transform, bufferization, and extension libraries required by the tool.
test/benchmark/CGRA-Bench Updates submodule commit for CGRA benchmark data.
test/Conversion/TosaToTaskflow/tosa-to-taskflow.mlir Adds an end-to-end test for the new TOSA→Taskflow pipeline.
test/Conversion/TosaToTaskflow/affine-to-taskflow.mlir Adds a focused test for the Affine→Taskflow conversion pass.
lib/Conversion/TosaToTaskflow/TosaToTaskflowPipeline.cpp Implements the TOSA→Taskflow pass pipeline (TOSA→Linalg/Arith/Tensor, bufferization, Linalg→Affine, Affine→Taskflow).
lib/Conversion/TosaToTaskflow/CMakeLists.txt Builds and links the new MLIRTosaToTaskflowPipeline library with required MLIR components.
lib/Conversion/CMakeLists.txt Integrates the new TOSA→Taskflow pipeline library into the Conversion CMake hierarchy and interface library.
include/Conversion/ConversionPasses.h Declares registerTosaToTaskflowPipeline() for external registration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@guosran guosran force-pushed the feature/tosa-lowering branch from e041bd9 to 0199b8d Compare January 23, 2026 23:19
This PR introduces a new pipeline to lower TOSA operations to the Taskflow dialect through Linalg and Affine conversions. Key changes include:

1.  **New Pipeline**: Added  that orchestrates:
    *   TOSA -> Linalg/Arith/Tensor conversion.
    *   Linalg optimizations (elementwise fusion).
    *   One-Shot Bufferization with  for deterministic results.
    *   Linalg -> Affine conversion.
    *   Affine -> Taskflow conversion.

2.  **Tooling Support**:
    *   Updated  to register TOSA, Bufferization, and related Dialects.
    *   Explicitly registered Bufferization interfaces for Linalg, Tensor, Arith, and SCF to prevent runtime crashes.
    *   Added  and  links.

3.  **Tests**:
    *   Added  to verify the full pipeline.
    *   Added  for direct affine lowering verification.
    *   Ensured compatibility with existing Taskflow tests (e.g. ).
@guosran guosran force-pushed the feature/tosa-lowering branch from 0199b8d to 609d337 Compare January 23, 2026 23:24
@ShangkunLi
Copy link
Collaborator

Thanks for this great job~

I think we should enable a progressive lowering process from tosa to taskflow. Like from tosa to affine, then affine to taskflow.

And could you please help investigate if we can perform some graph-level optimization in tosa level (e.g., operator fusion), or we have to introduce the linalg as an intermediate representation.

Split the TosaToTaskflow pipeline into two distinct pipelines:
1. : Lowers TOSA to Linalg (with optimizations and bufferization) and then to Affine. This serves as a foundational pipeline for inspection or further affine transformations.
2. : A composite pipeline that runs  followed by .

Key changes:
- Refactored  to expose .
- Registered both pipelines in  and .
- Added  test case to verify the intermediate affine stage.
@guosran
Copy link
Collaborator Author

guosran commented Jan 24, 2026

Thanks for this great job~

I think we should enable a progressive lowering process from tosa to taskflow. Like from tosa to affine, then affine to taskflow.

And could you please help investigate if we can perform some graph-level optimization in tosa level (e.g., operator fusion), or we have to introduce the linalg as an intermediate representation.

I have split the pipeline and added a test accordingly. Perhaps I could perform optimizations in a subsequent pr?

Enabled TOSA standard optimization passes (InferShapes, MakeBroadcastable, LayerwiseConstantFold) in the pipeline. Added 'tosa-fusion.mlir' to verify Linalg elementwise fusion and 'tosa-opt.mlir' to benchmark TOSA constant folding (currently a known limitation).
@guosran
Copy link
Collaborator Author

guosran commented Jan 24, 2026

Thanks for this great job~
I think we should enable a progressive lowering process from tosa to taskflow. Like from tosa to affine, then affine to taskflow.
And could you please help investigate if we can perform some graph-level optimization in tosa level (e.g., operator fusion), or we have to introduce the linalg as an intermediate representation.

I have split the pipeline and added a test accordingly. Perhaps I could perform optimizations in a subsequent pr?

done some of the optimizations, will present here

@guosran
Copy link
Collaborator Author

guosran commented Jan 24, 2026

Summary for optimizations

  • Input Source (TOSA IR)
    The test case consists of a chain of three logical operations: Add, Multiply, and Maximum (ReLU).
func.func @fusion_test(%arg0: tensor<16xf32>) -> tensor<16xf32> {
  // Op 1: Add
  %0 = tosa.add %arg0, %arg0 : (tensor<16xf32>, tensor<16xf32>) -> tensor<16xf32>
  // Op 2: Multiply
  %1 = tosa.mul %0, %0 : (tensor<16xf32>, tensor<16xf32>) -> tensor<16xf32>
  // Op 3: Maximum (ReLU placeholder)
  %zeros = "tosa.const"() {value = dense<0.0> : tensor<16xf32>} : () -> tensor<16xf32>
  %2 = tosa.maximum %1, %zeros : (tensor<16xf32>, tensor<16xf32>) -> tensor<16xf32>
  
  return %2 : tensor<16xf32>
}
  • Baseline Result (Without Linalg Fusion)
    Without the fusion transformation, each high-level TOSA operation results in a standalone loop nest, which is far from optimal.
func.func @fusion_test(%arg0: memref<16xf32>, %arg1: memref<16xf32>) {
  // Pass 1: Addition
  %tmp1 = memref.alloc() : memref<16xf32>
  affine.for %i = 0 to 16 {
    %v0 = affine.load %arg0[%i] : memref<16xf32>
    %v1 = arith.addf %v0, %v0 : f32
    affine.store %v1, %tmp1[%i] : memref<16xf32>
  }
  // Pass 2: Multiplication (Reads output of Pass 1)
  %tmp2 = memref.alloc() : memref<16xf32>
  affine.for %i = 0 to 16 {
    %v1 = affine.load %tmp1[%i] : memref<16xf32>
    %v2 = arith.mulf %v1, %v1 : f32
    affine.store %v2, %tmp2[%i] : memref<16xf32>
  }
  // Pass 3: Maximum (Reads output of Pass 2)
  affine.for %i = 0 to 16 {
    %v2 = affine.load %tmp2[%i] : memref<16xf32>
    %v3 = arith.maximumf %v2, %cst : f32
    affine.store %v3, %arg1[%i] : memref<16xf32>
  }
  return
}
  • Optimized Result (With Linalg Fusion)
    With the fusion pass enabled in our current pipeline, all three operations are consolidated into a single loop nest.
func.func @fusion_test(%arg0: memref<16xf32>, %arg1: memref<16xf32>) {
  %cst = arith.constant 0.000000e+00 : f32
  %alloc = memref.alloc() {alignment = 64 : i64} : memref<16xf32>
  affine.for %arg2 = 0 to 16 {
    %0 = affine.load %arg0[%arg2] : memref<16xf32>
    %1 = arith.addf %0, %0 : f32        // Fused Op 1
    %2 = arith.mulf %1, %1 : f32        // Fused Op 2
    %3 = arith.maximumf %2, %cst : f32  // Fused Op 3
    affine.store %3, %alloc[%arg2] : memref<16xf32>
  }
  memref.copy %alloc, %arg1 : memref<16xf32> to memref<16xf32>
  return
}
  • Conclusion
    The linalg lowering pipeline is crucial for performance.

@tancheng
Copy link
Contributor

Why remove the relu_int.cpp?

1 similar comment
@tancheng
Copy link
Contributor

Why remove the relu_int.cpp?

// TaskFlow Conversion Passes.
std::unique_ptr<mlir::Pass> createConvertAffineToTaskflowPass();
void registerTosaToAffinePipeline();
void registerTosaToTaskflowPipeline();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may not need such a pass pipeline for now.

You can add a test to verify the end2end lowering process, e.g., from python -> tosa ->linalg -> affine ->taskflow.

This is to verify the lowering process, we can add more optimizations in each level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants