Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding vector dialect support to Calyx? #1777

Closed
medbzkst opened this issue Nov 15, 2023 · 8 comments
Closed

Adding vector dialect support to Calyx? #1777

medbzkst opened this issue Nov 15, 2023 · 8 comments
Labels
C: CIRCT Changes for the CIRCT backend Status: Needs Triage Issue needs some thinking

Comments

@medbzkst
Copy link

I am targeting the lowering of MLIR codes that contain functions that receive memrefs as input, transform them into vectors (using load, maskedload, or gather operations), perform several operations using vector dialect, and then spit out a memref as an output (using store, maskedstore, or scatter operation). The thing is that CIRCT does not support lowering vector dialect operations to Calyx. However, vector dialect is an abstraction that can be directly concretised into hardware. One way to see that Calyx is suitable for that is the Dahlia and the systolic array generator frontends.

So what about generating hardware that interfaces with memories (memrefs as inputs and outputs) and performs all the required vector operations as they are instructed in vector operations? Maybe a conversion pass vector-to-calyx?

@sampsyo
Copy link
Contributor

sampsyo commented Nov 17, 2023

Hi, @medbzkst—thanks for the super interesting suggestion! I will freely admit I'm no expert on either vector nor the SCFToCalyx pass infrastructure. However, the prospect of lowering data-parallel MLIR ops to hardware seems totally tractable.

The main question I'd have is: can this be accomplished (perhaps in a lower-performance way) with existing MLIR passes? For example, there is VectorToSCF. Does it suffice, for correctness only, to use this and then lower the resulting vector-free code to Calyx? I'd be super interested to know if this causes any problems.

I expect that to generate not-great hardware, but it would be extremely useful as a baseline. It's not clear to me whether a better lowering would go directly from vector to Calyx or if some intermediate point would make more sense, but I'd be willing to believe that the direct route is good.

@sampsyo
Copy link
Contributor

sampsyo commented Nov 17, 2023

It occurs to me to mention @jiahanxie353, who has been ramping up on the CIRCT-related Calyx infrastructure. Is this something you'd have an opinion about, or even be interested in contributing to?

@jiahanxie353
Copy link
Collaborator

Hi @medbzkst thanks for bringing up this super interesting conversion!

While I'm not an expert on vector either, my current understanding is that lowering from Vector to SCF and then from SCF to Calyx seems more reasonable. The reason is SCF's if and for are very useful for Vector transferring operations, such as loading/storing to/from the memory. This is particularly when we have high dimensional Vectors, where the Vectors need to be unpacked, the bounds need to be checked, etc. It seems to me that SCF is more suitable for these aspects.

or even be interested in contributing to?

I'm definitely interested! I can delve more into it and return to the thread with some insights once I've formulated some thoughts.

@medbzkst
Copy link
Author

medbzkst commented Nov 19, 2023

Thank you for the feedback @sampsyo and @jiahanxie353.

Indeed the first idea is to try to use the available infrastructure, and intuitively it would be a great baseline for correctness and performance. However, I came across some problems while I was trying to do so.

The VectorToScf pass seems to be not as generic as it sounds. From the top of the source code file you can see the comment "This file implements lowering of vector transfer operations to SCF.". So it is precisely there to lower the transfer operations. Even if I give it a run on a very simple MLIR module that receives 2 vectors as input, sum them up, and return the output, it does not behave as expected in the sense that it does not convert that to an SCF for or while loop even by setting the -full-unroll option.

You can give it a try if you want on this MLIR module:

module {
    func.func @testing(%vec1 : vector<8xf64>, %vec2 : vector<8xf64>) -> (vector<8xf64>) {
        %res = arith.addf %vec1, %vec2 : vector<8xf64>
        return %res : vector<8xf64>
    }
}

By running mlir-opt --convert-scf-to-vector filename.mlir or mlir-opt --convert-scf-to-vector="full-unroll=1" filename.mlir.

Now the idea was, let's convert that manually (by an external script as a temporary solution before implementing a pass for that) to the desired loop implementation in SCF. The problem happened with those operations that iterate and check a condition, i.e. the masked operations. For example a gather operation if thought of as a loop, it would require checking whether to grab an element from memory or skip it and use a pass-through element. Even less sophisticated than that, a maskedload operation would do the same. The problem here is that the loop will have more than one control block which MLIR does not seem to be happy with using the while loop (perhaps that's why there is no pass available that does that?). Using the for loop it would also be problematic because those operations yield nothing, but at least it is at the level of CIRCT here.

Finally, the last chance was to lower everything to CF and lower from there to Calyx as it will be just about a bunch of blocks with some controls. However, this means having backedges in the CFG since we have loops, which is not lowerable so far.

For those reasons, I believe there should be a way to have a VectorToCalyx pass. Most especially when dealing with the masked operations, it would be probably more optimal to decide on how to connect components at the pass level and have a direct Calyx code optimised for that.

@sampsyo
Copy link
Contributor

sampsyo commented Nov 22, 2023

Intriguing. I think it might be useful, if we were to go down the "direct lowering" road, to try hand-writing some Calyx programs that perform data-parallel vector-ish operations and see how those look…

@rachitnigam
Copy link
Contributor

@medbzkst another possible place to look is #1585. @xerpi implemented a Halide-to-Calyx compiler that implements lowering for vector operations to Calyx but it wasn't merged into CIRCT. Perhaps that's a good place to start?

Also, if you have questions about Calyx, we can add you to the Calyx slack and directly answer questions there as well. If you'd like to join, please send me an email.

@rachitnigam rachitnigam added Status: Needs Triage Issue needs some thinking C: CIRCT Changes for the CIRCT backend labels Nov 23, 2023
@jiahanxie353
Copy link
Collaborator

jiahanxie353 commented Nov 23, 2023

Thanks @medbzkst for the detailed issue description!

Even if I give it a run on a very simple MLIR module that receives 2 vectors as input, sum them up, and return the output, it does not behave as expected in the sense that it does not convert that to an SCF for or while loop even by setting the -full-unroll option.

This is surprising...

Anyway, I think we should still keep VectorToSCF pass. Because I think structured control flow is just very important for vector operations. And I hope VectorToSCF is still under development, so they will eventually support full functionalities? Otherwise it's so surprising that they couldn't event support vectors add...

Now the idea was, let's convert that manually (by an external script as a temporary solution before implementing a pass for that) to the desired loop implementation in SCF.

With this manual temporary solution of basic case of maskload, I would propose using scf.for and then lower to Calyx. It's true that we currently don't support that, but I can try to implement it.

At the end of the day, I would imagine we can have code like:

func.func @maskedload(%base_vector: memref<?xf32>, %mask: memref<8xi1>, %pass_thru: memref<8xf32>) -> memref<8xf32> {
  %result = scf.for %index = 0 to %size step 1 {
    %mask_val = load %mask[%index] 
    %cond = cmpi eq %mask_val, %c1
    %result_next = scf.if %cond {
      %val = load %base_vector[%index] 
      scf.yield %val
    } else {
      %val = load %pass_thru[%index]
      scf.yield %val
    }
  scf.yield %result_next
  }
  return %result
}

So as long as we support lowering scf.for to Calyx, we are good for now.


Most especially when dealing with the masked operations, it would be probably more optimal to decide on how to connect components at the pass level and have a direct Calyx code optimised for that.

I think it might be useful, if we were to go down the "direct lowering" road, to try hand-writing some Calyx programs that perform data-parallel vector-ish operations and see how those look…

Agree with these comments. Given that vector is already pretty low level, we can expose vectors to Calyx directly and exploit parallelisms there.

Summing up, the solution I come up with is using a mix of SCF and Calyx, where SCF is in charge of the control flow, and Calyx to exploit "data-parallel vector-ish operations". And I can

  1. Support scf.for (with non-empty yield) lowering to Calyx;
  2. I will also explore some direct VectorToCalyx opportunities, especially with mask operations.

Please let me know what you think

@rachitnigam
Copy link
Contributor

Hey @medbzkst! I'm going to close this issue but please feel free to re-open this or a new issue if you'd like to discuss another problem/need help with something!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: CIRCT Changes for the CIRCT backend Status: Needs Triage Issue needs some thinking
Projects
None yet
Development

No branches or pull requests

4 participants