Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the peepopt shiftmul matcher #3873

Merged
merged 8 commits into from
Oct 16, 2023
Merged

Conversation

povik
Copy link
Member

@povik povik commented Aug 2, 2023

This extends the shiftmul matcher in two useful ways:

  • It deals with cases where the shift amount is zero-padded from below because opt_expr ran before and applied opt.opt_expr.mul_low_zeros.
  • It can deal with left shifts implementing a demuxer (through the addition of a mirrored matcher).

Related to #3833.

@povik
Copy link
Member Author

povik commented Aug 3, 2023

Force pushed. There was a confusion in the commit message about whether it's the left or right shift matcher we are adding.

@phsauter
Copy link
Contributor

phsauter commented Aug 3, 2023

As we already started to discuss in #3833, this currently only handles: foo[s*W+:W], where W is a constant.

Since yesterday I went through different scenarios/cases I could see happening and thought about how to handle them.
For this I will generally refer to foo as an array of multi-bit elements as this is probably the most common origin of such a construct.

  • foo[s*W+:W1] where W1<W. This is the case when only a few bits are selected from some array of elements or similar. This case is trivial as one can just add more padding in-between output bits but currently It seems to be that shift_amount must exactly match the shifter output-width for it to then apply the transformation. Edit: This is already covered, I just assumed it isn't because of the comment at the top and the following case isn't covered.
  • foo[s*W+:W2] where W2>W. My guess is that this case would not occur very often but it covers the case where one would like to select something from two consecutive elements from an array of elements. Personally I think this case should be noted but does not need to be handled right now. (the transformation here should however work basically the same except that we need to use clog2(W2) instead (size of shift-output).
  • foo[s*W+O+:W1] where O is a constant. This case describes the selection of a few bits from one element of the array. The offset gives the starting position of the bits. As I understand it, your proposal was to handle this separately in a shiftadd peepot, so that we then have two shifts, one being directly attached to a $mul, making it possible to use shiftmul. I think this is a good idea but it does have one minor caveat, lets again distinguish two options:
    1. We place the constant-offset shift in-front of the soon-to-be shiftmul block. This always works but it has the slight disadvantage that multiple accesses to different bits from the same element, may now be converted into shiftmul blocks, each having a different offset at its input and we are going to rely on later optimization passes to optimize this for us. If it is caught before techmap I would say this is a non-problem. If it is only caught after, then we create a lot of $MUX cells for nothing, hurting performance and increasing the memory footprint.
    2. We place the constant-offset shift after the shiftmul block. This would eliminate the minor caveat from (1) but it only works if W>(O+W1), which should be the case if the code and all previous transformations are reasonable but we cannot guarantee it. Meaning such a transformation could not be done independently and would have to be called as part of shiftmul, since we need to know W from the mul.
  • foo[s*W+O+:W1] where O=V*var. I do not expect this to occur often (if at all). This represents the case where first an element is selected from an array and then some starting point is selected, from which W1 bits are used. Such a scenario would likely also require a more holistic approach as we now essentially have two nested shifts and we would have to find out which is larger and perform that first. It is also possible that some earlier part might actually already split this into two shifts for us (though I doubt it). Again, I think this is uncommon enough to make a note of it but ignore it.

Tl;Dr: assuming foo[s*W+:W1], we should check which of W and W1 is larger and add padding accordingly, W=W1 is an arbitrary restriction.
Constant offsets can be generally dealt with by adding another shift in-front, then this pass can take care of the strided access. This may be suboptimal for performance.
More complex cases are possible but likely rare. They would likely require a monolithic approach.

@povik
Copy link
Member Author

povik commented Aug 3, 2023

  • foo[s*W+:W1] where W1<W. This is the case when only a few bits are selected from some array of elements or similar. This case is trivial as one can just add more padding in-between output bits but currently It seems to be that shift_amount must exactly match the shifter output-width for it to then apply the transformation.

Not really, W1<W is readily supported with the transformation as-is. (shift_amount is the signal that's input to the shift cell and specifies the shift amount, and the pattern matcher looks for a $mul cell the output of which matches this signal, but that's unrelated to W1<W.)

  • foo[s*W+:W2] where W2>W. My guess is that this case would not occur very often but it covers the case where one would like to select something from two consecutive elements from an array of elements. Personally I think this case should be noted but does not need to be handled right now. (the transformation here should however work basically the same except that we need to use clog2(W2) instead.

Yeah, I am not sure how often this would come up in actual code but it's not inconceivable. Maybe what other RTL compilers do in such a case would be of useful input since existing codebases could be accustomed to that. In any case I guess there should be some heuristics comparing the magnitude of W2 and W to decide when the transformation is considered beneficial.

@phsauter
Copy link
Contributor

phsauter commented Aug 3, 2023

You are correct in both cases (I already noticed the first one myself).

Other compilers are unlikely to use a concept like $shiftx to represent a block-select/mux, I would imagine they just use a block-select in the first case (in fact I know that at least one does exactly that).
In the W2>W (selection larger than a block) I would assume they just replicate the necessary wires into each group.
Meaning each group will be sized to W2 instead of W and filled with the wires going into the neighboring groups.

@povik
Copy link
Member Author

povik commented Aug 3, 2023

  • foo[s*W+O+:W1] where O is a constant. This case describes the selection of a few bits from one element of the array. The offset gives the starting position of the bits. As I understand it, your proposal was to handle this separately in a shiftadd peepot, so that we then have two shifts, one being directly attached to a $mul, making it possible to use shiftmul. I think this is a good idea but it does have one minor caveat, lets again distinguish two options:

    1. We place the constant-offset shift in-front of the soon-to-be shiftmul block. This always works but it has the slight disadvantage that multiple accesses to different bits from the same element, may now be converted into shiftmul blocks, each having a different offset at its input and we are going to rely on later optimization passes to optimize this for us. If it is caught before techmap I would say this is a non-problem. If it is only caught after, then we create a lot of $MUX cells for nothing, hurting performance and increasing the memory footprint.

I am not sure there's an issue there, and if there is overhead, maybe the situation can be improved by only leaving those signals connected to the shift cell that can actually be selected, and rounding up W1 (instead of W) to the next power-of-2. Though at that point maybe we should just be leaving $bmux behind, which didn't exist when the transformation was originally written.

@povik
Copy link
Member Author

povik commented Aug 3, 2023

Other compilers are unlikely to use a concept like $shiftx to represent a block-select/mux, I would imagine they just use a block-select in the first case (in fact I know that at least one does exactly that).
In the W2>W (selection larger than a block) I would assume they just replicate the necessary wires into each group.
Meaning each group will be sized to W2 instead of W and filled with the wires going into the neighboring groups.

Well but then at some point if W2 is large and W is small you could be better off expressing this as W independent shifts, instead of one block select with a lot of replication in the input signal. That's what I am slightly curious about how other compilers handle.

@phsauter
Copy link
Contributor

phsauter commented Aug 3, 2023

Well but then at some point if W2 is large and W is small you could be better off expressing this as W independent shifts, instead of one block select with a lot of replication in the input signal. That's what I am slightly curious about how other compilers handle.

Makes sense, I mentally didn't make W2 large enough :-)
I am going to try it today or tomorrow and will report back.

@phsauter
Copy link
Contributor

phsauter commented Aug 3, 2023

Though at that point maybe we should just be leaving $bmux behind.

Personally I think there is a pretty strong case for transforming it to a $bmux anyway as it makes it way more obvious whats even happening and I think in most normal cases it should generate less MUX cells.

A block-mux halves the bits it works on in every stage, so its cost should be something like block_size*2*log2(#blocks).
A barrel-shifter on the other hand always operates on the same number of bits in each stage, so its cost is roughly data_size*log2(#blocks) (where data_size=block_size*#blocks). This isn't entirely accurate since it actually matter how compact the bits of the control signals are (hence why this optimization exists in the first place).

Anyway, one would expect a block-mux to be more efficient in most cases, especially the ones you are likely to encounter in reality.
One big exception is once W2 is larger than W because for the bmux block_size=max(W2,W), while for the shifter it doesn't really change anything.

Obviously and arbitrarily good MUX-tree optimizer will likely achieve identical results but the amount of cells you have to operate on matters, especially once you try to synthesize larger designs.

@povik
Copy link
Member Author

povik commented Aug 3, 2023

A block-mux halves the bits it works on in every stage, so its cost should be something like block_size*2*log2(#blocks).

I don't think that's true since if each stage works on a different amount of bits, and that amount doubles going to earlier stages, the final cost must come down to something like block_size*#blocks*2 - 1 (for power-of-two #blocks). Once you prune the trivial and unused muxes in the implementation of the shift I think the cost will be the same.

One big exception is once W2 is larger than W because for the bmux block_size=max(W2,W), while for the shifter it doesn't really change anything.

Agreed on shift having a lower cost (in terms of elementary muxes) for the case W2>W.

Personally I think there is a pretty strong case for transforming it to a $bmux anyway...

I agree. Not so much because of any overhead consideration but for clarity and to make it easier for later passes to reason about what's going on in the circuit.

@phsauter
Copy link
Contributor

phsauter commented Aug 3, 2023

By the way I am currently working on a shiftadd.pmg that should be able to deal with any $add and $sub cells.
Funnily enough dealing with foo[s*W-O+:W1] (note the minus) is actually easier since we can just prepend some fill-bits.

The more common foo[s*W+O+:W1] is a bit of a pain if you want to make it as generic as possible since the +O would make it so a previous shift makes the lowest O bits unreachable, this is fine in all reasonable cases but there is nothing stopping foo[s*W] from going into the negative just enough to still reach these now lost bits. So the transformation wouldn't be 100% the same anymore.

Currently I just limited it so that one value must be constant (O) and the other must be unsigned. This might not be ideal if it starts chaining multiple operations though but for now it should work. I still have some bugs though, meaning they are not quite functionally equivalent for all the test cases I am using.
I will push them as soon as they are, then you can decide if you want them in here as well.

@phsauter
Copy link
Contributor

phsauter commented Aug 3, 2023

the final cost must come down to something like block_size*#blocks*2 - 1 (for power-of-two #blocks)

Yes you are correct, it even says that on the paper in front of me, I just didn't pay attention while typing it (well I didn't have the -1 but I ignored it on purpose since its just a constant).

@povik
Copy link
Member Author

povik commented Aug 3, 2023

By the way I am currently working on a shiftadd.pmg...

Good to know!

Currently I just limited it so that one value must be constant (O) and the other must be unsigned.

Yeah, I don't see any other reasonable way to do it. wreduce will convert cell inputs to unsigned if it deduces that it's okay, so relying on the input being unsigned in your situation should not be too restrictive, hopefully.

I still have some bugs though, meaning they are not quite functionally equivalent for all the test cases I am using.
I will push them as soon as they are, then you can decide if you want them in here as well.

I think it's best if you open a separate PR (a draft one if you don't feel it's ready for merging), and then we can discuss the shiftadd rule there. I don't have merging powers here but those that have will be happy we split things.

@phsauter
Copy link
Contributor

phsauter commented Aug 4, 2023

I will be opening a new PR, in the last user-meeting they said they prefer more small PRs since they are easier to review, exactly what you are saying as well.

Will you try to use $bmux/$demux now or are you still going for pure $shift/shiftx transformations for this PR?

I guess the second approach is probably cleaner for now, the other one should probably also be added as a separate PR.
So that in the end we have a flow that does shiftadd, then shiftbmux (if W2/W ratio is favorable) and then finally shiftmul that can take the remaining ones.

@povik
Copy link
Member Author

povik commented Aug 4, 2023

Will you try to use $bmux/$demux now or are you still going for pure $shift/shiftx transformations for this PR?

I guess the second approach is probably cleaner for now, the other one should probably also be added as a separate PR.

Yeah, I think I will make the $bmux/$demux change into a separate PR.

So that in the end we have a flow that does shiftadd, then shiftbmux (if W2/W ratio is favorable) and then finally shiftmul that can take the remaining ones.

Agreed as a prospect if what you mean by shiftmul here is the transformation to produce W independent shifters for high W2/W, though I would rather name it differently -- call shiftmul the transformation producing bmuxes (for continuity), and name a separate rule producing the W independent shifters. If what you meant by shiftmul here is the original shiftmul transformation (extended to W2>W), then I don't think we want that at the same time as the bmux transformation, since the cost of the circuitry they produce is essentially the same (once you prune the mux network in the implementation of the shifter).

@povik
Copy link
Member Author

povik commented Aug 4, 2023

Err, I wrote something bit confusing earlier...

One big exception is once W2 is larger than W because for the bmux block_size=max(W2,W), while for the shifter it doesn't really change anything.

Agreed on shift having a lower cost (in terms of elementary muxes) for the case W2>W.

I was thinking the untransformed shifter has a lower cost than the block-mux, but that is without accounting for the cost of the $mul... Once you shiftmul-transform the shifter to remove the $mul then the cost of the shifter and block-mux should be the same.

@povik
Copy link
Member Author

povik commented Aug 4, 2023

So as I see it we have three options we need to decide between when stumbling on a shiftmul pattern: Transform to $bmux, transform to W independent shifters, do nothing.

@phsauter
Copy link
Contributor

phsauter commented Aug 4, 2023

Your naming scheme is fine as well, I just think its important to keep the original (ie shift+mul -> shift) to make sure people can decide not to use the shift+mul -> bmux transformation if that makes any trouble for them (some potential later pass that relies on the $shiftx cells specifically.

So as I see it we have three options we need to decide between when stumbling on a shiftmul pattern: Transform to $bmux, transform to W independent shifters, do nothing.

Yes exactly, except that we don't really decide. Each option is just a separate peepopt that checks for the respective W2/W ratio.

filter !port(shift, \B).empty()
endmatch

match neg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match neg
match neg
select neg->type == $neg

Without this it tries to access port \Y on any cell, no matter the type, this gives an error as not all cells have a \Y port.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! Of course

} else {
// case of `$shl`
shift_amount = port(shift, \B);
if (!param(neg, \B_SIGNED).as_bool())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!param(neg, \B_SIGNED).as_bool())
if (!param(shift, \B_SIGNED).as_bool())

Likely a typo. If neg doesn't exist we also can't access any parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this cleans up the removal of the dffmux pattern in a0e99a9
Originally added in b8774ae

The changes (removals) concerning init look fine to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can add that to the commit message

if (genmode == "shiftmul")
GENERATE_PATTERN(peepopt_pm, shiftmul);
GENERATE_PATTERN(peepopt_pm, shiftmul_right);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GENERATE_PATTERN(peepopt_pm, shiftmul_right);
log_error("Abort in %s:%d: peepopt-shiftmul does not implement a generate option\n", __FILE__, __LINE__)

Doesn't this require a generate block in the pattern to actually work?

Running it ends with:

ERROR: pmgen generator is stuck: 10000 iterations with no matching module generated.

I think the most relevant PR is #1298.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I never looked into the generate stuff. Then maybe we should remove the generate flag altogether, I don't think muldiv has any generate blocks either.

@phsauter
Copy link
Contributor

phsauter commented Aug 9, 2023

The two actual code changes should make it pass the current tests.
The generate change is more a suggestion that also applies to muldiv.

Personally I also think we should add log_header() for all pmg-pases that describe the transformation in 1-2 sentences.

@povik
Copy link
Member Author

povik commented Aug 9, 2023

Thanks. I think we should mainly expand the help message, not sure if the patterns each warrant their log_header

@phsauter
Copy link
Contributor

phsauter commented Aug 9, 2023

Thanks. I think we should mainly expand the help message, not sure if the patterns each warrant their log_header

yeah that makes more sense, help it is!

Drop code that was once used by the 'dffmux' pattern but now is unused
after that pattern has been obsoleted by the 'opt_dff' pass.
No functional change intended.
The `opt_expr` pass running before `peepopt` can interfere with the
detection of a shiftmul pattern due to some of the bottom bits of the
shift amount being replaced with constant zero. Extend the detection to
cover those situations as well.
Add a separate shiftmul pattern to match on left shifts which implement
demuxing. This mirrors the right shift pattern matcher but is probably
best kept separate instead of merging the two into a single matcher.
In any case the diff of the two matchers should be easily readable.
@povik
Copy link
Member Author

povik commented Oct 16, 2023

Rebased, squashed the fixup into its parent commit, last three commits are new.

@nakengelhardt nakengelhardt merged commit edee11b into YosysHQ:master Oct 16, 2023
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants