Extend the `peepopt` shiftmul matcher #3873

povik · 2023-08-02T16:15:44Z

This extends the shiftmul matcher in two useful ways:

It deals with cases where the shift amount is zero-padded from below because opt_expr ran before and applied opt.opt_expr.mul_low_zeros.
It can deal with left shifts implementing a demuxer (through the addition of a mirrored matcher).

Related to #3833.

povik · 2023-08-03T10:03:47Z

Force pushed. There was a confusion in the commit message about whether it's the left or right shift matcher we are adding.

phsauter · 2023-08-03T11:49:31Z

As we already started to discuss in #3833, this currently only handles: foo[s*W+:W], where W is a constant.

Since yesterday I went through different scenarios/cases I could see happening and thought about how to handle them.
For this I will generally refer to foo as an array of multi-bit elements as this is probably the most common origin of such a construct.

foo[s*W+:W1] where W1<W. This is the case when only a few bits are selected from some array of elements or similar. This case is trivial as one can just add more padding in-between output bits but currently It seems to be that shift_amount must exactly match the shifter output-width for it to then apply the transformation. Edit: This is already covered, I just assumed it isn't because of the comment at the top and the following case isn't covered.
foo[s*W+:W2] where W2>W. My guess is that this case would not occur very often but it covers the case where one would like to select something from two consecutive elements from an array of elements. Personally I think this case should be noted but does not need to be handled right now. (the transformation here should however work basically the same except that we need to use clog2(W2) instead (size of shift-output).
foo[s*W+O+:W1] where O is a constant. This case describes the selection of a few bits from one element of the array. The offset gives the starting position of the bits. As I understand it, your proposal was to handle this separately in a shiftadd peepot, so that we then have two shifts, one being directly attached to a $mul, making it possible to use shiftmul. I think this is a good idea but it does have one minor caveat, lets again distinguish two options:
1. We place the constant-offset shift in-front of the soon-to-be shiftmul block. This always works but it has the slight disadvantage that multiple accesses to different bits from the same element, may now be converted into shiftmul blocks, each having a different offset at its input and we are going to rely on later optimization passes to optimize this for us. If it is caught before techmap I would say this is a non-problem. If it is only caught after, then we create a lot of $MUX cells for nothing, hurting performance and increasing the memory footprint.
2. We place the constant-offset shift after the shiftmul block. This would eliminate the minor caveat from (1) but it only works if W>(O+W1), which should be the case if the code and all previous transformations are reasonable but we cannot guarantee it. Meaning such a transformation could not be done independently and would have to be called as part of shiftmul, since we need to know W from the mul.
foo[s*W+O+:W1] where O=V*var. I do not expect this to occur often (if at all). This represents the case where first an element is selected from an array and then some starting point is selected, from which W1 bits are used. Such a scenario would likely also require a more holistic approach as we now essentially have two nested shifts and we would have to find out which is larger and perform that first. It is also possible that some earlier part might actually already split this into two shifts for us (though I doubt it). Again, I think this is uncommon enough to make a note of it but ignore it.

Tl;Dr: assuming foo[s*W+:W1], we should check which of W and W1 is larger and add padding accordingly, W=W1 is an arbitrary restriction.
Constant offsets can be generally dealt with by adding another shift in-front, then this pass can take care of the strided access. This may be suboptimal for performance.
More complex cases are possible but likely rare. They would likely require a monolithic approach.

povik · 2023-08-03T14:07:07Z

foo[s*W+:W1] where W1<W. This is the case when only a few bits are selected from some array of elements or similar. This case is trivial as one can just add more padding in-between output bits but currently It seems to be that shift_amount must exactly match the shifter output-width for it to then apply the transformation.

Not really, W1<W is readily supported with the transformation as-is. (shift_amount is the signal that's input to the shift cell and specifies the shift amount, and the pattern matcher looks for a $mul cell the output of which matches this signal, but that's unrelated to W1<W.)

foo[s*W+:W2] where W2>W. My guess is that this case would not occur very often but it covers the case where one would like to select something from two consecutive elements from an array of elements. Personally I think this case should be noted but does not need to be handled right now. (the transformation here should however work basically the same except that we need to use clog2(W2) instead.

Yeah, I am not sure how often this would come up in actual code but it's not inconceivable. Maybe what other RTL compilers do in such a case would be of useful input since existing codebases could be accustomed to that. In any case I guess there should be some heuristics comparing the magnitude of W2 and W to decide when the transformation is considered beneficial.

phsauter · 2023-08-03T14:30:15Z

You are correct in both cases (I already noticed the first one myself).

Other compilers are unlikely to use a concept like $shiftx to represent a block-select/mux, I would imagine they just use a block-select in the first case (in fact I know that at least one does exactly that).
In the W2>W (selection larger than a block) I would assume they just replicate the necessary wires into each group.
Meaning each group will be sized to W2 instead of W and filled with the wires going into the neighboring groups.

povik · 2023-08-03T14:36:16Z

foo[s*W+O+:W1] where O is a constant. This case describes the selection of a few bits from one element of the array. The offset gives the starting position of the bits. As I understand it, your proposal was to handle this separately in a shiftadd peepot, so that we then have two shifts, one being directly attached to a $mul, making it possible to use shiftmul. I think this is a good idea but it does have one minor caveat, lets again distinguish two options:

We place the constant-offset shift in-front of the soon-to-be shiftmul block. This always works but it has the slight disadvantage that multiple accesses to different bits from the same element, may now be converted into shiftmul blocks, each having a different offset at its input and we are going to rely on later optimization passes to optimize this for us. If it is caught before techmap I would say this is a non-problem. If it is only caught after, then we create a lot of $MUX cells for nothing, hurting performance and increasing the memory footprint.

I am not sure there's an issue there, and if there is overhead, maybe the situation can be improved by only leaving those signals connected to the shift cell that can actually be selected, and rounding up W1 (instead of W) to the next power-of-2. Though at that point maybe we should just be leaving $bmux behind, which didn't exist when the transformation was originally written.

povik · 2023-08-03T15:00:08Z

Other compilers are unlikely to use a concept like $shiftx to represent a block-select/mux, I would imagine they just use a block-select in the first case (in fact I know that at least one does exactly that).
In the W2>W (selection larger than a block) I would assume they just replicate the necessary wires into each group.
Meaning each group will be sized to W2 instead of W and filled with the wires going into the neighboring groups.

Well but then at some point if W2 is large and W is small you could be better off expressing this as W independent shifts, instead of one block select with a lot of replication in the input signal. That's what I am slightly curious about how other compilers handle.

phsauter · 2023-08-03T15:08:18Z

Well but then at some point if W2 is large and W is small you could be better off expressing this as W independent shifts, instead of one block select with a lot of replication in the input signal. That's what I am slightly curious about how other compilers handle.

Makes sense, I mentally didn't make W2 large enough :-)
I am going to try it today or tomorrow and will report back.

phsauter · 2023-08-03T16:02:26Z

Though at that point maybe we should just be leaving $bmux behind.

Personally I think there is a pretty strong case for transforming it to a $bmux anyway as it makes it way more obvious whats even happening and I think in most normal cases it should generate less MUX cells.

A block-mux halves the bits it works on in every stage, so its cost should be something like block_size*2*log2(#blocks).
A barrel-shifter on the other hand always operates on the same number of bits in each stage, so its cost is roughly data_size*log2(#blocks) (where data_size=block_size*#blocks). This isn't entirely accurate since it actually matter how compact the bits of the control signals are (hence why this optimization exists in the first place).

Anyway, one would expect a block-mux to be more efficient in most cases, especially the ones you are likely to encounter in reality.
One big exception is once W2 is larger than W because for the bmux block_size=max(W2,W), while for the shifter it doesn't really change anything.

Obviously and arbitrarily good MUX-tree optimizer will likely achieve identical results but the amount of cells you have to operate on matters, especially once you try to synthesize larger designs.

povik · 2023-08-03T17:06:54Z

A block-mux halves the bits it works on in every stage, so its cost should be something like block_size*2*log2(#blocks).

I don't think that's true since if each stage works on a different amount of bits, and that amount doubles going to earlier stages, the final cost must come down to something like block_size*#blocks*2 - 1 (for power-of-two #blocks). Once you prune the trivial and unused muxes in the implementation of the shift I think the cost will be the same.

One big exception is once W2 is larger than W because for the bmux block_size=max(W2,W), while for the shifter it doesn't really change anything.

Agreed on shift having a lower cost (in terms of elementary muxes) for the case W2>W.

Personally I think there is a pretty strong case for transforming it to a $bmux anyway...

I agree. Not so much because of any overhead consideration but for clarity and to make it easier for later passes to reason about what's going on in the circuit.

phsauter · 2023-08-03T17:36:47Z

By the way I am currently working on a shiftadd.pmg that should be able to deal with any $add and $sub cells.
Funnily enough dealing with foo[s*W-O+:W1] (note the minus) is actually easier since we can just prepend some fill-bits.

The more common foo[s*W+O+:W1] is a bit of a pain if you want to make it as generic as possible since the +O would make it so a previous shift makes the lowest O bits unreachable, this is fine in all reasonable cases but there is nothing stopping foo[s*W] from going into the negative just enough to still reach these now lost bits. So the transformation wouldn't be 100% the same anymore.

Currently I just limited it so that one value must be constant (O) and the other must be unsigned. This might not be ideal if it starts chaining multiple operations though but for now it should work. I still have some bugs though, meaning they are not quite functionally equivalent for all the test cases I am using.
I will push them as soon as they are, then you can decide if you want them in here as well.

phsauter · 2023-08-03T17:39:58Z

the final cost must come down to something like block_size*#blocks*2 - 1 (for power-of-two #blocks)

Yes you are correct, it even says that on the paper in front of me, I just didn't pay attention while typing it (well I didn't have the -1 but I ignored it on purpose since its just a constant).

povik · 2023-08-03T18:09:37Z

By the way I am currently working on a shiftadd.pmg...

Good to know!

Currently I just limited it so that one value must be constant (O) and the other must be unsigned.

Yeah, I don't see any other reasonable way to do it. wreduce will convert cell inputs to unsigned if it deduces that it's okay, so relying on the input being unsigned in your situation should not be too restrictive, hopefully.

I still have some bugs though, meaning they are not quite functionally equivalent for all the test cases I am using.
I will push them as soon as they are, then you can decide if you want them in here as well.

I think it's best if you open a separate PR (a draft one if you don't feel it's ready for merging), and then we can discuss the shiftadd rule there. I don't have merging powers here but those that have will be happy we split things.

phsauter · 2023-08-04T08:24:48Z

I will be opening a new PR, in the last user-meeting they said they prefer more small PRs since they are easier to review, exactly what you are saying as well.

Will you try to use $bmux/$demux now or are you still going for pure $shift/shiftx transformations for this PR?

I guess the second approach is probably cleaner for now, the other one should probably also be added as a separate PR.
So that in the end we have a flow that does shiftadd, then shiftbmux (if W2/W ratio is favorable) and then finally shiftmul that can take the remaining ones.

povik · 2023-08-04T08:56:23Z

Will you try to use $bmux/$demux now or are you still going for pure $shift/shiftx transformations for this PR?

I guess the second approach is probably cleaner for now, the other one should probably also be added as a separate PR.

Yeah, I think I will make the $bmux/$demux change into a separate PR.

So that in the end we have a flow that does shiftadd, then shiftbmux (if W2/W ratio is favorable) and then finally shiftmul that can take the remaining ones.

Agreed as a prospect if what you mean by shiftmul here is the transformation to produce W independent shifters for high W2/W, though I would rather name it differently -- call shiftmul the transformation producing bmuxes (for continuity), and name a separate rule producing the W independent shifters. If what you meant by shiftmul here is the original shiftmul transformation (extended to W2>W), then I don't think we want that at the same time as the bmux transformation, since the cost of the circuitry they produce is essentially the same (once you prune the mux network in the implementation of the shifter).

povik · 2023-08-04T09:09:21Z

Err, I wrote something bit confusing earlier...

One big exception is once W2 is larger than W because for the bmux block_size=max(W2,W), while for the shifter it doesn't really change anything.

Agreed on shift having a lower cost (in terms of elementary muxes) for the case W2>W.

I was thinking the untransformed shifter has a lower cost than the block-mux, but that is without accounting for the cost of the $mul... Once you shiftmul-transform the shifter to remove the $mul then the cost of the shifter and block-mux should be the same.

povik · 2023-08-04T09:14:03Z

So as I see it we have three options we need to decide between when stumbling on a shiftmul pattern: Transform to $bmux, transform to W independent shifters, do nothing.

phsauter · 2023-08-04T09:17:28Z

Your naming scheme is fine as well, I just think its important to keep the original (ie shift+mul -> shift) to make sure people can decide not to use the shift+mul -> bmux transformation if that makes any trouble for them (some potential later pass that relies on the $shiftx cells specifically.

So as I see it we have three options we need to decide between when stumbling on a shiftmul pattern: Transform to $bmux, transform to W independent shifters, do nothing.

Yes exactly, except that we don't really decide. Each option is just a separate peepopt that checks for the respective W2/W ratio.

phsauter · 2023-08-09T09:48:08Z

passes/pmgen/peepopt_shiftmul_left.pmg

+	filter !port(shift, \B).empty()
+endmatch
+
+match neg


Suggested change

match neg

match neg

select neg->type == $neg

Without this it tries to access port \Y on any cell, no matter the type, this gives an error as not all cells have a \Y port.

Ha! Of course

phsauter · 2023-08-09T09:49:12Z

passes/pmgen/peepopt_shiftmul_left.pmg

+	} else {
+		// case of `$shl`
+		shift_amount = port(shift, \B);
+		if (!param(neg, \B_SIGNED).as_bool())


Suggested change

if (!param(neg, \B_SIGNED).as_bool())

if (!param(shift, \B_SIGNED).as_bool())

Likely a typo. If neg doesn't exist we also can't access any parameters.

phsauter · 2023-08-09T10:15:15Z

passes/pmgen/peepopt.cc

It looks like this cleans up the removal of the dffmux pattern in a0e99a9
Originally added in b8774ae

The changes (removals) concerning init look fine to me.

Yes, can add that to the commit message

phsauter · 2023-08-09T10:20:24Z

passes/pmgen/peepopt.cc

 			if (genmode == "shiftmul")
-				GENERATE_PATTERN(peepopt_pm, shiftmul);
+				GENERATE_PATTERN(peepopt_pm, shiftmul_right);


Suggested change

GENERATE_PATTERN(peepopt_pm, shiftmul_right);

log_error("Abort in %s:%d: peepopt-shiftmul does not implement a generate option\n", __FILE__, __LINE__)

Doesn't this require a generate block in the pattern to actually work?

Running it ends with:

ERROR: pmgen generator is stuck: 10000 iterations with no matching module generated.

I think the most relevant PR is #1298.

I don't know, I never looked into the generate stuff. Then maybe we should remove the generate flag altogether, I don't think muldiv has any generate blocks either.

phsauter · 2023-08-09T10:24:15Z

The two actual code changes should make it pass the current tests.
The generate change is more a suggestion that also applies to muldiv.

Personally I also think we should add log_header() for all pmg-pases that describe the transformation in 1-2 sentences.

povik · 2023-08-09T10:32:08Z

Thanks. I think we should mainly expand the help message, not sure if the patterns each warrant their log_header

phsauter · 2023-08-09T17:46:04Z

Thanks. I think we should mainly expand the help message, not sure if the patterns each warrant their log_header

yeah that makes more sense, help it is!

Drop code that was once used by the 'dffmux' pattern but now is unused after that pattern has been obsoleted by the 'opt_dff' pass.

No functional change intended.

The `opt_expr` pass running before `peepopt` can interfere with the detection of a shiftmul pattern due to some of the bottom bits of the shift amount being replaced with constant zero. Extend the detection to cover those situations as well.

Add a separate shiftmul pattern to match on left shifts which implement demuxing. This mirrors the right shift pattern matcher but is probably best kept separate instead of merging the two into a single matcher. In any case the diff of the two matchers should be easily readable.

povik · 2023-10-16T12:18:53Z

Rebased, squashed the fixup into its parent commit, last three commits are new.

povik mentioned this pull request Aug 2, 2023

part-selects ($shift, $shiftx) create huge MUX-trees #3833

Closed

povik force-pushed the peepopt-work branch from 7e7917b to 0983083 Compare August 3, 2023 10:02

povik mentioned this pull request Aug 7, 2023

Produce $bmux cells in shiftmul transformation #3880

Draft

This was referenced Aug 8, 2023

Optimization of nowrshmsk #3875

Merged

peepopt: Add shiftadd pattern #3883

Merged

phsauter reviewed Aug 9, 2023

View reviewed changes

povik added 4 commits October 16, 2023 13:52

peepopt: Drop unused 'initbits' code

a0c3be3

Drop code that was once used by the 'dffmux' pattern but now is unused after that pattern has been obsoleted by the 'opt_dff' pass.

peepopt: Clean up 'shiftmul' a bit

bd8a81a

No functional change intended.

peepopt: Try to use original wires

dd1a8ae

povik force-pushed the peepopt-work branch from 75fb5ab to d4dfd35 Compare October 16, 2023 12:18

povik added 3 commits October 16, 2023 14:19

peepopt: Remove broken -generate option

5c0c825

peepopt: Describe rules in help message

660be4a

pmgen: Fix sample syntax

d6d1cc7

povik force-pushed the peepopt-work branch from d4dfd35 to d6d1cc7 Compare October 16, 2023 12:19

nakengelhardt merged commit edee11b into YosysHQ:master Oct 16, 2023
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend the `peepopt` shiftmul matcher #3873

Extend the `peepopt` shiftmul matcher #3873

povik commented Aug 2, 2023

povik commented Aug 3, 2023

phsauter commented Aug 3, 2023 •

edited

Loading

povik commented Aug 3, 2023 •

edited

Loading

phsauter commented Aug 3, 2023

povik commented Aug 3, 2023

povik commented Aug 3, 2023

phsauter commented Aug 3, 2023

phsauter commented Aug 3, 2023

povik commented Aug 3, 2023

phsauter commented Aug 3, 2023

phsauter commented Aug 3, 2023

povik commented Aug 3, 2023

phsauter commented Aug 4, 2023

povik commented Aug 4, 2023 •

edited

Loading

povik commented Aug 4, 2023

povik commented Aug 4, 2023

phsauter commented Aug 4, 2023

phsauter Aug 9, 2023

povik Aug 9, 2023

phsauter Aug 9, 2023

povik Aug 9, 2023

phsauter Aug 9, 2023

povik Aug 9, 2023

phsauter Aug 9, 2023

povik Aug 9, 2023

phsauter commented Aug 9, 2023

povik commented Aug 9, 2023

phsauter commented Aug 9, 2023

povik commented Oct 16, 2023

	if (!param(neg, \B_SIGNED).as_bool())
	if (!param(shift, \B_SIGNED).as_bool())

	GENERATE_PATTERN(peepopt_pm, shiftmul_right);
	log_error("Abort in %s:%d: peepopt-shiftmul does not implement a generate option\n", __FILE__, __LINE__)

Extend the peepopt shiftmul matcher #3873

Extend the peepopt shiftmul matcher #3873

Conversation

povik commented Aug 2, 2023

povik commented Aug 3, 2023

phsauter commented Aug 3, 2023 • edited Loading

povik commented Aug 3, 2023 • edited Loading

phsauter commented Aug 3, 2023

povik commented Aug 3, 2023

povik commented Aug 3, 2023

phsauter commented Aug 3, 2023

phsauter commented Aug 3, 2023

povik commented Aug 3, 2023

phsauter commented Aug 3, 2023

phsauter commented Aug 3, 2023

povik commented Aug 3, 2023

phsauter commented Aug 4, 2023

povik commented Aug 4, 2023 • edited Loading

povik commented Aug 4, 2023

povik commented Aug 4, 2023

phsauter commented Aug 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phsauter commented Aug 9, 2023

povik commented Aug 9, 2023

phsauter commented Aug 9, 2023

povik commented Oct 16, 2023

Extend the `peepopt` shiftmul matcher #3873

Extend the `peepopt` shiftmul matcher #3873

phsauter commented Aug 3, 2023 •

edited

Loading

povik commented Aug 3, 2023 •

edited

Loading

povik commented Aug 4, 2023 •

edited

Loading