Network compression #17

nhanvtran · 2017-11-11T04:32:11Z

After low weights are removed from a network, how do we implement the mechanism for skipping those in the HLS translation and RTL?

benjaminkreis · 2017-11-30T03:08:20Z

I've checked in a simplified external example that setting a non-programmable weight like variable to zero does indeed get the DSP removed.

Here in HLS4ML, there is still something I'm trying to figure out. In the default keras-to-hls example we have with 10 inputs -> 32 node hidden layer -> 1 output, I can set weights to zero in the first compute_layer with no problems. As expected, I see a reduction in the DSP usage. However, if I set one or more weights to zero in the second compute layer, vivado_hls aborts during synthesis with messages like

Global variable initializer type does not match global variable type!
[1 x i18]* @w2.V.1
Broken module found, compilation aborted!
Stack dump: 
0.  Running pass 'Function Pass Manager' on module '/home/kreis/sparse_tests/HLS4ML/keras-to-hls/my-hls-dir-test/myproject_prj/solution1/.autopilot/db/a.o.1.bc'.
/data/xilinx/Vivado_HLS/2016.4/bin/loader: line 179: 14762 Aborted                 (core dumped) "$RDI_PROG" "$@"

I've traced this down to this pragma: https://github.com/hls-fpga-machine-learning/HLS4ML/blob/master/nnet_utils/nnet_layer.h#L66. If I remove it, vivado_hls synthesizes without crashing (and smartly removes the corresponding DSP and the upstream ones not needed). @ejk43 or anyone else -- any ideas on why we can't partition the weights array in the second compute_layer when some of the weights are zero?

ejk43 · 2017-11-30T13:11:01Z

This partition pragmas go back to the days of partial unrolling, which fails without cyclic/block partitioning. It may be the case that we can drop the partition directives as a workaround, and the compiler may be smart enough to recognize that it needs to be fully partitioning to achieve the target pipeline directives...

Of course, that still does not address the root cause. I have no idea why you'd see that particular abort/crash in the HLS compiler. Do you have a branch/commit you can point to that shows the error?

benjaminkreis · 2017-11-30T15:41:41Z

If you run this example in this branch, you will see the error: https://github.com/hls-fpga-machine-learning/HLS4ML/tree/compress/example-prjs/higgs-1layer (the only thing I changed w.r.t. the head is setting one weight to zero here).

I think removing the line is probably ok because the total resource usage for DSPs, FF, and LUTs and the latency are unchanged.

It's strange to me that it only happens in the second layer computation. I am going try playing around with our three layer example to see what I can learn about that.

benjaminkreis · 2017-12-01T14:06:07Z

I've been playing around with our three hidden layer example (27 in->64 node->32 node->32 node->2 out). With the weight array partition pragma in (ie no changes to the code), I can set a weight to zero in any layer computation without getting the crash. Setting a random selection of weights to zero follows the expected resource usage:

Default weights: 3471 DSPs
Random 10% of weights set to zero: 3124 DSPs
Random 20% of weights set to zero: 2723 DSPs
Random 50% of weights set to zero: 1737 DSPs

If I change the architecture to only one output (27 in->64 node->32 node->32 node->1 out) and set a weight to zero in the last layer computation, I get the crash just like I did for the one hidden layer example with one output.

One thing that makes "one weight=0 with one output" special is that the number of multipliers that you should remove is 1 + the number of nodes of the previous layer (while if you have more than one output, you can only remove one multiplier). With this in mind, I also tried going back to the original 2 output architecture and setting the two weights connecting one node to the outputs to zero. This should remove 2 + the number of nodes of the previous layer multipliers. I thought this would crash, but it did not. It used 3447 DPSs, which is close to 3471-(2+32)=3437, though I'm not sure why it's not exact (perhaps because with ap_fixed<18,8> there isn't a one-to-one correspondence between multipliers and DSPs -- seems to be the case just based on the totals above!).

So as of right now it seems like there is something really special about one weight=0 with one output.

benjaminkreis · 2017-12-02T01:00:01Z

I've done one more thing investigating this.

@nhanvtran was wondering if the problem is specific to 2D arrays that could be 1D arrays, as is the case when the number of outputs equals one. So to test that, I made a new function in this other, ugly branch called compute_layer2 where I hardcoded the weight array inside. I tried making it 1D, and it didn't abort. But I also tried making it 2D here, and it didn't abort!! This is the version checked in now, which is really the same as the default code except we are using the hardcoded weights instead of the weights passed to the function. You can run this here.

My only attempt at going a little farther than that was to create compute_layer3 which removes the template complication, but this also crashed like the default, so it's something else about passing the weights in.

So, still not sure what is going on here. But I'm thinking we can stop investing time into this, at least for now, if simply removing that pragma fixes everything.

nhanvtran · 2017-12-02T04:43:23Z

do you just want to remove the pragma or put in a safety that doesn't use the pragma if Nin or Nout == 1?

benjaminkreis · 2017-12-02T14:43:12Z

I have a preference for removing it altogether, because if we don't need it for Nout==1, why do we need it for Nout!=1?

ejk43 · 2017-12-02T22:15:49Z

Just something else to consider: I added this function_instantiate pragma last month (https://github.com/hls-fpga-machine-learning/HLS4ML/blob/master/nnet_utils/nnet_layer.h#L59), which supposedly tells HLS that the specified variables are constants and can be optimized on a per-instantiation basis.

I'm not totally sure if this directive has any measurable impact, but I'm suspicious it could also be contributing to the crash you're seeing here. But taking out the unneeded partition directive seems like a fine solution.

ejk43 · 2017-12-03T01:23:48Z

I'm curious about another point here...

Default weights: 3471 DSPs
Random 10% of weights set to zero: 3124 DSPs
Random 20% of weights set to zero: 2723 DSPs
Random 50% of weights set to zero: 1737 DSPs

Did you happen to try using a resource reuse factor in these tests? I'm skeptical this will hit the full DSP reduction with resource reuse... it might only be able to optimize out a DSP if all multiplies along a certain path are 0?

benjaminkreis · 2017-12-03T13:05:18Z

Hi @ejk43, I think the function_instantiate pragma you refer to is okay. This crash happens whether or not it's there.

Regarding the DSP usage, I agree. I believe the optimal case would be if Vivado HLS puts the zeros in the same path. Then your resource usage would go from (number of multiplications)/(reuse factor) to (number of multiplications)/(reuse factor) - floor[(number of zeros)/(reuse factor)] (do you agree?).

To test this, I tried with reuse factors of 2 and 3
Reuse=1 Reuse=2 Reuse=3
Default : 3471 2432 1620
10% zero: 3124 2432 1620
20% zero: 2723 2432 1620
50% zero: 1737 1622 1595

You can see a few things here:

The number of multplications is 27x64+64x32+32x32+32x2=4864, but for whatever reason it does not need this many DPSs for default weights and a reuse factor of 1. However, when you move to a reuse factor of 2, it uses (number of multiplications)/2=2432 DPSs.
For a reuse factor of 2, you don't see a reduction in number of DSPs until quite a large number of weights are set to zero. And when you see the reduction, it is less than optimal (1622 instead of 1216). So it's not doing the fully optimal reduction in DSPs. Luckily compression techniques can remove even 90% of the multiplications while maintaining performance.
For a reuse factor of 3, you see a similar pattern but with a much smaller reduction.

ejk43 · 2017-12-04T00:22:27Z

A couple thoughts:

The number of multplications is 27x64+64x32+32x32+32x2=4864, but for whatever reason it does not need this many DPSs for default weights and a reuse factor of 1

I'd have to guess there are some weights that end up being zero or a bitwise power of two, which can be easily optimized out.

For a reuse factor of 2, you don't see a reduction in number of DSPs until quite a large number of weights are set to zero. And when you see the reduction, it is less than optimal (1622 instead of 1216). So it's not doing the fully optimal reduction in DSPs

Okay, that's a good data point. Yes I'd definitely like to see more resource reuse with zero-ed weights. I agree with your calculation-- and HLS does not seem smart enough now to achieve (multipliers - zeros)/(reuse factor). This may be something we have to work to hit if/when it's a priority

benjaminkreis · 2017-12-05T13:37:39Z

Nhan and I discussed, and I tried another test. In this branch I added the number of weights equaling zero to the layer struct. Then in the layer computation, I limit the number of DPSs to (multipliers - zeros)/(reuse factor) and see what happens.

For reuse factor = 2, I see:

Weights	n_zeros	Latency	Interval	DSP	FF	LUT
Random 20% of weights=0	not used	59	2	2432	81955	79065
Random 20% of weights=0	used assuming 20% zeros	60	2	1965	81841	89710
Default weights	used assuming 20% zeros	62	6	1965	88194	123819

The first row is before the DSP usage limit accounted for the number of zeros. In the second row, I use the limitation. Unfortunately in this case it has to add one clock of latency and a few more LUTs. As you can see in the third row, similar things happen if you try to squeeze the default weights through the reduced number of DSPs (Edit: ie it is doing the multiplication with logic cells). It's actually a more drastic change, with the interval increasing to 6, so I'm not 100% sure about what's going on, but I think at the least we can say this is not a silver bullet.

nhanvtran · 2017-12-05T13:54:51Z

what's the reuse_factor you set?

benjaminkreis · 2017-12-05T13:59:07Z

2

benjaminkreis · 2017-12-15T22:58:13Z

For @nhanvtran, I've repeated the test above where I randomly change weights to zero (and don't change the max multiplier allocation), now with the 1 hidden layer example. Here we have 10 inputs, a 32 node hidden layer, and 1 output. So maximum number of multipliers is 10x32+32x1=352. In this test, I only randomly set weights in the first layer compute to zero, and leave the final 32 as is.

.	Reuse=1	Reuse=2	Reuse=3
Default:	289	176	116
10% zero:	263	176	116
20% zero:	237	176	116
50% zero:	145	136	116

So with a reuse factor of 2 or 3 and up to 20% zero weights, we use the same number of DSPs as with default weights. In other words, HLS is not aligning the paths with zero weights here.

We start to see what we are hoping for when 50% of the weight are zero and reuse factor is 2 (ie aligning paths with zero weights). But things are getting a bit funny at this point -- in another random set, I got 149 DSPs, which is actually more than reuse=1. And the "interval" for reuse=3 is 6, not 3!

I think I will go even simpler and entirely remove the second layer in a test, so we are just looking at one layer computation. It would be nice to understand that first before looking at the interplay of layers that leads to the removal of downstream multiplications, etc.

benjaminkreis · 2017-12-18T22:55:23Z

@nhanvtran in case you get to this before me, here is a small head start on looking at this for only one layer computation (ie no hidden layers). It's just the normal keras-to-hls.py with a break statement to stop after the first layer, and it also contains the random setting to zeros for the weights.

Don't confuse this with this branch, which is the one with the "n_zeros" number added to the layer config (though we might want to combine the two branches).

nhanvtran · 2017-12-19T15:46:01Z

So I'm a little confused, everything seems to be working for me, at least so far.

I have a branch here: nvt/sparse_nzero where I have set 50% of the first layer 10x32 weights to zero. Then I tried reuse factor of 1,2,4. Here are the synthesis results for first layer

+-----------------+---------+-------+--------+--------+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  |
+-----------------+---------+-------+--------+--------+
|reuse1           |        0|    140|    2893|    2516|
+-----------------+---------+-------+--------+--------+
|reuse2           |        0|     84|    3026|    3992|
+-----------------+---------+-------+--------+--------+
|reuse4           |        0|     40|    3116|    3480|

Sometimes HLS figures out how to save more DSP, but for the most part it scales as expected.

benjaminkreis · 2017-12-19T16:47:40Z

Did you check the latency and interval?

benjaminkreis · 2017-12-20T05:08:19Z

In this branch I tried something slightly different where I:

first loop through the weights to make a list of the required multiplications (here)
then loop through the list and do the required multiplications (here)

The idea behind this was to perhaps help HLS "align the zeros" so that the number of multiplication "circuits" it needs is just the number of DPSs we allow it (with the nzero constraint). This would avoid doing extra multiplications in FFs and LUTs. The resource usage I get (see below) looks pretty similar to @nhanvtran's table above, perhaps suggesting that the extra FFs and LUTs are being used for routing instead of multiplication. Need to dig into this a little more...

For 50% of the weights set to zero:

+-----------------+---------+-------+--------+--------+--------+--------+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  | Latency|Interval|
+-----------------+---------+-------+--------+--------+--------+--------+
|Reuse=1          |        0|    132|    2772|    2451|       4|       1| 
+-----------------+---------+-------+--------+--------+--------+--------+
|Reuse=2          |        0|     78|    2882|    3829|       4|       2|
+-----------------+---------+-------+--------+--------+--------+--------+
|Reuse=4          |        0|     43|    3241|    3846|       6|       4|
+-----------------+---------+-------+--------+--------+--------+--------+

ejk43 · 2018-01-03T14:06:27Z

Hey, I saw this a while ago but kept forgetting to respond:

The resource usage I get (see below) looks pretty similar to @nhanvtran's table above, perhaps suggesting that the extra FFs and LUTs are being used for routing instead of multiplication

This is pretty typical when reusing multipliers... The extra FF and LUT logic, I believe, is used to register intermediate values, since there's now extra storage requirements within the operations (higher latency, higher computation interval).

The reduced multipliers are definitely worth the effort though, since the computations are so multiplication-heavy.

benjaminkreis · 2018-01-03T15:58:15Z

That's good to know! At one point I convinced myself that we were actually doing multiplications with logic, but I'm less sure of that now.

nhanvtran · 2018-02-19T18:33:04Z

It's working for now, let's close this.

Hard sigmoid branch

nhanvtran closed this as completed Feb 19, 2018

benjaminkreis mentioned this issue Jul 18, 2018

Syn failed for 3Layer with sublayer #79

Closed

GiuseppeDiGuglielmo pushed a commit that referenced this issue Oct 13, 2023

Merge pull request #17 from dgburnette/hard_sigmoid_branch

055ee70

Hard sigmoid branch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network compression #17

Network compression #17

nhanvtran commented Nov 11, 2017

benjaminkreis commented Nov 30, 2017 •

edited

ejk43 commented Nov 30, 2017 •

edited

benjaminkreis commented Nov 30, 2017

benjaminkreis commented Dec 1, 2017 •

edited

benjaminkreis commented Dec 2, 2017

nhanvtran commented Dec 2, 2017

benjaminkreis commented Dec 2, 2017

ejk43 commented Dec 2, 2017

ejk43 commented Dec 3, 2017

benjaminkreis commented Dec 3, 2017 •

edited

ejk43 commented Dec 4, 2017

benjaminkreis commented Dec 5, 2017 •

edited

nhanvtran commented Dec 5, 2017

benjaminkreis commented Dec 5, 2017

benjaminkreis commented Dec 15, 2017 •

edited

benjaminkreis commented Dec 18, 2017

nhanvtran commented Dec 19, 2017

benjaminkreis commented Dec 19, 2017 •

edited

benjaminkreis commented Dec 20, 2017 •

edited

ejk43 commented Jan 3, 2018

benjaminkreis commented Jan 3, 2018

nhanvtran commented Feb 19, 2018

Network compression #17

Network compression #17

Comments

nhanvtran commented Nov 11, 2017

benjaminkreis commented Nov 30, 2017 • edited

ejk43 commented Nov 30, 2017 • edited

benjaminkreis commented Nov 30, 2017

benjaminkreis commented Dec 1, 2017 • edited

benjaminkreis commented Dec 2, 2017

nhanvtran commented Dec 2, 2017

benjaminkreis commented Dec 2, 2017

ejk43 commented Dec 2, 2017

ejk43 commented Dec 3, 2017

benjaminkreis commented Dec 3, 2017 • edited

ejk43 commented Dec 4, 2017

benjaminkreis commented Dec 5, 2017 • edited

nhanvtran commented Dec 5, 2017

benjaminkreis commented Dec 5, 2017

benjaminkreis commented Dec 15, 2017 • edited

benjaminkreis commented Dec 18, 2017

nhanvtran commented Dec 19, 2017

benjaminkreis commented Dec 19, 2017 • edited

benjaminkreis commented Dec 20, 2017 • edited

ejk43 commented Jan 3, 2018

benjaminkreis commented Jan 3, 2018

nhanvtran commented Feb 19, 2018

benjaminkreis commented Nov 30, 2017 •

edited

ejk43 commented Nov 30, 2017 •

edited

benjaminkreis commented Dec 1, 2017 •

edited

benjaminkreis commented Dec 3, 2017 •

edited

benjaminkreis commented Dec 5, 2017 •

edited

benjaminkreis commented Dec 15, 2017 •

edited

benjaminkreis commented Dec 19, 2017 •

edited

benjaminkreis commented Dec 20, 2017 •

edited