Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network compression #17

Closed
nhanvtran opened this issue Nov 11, 2017 · 22 comments
Closed

Network compression #17

nhanvtran opened this issue Nov 11, 2017 · 22 comments

Comments

@nhanvtran
Copy link
Contributor

After low weights are removed from a network, how do we implement the mechanism for skipping those in the HLS translation and RTL?

@benjaminkreis
Copy link
Member

benjaminkreis commented Nov 30, 2017

I've checked in a simplified external example that setting a non-programmable weight like variable to zero does indeed get the DSP removed.

Here in HLS4ML, there is still something I'm trying to figure out. In the default keras-to-hls example we have with 10 inputs -> 32 node hidden layer -> 1 output, I can set weights to zero in the first compute_layer with no problems. As expected, I see a reduction in the DSP usage. However, if I set one or more weights to zero in the second compute layer, vivado_hls aborts during synthesis with messages like

Global variable initializer type does not match global variable type!
[1 x i18]* @w2.V.1
Broken module found, compilation aborted!
Stack dump: 
0.  Running pass 'Function Pass Manager' on module '/home/kreis/sparse_tests/HLS4ML/keras-to-hls/my-hls-dir-test/myproject_prj/solution1/.autopilot/db/a.o.1.bc'.
/data/xilinx/Vivado_HLS/2016.4/bin/loader: line 179: 14762 Aborted                 (core dumped) "$RDI_PROG" "$@"

I've traced this down to this pragma: https://github.com/hls-fpga-machine-learning/HLS4ML/blob/master/nnet_utils/nnet_layer.h#L66. If I remove it, vivado_hls synthesizes without crashing (and smartly removes the corresponding DSP and the upstream ones not needed). @ejk43 or anyone else -- any ideas on why we can't partition the weights array in the second compute_layer when some of the weights are zero?

@ejk43
Copy link
Contributor

ejk43 commented Nov 30, 2017

This partition pragmas go back to the days of partial unrolling, which fails without cyclic/block partitioning. It may be the case that we can drop the partition directives as a workaround, and the compiler may be smart enough to recognize that it needs to be fully partitioning to achieve the target pipeline directives...

Of course, that still does not address the root cause. I have no idea why you'd see that particular abort/crash in the HLS compiler. Do you have a branch/commit you can point to that shows the error?

@benjaminkreis
Copy link
Member

If you run this example in this branch, you will see the error: https://github.com/hls-fpga-machine-learning/HLS4ML/tree/compress/example-prjs/higgs-1layer (the only thing I changed w.r.t. the head is setting one weight to zero here).

I think removing the line is probably ok because the total resource usage for DSPs, FF, and LUTs and the latency are unchanged.

It's strange to me that it only happens in the second layer computation. I am going try playing around with our three layer example to see what I can learn about that.

@benjaminkreis
Copy link
Member

benjaminkreis commented Dec 1, 2017

I've been playing around with our three hidden layer example (27 in->64 node->32 node->32 node->2 out). With the weight array partition pragma in (ie no changes to the code), I can set a weight to zero in any layer computation without getting the crash. Setting a random selection of weights to zero follows the expected resource usage:

Default weights: 3471 DSPs
Random 10% of weights set to zero: 3124 DSPs
Random 20% of weights set to zero: 2723 DSPs
Random 50% of weights set to zero: 1737 DSPs

If I change the architecture to only one output (27 in->64 node->32 node->32 node->1 out) and set a weight to zero in the last layer computation, I get the crash just like I did for the one hidden layer example with one output.

One thing that makes "one weight=0 with one output" special is that the number of multipliers that you should remove is 1 + the number of nodes of the previous layer (while if you have more than one output, you can only remove one multiplier). With this in mind, I also tried going back to the original 2 output architecture and setting the two weights connecting one node to the outputs to zero. This should remove 2 + the number of nodes of the previous layer multipliers. I thought this would crash, but it did not. It used 3447 DPSs, which is close to 3471-(2+32)=3437, though I'm not sure why it's not exact (perhaps because with ap_fixed<18,8> there isn't a one-to-one correspondence between multipliers and DSPs -- seems to be the case just based on the totals above!).

So as of right now it seems like there is something really special about one weight=0 with one output.

@benjaminkreis
Copy link
Member

I've done one more thing investigating this.

@nhanvtran was wondering if the problem is specific to 2D arrays that could be 1D arrays, as is the case when the number of outputs equals one. So to test that, I made a new function in this other, ugly branch called compute_layer2 where I hardcoded the weight array inside. I tried making it 1D, and it didn't abort. But I also tried making it 2D here, and it didn't abort!! This is the version checked in now, which is really the same as the default code except we are using the hardcoded weights instead of the weights passed to the function. You can run this here.

My only attempt at going a little farther than that was to create compute_layer3 which removes the template complication, but this also crashed like the default, so it's something else about passing the weights in.

So, still not sure what is going on here. But I'm thinking we can stop investing time into this, at least for now, if simply removing that pragma fixes everything.

@nhanvtran
Copy link
Contributor Author

do you just want to remove the pragma or put in a safety that doesn't use the pragma if Nin or Nout == 1?

@benjaminkreis
Copy link
Member

I have a preference for removing it altogether, because if we don't need it for Nout==1, why do we need it for Nout!=1?

@ejk43
Copy link
Contributor

ejk43 commented Dec 2, 2017

Just something else to consider: I added this function_instantiate pragma last month (https://github.com/hls-fpga-machine-learning/HLS4ML/blob/master/nnet_utils/nnet_layer.h#L59), which supposedly tells HLS that the specified variables are constants and can be optimized on a per-instantiation basis.

I'm not totally sure if this directive has any measurable impact, but I'm suspicious it could also be contributing to the crash you're seeing here. But taking out the unneeded partition directive seems like a fine solution.

@ejk43
Copy link
Contributor

ejk43 commented Dec 3, 2017

I'm curious about another point here...

Default weights: 3471 DSPs
Random 10% of weights set to zero: 3124 DSPs
Random 20% of weights set to zero: 2723 DSPs
Random 50% of weights set to zero: 1737 DSPs

Did you happen to try using a resource reuse factor in these tests? I'm skeptical this will hit the full DSP reduction with resource reuse... it might only be able to optimize out a DSP if all multiplies along a certain path are 0?

@benjaminkreis
Copy link
Member

benjaminkreis commented Dec 3, 2017

Hi @ejk43, I think the function_instantiate pragma you refer to is okay. This crash happens whether or not it's there.

Regarding the DSP usage, I agree. I believe the optimal case would be if Vivado HLS puts the zeros in the same path. Then your resource usage would go from (number of multiplications)/(reuse factor) to (number of multiplications)/(reuse factor) - floor[(number of zeros)/(reuse factor)] (do you agree?).

To test this, I tried with reuse factors of 2 and 3
Reuse=1 Reuse=2 Reuse=3
Default : 3471 2432 1620
10% zero: 3124 2432 1620
20% zero: 2723 2432 1620
50% zero: 1737 1622 1595

You can see a few things here:

  1. The number of multplications is 27x64+64x32+32x32+32x2=4864, but for whatever reason it does not need this many DPSs for default weights and a reuse factor of 1. However, when you move to a reuse factor of 2, it uses (number of multiplications)/2=2432 DPSs.
  2. For a reuse factor of 2, you don't see a reduction in number of DSPs until quite a large number of weights are set to zero. And when you see the reduction, it is less than optimal (1622 instead of 1216). So it's not doing the fully optimal reduction in DSPs. Luckily compression techniques can remove even 90% of the multiplications while maintaining performance.
  3. For a reuse factor of 3, you see a similar pattern but with a much smaller reduction.

@ejk43
Copy link
Contributor

ejk43 commented Dec 4, 2017

A couple thoughts:

The number of multplications is 27x64+64x32+32x32+32x2=4864, but for whatever reason it does not need this many DPSs for default weights and a reuse factor of 1

I'd have to guess there are some weights that end up being zero or a bitwise power of two, which can be easily optimized out.

For a reuse factor of 2, you don't see a reduction in number of DSPs until quite a large number of weights are set to zero. And when you see the reduction, it is less than optimal (1622 instead of 1216). So it's not doing the fully optimal reduction in DSPs

Okay, that's a good data point. Yes I'd definitely like to see more resource reuse with zero-ed weights. I agree with your calculation-- and HLS does not seem smart enough now to achieve (multipliers - zeros)/(reuse factor). This may be something we have to work to hit if/when it's a priority

@benjaminkreis
Copy link
Member

benjaminkreis commented Dec 5, 2017

Nhan and I discussed, and I tried another test. In this branch I added the number of weights equaling zero to the layer struct. Then in the layer computation, I limit the number of DPSs to (multipliers - zeros)/(reuse factor) and see what happens.

For reuse factor = 2, I see:

Weights n_zeros Latency Interval DSP FF LUT
Random 20% of weights=0 not used 59 2 2432 81955 79065
Random 20% of weights=0 used assuming 20% zeros 60 2 1965 81841 89710
Default weights used assuming 20% zeros 62 6 1965 88194 123819

The first row is before the DSP usage limit accounted for the number of zeros. In the second row, I use the limitation. Unfortunately in this case it has to add one clock of latency and a few more LUTs. As you can see in the third row, similar things happen if you try to squeeze the default weights through the reduced number of DSPs (Edit: ie it is doing the multiplication with logic cells). It's actually a more drastic change, with the interval increasing to 6, so I'm not 100% sure about what's going on, but I think at the least we can say this is not a silver bullet.

@nhanvtran
Copy link
Contributor Author

what's the reuse_factor you set?

@benjaminkreis
Copy link
Member

2

@benjaminkreis
Copy link
Member

benjaminkreis commented Dec 15, 2017

For @nhanvtran, I've repeated the test above where I randomly change weights to zero (and don't change the max multiplier allocation), now with the 1 hidden layer example. Here we have 10 inputs, a 32 node hidden layer, and 1 output. So maximum number of multipliers is 10x32+32x1=352. In this test, I only randomly set weights in the first layer compute to zero, and leave the final 32 as is.

. Reuse=1 Reuse=2 Reuse=3
Default: 289 176 116
10% zero: 263 176 116
20% zero: 237 176 116
50% zero: 145 136 116

So with a reuse factor of 2 or 3 and up to 20% zero weights, we use the same number of DSPs as with default weights. In other words, HLS is not aligning the paths with zero weights here.

We start to see what we are hoping for when 50% of the weight are zero and reuse factor is 2 (ie aligning paths with zero weights). But things are getting a bit funny at this point -- in another random set, I got 149 DSPs, which is actually more than reuse=1. And the "interval" for reuse=3 is 6, not 3!

I think I will go even simpler and entirely remove the second layer in a test, so we are just looking at one layer computation. It would be nice to understand that first before looking at the interplay of layers that leads to the removal of downstream multiplications, etc.

@benjaminkreis
Copy link
Member

@nhanvtran in case you get to this before me, here is a small head start on looking at this for only one layer computation (ie no hidden layers). It's just the normal keras-to-hls.py with a break statement to stop after the first layer, and it also contains the random setting to zeros for the weights.

Don't confuse this with this branch, which is the one with the "n_zeros" number added to the layer config (though we might want to combine the two branches).

@nhanvtran
Copy link
Contributor Author

So I'm a little confused, everything seems to be working for me, at least so far.

I have a branch here: nvt/sparse_nzero where I have set 50% of the first layer 10x32 weights to zero. Then I tried reuse factor of 1,2,4. Here are the synthesis results for first layer

+-----------------+---------+-------+--------+--------+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  |
+-----------------+---------+-------+--------+--------+
|reuse1           |        0|    140|    2893|    2516|
+-----------------+---------+-------+--------+--------+
|reuse2           |        0|     84|    3026|    3992|
+-----------------+---------+-------+--------+--------+
|reuse4           |        0|     40|    3116|    3480|

Sometimes HLS figures out how to save more DSP, but for the most part it scales as expected.

@benjaminkreis
Copy link
Member

benjaminkreis commented Dec 19, 2017

Did you check the latency and interval?

@benjaminkreis
Copy link
Member

benjaminkreis commented Dec 20, 2017

In this branch I tried something slightly different where I:

  1. first loop through the weights to make a list of the required multiplications (here)
  2. then loop through the list and do the required multiplications (here)

The idea behind this was to perhaps help HLS "align the zeros" so that the number of multiplication "circuits" it needs is just the number of DPSs we allow it (with the nzero constraint). This would avoid doing extra multiplications in FFs and LUTs. The resource usage I get (see below) looks pretty similar to @nhanvtran's table above, perhaps suggesting that the extra FFs and LUTs are being used for routing instead of multiplication. Need to dig into this a little more...

For 50% of the weights set to zero:

+-----------------+---------+-------+--------+--------+--------+--------+
|       Name      | BRAM_18K| DSP48E|   FF   |   LUT  | Latency|Interval|
+-----------------+---------+-------+--------+--------+--------+--------+
|Reuse=1          |        0|    132|    2772|    2451|       4|       1| 
+-----------------+---------+-------+--------+--------+--------+--------+
|Reuse=2          |        0|     78|    2882|    3829|       4|       2|
+-----------------+---------+-------+--------+--------+--------+--------+
|Reuse=4          |        0|     43|    3241|    3846|       6|       4|
+-----------------+---------+-------+--------+--------+--------+--------+

@ejk43
Copy link
Contributor

ejk43 commented Jan 3, 2018

Hey, I saw this a while ago but kept forgetting to respond:

The resource usage I get (see below) looks pretty similar to @nhanvtran's table above, perhaps suggesting that the extra FFs and LUTs are being used for routing instead of multiplication

This is pretty typical when reusing multipliers... The extra FF and LUT logic, I believe, is used to register intermediate values, since there's now extra storage requirements within the operations (higher latency, higher computation interval).

The reduced multipliers are definitely worth the effort though, since the computations are so multiplication-heavy.

@benjaminkreis
Copy link
Member

That's good to know! At one point I convinced myself that we were actually doing multiplications with logic, but I'm less sure of that now.

@nhanvtran
Copy link
Contributor Author

It's working for now, let's close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants