-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network compression #17
Comments
I've checked in a simplified external example that setting a non-programmable weight like variable to zero does indeed get the DSP removed. Here in HLS4ML, there is still something I'm trying to figure out. In the default keras-to-hls example we have with 10 inputs -> 32 node hidden layer -> 1 output, I can set weights to zero in the first compute_layer with no problems. As expected, I see a reduction in the DSP usage. However, if I set one or more weights to zero in the second compute layer, vivado_hls aborts during synthesis with messages like
I've traced this down to this pragma: https://github.com/hls-fpga-machine-learning/HLS4ML/blob/master/nnet_utils/nnet_layer.h#L66. If I remove it, vivado_hls synthesizes without crashing (and smartly removes the corresponding DSP and the upstream ones not needed). @ejk43 or anyone else -- any ideas on why we can't partition the weights array in the second compute_layer when some of the weights are zero? |
This partition pragmas go back to the days of partial unrolling, which fails without cyclic/block partitioning. It may be the case that we can drop the partition directives as a workaround, and the compiler may be smart enough to recognize that it needs to be fully partitioning to achieve the target pipeline directives... Of course, that still does not address the root cause. I have no idea why you'd see that particular abort/crash in the HLS compiler. Do you have a branch/commit you can point to that shows the error? |
If you run this example in this branch, you will see the error: https://github.com/hls-fpga-machine-learning/HLS4ML/tree/compress/example-prjs/higgs-1layer (the only thing I changed w.r.t. the head is setting one weight to zero here). I think removing the line is probably ok because the total resource usage for DSPs, FF, and LUTs and the latency are unchanged. It's strange to me that it only happens in the second layer computation. I am going try playing around with our three layer example to see what I can learn about that. |
I've been playing around with our three hidden layer example (27 in->64 node->32 node->32 node->2 out). With the weight array partition pragma in (ie no changes to the code), I can set a weight to zero in any layer computation without getting the crash. Setting a random selection of weights to zero follows the expected resource usage: Default weights: 3471 DSPs If I change the architecture to only one output (27 in->64 node->32 node->32 node->1 out) and set a weight to zero in the last layer computation, I get the crash just like I did for the one hidden layer example with one output. One thing that makes "one weight=0 with one output" special is that the number of multipliers that you should remove is 1 + the number of nodes of the previous layer (while if you have more than one output, you can only remove one multiplier). With this in mind, I also tried going back to the original 2 output architecture and setting the two weights connecting one node to the outputs to zero. This should remove 2 + the number of nodes of the previous layer multipliers. I thought this would crash, but it did not. It used 3447 DPSs, which is close to 3471-(2+32)=3437, though I'm not sure why it's not exact (perhaps because with ap_fixed<18,8> there isn't a one-to-one correspondence between multipliers and DSPs -- seems to be the case just based on the totals above!). So as of right now it seems like there is something really special about one weight=0 with one output. |
I've done one more thing investigating this. @nhanvtran was wondering if the problem is specific to 2D arrays that could be 1D arrays, as is the case when the number of outputs equals one. So to test that, I made a new function in this other, ugly branch called compute_layer2 where I hardcoded the weight array inside. I tried making it 1D, and it didn't abort. But I also tried making it 2D here, and it didn't abort!! This is the version checked in now, which is really the same as the default code except we are using the hardcoded weights instead of the weights passed to the function. You can run this here. My only attempt at going a little farther than that was to create compute_layer3 which removes the template complication, but this also crashed like the default, so it's something else about passing the weights in. So, still not sure what is going on here. But I'm thinking we can stop investing time into this, at least for now, if simply removing that pragma fixes everything. |
do you just want to remove the pragma or put in a safety that doesn't use the pragma if Nin or Nout == 1? |
I have a preference for removing it altogether, because if we don't need it for Nout==1, why do we need it for Nout!=1? |
Just something else to consider: I added this function_instantiate pragma last month (https://github.com/hls-fpga-machine-learning/HLS4ML/blob/master/nnet_utils/nnet_layer.h#L59), which supposedly tells HLS that the specified variables are constants and can be optimized on a per-instantiation basis. I'm not totally sure if this directive has any measurable impact, but I'm suspicious it could also be contributing to the crash you're seeing here. But taking out the unneeded partition directive seems like a fine solution. |
I'm curious about another point here...
Did you happen to try using a resource reuse factor in these tests? I'm skeptical this will hit the full DSP reduction with resource reuse... it might only be able to optimize out a DSP if all multiplies along a certain path are 0? |
Hi @ejk43, I think the function_instantiate pragma you refer to is okay. This crash happens whether or not it's there. Regarding the DSP usage, I agree. I believe the optimal case would be if Vivado HLS puts the zeros in the same path. Then your resource usage would go from (number of multiplications)/(reuse factor) to (number of multiplications)/(reuse factor) - floor[(number of zeros)/(reuse factor)] (do you agree?). To test this, I tried with reuse factors of 2 and 3 You can see a few things here:
|
A couple thoughts:
I'd have to guess there are some weights that end up being zero or a bitwise power of two, which can be easily optimized out.
Okay, that's a good data point. Yes I'd definitely like to see more resource reuse with zero-ed weights. I agree with your calculation-- and HLS does not seem smart enough now to achieve (multipliers - zeros)/(reuse factor). This may be something we have to work to hit if/when it's a priority |
Nhan and I discussed, and I tried another test. In this branch I added the number of weights equaling zero to the layer struct. Then in the layer computation, I limit the number of DPSs to (multipliers - zeros)/(reuse factor) and see what happens. For reuse factor = 2, I see:
The first row is before the DSP usage limit accounted for the number of zeros. In the second row, I use the limitation. Unfortunately in this case it has to add one clock of latency and a few more LUTs. As you can see in the third row, similar things happen if you try to squeeze the default weights through the reduced number of DSPs (Edit: ie it is doing the multiplication with logic cells). It's actually a more drastic change, with the interval increasing to 6, so I'm not 100% sure about what's going on, but I think at the least we can say this is not a silver bullet. |
what's the reuse_factor you set? |
2 |
For @nhanvtran, I've repeated the test above where I randomly change weights to zero (and don't change the max multiplier allocation), now with the 1 hidden layer example. Here we have 10 inputs, a 32 node hidden layer, and 1 output. So maximum number of multipliers is 10x32+32x1=352. In this test, I only randomly set weights in the first layer compute to zero, and leave the final 32 as is.
So with a reuse factor of 2 or 3 and up to 20% zero weights, we use the same number of DSPs as with default weights. In other words, HLS is not aligning the paths with zero weights here. We start to see what we are hoping for when 50% of the weight are zero and reuse factor is 2 (ie aligning paths with zero weights). But things are getting a bit funny at this point -- in another random set, I got 149 DSPs, which is actually more than reuse=1. And the "interval" for reuse=3 is 6, not 3! I think I will go even simpler and entirely remove the second layer in a test, so we are just looking at one layer computation. It would be nice to understand that first before looking at the interplay of layers that leads to the removal of downstream multiplications, etc. |
@nhanvtran in case you get to this before me, here is a small head start on looking at this for only one layer computation (ie no hidden layers). It's just the normal keras-to-hls.py with a break statement to stop after the first layer, and it also contains the random setting to zeros for the weights. Don't confuse this with this branch, which is the one with the "n_zeros" number added to the layer config (though we might want to combine the two branches). |
So I'm a little confused, everything seems to be working for me, at least so far. I have a branch here:
Sometimes HLS figures out how to save more DSP, but for the most part it scales as expected. |
Did you check the latency and interval? |
In this branch I tried something slightly different where I:
The idea behind this was to perhaps help HLS "align the zeros" so that the number of multiplication "circuits" it needs is just the number of DPSs we allow it (with the nzero constraint). This would avoid doing extra multiplications in FFs and LUTs. The resource usage I get (see below) looks pretty similar to @nhanvtran's table above, perhaps suggesting that the extra FFs and LUTs are being used for routing instead of multiplication. Need to dig into this a little more... For 50% of the weights set to zero:
|
Hey, I saw this a while ago but kept forgetting to respond:
This is pretty typical when reusing multipliers... The extra FF and LUT logic, I believe, is used to register intermediate values, since there's now extra storage requirements within the operations (higher latency, higher computation interval). The reduced multipliers are definitely worth the effort though, since the computations are so multiplication-heavy. |
That's good to know! At one point I convinced myself that we were actually doing multiplications with logic, but I'm less sure of that now. |
It's working for now, let's close this. |
After low weights are removed from a network, how do we implement the mechanism for skipping those in the HLS translation and RTL?
The text was updated successfully, but these errors were encountered: