Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8 bit Winograd Convolution? #16

Open
manojrohit opened this issue Oct 10, 2018 · 5 comments
Open

8 bit Winograd Convolution? #16

manojrohit opened this issue Oct 10, 2018 · 5 comments

Comments

@manojrohit
Copy link

Is it possible to implement Winograd Convolution with 8 bit weights and activations? The intermediate transformations cause overflows which results in the loss of accuracy of the overall CNN. Is anyone aware of research implementing Winograd in low precision domains?

@andravin
Copy link
Owner

There are a couple of ways to think about this. I will assume you are using 8-bit integers and not 8-bit floating point numbers.

For deployment, the network weights are constant, so the winograd components can be computed offline in high precision, then quantized to 8-bits and stored. Because the Winograd components over-determine the raw weights, they actually contain more information than the raw weights.

The downside is that the winograd components use more memory than the raw weights. F(2x2, 3x3) filter transforms expands the raw weights by a factor of 1.78X, F(4x4, 3x3) by 4X.

Typically 8-bit activations are computed using full-precision multiplication and 32-bit accumulation, so that there is no precision loss during the computation. Then the 32-bit results are quantized to 8-bits before the next stage of computation.

So you could also apply the winograd transform to 32-bit activations before quantizing to 8-bits. If you were to do this, you would probably fuse the multiplications stage, inverse winograd transform, bias, activation, forward winograd transform, and quantization stages into a single operation.

The downside of this approach is that activations are stored in the winograd domain, which represents an expansion of the raw activations. The smaller the tile size, the bigger the expansion. F(2x2,3x3) expands raw activations by 4X, F(4x4,3x3) by 2.25X.

Another possibility is to quantize the activations to even less than 8-bit precision, so that when you perform the Winograd transform, the result uses no more than 8-bits. This probably works well in some applications at least, as there are research results showing accurate classification using low-precision activations.

Another possibility is to use 1-D Winograd transform, call it F(2x1, 3x3) or F(4x1, 3x3). This effectively turns a 2-D direct convolution into a 1-D direct convolution nested inside of a 1-D Winograd transform. The arithmetic complexity reduction is less, but so is the precision loss and activation and weight expansion. Also the computational intensity is higher, because the multiplications can be computed as matrix multiplications nested inside of a 1-D direct convolution. Additionally this might map to tensor core style arithmetic better than even 2-D direct convolution does. Also the 1-D Winograd transforms have even better data locality than the 2-D Winograd transforms, which are in turn better than the large tile FFT method.

As an aside, I would like to point out that the effect of locality on the minimum workspace size is missing from recent analyses of fast algorithms for convnets, even though our original publication exploited Winograd locality to fit the entire working set in the GPU's small shared memory space. Obviously small-tile convolutions make possible instruction schedules that have fewer cache misses than large-tile (FFT) convolution algorithms do.

I hope this gives you some ideas!

@BUG1989
Copy link

BUG1989 commented Feb 27, 2019

@andravin @manojrohit
Thank you very much for your advice.I have done int8 winograd F(2,3) in arm platform,In my implement,It is faster than fp32 winograd F(6,3) in most cases,and it is in a opensource project : )
ncnn int8 pr
naive c code

@manojrohit
Copy link
Author

Thank you for sharing the code @BUG1989. Any comments about accuracy degradation?

@BUG1989
Copy link

BUG1989 commented Feb 27, 2019

@manojrohit
In my project.The int8 winograd F(2,3) has the same accuracy as original int8 conv3x3s1

@andravin
Copy link
Owner

andravin commented Feb 28, 2019

Another thing to try with int8 winograd is to quantize each of the winograd components separately.

This might be especially helpful when the input to the convolutional layer is the output of a ReLU activation. In that case, the input is nonnegative, so the winograd component with input transform [0,1,1,0] is also nonnegative. The other winograd components, with transforms [0,1,-1,0], [1,0,-1,0], and [0,-1,0,1], are signed with expected mean of zero (you might have a sign flip in any of these components depending on how you compute the transform).

You probably capture an extra bit of dynamic range if you map the [0,1,1,0] components to an unsigned int8 with range [0,255] and the other components to a signed int8 with range [-128,127]. You just have to be careful to scale the components appropriately when performing the inverse transform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants