Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for fused convolutions #2294

Merged
merged 58 commits into from
Oct 11, 2021

Conversation

arrufat
Copy link
Contributor

@arrufat arrufat commented Feb 2, 2021

I've been playing a bit with the idea of having fused convolutions (convolution + batch_norm) in dlib.
I think the first step would be to move all the operations that are done by the affine_ layer into the convolution, that is, update the bias of the convolution and re-scale the filters.

This PR adds some helper methods that allow doing this. The next step could be adding a new layer that can be constructed from an affine_ layer and it's a no-op, like the tag layers, or add a version of the affine layer that does nothing (just outputs its input, without copying or anything). How would you approach this?

Finally, here's an example that uses a visitor to update the convolutions that are below an affine layer.
It can be build from by putting the file into the examples folder and loading the pretrained resnet 50 from the dnn_introduction3_ex.cpp. If we manage to make something interesting out of it, maybe it would be interesting to have this visitor, too.

#include "resnet.h"

#include <dlib/dnn.h>
#include <dlib/image_io.h>

using namespace std;
using namespace dlib;

class visitor_fuse_convolutions
{
    public:
    template <typename T> void fuse_convolutions(T&) const
    {
        // disable other layer types
    }

    // handle the standard case (convolutional layer followed by affine;
    template <long nf, long nr, long nc, int sy, int sx, int py, int px, typename U, typename E>
    void fuse_convolutions(add_layer<affine_, add_layer<con_<nf, nr, nc, sy, sx, py, px>, U>, E>& l)
    {
        // get the parameters from the affine layer as alias_tensor_instance
        auto gamma = l.layer_details().get_gamma();
        auto beta = l.layer_details().get_beta();

        // get the convolution below the affine layer and its paramaters
        auto& conv = l.subnet().layer_details();
        const long num_filters_out = conv.num_filters();
        const long num_rows = conv.nr();
        const long num_cols = conv.nc();
        tensor& params = conv.get_layer_params();
        // guess the number of input filters
        long num_filters_in;
        if (conv.bias_is_disabled())
            num_filters_in = params.size() / num_filters_out / num_rows / num_cols;
        else
            num_filters_in = (params.size() - num_filters_out) / num_filters_out / num_rows / num_cols;

        // set the new number of parameters for this convolution
        const size_t num_params = num_filters_in * num_filters_out * num_rows * num_cols + num_filters_out;
        alias_tensor filters(num_filters_out, num_filters_in, num_rows, num_cols);
        alias_tensor biases(1, num_filters_out);
        if (conv.bias_is_disabled())
        {
            conv.enable_bias();
            resizable_tensor new_params = params;
            new_params.set_size(num_params);
            biases(new_params, filters.size()) = 0;
            params = new_params;
        }

        // update the biases
        auto b = biases(params, filters.size());
        b+= mat(beta);

        // rescale the filters
        DLIB_CASSERT(filters.num_samples() == gamma.k());
        auto t = filters(params, 0);
        float* f = t.host();
        const float* g = gamma.host();
        for (long n = 0; n < filters.num_samples(); ++n)
        {
            for (long k = 0; k < filters.k(); ++k)
            {
                for (long r = 0; r < filters.nr(); ++r)
                {
                    for (long c = 0; c < filters.nc(); ++c)
                    {
                        f[tensor_index(t, n, k, r, c)] *= g[n];
                    }
                }
            }
        }

        // reset the affine layer
        gamma = 1;
        beta = 0;
    }

    template <typename input_layer_type>
    void operator()(size_t , input_layer_type& l) const
    {
        // ignore other layers
    }

    template <typename T, typename U, typename E>
    void operator()(size_t , add_layer<T, U, E>& l)
    {
        fuse_convolutions(l);
    }
};

int main(const int argc, const char** argv)
try
{
    resnet::infer_50 net1, net2;
    std::vector<std::string> labels;
    deserialize("resnet50_1000_imagenet_classifier.dnn") >> net1 >> labels;
    net2 = net1;
    matrix<rgb_pixel> image;
    load_image(image, "elephant.jpg");

    const auto& label1 = labels[net1(image)];
    const auto& out1 = net1.subnet().get_output();
    resizable_tensor probs(out1);
    tt::softmax(probs, out1);
    cout << "pred1: " << label1 << " (" << max(mat(probs)) << ")" << endl;


    // fuse the convolutions in the network
    dlib::visit_layers_backwards(net2, visitor_fuse_convolutions());
    const auto& label2 = labels[net2(image)];
    const auto& out2 = net2.subnet().get_output();
    tt::softmax(probs, out2);
    cout << "pred2: " << label2 << " (" << max(mat(probs)) << ")" << endl;

    cout << "max abs difference: " << max(abs(mat(out1) - mat(out2))) << endl;
    DLIB_CASSERT(max(abs(mat(out1) - mat(out2))) < 1e-2);
}
catch (const exception& e)
{
    cout << e.what() << endl;
    return EXIT_FAILURE;
}

output with this image (elephant.jpg):
elephant

pred1: African_elephant (0.962677)
pred2: African_elephant (0.962623)
max abs difference: 0.00436211

UPDATE: make visitor more generic and show a results with a real image

@arrufat arrufat mentioned this pull request Feb 3, 2021
10 tasks
@pfeatherstone
Copy link
Contributor

This looks good. Does it speed up performance? CPU and GPU?

@arrufat
Copy link
Contributor Author

arrufat commented Feb 3, 2021

I didn't benchmark anything, but I guess it will be the same as without "fusing" the convolutions, since we are still calling the affine_ layer. Now we need a layer that can be constructed from an affine_ layer and it's a no-op. Then process could look like:

  1. train with network defined with bn_con_
  2. assign to the same network, but defined with affine_
  3. run the visitor_fuse_convolutions
  4. assign the modified network to another one defined with a no-op layer instead of affine_

If we could add a method like .disable() to the affine_ layer and make it a no-op, that would be great, since step 4 could be done in the visitor.

@pfeatherstone
Copy link
Contributor

If we could add a method like .disable() to the affine_ layer and make it a no-op, that would be great, since step 4 could be done in the visitor.

I would do that. Just need to add a boolean flag to the affine_ layer. Need to make that flag serializable and bump the version of the layer when serializing while still supporting the old version (i.e. without the boolean flag) when deserializing. That's the easiest thing to do from a user perspective. They don't have to create a 3rd type of network and they don't need to care about how deserialization works.

@arrufat
Copy link
Contributor Author

arrufat commented Feb 3, 2021

Yes, if that is possible, me too :)

@davisking
Copy link
Owner

Yeah, I would go with a .disable() on the affine layer :)

@arrufat
Copy link
Contributor Author

arrufat commented Feb 3, 2021

I've updated the affine_ layer to support the disabled mode.
In that example, I ran the image with both networks 1000, and these are the average timings:

layer inference fps net size VRAM
affine 4.115 ms 243 86.9994 MiB 594 MiB
disabled 3.829 ms 261 86.8261 MiB 588 MiB

So, some improvements. that we can get for free :)

@pfeatherstone
Copy link
Contributor

So 7% performance gain depending on which way you look at it. Nice work!

@pfeatherstone
Copy link
Contributor

pfeatherstone commented Feb 3, 2021

Darknet FPS / dlib FPS ~ 70%. So it must be something else that's holding dlib back. But this is definitely a step closer

@arrufat
Copy link
Contributor Author

arrufat commented Feb 3, 2021

Resnet 50 has only 49 batch normalizations (YOLOv3 has 75 and YOLOv4 109). So I am expecting to gain a bit more on those models :)

@arrufat
Copy link
Contributor Author

arrufat commented Feb 3, 2021

@davisking when you have time, let us know what you think about this PR, notably the naming of functions, etc.

Also, maybe that visitor could be more generic and accept the case where the batch norm has an fc layer as its input... But I am not sure it's worth it... But in that case, the visitor should be named differently.

Ah, I put the visitor there because it depends directly on the affine_ layer (I tried to put it into the utilities header, but that didn't work.

@arrufat arrufat changed the title add helper methods to implement fused convolutions Add support for fused convolutions Feb 3, 2021
@pfeatherstone
Copy link
Contributor

pfeatherstone commented Feb 3, 2021

It does make you wonder if there is any point in the affine layer anymore. Unless people are using it during training too. My guess is they are using it for converting batchnorm layers. In which case, this visitor should always be used. So maybe the visitor should somehow be implicitly called when inferring networks. Hmm

@arrufat
Copy link
Contributor Author

arrufat commented Feb 4, 2021

From the affine_ layer documentation, this layer does nothing (identity transform) unless initialized with a bn_ layer. And it has no trainable parameters, so there's no point in using it for training.

I guess it should be disabled by default, and then enabled once it's initialized with a bn_ layer.
That way, when the user couts the net, it will see disabled if the affine_ hasn't been initialized properly. Thoughts on this?

dlib/dnn/layers.h Outdated Show resolved Hide resolved
@arrufat
Copy link
Contributor Author

arrufat commented Feb 4, 2021

After fixing the segfault, now I get this error when trying to do inference in a fused network that had bias disabled in convolutions:

Error detected at line 258.
Error detected in file ../external/dlib/dlib/cuda/cudnn_dlibapi.cpp.
Error detected in function void dlib::cuda::add(float, dlib::tensor&, float, const dlib::tensor&).

Failing expression was (have_same_dimensions(src, dest) || (src.num_samples()==1 && src.k()==dest.k() && src.nr()==1 && src.nc()==1) || (src.num_samples()==1 && src.k()==dest.k() && src.nr()==dest.nr() && src.nc()==dest.nc()) || (src.num_samples()==1 && src.k()==1 && src.nr()==dest.nr() && src.nc()==dest.nc()) || (src.num_samples()==dest.num_samples() && src.k()==1 && src.nr()==1 && src.nc()==1)) && is_same_object(src,dest) == false.

         dest.num_samples(): 1
         dest.k():           64
         dest.nr():          112
         dest.nc():          112
         src.num_samples():  0
         src.k():            0
         src.nr():           0
         src.nc():           0

So I am guessing there are some other parameters that should be modified elsewhere in the network... I'll keep digging.

EDIT: I found the problem, the biases alias tensor was only setup if the bias was enabled, I have to set it up if I enable bias. Either that, or set it up no matter what.

@davisking
Copy link
Owner

Sweet. Can't look today. Might look tomorrow. Not sure.

@arrufat
Copy link
Contributor Author

arrufat commented Sep 5, 2021

No hurries :) Thank you!

@arrufat
Copy link
Contributor Author

arrufat commented Sep 6, 2021

Update, I tried fusing the weights of another network (VoVNet based) I had already trained, which already had duplicative bias disabled.
The inference speed increased by around 7% as expected, and the VRAM usage decreased as well, in contrast to the YOLOv3 model... I don't know what you're going to do with this information, but here it is anyway :)

@davisking
Copy link
Owner

From the affine_ layer documentation, this layer does nothing (identity transform) unless initialized with a bn_ layer. And it has no trainable parameters, so there's no point in using it for training.

I guess it should be disabled by default, and then enabled once it's initialized with a bn_ layer.
That way, when the user couts the net, it will see disabled if the affine_ hasn't been initialized properly. Thoughts on this?

Yeah, it defaults to an identity transform by default. So nothing wrong with having it default do not actually doing anything until it gets assigned. Not sure how I missed this comment and didn't reply until now :/

@davisking
Copy link
Owner

I'm severely tardy in looking at this PR too :( Should I look now or are you editing it again?

@arrufat
Copy link
Contributor Author

arrufat commented Sep 25, 2021

There's no hurry, don't feel any pressure :)
You can have a look now, I think it's ready (it has worked on all networks I tried)

@arrufat
Copy link
Contributor Author

arrufat commented Sep 25, 2021

The functionality is there already, I should probably add some tests for networks that have convolutions with and without biases.
I mean, it works, but there are no tests.
And also probably update an example program (of your choosing) to showcase this new functionality. Having the extra 7% speedup is quite nice :)

Copy link
Owner

@davisking davisking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this looks cool. I left some comments. Totally add some tests too :)

dlib/cuda/tensor_tools.h Outdated Show resolved Hide resolved
dlib/cuda/tensor_tools.h Outdated Show resolved Hide resolved
dlib/dnn/layers.h Outdated Show resolved Hide resolved
dlib/dnn/layers_abstract.h Outdated Show resolved Hide resolved
dlib/dnn/layers_abstract.h Outdated Show resolved Hide resolved
dlib/dnn/layers_abstract.h Outdated Show resolved Hide resolved
dlib/dnn/layers_abstract.h Show resolved Hide resolved
@arrufat
Copy link
Contributor Author

arrufat commented Sep 26, 2021

If the tests pass, then it's ready :)

Copy link
Owner

@davisking davisking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice :)

@davisking davisking merged commit adca747 into davisking:master Oct 11, 2021
@facug91
Copy link
Contributor

facug91 commented Aug 26, 2022

I was having a problem in master when running the same example code I used in #2655, using Docker image nvidia/cuda:10.2-cudnn7-devel-ubuntu18.04. When I executed the code I got the following exception:

Error while calling cudnnConvolutionBiasActivationForward( context(), &alpha1, descriptor(data), data.device(), (const cudnnFilterDescriptor_t)filter_handle, filters.device(), (const cudnnConvolutionDescriptor_t)conv_handle, (cudnnConvolutionFwdAlgo_t)forward_algo, forward_workspace, forward_workspace_size_in_bytes, &alpha2, out_desc, out, descriptor(biases), biases.device(), identity_activation_descriptor(), out_desc, out) in file /home/user/dlib/dlib/cuda/cudnn_dlibapi.cpp:1219. code: 9, reason: CUDNN_STATUS_NOT_SUPPORTED

I found that this specific commit is the one causing that problem. Testing it with the commit before this one, it works.

@arrufat
Copy link
Contributor Author

arrufat commented Aug 26, 2022

Oh, that's odd, I've been using this a lot, without issues. Can you give more details? Maybe it's related to CUDA? I've only ever tried it with CUDA 11, I think.

Edit: I just saw the issue.

@facug91
Copy link
Contributor

facug91 commented Aug 26, 2022

I should read the code, but it seems to me that the problem might be with cudnn 7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants