Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature suggestions #3

Open
4 of 10 tasks
pfeatherstone opened this issue Jan 8, 2021 · 141 comments
Open
4 of 10 tasks

Feature suggestions #3

pfeatherstone opened this issue Jan 8, 2021 · 141 comments

Comments

@pfeatherstone
Copy link
Contributor

pfeatherstone commented Jan 8, 2021

Just putting some suggestions out there. Maybe we could organize these into projects.

  • support scaled yolov4
  • support yolov5 models
  • add loss layer for training yolo models
  • Create a fuse_conv_batchnorm visitor for enhanced performance during inference
  • bipartite matching loss (Hungarian algorithm)
  • transformer (for Detr like model)
  • add GIoU, DIoU, CIoU options to loss_yolo
  • see what happens when pre-training the (darkner53) backbone with Barlow twins loss using unlabelled imagenet, then fine tuning neck and heads with loss_yolo. Hopefully training the backbone with an unsupervised loss gives you better features than one trained specifically for classification. Presumably, with the backbone being frozen and therefore not requiring any gradient computation or batch-normalisation, this should accelerate training and reduce VRAM right ?
  • Add cutmix augmentation
  • Enhance CPU with NNPACK or oneDNN
@arrufat
Copy link
Member

arrufat commented Jan 8, 2021

Yes, I want to do all those things at some point :)

I've already started working on YOLO scaled models, in particular YOLOv4x-mish, which can be found here.

My idea is to make a generic template class for yolo models, where the template type is the yolo model itself, then put each one in a separate unit, so that we can link to them (this greatly improves compilation time).

I've also started working on the to_label() part for yolo models, but it's not ready yet (lack of time these days). I will definitely push it unless someone wants to work on it. My dream would be to have the training part working, as well.

I am not sure how we can improve the performance. In my tests, dlib performs faster than pytorch for densenet, resnet and vovnet architectures and uses less memory (for small batch sizes, up to 4 or 8), but the tendency inverts for big batch sizes. If that is the case, dlib should be fast on single inference, so I am wondering if it's the post-processing (NMS, etc) that drags the peformance down... I want to test that at some point, as well.

@pfeatherstone
Copy link
Contributor Author

When i was doing my tests, both onnx inference and dlib inference were doing NMS stuff. The NMS stuff is practically instantaneous (i haven't properly measured it though). I think it's something else that's causing bottlenecks. But your benchmarks are very interesting, and not what i was expecting based on my tests with yolov3. Properly profiling this at some point will be useful.

@arrufat
Copy link
Member

arrufat commented Jan 8, 2021

Did you set CUDA_LAUNCH_BLOCKING to 1? I used that in all my benchmarks.

@pfeatherstone
Copy link
Contributor Author

So when running all the yolo models with this repository, do you get similar performance to darknet and pytorch?

@pfeatherstone
Copy link
Contributor Author

Did you set CUDA_LAUNCH_BLOCKING to 1? I used that in all my benchmarks.

No i've never used that. I imagine that would slow pytorch down.

@arrufat
Copy link
Member

arrufat commented Jan 8, 2021

I added it because the creator of PyTorch suggested to, for proper benchmarking. arrufat/dlib-pytorch-benchmark#2

@pfeatherstone
Copy link
Contributor Author

pfeatherstone commented Jan 8, 2021

Oh ok. Fair enough. At the end of the day, running yolov3 with onnxruntime, pytorch or darknet yields roughly 65 FPS on 416x416 images. With dlib, i think i got around 45 FPS. If we can close that gap, that would be great.

@arrufat
Copy link
Member

arrufat commented Jan 18, 2021

Oh ok. Fair enough. At the end of the day, running yolov3 with onnxruntime, pytorch or darknet yields roughly 65 FPS on 416x416 images. With dlib, i think i got around 45 FPS. If we can close that gap, that would be great.

I've just run yolov3 on darknet and dlib and I can confirm similar numbers, more precisely, on an NVIDIA GeForce GTX 1080 Ti:

FPS FPS VRAM (MiB) VRAM(MiB)
model (size) darknet dlib darknet dlib
yolov3 (416) 70 50 865 835
yolov4 (608) 32 22 1545 1742

I agree, it'd be cool if we could find out and fix the bottlenecks.

@pfeatherstone
Copy link
Contributor Author

That's great thanks. Whenever i have time i will have a look. A decent profiler will go a long way. I always struggle to interpret sysprof. I'll try orbit again at some point. Though i had trouble building it last time i seem to remember.

@pfeatherstone
Copy link
Contributor Author

It could be tensor.host() is called in a few places which introduce unnecessary barriers. I don't know enough about CUDA to be honest.

@arrufat
Copy link
Member

arrufat commented Jan 18, 2021

As far as I know, tensor.host() is only called once in user code (to get the actual output of the network). I need to check if it's called somewhere else inside some layer implementation...

@pfeatherstone
Copy link
Contributor Author

Yep building google's orbit profiler failed again. I'll have to do this at some point in my free time. Thanks @arrufat for investigating.

@davisking
Copy link

davisking commented Jan 18, 2021 via email

@pfeatherstone
Copy link
Contributor Author

@davisking Does dlib do fused convolution and batch normalisation ?

@pfeatherstone
Copy link
Contributor Author

Darknet definitely does that. But then again, Pytorch doesn't and it achieves similar FPS to darknet, if not faster. I've seen onnxruntime achieve even faster FPS, but it does all sorts of crazy shit with graph optimization.

@pfeatherstone
Copy link
Contributor Author

If dlib doesn't implement fused conv-batchnorm, maybe that could be implemented as a layer visitor when doing inference, which updates the convolutional filters and biases, and nulls the affine layers.

@pfeatherstone
Copy link
Contributor Author

Here is Alexey's code for fused conv-batchnorm:

void fuse_conv_batchnorm(network net)
{
    int j;
    for (j = 0; j < net.n; ++j) {
        layer *l = &net.layers[j];

        if (l->type == CONVOLUTIONAL) {
            //printf(" Merges Convolutional-%d and batch_norm \n", j);

            if (l->share_layer != NULL) {
                l->batch_normalize = 0;
            }

            if (l->batch_normalize) {
                int f;
                for (f = 0; f < l->n; ++f)
                {
                    l->biases[f] = l->biases[f] - (double)l->scales[f] * l->rolling_mean[f] / (sqrt((double)l->rolling_variance[f] + .00001));

                    double precomputed = l->scales[f] / (sqrt((double)l->rolling_variance[f] + .00001));

                    const size_t filter_size = l->size*l->size*l->c / l->groups;
                    int i;
                    for (i = 0; i < filter_size; ++i) {
                        int w_index = f*filter_size + i;

                        l->weights[w_index] *= precomputed;
                    }
                }

                free_convolutional_batchnorm(l);
                l->batch_normalize = 0;
#ifdef GPU
                if (gpu_index >= 0) {
                    push_convolutional_layer(*l);
                }
#endif
            }
        }
        else  if (l->type == SHORTCUT && l->weights && l->weights_normalization)
        {
            if (l->nweights > 0) {
                //cuda_pull_array(l.weights_gpu, l.weights, l.nweights);
                int i;
                for (i = 0; i < l->nweights; ++i) printf(" w = %f,", l->weights[i]);
                printf(" l->nweights = %d, j = %d \n", l->nweights, j);
            }

            // nweights - l.n or l.n*l.c or (l.n*l.c*l.h*l.w)
            const int layer_step = l->nweights / (l->n + 1);    // 1 or l.c or (l.c * l.h * l.w)

            int chan, i;
            for (chan = 0; chan < layer_step; ++chan)
            {
                float sum = 1, max_val = -FLT_MAX;

                if (l->weights_normalization == SOFTMAX_NORMALIZATION) {
                    for (i = 0; i < (l->n + 1); ++i) {
                        int w_index = chan + i * layer_step;
                        float w = l->weights[w_index];
                        if (max_val < w) max_val = w;
                    }
                }

                const float eps = 0.0001;
                sum = eps;

                for (i = 0; i < (l->n + 1); ++i) {
                    int w_index = chan + i * layer_step;
                    float w = l->weights[w_index];
                    if (l->weights_normalization == RELU_NORMALIZATION) sum += lrelu(w);
                    else if (l->weights_normalization == SOFTMAX_NORMALIZATION) sum += expf(w - max_val);
                }

                for (i = 0; i < (l->n + 1); ++i) {
                    int w_index = chan + i * layer_step;
                    float w = l->weights[w_index];
                    if (l->weights_normalization == RELU_NORMALIZATION) w = lrelu(w) / sum;
                    else if (l->weights_normalization == SOFTMAX_NORMALIZATION) w = expf(w - max_val) / sum;
                    l->weights[w_index] = w;
                }
            }

            l->weights_normalization = NO_NORMALIZATION;

#ifdef GPU
            if (gpu_index >= 0) {
                push_shortcut_layer(*l);
            }
#endif
        }
        else {
            //printf(" Fusion skip layer type: %d \n", l->type);
        }
    }
}

So a dlib visitor and a bit of tensor manipulation. Shouldn't be too bad.

@davisking
Copy link

@davisking Does dlib do fused convolution and batch normalisation ?

No. Need to have new layers for that.

@pfeatherstone
Copy link
Contributor Author

won't a layer visitor do the job?

@davisking
Copy link

won't a layer visitor do the job?

Yeah or that with appropriate updates to the code.

@pfeatherstone
Copy link
Contributor Author

@arrufat can you run your benchmark again but disable fuse_conv_batchnorm in darknet. i've done a quick grep, and i think all you need to do is uncomment the following lines:
line 2251 in parser.c
line 162 in demo.c
line 1617 in detector.c
If you're using darknet demo ..., then you won't need to do the last one.
I would do it myself, but the benchmark is only meaningful if it's done on the same machine in the same "conditions"
If the FPS is still around 70, then we know it's not fuse_conv_batchnorm that's causing the performance boost.

@arrufat
Copy link
Member

arrufat commented Jan 19, 2021

@pfeatherstone after doing what you suggested, I go from 70 fps to 60 fps, so there's some room for improvement there :)

@pfeatherstone
Copy link
Contributor Author

That's promising. But it it might still suggest it's something else causing bottlenecks. I'll try adding the visitor this weekend. It shouldn't take too long. It also requires adding "bypass" functionality in affine_ layer.

@arrufat
Copy link
Member

arrufat commented Jan 19, 2021

Could it be possible to make a new layer similar to affine_ than can be constructed from a bn_ but that behaves like a tag (i.e. it just forwards the input to its output without any runtime cost)?

Then we could assign a network defined with bn_ layers to a network defined with this "bypass" layers, in the same way it's done for affine_.

@pfeatherstone
Copy link
Contributor Author

Can do. I have no strong opinions. I imagine there would be no runtime cost if there was a flag due to branch prediction always guessing correctly (presumably there would be an if-statement around the flag. If true, simply forward input to output). Honestly it makes no difference to me. Whichever is the most expressive. Your way requires a new type, which means the whole network is a new type, which means the compiler has to compiler yet another gigantic new type, which means i have to wait another 15 minutes for clang to build yolov3. But at this stage, +- 15 minutes for building networks in dlib isn't a biggy.

@arrufat
Copy link
Member

arrufat commented Jan 19, 2021

Yes, I agree, compile-times are getting a bit out of hand for big YOLO models (such as the recently published improvements to YOLOv4.) Maybe having an extra branch in each bn_ layer it doesn't affect the performance...

Regarding the compile times, that's why I build each model as a separate library and then link to it, so I don't have to rebuild it every time I change the code somewhere else)

https://github.com/dlib-users/darknet/blob/b78eddc08a7f5520103b2b296067a3516f5f7faa/CMakeLists.txt#L74-L77\

Here are the sizes of the compiled models, yolov3 is really tiny compared to the latest yolov4x_mish...

-rw-r--r--  1 adria 1.9M Jan 18 23:43 libyolov3.a
-rw-r--r--  1 adria 4.8M Jan 18 23:43 libyolov4.a
-rw-r--r--  1 adria 5.2M Jan 18 23:43 libyolov4_sam_mish.a
-rw-r--r--  1 adria  15M Jan 18 23:45 libyolov4x_mish.a

@pfeatherstone
Copy link
Contributor Author

If I had a couple months of free time, I would roll up my sleeves and propose a new functional API to dnns in dlib, using dynamic polymorphism instead of static polymorphism for neural networks. I think that would solve a lot of frustrations, including compile times. I can see the benefits of using templates, it means you expedite optimisations to the compiler, but with large models, as you said, it gets out of hand. Having a functional API similar to pytorch for example would make dnns more accessible in dlib i think. But this would require a huge amount of time to get it right.

@pfeatherstone
Copy link
Contributor Author

But that would require a lot of work on the tensor type too i think. So this wouldn't be an easy thing to do.

@pfeatherstone
Copy link
Contributor Author

Yes, I agree, compile-times are getting a bit out of hand for big YOLO models (such as the recently published improvements to YOLOv4.) Maybe having an extra branch in each bn_ layer it doesn't affect the performance...

Regarding the compile times, that's why I build each model as a separate library and then link to it, so I don't have to rebuild it every time I change the code somewhere else)

[https://github.com/dlib-users/darknet/blob/b78eddc08a7f5520103b2b296067a3516f5f7faa/CMakeLists.txt#L74-L77](https://github.com/dlib-users/darknet/blob/b78eddc08a7f5520103b2b296067a3516f5f7faa/CMakeLists.txt#L74-L77%5C)

Here are the sizes of the compiled models, yolov3 is really tiny compared to the latest yolov4x_mish...

-rw-r--r--  1 adria 1.9M Jan 18 23:43 libyolov3.a
-rw-r--r--  1 adria 4.8M Jan 18 23:43 libyolov4.a
-rw-r--r--  1 adria 5.2M Jan 18 23:43 libyolov4_sam_mish.a
-rw-r--r--  1 adria  15M Jan 18 23:45 libyolov4x_mish.a

It's still impressive that a single model is compiled to nearly 2MB of binary. Maybe the bottlenecks are caused by code bloating? I don't know. I've never properly looked at the effects of binary size on performance.

@arrufat
Copy link
Member

arrufat commented Jan 19, 2021

Honestly I really like the declarative way of defining networks in dlib, even if it requires some work to add new layers, it's worth it because:

  • serialization works really well
  • autocompletion anywhere inside the network, for example net.subnet().layer_details().set_num_filters();
  • a lot of errors are caught at compile time (in PyTorch I always get the tensor shapes wrong, that never happened in dlib)
  • for me, it's really easy to look at a dlib network definition and know what's doing

Other than the compile times, I think dlib's approach to neural nets is the best (but I might be biased :P)

EDIT: also, if at some point dlib is ported to C++20, we could use concepts to get better error messages when we make a mistake in the network definition, that would be awesome.

@pfeatherstone
Copy link
Contributor Author

I could be wrong. I'm sure they do work but they are definitely not optimised for other architectures.

@arrufat
Copy link
Member

arrufat commented Mar 4, 2022

You can do label smoothing using the loss_multibinary_log, as I noted in davisking/dlib#2141. Actually, in ResNet strikes back, they do exactly that.

Quoting from the paper:

Our procedure include recent advances from the literature as well as new proposals. Noticeably, we depart from the usual cross-entropy loss. Instead, our training solves a multi-classification problem when using Mixup and CutMix: we minimize the binary cross entropy for each concept selected by these augmentations, assuming that all the mixed concepts are present in the synthetized image.

@pfeatherstone
Copy link
Contributor Author

pfeatherstone commented Mar 4, 2022

I'm sorry i missed that. Presumably the labels can be any real number in range [-1,+1]?

@arrufat
Copy link
Member

arrufat commented Mar 4, 2022

They can be any real value, just positive for one class and negative for the other, and the absolute value, would be the weight of the label. So if you have a dataset with imbalace 1:2, you can set the labels as 0.67, -1.34 or something like that.
The values themselves don't matter, what matters is the relation between them. If you put large values, you might have to reduce the learning rate.

@pfeatherstone
Copy link
Contributor Author

I thought the smoothed version of cross entropy still assumed a softmax layer during inference. Like what is done in torch Whereas the loss in dlib is using sigmoid right? I don't know which is best but there is definitely functionality there already in dlib, you're right. I'm not trying to start another religious war don't worry. Just throwing ideas. If you don't find them appropriate I will remove them

@arrufat
Copy link
Member

arrufat commented Mar 4, 2022

From what I've seen, if you want to output a single label per image, training using softmax or sigmoid leads to almost identical results.

I prefer the latter, since it's a more generic approach. It can deal with the cases when:

  • none of the training classes appear at inference time (since all sigmoids can output zero, they don't need to add up to 1)
  • more than one class appears in an image, they don't need to fight, they can output all 1, if they want.

In fact, the classification part of YOLO uses a sigmoid, and generally, we train it with bounding boxes that only have one label.

If you check that ResNet paper, you'll see the experiments between what they call Cross Entropy (CE) and Binary Cross Entropy (BCE). Which correspond to the loss_multiclass_log and loss_multibinary_log in dlib, respectively.

@pfeatherstone
Copy link
Contributor Author

Interesting, yeah I see your point. I haven't read that paper. I read https://arxiv.org/pdf/1905.04899.pdf

@arrufat
Copy link
Member

arrufat commented Mar 4, 2022

I was checking YOLOv5 code, and it uses the CIOU loss, as I thought (it's even hard-coded) . I think the other ones don't make much sense if you have the CIOU loss at your disposal.

@pfeatherstone
Copy link
Contributor Author

pfeatherstone commented Mar 4, 2022

Oh ok. They must have changed it recently. For a long time I was following the code, and even though they supported all 3 IOU losses, they used GIOU. Maybe that was with their Yolov3 repo. The author even said in issue somewhere that CIOU provided no benefits over GIOU. But that might not be the case anymore.

ultralytics/yolov3#996 (comment)

@pfeatherstone
Copy link
Contributor Author

In any case, I'm not instigating a debate here, I just think it would be a nice addition to support all 3. Again, I'm not suggesting you do them. It's just an idea. It could very well be the case that for some models GIOU gives you equal performance with less compute.

@davisking
Copy link

@davisking talking about type erasure. I was thinking of adding a general purpose type erasure module similar to boost-ext TE or Dyno, but targeting C++11. What do you think ? hopefully it won't require a concept map like what the Dyno library does and will be succinct like the TE library. I've got it working with c++14 but think it should be possible to have a c++11 version. So you could have polymorphic objects on the stack without inheritance. This is really useful in API design without having loads of smart pointers using inheritance and you can use good old c++ objects.

I haven't used any of those libraries so I'm not 100% what you mean. I think a lot of the tools in boost get kind of carried away with "hey this is neat" vs "this is really the best way to deliver value to humans" so I don't end up using boost very much :|

@davisking
Copy link

@davisking jumping back to DL, have you tried https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html with your torch models ? I haven't. If so, what is your experience with it ? There is also https://github.com/hjmshi/PyTorch-LBFGS

I haven't. I assume it's a proper LBFGS implementation though.

@davisking
Copy link

@davisking jumping back to DL, have you tried https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html with your torch models ? I haven't. If so, what is your experience with it ? There is also https://github.com/hjmshi/PyTorch-LBFGS

Oh that's a SGD tool that uses the LBFGS equation to compute the step direction. That's not really "LBFGS" though. LBFGS is supposed to be used with a line search and is not a SGD type algorithm. Like it does not operate on "batches".

@pfeatherstone
Copy link
Contributor Author

pfeatherstone commented Mar 5, 2022

@davisking talking about type erasure. I was thinking of adding a general purpose type erasure module similar to boost-ext TE or Dyno, but targeting C++11. What do you think ? hopefully it won't require a concept map like what the Dyno library does and will be succinct like the TE library. I've got it working with c++14 but think it should be possible to have a c++11 version. So you could have polymorphic objects on the stack without inheritance. This is really useful in API design without having loads of smart pointers using inheritance and you can use good old c++ objects.

I haven't used any of those libraries so I'm not 100% what you mean. I think a lot of the tools in boost get kind of carried away with "hey this is neat" vs "this is really the best way to deliver value to humans" so I don't end up using boost very much :|

Neither of them are boost libraries actually. TE is 300 lines of code and that's it. I would check oit TE. It's really cool. It's dynamic polymorphism without inheritance. So non-intrusive. I think you would like it a lot. But it requires c++17. You can make it c++14 with a couple mods but requires a bit more work to make it c++11. But with it, any kind of type erasure is trivially done.

@davisking
Copy link

@davisking talking about type erasure. I was thinking of adding a general purpose type erasure module similar to boost-ext TE or Dyno, but targeting C++11. What do you think ? hopefully it won't require a concept map like what the Dyno library does and will be succinct like the TE library. I've got it working with c++14 but think it should be possible to have a c++11 version. So you could have polymorphic objects on the stack without inheritance. This is really useful in API design without having loads of smart pointers using inheritance and you can use good old c++ objects.

I haven't used any of those libraries so I'm not 100% what you mean. I think a lot of the tools in boost get kind of carried away with "hey this is neat" vs "this is really the best way to deliver value to humans" so I don't end up using boost very much :|

Neither of them are boost libraries actually. TE is 300 lines of code and that's it. I would check oit TE. It's really cool. It's dynamic polymorphism without inheritance. So non-intrusive. I think you would like it a lot. But it requires c++17. You can make it c++14 with a couple mods but requires a bit more work to make it c++11. But with it, any kind of type erasure is trivially done.

Yeah it's clever. It seems like a hack around not having template concepts (which are in C++20 now) though. Like use template concepts. They ought to be great. Like update your compiler you know :D Dlib has to wait a long time to update, but end users, which includes us when not adding code to the dlib repo can use all the newest stuff. Like I use C++17 at work and it's great.

@pfeatherstone
Copy link
Contributor Author

pfeatherstone commented Mar 5, 2022

It's not the same as concepts though. It's closer to inheritance than it is to templates. It's a generalisation or std::function but for any kind of objects. I think that's the best way I can explain it. Like std::function isn't a template hack. It's achieving something else. But yeah I see your point, I could just use that library and accept the c++17 requirement. I just like to keep stuff as portable as possible. I still have to support old compilers for some of my stuff at work. That's why I've submitted a few PRs in the past that add c++14 and c++17 backports

@pfeatherstone
Copy link
Contributor Author

And I don't quite understand the argument for updating my compiler. I've done that before. I build it then run the binary on another Linux box then it complains it can't find the right version of glibc or something. And aggressively statically linking everything doesn't always work. So I always just use the default compiler for a system.

@pfeatherstone
Copy link
Contributor Author

pfeatherstone commented Mar 5, 2022

It would be great if gcc and clang supported a feature like "clang update" or "gcc update" and everything just worked and I had all the latest features. That is where modern languages shine. They've embraced bleeding edge a bit more. I would really like it if c++ no longer supported dynamic linking. Disk space is no longer an issue. Unless you're dynamically loading modules at runtime, I see no benefit in shared libraries. For me they cause problems and means I can't really do stuff like updating my compiler and expect everything to work.

@pfeatherstone
Copy link
Contributor Author

Anyway, silly rant over. I proposed a type erasure utility since you mentioned that IF you were to re-write the DNN stuff, you would use a more dynamic model, likely using type erasure (instead of inheritance) to record forward (and backward?) functions. So it seemed appropriate to mention it.

@davisking
Copy link

It's not the same as concepts though. It's closer to inheritance than it is to templates. It's a generalisation or std::function but for any kind of objects.

Sure, but why not use templates? Like aside from being able to hide some implementation details outside a header I would prefer template concepts to this most of the time. I mean, I use std::function sometimes too. But IDK. This feels excessively clever, but maybe it isn't. What do you use this for that isn't nicely accomplished via a shared_ptr and inheritance of a template?

I think that's the best way I can explain it. Like std::function isn't a template hack. It's achieving something else. But yeah I see your point, I could just use that library and accept the c++17 requirement. I just like to keep stuff as portable as possible. I still have to support old compilers for some of my stuff at work. That's why I've submitted a few PRs in the past that add c++14 and c++17 backports.

Yeah, I know this pain well :|

@davisking
Copy link

And I don't quite understand the argument for updating my compiler. I've done that before. I build it then run the binary on another Linux box then it complains it can't find the right version of glibc or something. And aggressively statically linking everything doesn't always work. So I always just use the default compiler for a system.

I mean yeah, but getting a new compiler is great when you can get it :D

@davisking
Copy link

It would be great if gcc and clang supported a feature like "clang update" or "gcc update" and everything just worked and I had all the latest features. That is where modern languages shine. They've embraced bleeding edge a bit more. I would really like it if c++ no longer supported dynamic linking. Disk space is no longer an issue. Unless you're dynamically loading modules at runtime, I see no benefit in shared libraries. For me they cause problems and means I can't really do stuff like updating my compiler and expect everything to work.

Depends on your domain. For instance, I work on a large software system that really needs dynamic linking. It's a bunch of separate processes. Linking it all into one big monster process would be less good since it is a safety critical system. We don't want one thing blowing up to take down the whole thing. So that process isolation is a big deal. And at the time time, if everything was statically linked then we would quite literally run out of RAM.

But yeah, for most applications static linking is the way to go. Or mostly static linking anyway.

@davisking
Copy link

davisking commented Mar 8, 2022

Anyway, silly rant over. I proposed a type erasure utility since you mentioned that IF you were to re-write the DNN stuff, you would use a more dynamic model, likely using type erasure (instead of inheritance) to record forward (and backward?) functions. So it seemed appropriate to mention it.

Yeah, would definitely use type erasure if rewriting the DNN stuff.

And I deliberately didn't use type erasure the first time, since type erasure makes serialization more complicated. Not a lot more complicated, but still more complicated. It makes a lot of stuff a little bit more complicated. But still, that was not optimizing the right thing. I didn't expect DNNs to end up being as large and crazy as they got, and at those sizes and complexities type erasure is definitely the better way to go.

@pfeatherstone
Copy link
Contributor Author

What do you use this for that isn't nicely accomplished via a shared_ptr and inheritance of a template?

I would watch Sean Parent's talk "Inheritance is the base class of evil" or Louis Dionne's CPPCON talk on Dyno. They would explain it better than I do

@arrufat
Copy link
Member

arrufat commented Mar 10, 2022

@pfeatherstone I just noticed (by chance) that you can load a network that differs in the definition only in the number of repeated layers.
This means that, due to the way that I defined the YOLOv5 models, you can pick and build just one variant, and load any of the other variants. That's kind of cool :)
It would be even cooler to be able to change it at runtime programmatically, though, similar to the number of outputs in the fc layer, or to the number of filters in the convolutional layers.

I will check if I can add an option to change the number of the repetitions in the repeat layer at runtime. @davisking, do you think that's feasible?

@pfeatherstone
Copy link
Contributor Author

That would be awesome!

@pfeatherstone
Copy link
Contributor Author

pfeatherstone commented Mar 10, 2022

It would also be cool if the batch-normalization layers had a runtime option to set them to training mode or evaluation mode (affine layer behaviour) similar to other NN frameworks. So you would only ever need 1 network definition, not 2 (one for training and one for evaluation).

Then you could do something like model.train() to set it back to batch-normalization behaviour and keep track of statistics, or do model.eval() and set it to affine behaviour. That would probably save compilation time massively too. (half it in theory)

@pfeatherstone
Copy link
Contributor Author

Very quickly, going back to enhanced loss_yolo layer supporting the IOU variants. Currently, it assumes the inputs to the yolo loss function have had sigmoid applied to them. That wouldn't work with GIOU, DIOU or CIOU. Is there any reason why the loss_yolo layer doesn't do the sigmoid itself? If it did, then you could add CIOU for example and not worry about having to undo the sigmoid.

@arrufat
Copy link
Member

arrufat commented Mar 10, 2022

Very quickly, going back to enhanced loss_yolo layer supporting the IOU variants. Currently, it assumes the inputs to the yolo loss function have had sigmoid applied to them. That wouldn't work with GIOU, DIOU or CIOU. Is there any reason why the loss_yolo layer doesn't do the sigmoid itself? If it did, then you could add CIOU for example and not worry about having to undo the sigmoid.

Yes, there's a technical reason. First, I wanted to perform the sigmoid operation inside the loss function, but the get_output() method returns a const reference, so it meant that I had to copy the output tensor in the loss layer to apply the sigmoid. That's why I decided to do it outside. Moreover, the Darknet code also does it in the network when new_coords==1, and it has the option to do the CIOU loss.

@davisking
Copy link

@pfeatherstone I just noticed (by chance) that you can load a network that differs in the definition only in the number of repeated layers. This means that, due to the way that I defined the YOLOv5 models, you can pick and build just one variant, and load any of the other variants. That's kind of cool :) It would be even cooler to be able to change it at runtime programmatically, though, similar to the number of outputs in the fc layer, or to the number of filters in the convolutional layers.

I will check if I can add an option to change the number of the repetitions in the repeat layer at runtime. @davisking, do you think that's feasible?

Yeah, probably not a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants