Feature suggestions #3

pfeatherstone · 2021-01-08T11:00:18Z

arrufat · 2021-01-08T11:13:39Z

Yes, I want to do all those things at some point :)

I've already started working on YOLO scaled models, in particular YOLOv4x-mish, which can be found here.

My idea is to make a generic template class for yolo models, where the template type is the yolo model itself, then put each one in a separate unit, so that we can link to them (this greatly improves compilation time).

I've also started working on the to_label() part for yolo models, but it's not ready yet (lack of time these days). I will definitely push it unless someone wants to work on it. My dream would be to have the training part working, as well.

I am not sure how we can improve the performance. In my tests, dlib performs faster than pytorch for densenet, resnet and vovnet architectures and uses less memory (for small batch sizes, up to 4 or 8), but the tendency inverts for big batch sizes. If that is the case, dlib should be fast on single inference, so I am wondering if it's the post-processing (NMS, etc) that drags the peformance down... I want to test that at some point, as well.

pfeatherstone · 2021-01-08T11:20:50Z

When i was doing my tests, both onnx inference and dlib inference were doing NMS stuff. The NMS stuff is practically instantaneous (i haven't properly measured it though). I think it's something else that's causing bottlenecks. But your benchmarks are very interesting, and not what i was expecting based on my tests with yolov3. Properly profiling this at some point will be useful.

arrufat · 2021-01-08T11:26:18Z

Did you set CUDA_LAUNCH_BLOCKING to 1? I used that in all my benchmarks.

pfeatherstone · 2021-01-08T11:32:52Z

So when running all the yolo models with this repository, do you get similar performance to darknet and pytorch?

pfeatherstone · 2021-01-08T11:33:51Z

Did you set CUDA_LAUNCH_BLOCKING to 1? I used that in all my benchmarks.

No i've never used that. I imagine that would slow pytorch down.

arrufat · 2021-01-08T11:36:04Z

I added it because the creator of PyTorch suggested to, for proper benchmarking. arrufat/dlib-pytorch-benchmark#2

pfeatherstone · 2021-01-08T11:38:43Z

Oh ok. Fair enough. At the end of the day, running yolov3 with onnxruntime, pytorch or darknet yields roughly 65 FPS on 416x416 images. With dlib, i think i got around 45 FPS. If we can close that gap, that would be great.

arrufat · 2021-01-18T05:23:26Z

Oh ok. Fair enough. At the end of the day, running yolov3 with onnxruntime, pytorch or darknet yields roughly 65 FPS on 416x416 images. With dlib, i think i got around 45 FPS. If we can close that gap, that would be great.

I've just run yolov3 on darknet and dlib and I can confirm similar numbers, more precisely, on an NVIDIA GeForce GTX 1080 Ti:

	FPS	FPS	VRAM (MiB)	VRAM(MiB)
model (size)	darknet	dlib	darknet	dlib
yolov3 (416)	70	50	865	835
yolov4 (608)	32	22	1545	1742

I agree, it'd be cool if we could find out and fix the bottlenecks.

pfeatherstone · 2021-01-18T08:03:14Z

That's great thanks. Whenever i have time i will have a look. A decent profiler will go a long way. I always struggle to interpret sysprof. I'll try orbit again at some point. Though i had trouble building it last time i seem to remember.

pfeatherstone · 2021-01-18T08:05:27Z

It could be tensor.host() is called in a few places which introduce unnecessary barriers. I don't know enough about CUDA to be honest.

arrufat · 2021-01-18T09:01:25Z

As far as I know, tensor.host() is only called once in user code (to get the actual output of the network). I need to check if it's called somewhere else inside some layer implementation...

pfeatherstone · 2021-01-18T09:12:37Z

Yep building google's orbit profiler failed again. I'll have to do this at some point in my free time. Thanks @arrufat for investigating.

davisking · 2021-01-18T13:53:11Z

I'm not sure what causes this, but there shouldn't be any unnecessary tensor.host() calls. When the network is running it should all stay on the GPU. My guess is that darknet is making use of the fused conv+relu methods in cuDNN. dlib doesn't do that yet, it's still running those as 2 calls to cuDNN rather than one, which is a modest but non-trivial difference in speed if darknet is doing it like that.

…

On Mon, Jan 18, 2021 at 4:12 AM pfeatherstone ***@***.***> wrote: Yep building google's orbit profiler failed again. I'll have to do this at some point in my free time. Thanks @arrufat <https://github.com/arrufat> for investigating. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPYFRYIEWZIZISAT7UR74TS2P3RNANCNFSM4V2JGZ7A> .

pfeatherstone · 2021-01-18T15:05:25Z

@davisking Does dlib do fused convolution and batch normalisation ?

pfeatherstone · 2021-01-18T15:12:10Z

Darknet definitely does that. But then again, Pytorch doesn't and it achieves similar FPS to darknet, if not faster. I've seen onnxruntime achieve even faster FPS, but it does all sorts of crazy shit with graph optimization.

pfeatherstone · 2021-01-18T15:16:22Z

If dlib doesn't implement fused conv-batchnorm, maybe that could be implemented as a layer visitor when doing inference, which updates the convolutional filters and biases, and nulls the affine layers.

pfeatherstone · 2021-01-18T15:42:19Z

Here is Alexey's code for fused conv-batchnorm:

void fuse_conv_batchnorm(network net)
{
    int j;
    for (j = 0; j < net.n; ++j) {
        layer *l = &net.layers[j];

        if (l->type == CONVOLUTIONAL) {
            //printf(" Merges Convolutional-%d and batch_norm \n", j);

            if (l->share_layer != NULL) {
                l->batch_normalize = 0;
            }

            if (l->batch_normalize) {
                int f;
                for (f = 0; f < l->n; ++f)
                {
                    l->biases[f] = l->biases[f] - (double)l->scales[f] * l->rolling_mean[f] / (sqrt((double)l->rolling_variance[f] + .00001));

                    double precomputed = l->scales[f] / (sqrt((double)l->rolling_variance[f] + .00001));

                    const size_t filter_size = l->size*l->size*l->c / l->groups;
                    int i;
                    for (i = 0; i < filter_size; ++i) {
                        int w_index = f*filter_size + i;

                        l->weights[w_index] *= precomputed;
                    }
                }

                free_convolutional_batchnorm(l);
                l->batch_normalize = 0;
#ifdef GPU
                if (gpu_index >= 0) {
                    push_convolutional_layer(*l);
                }
#endif
            }
        }
        else  if (l->type == SHORTCUT && l->weights && l->weights_normalization)
        {
            if (l->nweights > 0) {
                //cuda_pull_array(l.weights_gpu, l.weights, l.nweights);
                int i;
                for (i = 0; i < l->nweights; ++i) printf(" w = %f,", l->weights[i]);
                printf(" l->nweights = %d, j = %d \n", l->nweights, j);
            }

            // nweights - l.n or l.n*l.c or (l.n*l.c*l.h*l.w)
            const int layer_step = l->nweights / (l->n + 1);    // 1 or l.c or (l.c * l.h * l.w)

            int chan, i;
            for (chan = 0; chan < layer_step; ++chan)
            {
                float sum = 1, max_val = -FLT_MAX;

                if (l->weights_normalization == SOFTMAX_NORMALIZATION) {
                    for (i = 0; i < (l->n + 1); ++i) {
                        int w_index = chan + i * layer_step;
                        float w = l->weights[w_index];
                        if (max_val < w) max_val = w;
                    }
                }

                const float eps = 0.0001;
                sum = eps;

                for (i = 0; i < (l->n + 1); ++i) {
                    int w_index = chan + i * layer_step;
                    float w = l->weights[w_index];
                    if (l->weights_normalization == RELU_NORMALIZATION) sum += lrelu(w);
                    else if (l->weights_normalization == SOFTMAX_NORMALIZATION) sum += expf(w - max_val);
                }

                for (i = 0; i < (l->n + 1); ++i) {
                    int w_index = chan + i * layer_step;
                    float w = l->weights[w_index];
                    if (l->weights_normalization == RELU_NORMALIZATION) w = lrelu(w) / sum;
                    else if (l->weights_normalization == SOFTMAX_NORMALIZATION) w = expf(w - max_val) / sum;
                    l->weights[w_index] = w;
                }
            }

            l->weights_normalization = NO_NORMALIZATION;

#ifdef GPU
            if (gpu_index >= 0) {
                push_shortcut_layer(*l);
            }
#endif
        }
        else {
            //printf(" Fusion skip layer type: %d \n", l->type);
        }
    }
}

So a dlib visitor and a bit of tensor manipulation. Shouldn't be too bad.

davisking · 2021-01-18T15:50:11Z

@davisking Does dlib do fused convolution and batch normalisation ?

No. Need to have new layers for that.

pfeatherstone · 2021-01-18T15:51:31Z

won't a layer visitor do the job?

davisking · 2021-01-18T15:52:12Z

won't a layer visitor do the job?

Yeah or that with appropriate updates to the code.

pfeatherstone · 2021-01-18T17:25:14Z

@arrufat can you run your benchmark again but disable fuse_conv_batchnorm in darknet. i've done a quick grep, and i think all you need to do is uncomment the following lines:
line 2251 in parser.c
line 162 in demo.c
line 1617 in detector.c
If you're using darknet demo ..., then you won't need to do the last one.
I would do it myself, but the benchmark is only meaningful if it's done on the same machine in the same "conditions"
If the FPS is still around 70, then we know it's not fuse_conv_batchnorm that's causing the performance boost.

arrufat · 2021-01-19T01:18:02Z

@pfeatherstone after doing what you suggested, I go from 70 fps to 60 fps, so there's some room for improvement there :)

pfeatherstone · 2021-01-19T08:21:03Z

That's promising. But it it might still suggest it's something else causing bottlenecks. I'll try adding the visitor this weekend. It shouldn't take too long. It also requires adding "bypass" functionality in affine_ layer.

arrufat · 2021-01-19T08:32:15Z

Could it be possible to make a new layer similar to affine_ than can be constructed from a bn_ but that behaves like a tag (i.e. it just forwards the input to its output without any runtime cost)?

Then we could assign a network defined with bn_ layers to a network defined with this "bypass" layers, in the same way it's done for affine_.

pfeatherstone · 2021-01-19T08:37:04Z

Can do. I have no strong opinions. I imagine there would be no runtime cost if there was a flag due to branch prediction always guessing correctly (presumably there would be an if-statement around the flag. If true, simply forward input to output). Honestly it makes no difference to me. Whichever is the most expressive. Your way requires a new type, which means the whole network is a new type, which means the compiler has to compiler yet another gigantic new type, which means i have to wait another 15 minutes for clang to build yolov3. But at this stage, +- 15 minutes for building networks in dlib isn't a biggy.

arrufat · 2021-01-19T08:48:16Z

Yes, I agree, compile-times are getting a bit out of hand for big YOLO models (such as the recently published improvements to YOLOv4.) Maybe having an extra branch in each bn_ layer it doesn't affect the performance...

Regarding the compile times, that's why I build each model as a separate library and then link to it, so I don't have to rebuild it every time I change the code somewhere else)

https://github.com/dlib-users/darknet/blob/b78eddc08a7f5520103b2b296067a3516f5f7faa/CMakeLists.txt#L74-L77\

Here are the sizes of the compiled models, yolov3 is really tiny compared to the latest yolov4x_mish...

-rw-r--r--  1 adria 1.9M Jan 18 23:43 libyolov3.a
-rw-r--r--  1 adria 4.8M Jan 18 23:43 libyolov4.a
-rw-r--r--  1 adria 5.2M Jan 18 23:43 libyolov4_sam_mish.a
-rw-r--r--  1 adria  15M Jan 18 23:45 libyolov4x_mish.a

pfeatherstone · 2021-01-19T08:52:06Z

If I had a couple months of free time, I would roll up my sleeves and propose a new functional API to dnns in dlib, using dynamic polymorphism instead of static polymorphism for neural networks. I think that would solve a lot of frustrations, including compile times. I can see the benefits of using templates, it means you expedite optimisations to the compiler, but with large models, as you said, it gets out of hand. Having a functional API similar to pytorch for example would make dnns more accessible in dlib i think. But this would require a huge amount of time to get it right.

pfeatherstone · 2021-01-19T08:54:06Z

But that would require a lot of work on the tensor type too i think. So this wouldn't be an easy thing to do.

pfeatherstone · 2021-01-19T08:56:58Z

Yes, I agree, compile-times are getting a bit out of hand for big YOLO models (such as the recently published improvements to YOLOv4.) Maybe having an extra branch in each bn_ layer it doesn't affect the performance...

Regarding the compile times, that's why I build each model as a separate library and then link to it, so I don't have to rebuild it every time I change the code somewhere else)

[https://github.com/dlib-users/darknet/blob/b78eddc08a7f5520103b2b296067a3516f5f7faa/CMakeLists.txt#L74-L77](https://github.com/dlib-users/darknet/blob/b78eddc08a7f5520103b2b296067a3516f5f7faa/CMakeLists.txt#L74-L77%5C)

Here are the sizes of the compiled models, yolov3 is really tiny compared to the latest yolov4x_mish...
-rw-r--r--  1 adria 1.9M Jan 18 23:43 libyolov3.a
-rw-r--r--  1 adria 4.8M Jan 18 23:43 libyolov4.a
-rw-r--r--  1 adria 5.2M Jan 18 23:43 libyolov4_sam_mish.a
-rw-r--r--  1 adria  15M Jan 18 23:45 libyolov4x_mish.a

It's still impressive that a single model is compiled to nearly 2MB of binary. Maybe the bottlenecks are caused by code bloating? I don't know. I've never properly looked at the effects of binary size on performance.

arrufat · 2021-01-19T09:11:42Z

Honestly I really like the declarative way of defining networks in dlib, even if it requires some work to add new layers, it's worth it because:

serialization works really well
autocompletion anywhere inside the network, for example net.subnet().layer_details().set_num_filters();
a lot of errors are caught at compile time (in PyTorch I always get the tensor shapes wrong, that never happened in dlib)
for me, it's really easy to look at a dlib network definition and know what's doing

Other than the compile times, I think dlib's approach to neural nets is the best (but I might be biased :P)

EDIT: also, if at some point dlib is ported to C++20, we could use concepts to get better error messages when we make a mistake in the network definition, that would be awesome.

pfeatherstone · 2022-03-04T16:07:12Z

I could be wrong. I'm sure they do work but they are definitely not optimised for other architectures.

arrufat · 2022-03-04T16:07:31Z

You can do label smoothing using the loss_multibinary_log, as I noted in davisking/dlib#2141. Actually, in ResNet strikes back, they do exactly that.

Quoting from the paper:

Our procedure include recent advances from the literature as well as new proposals. Noticeably, we depart from the usual cross-entropy loss. Instead, our training solves a multi-classification problem when using Mixup and CutMix: we minimize the binary cross entropy for each concept selected by these augmentations, assuming that all the mixed concepts are present in the synthetized image.

pfeatherstone · 2022-03-04T16:13:36Z

I'm sorry i missed that. Presumably the labels can be any real number in range [-1,+1]?

arrufat · 2022-03-04T16:14:39Z

They can be any real value, just positive for one class and negative for the other, and the absolute value, would be the weight of the label. So if you have a dataset with imbalace 1:2, you can set the labels as 0.67, -1.34 or something like that.
The values themselves don't matter, what matters is the relation between them. If you put large values, you might have to reduce the learning rate.

pfeatherstone · 2022-03-04T16:18:03Z

I thought the smoothed version of cross entropy still assumed a softmax layer during inference. Like what is done in torch Whereas the loss in dlib is using sigmoid right? I don't know which is best but there is definitely functionality there already in dlib, you're right. I'm not trying to start another religious war don't worry. Just throwing ideas. If you don't find them appropriate I will remove them

arrufat · 2022-03-04T16:22:43Z

From what I've seen, if you want to output a single label per image, training using softmax or sigmoid leads to almost identical results.

I prefer the latter, since it's a more generic approach. It can deal with the cases when:

none of the training classes appear at inference time (since all sigmoids can output zero, they don't need to add up to 1)
more than one class appears in an image, they don't need to fight, they can output all 1, if they want.

In fact, the classification part of YOLO uses a sigmoid, and generally, we train it with bounding boxes that only have one label.

If you check that ResNet paper, you'll see the experiments between what they call Cross Entropy (CE) and Binary Cross Entropy (BCE). Which correspond to the loss_multiclass_log and loss_multibinary_log in dlib, respectively.

pfeatherstone · 2022-03-04T16:26:43Z

Interesting, yeah I see your point. I haven't read that paper. I read https://arxiv.org/pdf/1905.04899.pdf

arrufat · 2022-03-04T17:09:56Z

I was checking YOLOv5 code, and it uses the CIOU loss, as I thought (it's even hard-coded) . I think the other ones don't make much sense if you have the CIOU loss at your disposal.

pfeatherstone · 2022-03-04T17:21:39Z

Oh ok. They must have changed it recently. For a long time I was following the code, and even though they supported all 3 IOU losses, they used GIOU. Maybe that was with their Yolov3 repo. The author even said in issue somewhere that CIOU provided no benefits over GIOU. But that might not be the case anymore.

ultralytics/yolov3#996 (comment)

pfeatherstone · 2022-03-04T17:23:35Z

In any case, I'm not instigating a debate here, I just think it would be a nice addition to support all 3. Again, I'm not suggesting you do them. It's just an idea. It could very well be the case that for some models GIOU gives you equal performance with less compute.

davisking · 2022-03-05T12:44:05Z

@davisking talking about type erasure. I was thinking of adding a general purpose type erasure module similar to boost-ext TE or Dyno, but targeting C++11. What do you think ? hopefully it won't require a concept map like what the Dyno library does and will be succinct like the TE library. I've got it working with c++14 but think it should be possible to have a c++11 version. So you could have polymorphic objects on the stack without inheritance. This is really useful in API design without having loads of smart pointers using inheritance and you can use good old c++ objects.

I haven't used any of those libraries so I'm not 100% what you mean. I think a lot of the tools in boost get kind of carried away with "hey this is neat" vs "this is really the best way to deliver value to humans" so I don't end up using boost very much :|

davisking · 2022-03-05T12:45:05Z

@davisking jumping back to DL, have you tried https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html with your torch models ? I haven't. If so, what is your experience with it ? There is also https://github.com/hjmshi/PyTorch-LBFGS

I haven't. I assume it's a proper LBFGS implementation though.

davisking · 2022-03-05T12:48:05Z

@davisking jumping back to DL, have you tried https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html with your torch models ? I haven't. If so, what is your experience with it ? There is also https://github.com/hjmshi/PyTorch-LBFGS

Oh that's a SGD tool that uses the LBFGS equation to compute the step direction. That's not really "LBFGS" though. LBFGS is supposed to be used with a line search and is not a SGD type algorithm. Like it does not operate on "batches".

pfeatherstone · 2022-03-05T13:20:11Z

@davisking talking about type erasure. I was thinking of adding a general purpose type erasure module similar to boost-ext TE or Dyno, but targeting C++11. What do you think ? hopefully it won't require a concept map like what the Dyno library does and will be succinct like the TE library. I've got it working with c++14 but think it should be possible to have a c++11 version. So you could have polymorphic objects on the stack without inheritance. This is really useful in API design without having loads of smart pointers using inheritance and you can use good old c++ objects.

I haven't used any of those libraries so I'm not 100% what you mean. I think a lot of the tools in boost get kind of carried away with "hey this is neat" vs "this is really the best way to deliver value to humans" so I don't end up using boost very much :|

Neither of them are boost libraries actually. TE is 300 lines of code and that's it. I would check oit TE. It's really cool. It's dynamic polymorphism without inheritance. So non-intrusive. I think you would like it a lot. But it requires c++17. You can make it c++14 with a couple mods but requires a bit more work to make it c++11. But with it, any kind of type erasure is trivially done.

davisking · 2022-03-05T13:59:32Z

@davisking talking about type erasure. I was thinking of adding a general purpose type erasure module similar to boost-ext TE or Dyno, but targeting C++11. What do you think ? hopefully it won't require a concept map like what the Dyno library does and will be succinct like the TE library. I've got it working with c++14 but think it should be possible to have a c++11 version. So you could have polymorphic objects on the stack without inheritance. This is really useful in API design without having loads of smart pointers using inheritance and you can use good old c++ objects.

I haven't used any of those libraries so I'm not 100% what you mean. I think a lot of the tools in boost get kind of carried away with "hey this is neat" vs "this is really the best way to deliver value to humans" so I don't end up using boost very much :|

Neither of them are boost libraries actually. TE is 300 lines of code and that's it. I would check oit TE. It's really cool. It's dynamic polymorphism without inheritance. So non-intrusive. I think you would like it a lot. But it requires c++17. You can make it c++14 with a couple mods but requires a bit more work to make it c++11. But with it, any kind of type erasure is trivially done.

Yeah it's clever. It seems like a hack around not having template concepts (which are in C++20 now) though. Like use template concepts. They ought to be great. Like update your compiler you know :D Dlib has to wait a long time to update, but end users, which includes us when not adding code to the dlib repo can use all the newest stuff. Like I use C++17 at work and it's great.

pfeatherstone · 2022-03-05T18:32:54Z

It's not the same as concepts though. It's closer to inheritance than it is to templates. It's a generalisation or std::function but for any kind of objects. I think that's the best way I can explain it. Like std::function isn't a template hack. It's achieving something else. But yeah I see your point, I could just use that library and accept the c++17 requirement. I just like to keep stuff as portable as possible. I still have to support old compilers for some of my stuff at work. That's why I've submitted a few PRs in the past that add c++14 and c++17 backports

pfeatherstone · 2022-03-05T18:53:49Z

And I don't quite understand the argument for updating my compiler. I've done that before. I build it then run the binary on another Linux box then it complains it can't find the right version of glibc or something. And aggressively statically linking everything doesn't always work. So I always just use the default compiler for a system.

pfeatherstone · 2022-03-05T18:56:13Z

It would be great if gcc and clang supported a feature like "clang update" or "gcc update" and everything just worked and I had all the latest features. That is where modern languages shine. They've embraced bleeding edge a bit more. I would really like it if c++ no longer supported dynamic linking. Disk space is no longer an issue. Unless you're dynamically loading modules at runtime, I see no benefit in shared libraries. For me they cause problems and means I can't really do stuff like updating my compiler and expect everything to work.

pfeatherstone · 2022-03-05T19:44:33Z

Anyway, silly rant over. I proposed a type erasure utility since you mentioned that IF you were to re-write the DNN stuff, you would use a more dynamic model, likely using type erasure (instead of inheritance) to record forward (and backward?) functions. So it seemed appropriate to mention it.

davisking · 2022-03-07T23:55:16Z

It's not the same as concepts though. It's closer to inheritance than it is to templates. It's a generalisation or std::function but for any kind of objects.

Sure, but why not use templates? Like aside from being able to hide some implementation details outside a header I would prefer template concepts to this most of the time. I mean, I use std::function sometimes too. But IDK. This feels excessively clever, but maybe it isn't. What do you use this for that isn't nicely accomplished via a shared_ptr and inheritance of a template?

I think that's the best way I can explain it. Like std::function isn't a template hack. It's achieving something else. But yeah I see your point, I could just use that library and accept the c++17 requirement. I just like to keep stuff as portable as possible. I still have to support old compilers for some of my stuff at work. That's why I've submitted a few PRs in the past that add c++14 and c++17 backports.

Yeah, I know this pain well :|

davisking · 2022-03-07T23:55:55Z

And I don't quite understand the argument for updating my compiler. I've done that before. I build it then run the binary on another Linux box then it complains it can't find the right version of glibc or something. And aggressively statically linking everything doesn't always work. So I always just use the default compiler for a system.

I mean yeah, but getting a new compiler is great when you can get it :D

davisking · 2022-03-07T23:59:03Z

It would be great if gcc and clang supported a feature like "clang update" or "gcc update" and everything just worked and I had all the latest features. That is where modern languages shine. They've embraced bleeding edge a bit more. I would really like it if c++ no longer supported dynamic linking. Disk space is no longer an issue. Unless you're dynamically loading modules at runtime, I see no benefit in shared libraries. For me they cause problems and means I can't really do stuff like updating my compiler and expect everything to work.

Depends on your domain. For instance, I work on a large software system that really needs dynamic linking. It's a bunch of separate processes. Linking it all into one big monster process would be less good since it is a safety critical system. We don't want one thing blowing up to take down the whole thing. So that process isolation is a big deal. And at the time time, if everything was statically linked then we would quite literally run out of RAM.

But yeah, for most applications static linking is the way to go. Or mostly static linking anyway.

davisking · 2022-03-08T00:00:21Z

Anyway, silly rant over. I proposed a type erasure utility since you mentioned that IF you were to re-write the DNN stuff, you would use a more dynamic model, likely using type erasure (instead of inheritance) to record forward (and backward?) functions. So it seemed appropriate to mention it.

Yeah, would definitely use type erasure if rewriting the DNN stuff.

And I deliberately didn't use type erasure the first time, since type erasure makes serialization more complicated. Not a lot more complicated, but still more complicated. It makes a lot of stuff a little bit more complicated. But still, that was not optimizing the right thing. I didn't expect DNNs to end up being as large and crazy as they got, and at those sizes and complexities type erasure is definitely the better way to go.

pfeatherstone · 2022-03-08T07:13:22Z

What do you use this for that isn't nicely accomplished via a shared_ptr and inheritance of a template?

I would watch Sean Parent's talk "Inheritance is the base class of evil" or Louis Dionne's CPPCON talk on Dyno. They would explain it better than I do

arrufat · 2022-03-10T08:15:02Z

@pfeatherstone I just noticed (by chance) that you can load a network that differs in the definition only in the number of repeated layers.
This means that, due to the way that I defined the YOLOv5 models, you can pick and build just one variant, and load any of the other variants. That's kind of cool :)
It would be even cooler to be able to change it at runtime programmatically, though, similar to the number of outputs in the fc layer, or to the number of filters in the convolutional layers.

I will check if I can add an option to change the number of the repetitions in the repeat layer at runtime. @davisking, do you think that's feasible?

pfeatherstone · 2022-03-10T08:52:58Z

That would be awesome!

pfeatherstone · 2022-03-10T08:54:45Z

It would also be cool if the batch-normalization layers had a runtime option to set them to training mode or evaluation mode (affine layer behaviour) similar to other NN frameworks. So you would only ever need 1 network definition, not 2 (one for training and one for evaluation).

Then you could do something like model.train() to set it back to batch-normalization behaviour and keep track of statistics, or do model.eval() and set it to affine behaviour. That would probably save compilation time massively too. (half it in theory)

pfeatherstone · 2022-03-10T08:59:16Z

Very quickly, going back to enhanced loss_yolo layer supporting the IOU variants. Currently, it assumes the inputs to the yolo loss function have had sigmoid applied to them. That wouldn't work with GIOU, DIOU or CIOU. Is there any reason why the loss_yolo layer doesn't do the sigmoid itself? If it did, then you could add CIOU for example and not worry about having to undo the sigmoid.

arrufat · 2022-03-10T12:00:57Z

Very quickly, going back to enhanced loss_yolo layer supporting the IOU variants. Currently, it assumes the inputs to the yolo loss function have had sigmoid applied to them. That wouldn't work with GIOU, DIOU or CIOU. Is there any reason why the loss_yolo layer doesn't do the sigmoid itself? If it did, then you could add CIOU for example and not worry about having to undo the sigmoid.

Yes, there's a technical reason. First, I wanted to perform the sigmoid operation inside the loss function, but the get_output() method returns a const reference, so it meant that I had to copy the output tensor in the loss layer to apply the sigmoid. That's why I decided to do it outside. Moreover, the Darknet code also does it in the network when new_coords==1, and it has the option to do the CIOU loss.

davisking · 2022-03-10T13:05:32Z

@pfeatherstone I just noticed (by chance) that you can load a network that differs in the definition only in the number of repeated layers. This means that, due to the way that I defined the YOLOv5 models, you can pick and build just one variant, and load any of the other variants. That's kind of cool :) It would be even cooler to be able to change it at runtime programmatically, though, similar to the number of outputs in the fc layer, or to the number of filters in the convolutional layers.

I will check if I can add an option to change the number of the repetitions in the repeat layer at runtime. @davisking, do you think that's feasible?

Yeah, probably not a problem.

Feature suggestions #3

Feature suggestions #3

Comments

pfeatherstone commented Jan 8, 2021 • edited Loading

arrufat commented Jan 8, 2021

pfeatherstone commented Jan 8, 2021

arrufat commented Jan 8, 2021

pfeatherstone commented Jan 8, 2021

pfeatherstone commented Jan 8, 2021

arrufat commented Jan 8, 2021

pfeatherstone commented Jan 8, 2021 • edited Loading

arrufat commented Jan 18, 2021

pfeatherstone commented Jan 18, 2021

pfeatherstone commented Jan 18, 2021

arrufat commented Jan 18, 2021

pfeatherstone commented Jan 18, 2021

davisking commented Jan 18, 2021 via email

pfeatherstone commented Jan 18, 2021

pfeatherstone commented Jan 18, 2021

pfeatherstone commented Jan 18, 2021

pfeatherstone commented Jan 18, 2021

davisking commented Jan 18, 2021

pfeatherstone commented Jan 18, 2021

davisking commented Jan 18, 2021

pfeatherstone commented Jan 18, 2021

arrufat commented Jan 19, 2021

pfeatherstone commented Jan 19, 2021

arrufat commented Jan 19, 2021

pfeatherstone commented Jan 19, 2021

arrufat commented Jan 19, 2021 • edited Loading

pfeatherstone commented Jan 19, 2021

pfeatherstone commented Jan 19, 2021

pfeatherstone commented Jan 19, 2021

arrufat commented Jan 19, 2021 • edited Loading

pfeatherstone commented Mar 4, 2022

arrufat commented Mar 4, 2022 • edited Loading

pfeatherstone commented Mar 4, 2022 • edited Loading

arrufat commented Mar 4, 2022 • edited Loading

pfeatherstone commented Mar 4, 2022

arrufat commented Mar 4, 2022 • edited Loading

pfeatherstone commented Mar 4, 2022

arrufat commented Mar 4, 2022

pfeatherstone commented Mar 4, 2022 • edited Loading

pfeatherstone commented Mar 4, 2022

davisking commented Mar 5, 2022

davisking commented Mar 5, 2022

davisking commented Mar 5, 2022

pfeatherstone commented Mar 5, 2022 • edited Loading

davisking commented Mar 5, 2022

pfeatherstone commented Mar 5, 2022 • edited Loading

pfeatherstone commented Mar 5, 2022

pfeatherstone commented Mar 5, 2022 • edited Loading

pfeatherstone commented Mar 5, 2022

davisking commented Mar 7, 2022

davisking commented Mar 7, 2022

davisking commented Mar 7, 2022

davisking commented Mar 8, 2022 • edited Loading

pfeatherstone commented Mar 8, 2022

arrufat commented Mar 10, 2022

pfeatherstone commented Mar 10, 2022

pfeatherstone commented Mar 10, 2022 • edited Loading

pfeatherstone commented Mar 10, 2022

arrufat commented Mar 10, 2022 • edited Loading

davisking commented Mar 10, 2022

pfeatherstone commented Jan 8, 2021 •

edited

Loading

pfeatherstone commented Jan 8, 2021 •

edited

Loading

arrufat commented Jan 19, 2021 •

edited

Loading

arrufat commented Jan 19, 2021 •

edited

Loading

arrufat commented Mar 4, 2022 •

edited

Loading

pfeatherstone commented Mar 4, 2022 •

edited

Loading

arrufat commented Mar 4, 2022 •

edited

Loading

arrufat commented Mar 4, 2022 •

edited

Loading

pfeatherstone commented Mar 4, 2022 •

edited

Loading

pfeatherstone commented Mar 5, 2022 •

edited

Loading

pfeatherstone commented Mar 5, 2022 •

edited

Loading

pfeatherstone commented Mar 5, 2022 •

edited

Loading

davisking commented Mar 8, 2022 •

edited

Loading

pfeatherstone commented Mar 10, 2022 •

edited

Loading

arrufat commented Mar 10, 2022 •

edited

Loading