-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature suggestions #3
Comments
Yes, I want to do all those things at some point :) I've already started working on YOLO scaled models, in particular YOLOv4x-mish, which can be found here. My idea is to make a generic template class for yolo models, where the template type is the yolo model itself, then put each one in a separate unit, so that we can link to them (this greatly improves compilation time). I've also started working on the I am not sure how we can improve the performance. In my tests, dlib performs faster than pytorch for densenet, resnet and vovnet architectures and uses less memory (for small batch sizes, up to 4 or 8), but the tendency inverts for big batch sizes. If that is the case, dlib should be fast on single inference, so I am wondering if it's the post-processing (NMS, etc) that drags the peformance down... I want to test that at some point, as well. |
When i was doing my tests, both onnx inference and dlib inference were doing NMS stuff. The NMS stuff is practically instantaneous (i haven't properly measured it though). I think it's something else that's causing bottlenecks. But your benchmarks are very interesting, and not what i was expecting based on my tests with yolov3. Properly profiling this at some point will be useful. |
Did you set |
So when running all the yolo models with this repository, do you get similar performance to darknet and pytorch? |
No i've never used that. I imagine that would slow pytorch down. |
I added it because the creator of PyTorch suggested to, for proper benchmarking. arrufat/dlib-pytorch-benchmark#2 |
Oh ok. Fair enough. At the end of the day, running yolov3 with onnxruntime, pytorch or darknet yields roughly 65 FPS on 416x416 images. With dlib, i think i got around 45 FPS. If we can close that gap, that would be great. |
I've just run yolov3 on darknet and dlib and I can confirm similar numbers, more precisely, on an NVIDIA GeForce GTX 1080 Ti:
I agree, it'd be cool if we could find out and fix the bottlenecks. |
That's great thanks. Whenever i have time i will have a look. A decent profiler will go a long way. I always struggle to interpret sysprof. I'll try orbit again at some point. Though i had trouble building it last time i seem to remember. |
It could be |
As far as I know, |
Yep building google's orbit profiler failed again. I'll have to do this at some point in my free time. Thanks @arrufat for investigating. |
I'm not sure what causes this, but there shouldn't be any
unnecessary tensor.host() calls. When the network is running it should all
stay on the GPU. My guess is that darknet is making use of the fused
conv+relu methods in cuDNN. dlib doesn't do that yet, it's still running
those as 2 calls to cuDNN rather than one, which is a modest but
non-trivial difference in speed if darknet is doing it like that.
…On Mon, Jan 18, 2021 at 4:12 AM pfeatherstone ***@***.***> wrote:
Yep building google's orbit profiler failed again. I'll have to do this at
some point in my free time. Thanks @arrufat <https://github.com/arrufat>
for investigating.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPYFRYIEWZIZISAT7UR74TS2P3RNANCNFSM4V2JGZ7A>
.
|
@davisking Does dlib do fused convolution and batch normalisation ? |
Darknet definitely does that. But then again, Pytorch doesn't and it achieves similar FPS to darknet, if not faster. I've seen onnxruntime achieve even faster FPS, but it does all sorts of crazy shit with graph optimization. |
If dlib doesn't implement fused conv-batchnorm, maybe that could be implemented as a layer visitor when doing inference, which updates the convolutional filters and biases, and nulls the affine layers. |
Here is Alexey's code for fused conv-batchnorm:
So a dlib visitor and a bit of tensor manipulation. Shouldn't be too bad. |
No. Need to have new layers for that. |
won't a layer visitor do the job? |
Yeah or that with appropriate updates to the code. |
@arrufat can you run your benchmark again but disable |
@pfeatherstone after doing what you suggested, I go from 70 fps to 60 fps, so there's some room for improvement there :) |
That's promising. But it it might still suggest it's something else causing bottlenecks. I'll try adding the visitor this weekend. It shouldn't take too long. It also requires adding "bypass" functionality in |
Could it be possible to make a new layer similar to Then we could assign a network defined with |
Can do. I have no strong opinions. I imagine there would be no runtime cost if there was a flag due to branch prediction always guessing correctly (presumably there would be an if-statement around the flag. If true, simply forward input to output). Honestly it makes no difference to me. Whichever is the most expressive. Your way requires a new type, which means the whole network is a new type, which means the compiler has to compiler yet another gigantic new type, which means i have to wait another 15 minutes for clang to build yolov3. But at this stage, +- 15 minutes for building networks in dlib isn't a biggy. |
Yes, I agree, compile-times are getting a bit out of hand for big YOLO models (such as the recently published improvements to YOLOv4.) Maybe having an extra branch in each Regarding the compile times, that's why I build each model as a separate library and then link to it, so I don't have to rebuild it every time I change the code somewhere else) Here are the sizes of the compiled models, yolov3 is really tiny compared to the latest yolov4x_mish...
|
If I had a couple months of free time, I would roll up my sleeves and propose a new functional API to dnns in dlib, using dynamic polymorphism instead of static polymorphism for neural networks. I think that would solve a lot of frustrations, including compile times. I can see the benefits of using templates, it means you expedite optimisations to the compiler, but with large models, as you said, it gets out of hand. Having a functional API similar to pytorch for example would make dnns more accessible in dlib i think. But this would require a huge amount of time to get it right. |
But that would require a lot of work on the |
It's still impressive that a single model is compiled to nearly 2MB of binary. Maybe the bottlenecks are caused by code bloating? I don't know. I've never properly looked at the effects of binary size on performance. |
Honestly I really like the declarative way of defining networks in dlib, even if it requires some work to add new layers, it's worth it because:
Other than the compile times, I think dlib's approach to neural nets is the best (but I might be biased :P) EDIT: also, if at some point dlib is ported to C++20, we could use concepts to get better error messages when we make a mistake in the network definition, that would be awesome. |
I could be wrong. I'm sure they do work but they are definitely not optimised for other architectures. |
You can do label smoothing using the Quoting from the paper:
|
I'm sorry i missed that. Presumably the labels can be any real number in range [-1,+1]? |
They can be any real value, just positive for one class and negative for the other, and the absolute value, would be the weight of the label. So if you have a dataset with imbalace 1:2, you can set the labels as 0.67, -1.34 or something like that. |
I thought the smoothed version of cross entropy still assumed a softmax layer during inference. Like what is done in torch Whereas the loss in dlib is using sigmoid right? I don't know which is best but there is definitely functionality there already in dlib, you're right. I'm not trying to start another religious war don't worry. Just throwing ideas. If you don't find them appropriate I will remove them |
From what I've seen, if you want to output a single label per image, training using softmax or sigmoid leads to almost identical results. I prefer the latter, since it's a more generic approach. It can deal with the cases when:
In fact, the classification part of YOLO uses a sigmoid, and generally, we train it with bounding boxes that only have one label. If you check that ResNet paper, you'll see the experiments between what they call Cross Entropy (CE) and Binary Cross Entropy (BCE). Which correspond to the |
Interesting, yeah I see your point. I haven't read that paper. I read https://arxiv.org/pdf/1905.04899.pdf |
I was checking YOLOv5 code, and it uses the CIOU loss, as I thought (it's even hard-coded) . I think the other ones don't make much sense if you have the CIOU loss at your disposal. |
Oh ok. They must have changed it recently. For a long time I was following the code, and even though they supported all 3 IOU losses, they used GIOU. Maybe that was with their Yolov3 repo. The author even said in issue somewhere that CIOU provided no benefits over GIOU. But that might not be the case anymore. |
In any case, I'm not instigating a debate here, I just think it would be a nice addition to support all 3. Again, I'm not suggesting you do them. It's just an idea. It could very well be the case that for some models GIOU gives you equal performance with less compute. |
I haven't used any of those libraries so I'm not 100% what you mean. I think a lot of the tools in boost get kind of carried away with "hey this is neat" vs "this is really the best way to deliver value to humans" so I don't end up using boost very much :| |
I haven't. I assume it's a proper LBFGS implementation though. |
Oh that's a SGD tool that uses the LBFGS equation to compute the step direction. That's not really "LBFGS" though. LBFGS is supposed to be used with a line search and is not a SGD type algorithm. Like it does not operate on "batches". |
Neither of them are boost libraries actually. TE is 300 lines of code and that's it. I would check oit TE. It's really cool. It's dynamic polymorphism without inheritance. So non-intrusive. I think you would like it a lot. But it requires c++17. You can make it c++14 with a couple mods but requires a bit more work to make it c++11. But with it, any kind of type erasure is trivially done. |
Yeah it's clever. It seems like a hack around not having template concepts (which are in C++20 now) though. Like use template concepts. They ought to be great. Like update your compiler you know :D Dlib has to wait a long time to update, but end users, which includes us when not adding code to the dlib repo can use all the newest stuff. Like I use C++17 at work and it's great. |
It's not the same as concepts though. It's closer to inheritance than it is to templates. It's a generalisation or std::function but for any kind of objects. I think that's the best way I can explain it. Like std::function isn't a template hack. It's achieving something else. But yeah I see your point, I could just use that library and accept the c++17 requirement. I just like to keep stuff as portable as possible. I still have to support old compilers for some of my stuff at work. That's why I've submitted a few PRs in the past that add c++14 and c++17 backports |
And I don't quite understand the argument for updating my compiler. I've done that before. I build it then run the binary on another Linux box then it complains it can't find the right version of glibc or something. And aggressively statically linking everything doesn't always work. So I always just use the default compiler for a system. |
It would be great if gcc and clang supported a feature like "clang update" or "gcc update" and everything just worked and I had all the latest features. That is where modern languages shine. They've embraced bleeding edge a bit more. I would really like it if c++ no longer supported dynamic linking. Disk space is no longer an issue. Unless you're dynamically loading modules at runtime, I see no benefit in shared libraries. For me they cause problems and means I can't really do stuff like updating my compiler and expect everything to work. |
Anyway, silly rant over. I proposed a type erasure utility since you mentioned that IF you were to re-write the DNN stuff, you would use a more dynamic model, likely using type erasure (instead of inheritance) to record forward (and backward?) functions. So it seemed appropriate to mention it. |
Sure, but why not use templates? Like aside from being able to hide some implementation details outside a header I would prefer template concepts to this most of the time. I mean, I use
Yeah, I know this pain well :| |
I mean yeah, but getting a new compiler is great when you can get it :D |
Depends on your domain. For instance, I work on a large software system that really needs dynamic linking. It's a bunch of separate processes. Linking it all into one big monster process would be less good since it is a safety critical system. We don't want one thing blowing up to take down the whole thing. So that process isolation is a big deal. And at the time time, if everything was statically linked then we would quite literally run out of RAM. But yeah, for most applications static linking is the way to go. Or mostly static linking anyway. |
Yeah, would definitely use type erasure if rewriting the DNN stuff. And I deliberately didn't use type erasure the first time, since type erasure makes serialization more complicated. Not a lot more complicated, but still more complicated. It makes a lot of stuff a little bit more complicated. But still, that was not optimizing the right thing. I didn't expect DNNs to end up being as large and crazy as they got, and at those sizes and complexities type erasure is definitely the better way to go. |
I would watch Sean Parent's talk "Inheritance is the base class of evil" or Louis Dionne's CPPCON talk on Dyno. They would explain it better than I do |
@pfeatherstone I just noticed (by chance) that you can load a network that differs in the definition only in the number of repeated layers. I will check if I can add an option to change the number of the repetitions in the repeat layer at runtime. @davisking, do you think that's feasible? |
That would be awesome! |
It would also be cool if the batch-normalization layers had a runtime option to set them to training mode or evaluation mode (affine layer behaviour) similar to other NN frameworks. So you would only ever need 1 network definition, not 2 (one for training and one for evaluation). Then you could do something like |
Very quickly, going back to enhanced |
Yes, there's a technical reason. First, I wanted to perform the sigmoid operation inside the loss function, but the |
Yeah, probably not a problem. |
Just putting some suggestions out there. Maybe we could organize these into projects.
fuse_conv_batchnorm
visitor for enhanced performance during inferenceThe text was updated successfully, but these errors were encountered: