-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading the frontal face shape predictor calls 135053 times malloc. #2919
Comments
You sure you are building with optimizations turned on? For instance, I
get a load time of 480ms for loading the 68 point model.
…On Wed, Feb 28, 2024 at 11:30 AM Martijn Courteaux ***@***.***> wrote:
Main idea
I am not familiar with the dlib codebase, but it seems there is some
mem_manager stuff happening in quite some places. As the whole
dlib::deserialize<> traversal is doing a bunch of small mallocs/news,
this is ideal for a bump allocator (a.k.a "memory arena").
I get that it's not trivial to integrate that into the STL containers
being used. STL uses something called "polymorphic resources" in the
std::pmr:: namespace, which supports bump allocators.
However, most allocations happen inside dlib::matrix (I estimate 70% of
them).
So, I instrumented operator new() and operator delete() to keep track of
these things. The result is that during loading of the frontal face shape
predictor here is what happens:
- 135053 allocations
- 4 frees
- total allocation 68MB.
- average allocation size: 515 bytes per malloc.
- 70% of those allocations all happen inside dlib::matrix.
Overall, I'd argue that this is bad for performance.
------------------------------
I actually tested it, and replaced the default operator new behavior by
using a bump allocator (memory arena), and the load time went from 1.75s to
1.18s, which is a 48% performance increase.
Anything else?
*No response*
—
Reply to this email directly, view it on GitHub
<#2919>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPYFR3I26ZXK4BAUYQLPEDYV5LRBAVCNFSM6AAAAABD6LCLXGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2TSMZVHA4TMOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Sorry, my reported times were actually from both the "frontal_face_detector" AND the "shape_predictor_68_face_landmarks" together. Let me break down more clearly what's happening:
So, my timings were too much influenced by the fact that I was recording the allocations and frees with too much detail. I don't know how you manage to load the 68 point model so quickly. I'm using this snippet: std::string path = "shape_predictor_68_face_landmarks.dat";
try {
dlib::deserialize(path) >> m_internals->sp_face_landmarks;
return true;
} catch (const dlib::serialization_error &e) {
spdlog::error("Could not load {}: {}", path, e.what());
return false;
}
|
I was just doing Anyway, you shouldn't need to worry about this startup time right? Just don't do it more than once? |
Warning: this issue has been inactive for 35 days and will be automatically closed on 2024-04-14 if there is no further activity. If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search. |
I indeed do it once, but this is a very expensive wait time of 1100ms. My computer can read more than 1GB/s (sequential reading) from SSD. The thing we are loading is 70MB, which should take less than 70ms, not 1142ms. Of course, I'm aware of the base64-decode happening for the FFD. Overall, what I'm trying to say is that this way of making it user-friendly (read: programmer-friendly) is actually making it unsuitable for production code. It's a bad user experience if this thing takes 1.1s to load 70MB of coefficients. |
I was never inconvenienced by the loading time of the shape predictor model. Out of curiosity, I just timed how long it takes on my machine, and it's about 350 ms. Here's what I did, using the webcam face pose example program. Add this at the top: #include <chrono>
using fms = std::chrono::duration<float, std::milli>; Time the loading: const auto t0 = std::chrono::steady_clock::now();
deserialize("shape_predictor_68_face_landmarks.dat") >> pose_model;
const auto t1 = std::chrono::steady_clock::now();
cout << "shape predictor loaded in " << chrono::duration_cast<fms>(t1 - t0).count() << " ms\n"; |
Warning: this issue has been inactive for 35 days and will be automatically closed on 2024-05-22 if there is no further activity. If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search. |
Warning: this issue has been inactive for 43 days and will be automatically closed on 2024-05-22 if there is no further activity. If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search. |
Notice: this issue has been closed because it has been inactive for 45 days. You may reopen this issue if it has been closed in error. |
Solved by ignoring it long enough. Nice. |
Main idea
I am not familiar with the dlib codebase, but it seems there is some
mem_manager
stuff happening in quite some places. As the whole dlib::deserialize<> traversal is doing a bunch of smallnew
s, this is ideal for a bump allocator (a.k.a "memory arena").I get that it's not trivial to integrate that into the STL containers being used. STL uses something called "polymorphic resources" in the
std::pmr::
namespace, which supports bump allocators.However, most allocations happen inside dlib::matrix (I estimate 70% of them).
So, I instrumented
operator new()
andoperator delete()
to keep track of these things. The result is that during loading of the frontal face shape predictor here is what happens:dlib::matrix
.Overall, I'd argue that this is bad for performance.
I actually tested it, and replaced the default
operator new
behavior by using a bump allocator (memory arena), and the load time went from 1.75s to 1.18s, which is a 48% performance increase.Anything else?
No response
The text was updated successfully, but these errors were encountered: