Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster statm reading on linux #133

Closed
treo opened this issue Oct 6, 2016 · 7 comments
Closed

Faster statm reading on linux #133

treo opened this issue Oct 6, 2016 · 7 comments

Comments

@treo
Copy link

treo commented Oct 6, 2016

@akhodakivskiy has experienced that reading from /proc/self/statm seems to stall his computations while using dl4j.

I have investigated this a bit further, and found that re-opening the file on each call as it is currently done isn't efficient. Averaged over 64K calls it takes about 1200ns per call to open, read, and close the file. I then tried to move the fopen/fclose calls out of the loop and simply use rewind to put it back at the beginning. This took down the time per average read down to about 500 to 600ns, but didn't read the correct values anymore.

Now using just open/close and using lseek and read+sscanf, i.e. reading it using non-buffered methods, I can get the correct value on each iteration and on average it takes about 350ns to get it.

So I think it would be nice if it reading the memory usage on linux could be done without opening and closing the statm file each time data is needed.

@akhodakivskiy
Copy link

Do you think it's feasible to track all allocations from within javacpp? E.g. create a wrapper around malloc that increments allocation size?

@saudet
Copy link
Member

saudet commented Oct 7, 2016

Possible, it would be hard, but that's not the point. We need to know how much physical memory gets used because that's what, for example, YARN checks when deciding to kill a process.

@saudet
Copy link
Member

saudet commented Oct 8, 2016

Leaving the file descriptor opened along with a couple of other changes, I've managed to make the call 20 times faster, so it should be satisfactory now. @akhodakivskiy Let me know if this call still causes problems though. Thanks!

@saudet
Copy link
Member

saudet commented Oct 8, 2016

Tested with this code:

import org.bytedeco.javacpp.*;
import org.bytedeco.javacpp.annotation.*;

@Platform
public class TestPhysical {
    public static void main(String[] args) {
        Loader.load();
        long time = System.nanoTime();
        for (int i = 0; i < 1000000; i++) {
            Pointer.physicalBytes();
        }
        System.out.println((System.nanoTime() - time) / 1000000 + " ns " + Pointer.physicalBytes() + " bytes");
    }
}

@akhodakivskiy
Copy link

It looks much better, deallocator() accounting for 7% of CPU time, down from 70% in previous version. Not sure what is the desired level?

On the threads screenshot below the red is waiting for deallocator()

screen shot 2016-10-11 at 2 39 21 pm
screen shot 2016-10-11 at 2 39 08 pm

@saudet
Copy link
Member

saudet commented Oct 11, 2016

Well, memory allocation itself isn't free and the cost is sometimes borne at deallocation time...

@saudet
Copy link
Member

saudet commented Nov 14, 2016

In any case, the faster implementation is now in version 1.2.5, which seems to be taking about as much time as native memory allocation itself. Please let me know if you find a case where this is still a bottleneck though. Thanks for reporting this issue!

@saudet saudet closed this as completed Nov 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants