Zero-copy and 32-bit address spaces #182

fitzgen · 2017-02-07T19:03:44Z

gimli is zero-copy⁰, which has the effect that whole ELF/MachO sections must be completely loaded into memory before they are given to gimli. When combined with large object files and large debug sections, this is problematic on 32-bit architectures, which have a limited address space.

Some Firefox folks are investigating using gimli for symbolicating Firefox's builtin profiler's backtraces, and this is a limiting issue. libxul.so alone is 1.2GB, and there just isn't enough contiguous vm address space for mmaping it whole (and often plain not enough physical memory available).

Julian Seward (one of said Firefox folks) tells me he had this same issue with Valgrind: it used to map whole object files, and it constantly failed on 32-bit architectures (particularly ARM). He replaced Valgrind's use of raw base pointers and offsets with an abstraction layer to read individual values. Under the hood, the abstraction is basically a direct-mapped cache with a few very large (64K) lines. Eventually, they even gained swappable backend implementations for the abstraction, one of which transparently fetched debug info over the network.

So. Can we revisit the zero-copy approach and take an approach similar to what Valgrind did?

I think we could abstract reading raw values with a Reader trait, and make an MmapWholeFile implementation of this trait to (mostly) maintain the status quo. (It isn't clear to me if we will be able to maintain the extant zero-copies, however.) Then, we could have a second Reader trait implementation, DirectMappedCache, like what Valgrind does.

This will likely introduce more error propagation and try!s within gimli code, but I suspect that it won't change the API for users very much if at all, since parsing is already fallible.

As far as performance goes: I'll quote Julian's recounting of his experience doing this transition in Valgrind:

FWIW, we got a perhaps 25% slowdown in reading line, CFI and symbol information once the abstraction layer was in, which was tolerable. In my experience with Valgrind, for most practical uses, the cost of post-processing is far greater than the cost of reading the image.

My hope is that we could mitigate even that slowdown by using the MmapWholeFile implementation of the Reader trait on 64-bit architectures by default in our addr2line and dwarfdump ports.

Finally, if we decide to continue down this road, Julian has volunteered to contribute the necessary changes.

Thoughts @philipc @jonhoo @tromey @khuey @kamalmarhubi @jvns @jimblandy?

Thanks!

⁰ True, a bunch of stuff can't be and isn't actually zero-copy, like LEB128 parsing, but we avoid copies as much as possible right now.

The text was updated successfully, but these errors were encountered:

jimblandy · 2017-02-07T19:41:52Z

Does Gimli have any established "realistic" loads with which to assess the cost?

If you're going to reuse cache lines, then parsing will definitely have to stop borrowing data out of the file's mmap'd image.

Have you looked into mapping only the portions of the executables that contain the debug info, instead of mapping the whole file?

fitzgen · 2017-02-07T20:39:33Z

Does Gimli have any established "realistic" loads with which to assess the cost?

I think the closest would be our addr2line benchmarks:

https://github.com/gimli-rs/addr2line#performance
https://github.com/gimli-rs/addr2line/blob/master/benchmark.sh

Perhaps @khuey, @jvns, and @kamalmarhubi have more here.

If you're going to reuse cache lines, then parsing will definitely have to stop borrowing data out of the file's mmap'd image.

I have vague hopes for using std::borrow::Cow to help cut down on copies when possible. But yes, we would be giving up the ability to indefinitely borrow from the mapped object file.

Have you looked into mapping only the portions of the executables that contain the debug info, instead of mapping the whole file?

I have not.

It should be possible to read just the object file's header to get the offset ranges for each section and then map just those sections into memory. If the section is very large, however, we are back at square one again.

I'm not sure whether it is worth pursuing this approach or not.

khuey · 2017-02-07T20:44:44Z

My libxul's .debug_info is 500 MB (the total binary is 900, .text is 110) so I don't think just mapping the relevant sections saves you much in the grand scheme of things.

Unfortunately I don't have performance data for anything other than toy programs yet with my stuff.

philipc · 2017-02-07T21:23:30Z

Can we change the API to return offsets instead of references, so that any copies are delayed? For the mmap case, you would still be able to turn that into a zero-copy reference.

fitzgen · 2017-02-07T21:33:27Z

Can we change the API to return offsets instead of references, so that any copies are delayed? For the mmap case, you would still be able to turn that into a zero-copy reference.

"the API" == the hypothetical Reader trait's interface, right?

So if we wanted to read 10 contiguous bytes from the file, we return (a newtype of) a tuple (start_offset, 10) and delay grabbing the data until we need it for the next operation (if ever)?

This would work great for strings we get out of .debug_str. It implies an extra step when doing something like parsing line number program opcodes, but I don't see any reason it would imply additional runtime cost vs returning Cow.

philipc · 2017-02-07T21:42:54Z

"the API" == the hypothetical Reader trait's interface, right?

I meant anything in gimli's API that returns &'input. This would also affect internal usage of &'input too of course.

So the idea is we return the offset instead of immediately doing an allocation+copy. For the non-mmap case, you'll probably still need to do the allocation+copy if you want to read it, but you may not always want to.

This would work great for strings we get out of .debug_str. It implies an extra step when doing something like parsing line number program opcodes

Yes this would be for things like .debug_str. For line number program opcodes, we're already copying them so it wouldn't affect that (the only reference I can see in the line number parsing is the file names).

philipc · 2017-02-11T05:50:42Z

I had a play with converting the API to use offsets, and it didn't work out very well. The problem is that we need to support values that might be offsets into one of a number of sections. So the API probably needs to return a struct of { section: &'input Reader, offset: usize } instead, where each section potentially (but not necessarily) has a different Reader.

julian-seward1 · 2017-02-20T17:39:40Z

To make this a bit more concrete, here's a precis of what I did
in Valgrind. Taken from coregrind/m_debuginfo/priv_image.h, if
you want to look.

We have an abstract type DiImage (Debug Info Image), and a 64-bit
unsigned int type for denoting offsets in it:

  typedef  struct   _DiImage  DiImage;
  typedef  uint64_t           DiOffT;

You can create an image from a local file ..

  DiImage* ML_(img_from_local_file)(const HChar* fullpath);

.. and by other means not relevant here (by connecting to a
debuginfo-server over TCP/IP)

Then you can ask some basic stuff

  /* Virtual size of the image. */
  DiOffT ML_(img_size)(const DiImage* img);

  /* Does the section [offset, +size) exist in the image? */
  Bool ML_(img_valid)(const DiImage* img, DiOffT offset, SizeT size);

You can copy arbitrary chunks out of the image:

  /* Get info out of an image.  If any part of the section denoted by
     [offset, +size) is invalid, does not return. */
  void ML_(img_get)(/*OUT*/void* dst,
                    DiImage* img, DiOffT offset, SizeT size);

which is general, but pretty damn awkward. So there are various
kinds of convenience functions:

  /* Fetch various sized primitive types from the image.  These operate
     at the endianness and word size of the host. */
  UShort ML_(img_get_UShort)(DiImage* img, DiOffT offset);

  /* Do strlen of a C string in the image. */
  SizeT ML_(img_strlen)(DiImage* img, DiOffT off);

(and many more)

and you can do some not-so-obvious things that have to be done in
the implementation itself for the reasons described in the comment
(assuming we're connected to a remote debuginfo server)

  /* Calculate the "GNU Debuglink CRC" for the image.  This
     unfortunately has to be done as part of the DiImage implementation
     because it involves reading the entire image, and is therefore
     something that needs to be handed off to the remote server -- since
     to do it otherwise would imply pulling the entire image across the
     connection, making the client/server split pointless. */
  UInt ML_(img_calc_gnu_debuglink_crc32)(DiImage* img);

and finally of course

  /* Destroy an existing image. */
  void ML_(img_done)(DiImage*);

julian-seward1 · 2017-02-20T17:41:18Z

Ecch. The previous comment window seems to have auto-transformed various
instances of "DiImage STAR" (pointer-to-DiImage) to plain DiImage (ie the star
has disappeared.)

fitzgen · 2017-02-20T17:42:29Z

(I edited the comment for formatting.)

julian-seward1 · 2017-02-20T17:48:59Z

The downside of the above is of course that you have to copy out
everything that you want to keep from the image. In practice I don't
think that's a big deal. In our implementation the main challenge was
to implement DiImage using a cache of slices of the file (per comments
above) fast enough so that there wasn't too much of a slowdown
compared with the mmap-the-whole-thing approach. In the end I think
it was about 20%-30% slower, which isn't great, but it wasn't a showstopper
either, and I never even bothered to implement DiImage-via-mmap
even for 64 bit targets.

The fact that this all had to be done in C (hence, no templates, traits,
or compiler-assisted specialisation) made the perf problems more difficult.
In Rust you might be able to do better.

Obviously the slowdown depends on how much stuff you haul out of
the image. In V's case it is: the ELF symbol tables and (obviously)
the associated strings; Dwarf unwind info, and Dwarf inline-unwind info.

julian-seward1 · 2017-02-20T18:04:19Z

Oh, and the Dwarf line number info too.

fitzgen · 2017-05-03T15:42:17Z

Status update: AIUI, @julian-seward1 went with a more incremental approach using code that already exists in mozilla-central.

This is still something that I'd like to have sometime before any hypothetical 1.0 release (which we haven't talked about at all and seems very far in the future). Also, haven't had many cycles for gimli and related projects lately, so I probably won't be doing this myself super soon either.

julian-seward1 · 2017-05-04T06:25:56Z

The Valgrind implementation of the segmented image machinery recently
got extended to handle compressed debuginfo images (revs 15868, 16279)
which is useful for gcc >= 7.0, but was surprisingly tricky to make work well.

luser · 2017-06-08T18:54:04Z

I have vague hopes for using std::borrow::Cow to help cut down on copies when possible. But yes, we would be giving up the ability to indefinitely borrow from the mapped object file.

Assuming that the structs you're returning are POD (which I think they are?) then you ought to be able to return Cow<'a, T> instead of &'a T and hide whether you're using mmap or not internally.

I keep thinking about this because I have yet to settle on the nicest way of parsing existing binary data formats in Rust. I was fiddling with something related that may be relevant:
https://gist.github.com/luser/c992337384fae88cd5d65f28176bfb07

In that example I was worried about returning references to unaligned data, which is not that hard to work around but does require a copy, but you could use the same principle to hand back a copy when you're not using mmap.

philipc · 2017-07-22T03:28:36Z

gimli currently uses usize for offsets and lengths. I think we need to change to using u64 in order to support large object files on 32-bit. Another option would be to make it an associated type of the Reader trait.

philipc · 2017-07-31T22:49:22Z

Reader trait is defined and implemented for EndianBuf. It would be nice to have another implementation of it, but I think that is outside the scope of this crate.

fitzgen mentioned this issue Feb 19, 2017

Consider adding a higher level API on top of DIE parsing/traversal #82

Open

philipc mentioned this issue May 3, 2017

'Should parse compilation unit: UnexpectedEof' running dwarfdump on a binary with a SHF_COMPRESSED .debug_info #195

Closed

philipc closed this as completed Jul 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-copy and 32-bit address spaces #182

Zero-copy and 32-bit address spaces #182

fitzgen commented Feb 7, 2017

jimblandy commented Feb 7, 2017

fitzgen commented Feb 7, 2017

khuey commented Feb 7, 2017

philipc commented Feb 7, 2017

fitzgen commented Feb 7, 2017

philipc commented Feb 7, 2017

philipc commented Feb 11, 2017

julian-seward1 commented Feb 20, 2017 •

edited by fitzgen

julian-seward1 commented Feb 20, 2017

fitzgen commented Feb 20, 2017

julian-seward1 commented Feb 20, 2017

julian-seward1 commented Feb 20, 2017

fitzgen commented May 3, 2017

julian-seward1 commented May 4, 2017

luser commented Jun 8, 2017

philipc commented Jul 22, 2017

philipc commented Jul 31, 2017

Zero-copy and 32-bit address spaces #182

Zero-copy and 32-bit address spaces #182

Comments

fitzgen commented Feb 7, 2017

jimblandy commented Feb 7, 2017

fitzgen commented Feb 7, 2017

khuey commented Feb 7, 2017

philipc commented Feb 7, 2017

fitzgen commented Feb 7, 2017

philipc commented Feb 7, 2017

philipc commented Feb 11, 2017

julian-seward1 commented Feb 20, 2017 • edited by fitzgen

julian-seward1 commented Feb 20, 2017

fitzgen commented Feb 20, 2017

julian-seward1 commented Feb 20, 2017

julian-seward1 commented Feb 20, 2017

fitzgen commented May 3, 2017

julian-seward1 commented May 4, 2017

luser commented Jun 8, 2017

philipc commented Jul 22, 2017

philipc commented Jul 31, 2017

julian-seward1 commented Feb 20, 2017 •

edited by fitzgen