Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize dwarf line numbers decoding #7413

Merged
merged 7 commits into from Feb 15, 2019

Conversation

asterite
Copy link
Member

@asterite asterite commented Feb 11, 2019

The first time an Exception's full backtrace (filenames, line numbers, etc.) must be built, the runtime will first decode all DWARF information from the dwarf file (or somewhere else? not sure).

I noticed this process takes about 6 seconds for the compiler's std specs (because it's a huge program, the bigger the program the bigger the dwarf file and the longer it will take to decode it), specifically you can see this doing:

$ bin/crystal build spec/std_spec.cr
$ ./std_spec -e "inspects with cause"
.

Finished in 5.88 seconds

I thought 6 seconds to decode the backtrace was too much so I started to dig in the code to see if I can improve it. I managed to improve part of the decoding process by about 6 times.

The improvement is explained in the PR code comments. There's a decode_sequences method in DWARF::LineNumbers. Previously it took 4.84 seconds to do it for the std_spec. Now it takes 0.82 seconds.

Now running that spec above looks like this:

$ bin/crystal build spec/std_spec.cr
14:05 $ ./std_spec -e "inspects with cause"
.

Finished in 1.34 seconds

Note that this improvement is only see the first time ex.inspect_with_backtrace is called for any exception. Later on the dwarf info should have been decoded. Still, it's good to win a lot of time in this first decoding.

I believe there might be more things to improve, but so far this is the easiest change that gives the most boost.

@yxhuvud
Copy link
Contributor

yxhuvud commented Feb 11, 2019

Replacing the implementation of #find at line 187 with a binary search looks like another low hanging fruit, assuming it is a nontrivial amount of rows.

@ysbaddaden
Copy link
Contributor

Impressive boost, especially since the files/directories arrays are usually small if I remember correctly.

@asterite
Copy link
Member Author

@yxhuvud Yes, find could use a binary search, I thought of it. That said, I don't understand why @matrix is defined like that, maybe pushing all Rows in a single array would be enough? Not sure.

But that's not the bottleneck, at least when I profiled it. The bottleneck are:

  • Debug::DWARF::Strings#decode: most of the time spent it gets and tell/pos (this can be optimized just a bit by using gets('\0', chomp: true) but I didn't want to make this PR too big
  • Debug::DWARF.read_unsigned_leb128: most of the time is spent in read_byte

I'm not sure there's something we can do about those two points. In both cases (I think?) the IO is repositioned and read, and because it's buffered an entire block is consumed. Maybe changing the buffer size could improve this because decode usually needs just a few bytes (it's just a name) and I guess the same applies to read_unsigned_leb128. But for that we should first make the buffer size of IO::Buffered configurable.

@asterite
Copy link
Member Author

asterite commented Feb 11, 2019

@ysbaddaden Yes, the arrays are small (about 600 entries for std_spec). The problem is that the register_filename and register_directory end up being called, at least in std_spec, 1 million times, and most of those calls are always with the same filename/directory that was passed the last time. In fact, another way I tried to optimize this is by only recomputing the file/path indexes when register.file was set, or when Register.new was done, or at the beginning of read_statement_program but the code ended up being a bit more complex and taking more time if I also didn't do the "is it the same as the last time" check.

src/debug/dwarf/line_numbers.cr Outdated Show resolved Hide resolved
src/debug/dwarf/line_numbers.cr Outdated Show resolved Hide resolved
@ysbaddaden
Copy link
Contributor

The matrix is defined as an Array(Array(Row)) to match the dwarf definition, but more importantly to skip a bunch of addresses to reach the actual block more quickly.

@asterite
Copy link
Member Author

I added another commit that introduces two temporary hashes that map dir/file-names to indexes, to avoid doing linear scans over @files and @directories. That reduces the time of Debug::DWARF::LineNumbers#decode_sequences from 0.82s to 0.78s.

# another file comes next, etc.). So we remember the last
# mapped index to avoid unnecessary lookups.
@last_filename : String?
@last_filename_index = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if the files are read in some order (and not jumping from file to file, do @last_filename ever differ from @files[-1], (and similar for the index)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they differ. They are not read in order. I don't know why, though, I'm not familiar with the format. Maybe understanding the format first, then optimizing it might be better... 😊

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, the format follows functions (symbols). An instruction group of the function refers to a file:line, then the following group likely refers to the same file at a further line... unless the compiler inlined code, which will cause jumps to another file, etc.

If I recall correctly, that is.

@asterite
Copy link
Member Author

asterite commented Feb 12, 2019

I just pushed another commit with more improvements. I tried to do small commits for this but I was experimenting and in the end it was a bit hard for me to separate all the changes.

Essentially, I changed some UInt32 | UInt64 unions to just UInt64 (the cost of dealing with unions is bigger than just tracking the exact type). Another change is avoiding calling io.tell multiple times (this is expensive; I also used io.pos, I think we should remove the tell alias).

But the biggest change is in DWARF::Strings: we now use a small (16Kb) buffer to read strings from instead of seeking and reading from the IO (which is done a lot of times).

So now:

$ bin/crystal build spec/std_spec.cr
$ ./std_spec -e "inspects with cause"
.

Finished in 1.31 seconds

(down from 1.83s)

I'm almost sure more things can be optimized, but with these changes I'm pretty satisfied now.

@asterite
Copy link
Member Author

Meh, I'll redo the last commits in smaller commits because CI failed and I can't figure out why...

@jkthorne
Copy link
Contributor

@asterite why use 2 hashes? Directories and filenames would not collide. It seems like you could reduce this to one register method and one hash.

@asterite
Copy link
Member Author

@wontruefree files and directories are public and are used by other parts of the code that need the "index -> name" mapping.

@jkthorne
Copy link
Contributor

@asterite I am not sure I understand.

But if you have considered it and know why it won't work that is cool. I cannot make a PR into your fork but here is what I was thinking.

master...wontruefree:opt/dwarf-decoding

@asterite
Copy link
Member Author

@wontruefree the @files and @directories variables are exposed through the getters of LineNumbers and used here. In your code you just stop putting data there. Try it with this code:

begin
  raise "oh no"
rescue ex
  puts ex.inspect_with_backtrace
end

begin
  [1][2]
rescue ex
  puts ex.inspect_with_backtrace
end

Do you get something correct in the output?

@Sija
Copy link
Contributor

Sija commented Feb 12, 2019

@asterite CI just failed again with GC Warning: Failed to expand heap by 16777216 bytes, seems like crusade is not over yet...

@jkthorne
Copy link
Contributor

@asterite I see what you are saying.
I updated the branch and reduced the external interface to use a method called paths.

After thinking about this why not just make the index the instance variable and have the callstack use a hash lookup. Seems like that would be faster.

@asterite
Copy link
Member Author

I think some tests are allocating too much memory and it dies on Linux 32 bits. But I didn't start investigating that yet.

@asterite
Copy link
Member Author

I just pushed yet another commit. Thanks to what @wontruefree said I noticed that it doesn't make sense to build those files/directories arrays. Instead, we can store them directly inside each Row.

@ysbaddaden Do you see a problem with this?

src/callstack.cr Outdated Show resolved Hide resolved
@ysbaddaden
Copy link
Contributor

@asterite It's probably fine. I don't recall exactly why I used arrays and indices. Maybe it helped limit the overall memory usage by avoiding to keep duplicated string allocations?

@asterite
Copy link
Member Author

@ysbaddaden I don't think so because all of the strings are stored inside arrays inside Sequences so a same string shares the same memory for others. But maybe it wasn't like that at some point, I don't know. This way it seems to work :-)

Maybe if I have time I'll learn that format and write specs for it, so we can continue simplifying it (with more confidence). However, right now I think the times are more or less fine so I might focus on something else after this.

@ysbaddaden
Copy link
Contributor

Once the sequences are decoded we don't retain any references to them, so their @directories and @files arrays will be collected by the GC. I think most folder names and enough file names are shared across sequences that de-duplicating can be significant on memory usage, especially with large programs, but that should be verified.

@Sija
Copy link
Contributor

Sija commented Feb 13, 2019

@asterite Would you mind updating PR description with the newest run-time numbers?

@asterite
Copy link
Member Author

@Sija done!

@asterite
Copy link
Member Author

Good to go?

@asterite asterite added this to the 0.28.0 milestone Feb 14, 2019
@asterite
Copy link
Member Author

I'll merge it. Should I squash or not? I never know. I think each commit is meaningful by itself.

@ysbaddaden
Copy link
Contributor

Let's keep each commit. We could squash "simplify while loop" but that's nitpicking.

@asterite asterite merged commit d83fe1c into crystal-lang:master Feb 15, 2019
@asterite asterite deleted the opt/dwarf-decoding branch March 30, 2019 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants