Code extraction performance improvements #604

imjoehaines · 2020-07-15T14:21:14Z

Goal

This PR improves the performance of extracting code from files in a stacktrace

The performance improvements are especially relevant when a stacktrace has the same file multiple times (e.g. when chaining private methods) and when the files are long

Exactly how much better this is is impossible to say without looking at a specific stacktrace — it can be more than a 3x speedup if it mostly consists of the same few files and they are very long, but every stacktrace is different 🌈. I didn't observe any noticable slow down when a stackrace contained exclusively different files, but the new code does involve a bit more work in the worst case scenario

Changeset

Added

Code extraction is now handled by a CodeExtractor class. This exists to keep track of which files in the stacktrace we need to get code from (we will sometimes remove files from the processed stacktrace) and so that it can do a single pass per file, extracting all the regions of code that it needs in one go
The Stacktraceclass is now a module as it was always basically one method and was immediately converted to an array when it was called
A bunch of new tests and fixtures for code extraction and some untested parts of the stacktrace processing

Linked issues

Related to #481

Review

For the submitter, initial self-review:

Commented on code changes inline explain the reasoning behind the approach
Reviewed the test cases added for completeness and possible points for discussion
A changelog entry was added for the goal of this pull request
Check the scope of the changeset - is everything in the diff required for the pull request?
This pull request is ready for:
- Initial review of the intended approach, not yet feature complete
- Structural review of the classes, functions, and properties modified
- Final review

For the pull request reviewer(s), this changeset has been reviewed for:

Consistency across platforms for structures or concepts added or modified
Consistency between the changeset and the goal stated above
Internal consistency with the rest of the library - is there any overlap between existing interfaces and any which have been added?
Usage friction - is the proposed change in usage cumbersome or complicated?
Performance and complexity - are there any cases of unexpected O(n^3) when iterating, recursing, flat mapping, etc?
Concurrency concerns - if components are accessed asynchronously, what issues will arise
Thoroughness of added tests and any missing edge cases
Idiomatic use of the language

This also uses 'to_a' to ensure the order of code is correct. Hashes are compared without regard for order because usually the order doesn't matter. I don't think the API cares about the order because we explicitly number the lines, but it's useful to validate our code is doing what is expected

We now open each file in a stacktrace once to save opening the same file multiple times if it is in the stacktrace multiple times. It's quite common to have the same file in multiple stacktrace lines, e.g. when a private method raises. The performance gain is especially noticeable when a a really long file is repeated in the stacktrace as we have to iterate through each line to find the ones we're interested in e.g. a trace may look like this: '/some/file line 5 in "foo"' '/some/file line 10 in "bar"' '/some/file line 15 in "baz"' Previously we would: 1. Open the file for frame 1 2. Read each line up to line 11 3. Trim lines down to lines 2-8 4. Open the file for frame 2 5. Read each line up to line 14 6. Trim lines down to lines 7-13 7. Open the file for frame 3 8. Read each line up to line 19 9. Trim lines down to lines 12-18 Now we: 1. Open the file once 2. Read from line 1 until line 19 3. Add the lines each frame cares about: a. Lines 1-11 for frame 1 b. Lines 3-14 for frame 2 c. Lines 8-19 for frame 3 3. Trim lines down: a. To lines 2-8 for frame 1 b. To lines 7-13 for frame 2 c. To lines 12-18 for frame 3 Trimming is necessary because we need to record more lines than we actually want to keep, so that we can always record 7 lines of context, even when the frame starts at line 1 or the last line in the file The new way is still quite inefficient and requires iterating over the stack frames multiple times, however this passes all tests and can be improved after profiling it against the old method

There's no Stacktrace.new anymore, so we get one less frame

These will end up being converted into unreadable paths, e.g. "spec/fixtures/crashes/file1.rb" is readable from the root of bugsnag-ruby but will be stripped to "file1.rb" by the project_root, which is unreadable as it's relative to a different directory This test ensures we read files based on the expanded file path, rather than the final path we send to the Bugsnag API

This exposed a bug in the library, which may have never been reported because it was never hit or because it's quite subtle (the stacktrace would be truncated, but that may not be obvious) Previously we were using 'return' which returns from the method, not the 'map' so we'd remove the entire stacktrace if one of the files ended up empty It's possible this condition is never hit and therefore no one has encountered the bug to report it, but it's easy enough to fix and now there's a test to ensure it does the right thing

Most other notifiers set 'code' to null/nil in this case and there's not really a reason not to. If we omit it entirely we _could_ break someone's code if they assume the key 'code' always exists

:s I think this used to make sense in one permutation, but: 11 / 2 == 5.5 ceil(5.5) == 6 6 + 1 == 7 So we were previously using MAXIMUM_LINES_TO_KEEP in a very obsfucated way...

This is largely covered by 'stacktrace_spec', but these tests can be a bit more specific

These were required when Stacktrace was a class and needed to be 'new'-ed, but now it returns an array already

lib/bugsnag/stacktrace.rb

This is no longer needed because, other than misconfiguration, there's no way for the filename to be empty here it was implemented in a479b6e, which also let users add their own filters for the stacktrace file name. This would have been necessary then, but seems like it's not anymore — the only way to trigger this in a real backtrace is by setting the project root to a filename, which isn't a realistic use-case See #604 (review)

imjoehaines added 18 commits July 15, 2020 11:26

Add a more complete test for code in stack frames

c2d906b

Add test for when the stacktrace spans two files

a4b0f12

Make Stacktrace a module as it's really 1 method

7465716

Fix report_spec as we've removed a frame

582ead4

There's no Stacktrace.new anymore, so we get one less frame

Add colons to TODOs so Rubocop is happy

487b526

Add a test for when regex parsing fails

32bd10c

Add a test for stripping Gem paths from files

065794d

Remove TODO

3754669

Most other notifiers set 'code' to null/nil in this case and there's not really a reason not to. If we omit it entirely we _could_ break someone's code if they assume the key 'code' always exists

Simplify how many lines we get from a file

add0634

:s I think this used to make sense in one permutation, but: 11 / 2 == 5.5 ceil(5.5) == 6 6 + 1 == 7 So we were previously using MAXIMUM_LINES_TO_KEEP in a very obsfucated way...

Give CodeExtractor access to configuration

6628cd1

Add some basic tests for CodeExtractor

7416084

This is largely covered by 'stacktrace_spec', but these tests can be a bit more specific

Handle exceptions in extract!

e355f1d

Improve YARD docs

85a38c4

Remove unnecessary 'to_a' calls

0fe0262

These were required when Stacktrace was a class and needed to be 'new'-ed, but now it returns an array already

imjoehaines marked this pull request as ready for review July 15, 2020 14:50

imjoehaines requested a review from kattrali July 15, 2020 14:53

kattrali reviewed Jul 17, 2020

View reviewed changes

lib/bugsnag/stacktrace.rb Outdated Show resolved Hide resolved

imjoehaines added 2 commits July 17, 2020 11:58

Remove unnecessary Rubucop toggle

d2c4512

kattrali approved these changes Jul 17, 2020

View reviewed changes

imjoehaines merged commit d3c6d2b into next Jul 23, 2020

imjoehaines deleted the code-extraction-performance-improvements branch July 23, 2020 15:58

imjoehaines mentioned this pull request Jul 24, 2020

Release v6.15.0 #614

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code extraction performance improvements #604

Code extraction performance improvements #604

imjoehaines commented Jul 15, 2020 •

edited

Code extraction performance improvements #604

Code extraction performance improvements #604

Conversation

imjoehaines commented Jul 15, 2020 • edited

Goal

Changeset

Added

Linked issues

Review

imjoehaines commented Jul 15, 2020 •

edited