Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace spans can be filtered by a minimum duration #215

Merged
merged 6 commits into from Mar 20, 2024
Merged

Conversation

pda
Copy link
Member

@pda pda commented Mar 6, 2024

Introduce the ability to filter trace spans by a minimum duration, so that fast spans can be discarded and only slow (interesting) spans are kept in memory and uploaded.

Configurable using BUILDKITE_ANALYTICS_TRACE_MIN_SECONDS (decimal value) in environment.

@pda pda requested a review from a team as a code owner March 6, 2024 00:39
Configurable using BUILDKITE_ANALYTICS_TRACE_MIN_SECONDS (decimal value)
in environment.
@@ -35,21 +35,30 @@ def initialize(section, start_at, end_at, detail)
@children = []
end

def duration
raise IncompleteSpan if end_at.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see this error being caught anywhere? If that's the case, I think this will might cause Rspec to fail with that "error outside of execution" error message and/or fail the job?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah.

I think this is what Java would call an IllegalStateException, and is basically a runtime assertion that some internal usage hasn't gone wrong; calling duration before setting end_at makes no sense, and as far as we can tell nothing is or should be doing that.

My normal default is to crash with a clear error message when something “impossible” happens.

But perhaps I need to rethink that in test collectors where it's more important that the show goes on, even if that means the test collector can't behave sanely. 🤔

I'm pretty sure without this check, the old code would have done this instead:

NoMethodError: undefined method `-' for nil:NilClass

So I don't think it's a regression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right I see, you're setting the end_at right before the duration method is called - so the failure case would be if the call to Monotonic time library fails, which "should be" impossible. Is it worth putting in such a check for a dependant library? Not sure why the time call would fail? And seeing as this would have raised an exception in the past anyways, and we've never seen a case of that with the collector, maybe its fine to just rely on monotonic time doing its thing and not having an explicit check for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump on this comment :)

Copy link
Member Author

@pda pda Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's fine; I don't think the monotonic time check can realistically return nil, if anything it might raise an error and crash the whole party.

So yeah I think it's fine like this.
I'd pitch it as:

if the impossible happens, give it a more descriptive name than “NoMethodError”

@pda
Copy link
Member Author

pda commented Mar 6, 2024

I wonder if configuring this as milliseconds (integer?) would be better than seconds (decimal/float).

I find much easier to express 1 second in milliseconds (1,000) than 1 millisecond in seconds (0.001?).
And I suspect people would generally want to configure this somewhere in the 1–1000ms range, not some whole number of seconds.

I think I'll change it.

pda added 5 commits March 6, 2024 13:55
Interally we use floating-point seconds for duration, so I've continued
to do that internally, but the external configuration layer now accepts
milliseconds instead.
I ran `rspec` without `bundle exec` and hit some errors that were easy
to fix. By fixing them I'm not advocating running it outside bundler :)

The spec is using Pathname from stdlib, so we should require it.

We're stubbing out the top-level Rails constant in one test, but that's
leaking into all subsequent tests, leading to weird errors elsewhere:

    spec/test_collector/minitest_plugin/without_plugin_spec.rb

    #<TypeError:"Rails is not a module\n/Users/pda/bk/test-collector-ruby/spec/test_collector/minitest_plugin/trace_spec.rb:92: previous definition of Rails was here">
If the option isn't present in env when .configure is called, ensure the
trace_min_duration configuration is reset to nil, rather than leaving it
in its previous state.

This is mainly useful in tests where an earlier test may have set a
different value, since we're dealing with class level state, not
per-instance.
@pda
Copy link
Member Author

pda commented Mar 6, 2024

I've tested this against a large build within Buildkite with BUILDKITE_ANALYTICS_TRACE_MIN_MS=10, and verified that I see spans over 10ms and none under 10ms 👌🏼

Copy link
Contributor

@niceking niceking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think raising the error slightly confuses me as it makes me think that this is a likely case that we should protect from, but it is as @pda pointed out not a regression and we've never had this happen in the while so its pretty harmless to leave in 🤷‍♀️

Excited to support this feature tho!! Big spans only!!!!

@pda pda merged commit 6e91c54 into main Mar 20, 2024
1 check passed
@pda pda deleted the trace-min-seconds branch March 20, 2024 22:52
@pda pda mentioned this pull request Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants