Skip to content
This repository has been archived by the owner on Dec 23, 2023. It is now read-only.

Adds Tracing.getExportComponent().flushAndShutdown() for use within application shutdown hooks. #1141

Merged
merged 13 commits into from
May 12, 2018

Conversation

cwensel
Copy link
Contributor

@cwensel cwensel commented Apr 20, 2018

This allows a developer to force a flush from within a shutdown hook or other means.

Unfortunately the underlying Disruptor instance only provides a #shutdown() call, not a flush, or a public method for testing for backlog. Thus shutdown has propagated up to the above api call.

I did not add a test case, happy to discuss how this would be implemented reliably.

@codecov-io
Copy link

codecov-io commented Apr 20, 2018

Codecov Report

Merging #1141 into master will decrease coverage by 0.23%.
The diff coverage is 48.71%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1141      +/-   ##
============================================
- Coverage      82.1%   81.87%   -0.24%     
- Complexity     1228     1229       +1     
============================================
  Files           192      192              
  Lines          5941     5975      +34     
  Branches        551      553       +2     
============================================
+ Hits           4878     4892      +14     
- Misses          916      935      +19     
- Partials        147      148       +1
Impacted Files Coverage Δ Complexity Δ
...va/io/opencensus/trace/export/ExportComponent.java 88.88% <0%> (-11.12%) 2 <0> (ø)
...re/trace/export/InProcessSampledSpanStoreImpl.java 94.04% <0%> (-1.14%) 19 <0> (ø)
...us/implcore/trace/export/SampledSpanStoreImpl.java 92.85% <0%> (-7.15%) 3 <0> (ø)
...sus/implcore/trace/export/ExportComponentImpl.java 85% <0%> (-15%) 9 <0> (ø)
...opencensus/implcore/internal/SimpleEventQueue.java 75% <0%> (-25%) 2 <0> (ø)
.../opencensus/impl/internal/DisruptorEventQueue.java 78.04% <55.55%> (-18.38%) 4 <2> (ø)
...census/implcore/trace/export/SpanExporterImpl.java 93.15% <69.23%> (-5.19%) 8 <1> (+1)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36c018e...d662a76. Read the comment docs.

@sebright
Copy link
Contributor

/cc @bogdandrutu

@bogdandrutu bogdandrutu requested review from dinooliva and sebright and removed request for HailongWen April 22, 2018 19:25
Copy link
Contributor

@sebright sebright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature! I have a few comments, mainly about the behavior of the Disruptor. I also added some suggestions on testing the two main parts of this change.

@@ -66,6 +66,8 @@ public static ExportComponent newNoopExportComponent() {
*/
public abstract SampledSpanStore getSampledSpanStore();

public abstract void flushAndShutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a Javadoc with a @since tag? Also, I would expect a shutdown combined with a flush to be a more common use case than an immediate shutdown, so I think this method could just be named "shutdown". That name is also consistent with ExecutorService.shutdown, which finishes executing the existing tasks.

@@ -66,6 +66,8 @@ public static ExportComponent newNoopExportComponent() {
*/
public abstract SampledSpanStore getSampledSpanStore();

public abstract void flushAndShutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be concrete, for backwards-compatibility. The same applies to all of the new methods on public non-final classes in the api directory.

* Shuts down the underlying disruptor.
*
* <p>Unfortunately there is no underlying public flush mechanism, without it there is a race
* condition in the ring buffer where it can hold events into the jvm shutdown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there is no underlying public flush mechanism, without it there is a race
condition in the ring buffer where it can hold events into the jvm shutdown.

I'm not sure I understand this comment. Is this an explanation for why DisruptorEventQueue can't support flushing without also shutting down the disruptor, or does this implementation contain a race condition? If there is a race condition, what effect could it have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the comment. It was lamenting the lack of a flush vs a shutdown, unclearly.

@@ -132,6 +132,13 @@ static SampledSpanStore newNoopSampledSpanStore() {
@PublicForTesting
public abstract Set<String> getRegisteredSpanNamesForCollection();

/**
* This forces any underlying event queue to flush any pending events and shutdown and handlers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say "any handlers"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the underlying implementation is simply shutting down the event queue. there are no handlers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the part of the Javadoc that mentions handlers. Should that part be removed?

@Override
public void flushAndShutdown() {
sampledSpanStore.shutdown();
spanExporter.flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this method also shut down the SpanExporter thread?

*/
@Override
public void shutdown() {
disruptor.shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for Disruptor.shutdown says "It is critical that publishing to the ring buffer has stopped before calling this method, otherwise it may never return." Should this method stop the DisruptorEventQueue from accepting more Entries before calling Disruptor.shutdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, but this would require a check on every publish/enqueue call from what I can tell, unless we swap something out with a Noop impl.

but this is a shutdown, its intended to be run during the jvm shutdown, so new spans should stop at some point naturally. thus this continuing to accept spans until there are no more might be considered a virtue.

if the jvm never shuts down, there is clearly an issue elsewhere (the thing accepting requests that spawns spans is not itself shutdown).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that I don't think that this class should continue to enqueue Entries on the Disruptor after the Disruptor has been shut down. That could cause the thread that recorded the instrumentation data to block forever. I think it would be better to discard the data that cannot be processed, and log an error, to avoid interfering with the application.

I'm not sure how we could make this class continue to accept data after shutdown without a method supporting similar behavior on the Disruptor.

/cc @bogdandrutu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is only danger if users misuse the shutdown() call. that is, try to shut down while the application is running, not when the application is shutting down.

the two dangers are the shutdown() call not returning if spans are added at a frequency greater than the cycle frequency, preventing the hasBacklog() call from ever returning false.

and the enqueue() call not discarding entries causing, I presume, a memory/resource leak. I haven’t tested this theory.

as I see it, we document this behavior. or within DisruptorEventQueue#enqueue() we wrap the enqueue logic in an anonymous class, and swap it out for a Noop version at shutdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the number of entries is limited by the buffer size, but I want to avoid having the call to enqueue not return when the buffer is full. I prefer your suggestion to swap the disruptor for a no-op version. The no-op enqueue method could also log a warning the first time it is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took a shot at both.

@@ -188,5 +193,18 @@ public void run() {
}
}
}

void flush() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could test this method by creating a SpanExporterImpl with a very long scheduleDelay, adding a few spans (less than the buffer size), calling flush(), and then asserting that the spans were exported, as in exportNotSampledSpans.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i left this out for now. looks like there are three options

  • copy the test class, and change the delay (removing the other tests)
  • add a method on SpanExporter to change the delay via a test visible method, change the delay in the new test
  • refactor the test class so the setup is in setup() and subclass to change the delay and other params

i’m sure you have something better in mind, so let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the new test could probably go in the existing test class, but the initialization code would need to be refactored to allow the scheduleDelay to be customized for each test. I think you could refactor it with these steps:

  • Move spanExporter and startEndHandler into the tests that need them, as local variables.
  • Move the call to spanExporter.registerHandler into the tests that need the SpanExporter.
  • Make createSampledEndedSpan and createNotSampledEndedSpan each take the StartEndHandler as an argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any comments on how I should implement the test?

* condition in the ring buffer where it can hold events into the jvm shutdown.
*/
@Override
public void shutdown() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems difficult to test, because the Disruptor is a singleton. I'm fine with adding a TODO to test it after we refactor the class to avoid the singleton.

@sebright sebright assigned dinooliva and unassigned HailongWen Apr 24, 2018
@sebright sebright added the action required The pull request is blocked by something other than a need for code review. label Apr 24, 2018
*
* @since 0.13
*/
public abstract void shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure if we should expose this API in RunningSpanStore.
For me, adding a shutdown() in ExportComponent is fine - that could be the entry for users and it hides all the details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would a shutdown method do in RunningSpanStore? I don't think it has a separate thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it protected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebright sorry, I meant SampledSpanStore in the first place. But for RunningSpanStore, a shutdown does make sense. For a RunningSpanStore, one might want to close, set proper status and sample option and export some important running spans before the program exit unexpectedly - I made that up :)

Back to the topic, what I originally meant was that we should probably only expose ExportComponent.shutdown(). Other subsequent shutdown methods should be hide from users (by making them protected, or mark as Internal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like that idea. Users probably wouldn't need to call shutdown on only one part of the TraceComponent. I think that the new methods on SampledSpanStore and SpanExporter are only called from the implementation anyway, so they could be completely removed from the superclasses under api/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i’ll wait on guidance after you have review the new commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to avoid exposing new APIs until they are needed. Did you only add the methods so that they could be called by ExportComponent.shutdown, or is there a use case for only calling shutdown or flush on a subset of the components?

@cwensel
Copy link
Contributor Author

cwensel commented Apr 26, 2018

pushed changes per some of the comments.

still needs a test, see above.

unsure if we should still hide the shutdown call chain. waiting on guidance.

@sebright sebright removed the action required The pull request is blocked by something other than a need for code review. label Apr 26, 2018
@sebright sebright assigned sebright and HailongWen and unassigned sebright Apr 26, 2018
@sebright sebright added the action required The pull request is blocked by something other than a need for code review. label May 1, 2018
@cwensel
Copy link
Contributor Author

cwensel commented May 3, 2018

ok, added the test

}

@Test
public void exportNotSampledSpansFlushed() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Test
public void exportNotSampledSpansFlushed() {
// Set the export delay to a very long value in order to confirm the #flush() below works
SpanExporterImpl spanExporter = SpanExporterImpl.create(4, Duration.create(Long.MAX_VALUE, 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Duration.create has a limit on the number of seconds, and it uses zero when the value is greater than 315576000000L:

if (seconds < -MAX_SECONDS || seconds > MAX_SECONDS) {
return ZERO;

I created an issue to improve the API: #1179

spanExporter.registerHandler("test.service", serviceHandler);

SpanImpl span1 = createNotSampledEndedSpan(startEndHandler, SPAN_NAME_1);
SpanImpl span2 = createSampledEndedSpan(startEndHandler, SPAN_NAME_2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test could be simplified by removing the non-sampled span, since that was tested in exportNotSampledSpans.

@cwensel
Copy link
Contributor Author

cwensel commented May 3, 2018

cleaned up test.

if looks good, I can squash and rebase to resolve any conflicts.

@songy23 songy23 added this to the Release 0.14.0 milestone May 3, 2018
@sebright
Copy link
Contributor

sebright commented May 3, 2018

The test looks good. I don't mind if you rebase.

I think that there are only two remaining review comments, related to removing the public methods on SpanExporter and SampledSpanStore (#1141 (comment)), and preventing threads from blocking when enqueueing events after Disruptor shutdown (#1141 (comment)).

@cwensel
Copy link
Contributor Author

cwensel commented May 4, 2018

lowered the visibility on the shutdown methods. and see comments above #1141 (comment)

@sebright sebright added action required The pull request is blocked by something other than a need for code review. and removed action required The pull request is blocked by something other than a need for code review. labels May 4, 2018
@cwensel
Copy link
Contributor Author

cwensel commented May 10, 2018

pushed a noop handler for the enqueuing. let me know if I missed anything else.

@cwensel
Copy link
Contributor Author

cwensel commented May 10, 2018

rebased, didn’t squash.

went w/ AtomicBoolean, since it isn’t initialized unless shutdown() is called and seems a little cleaner.

Copy link
Contributor

@sebright sebright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I think the only comment left is the one about making the DisruptorEnqueuer volatile.

CHANGELOG.md Outdated
- Map http attributes to Stackdriver format (fix [#1153](https://github.com/census-instrumentation/opencensus-java/issues/1153)).

## 0.13.1 - 2018-05-02
- Fix a typo on displaying Aggregation Type for a View on StatsZ page.
- Set bucket bounds as "le" labels for Prometheus Stats exporter.

## 0.13.0 - 2018-04-27
=======
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this left over from the merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

CHANGELOG.md Outdated
@@ -1,13 +1,15 @@
## Unreleased

## 0.13.2 - 2018-05-08
- Adds Tracing.getExportComponent().shutdown() for use within application shutdown hooks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.13.2 was already released, so this should go under unreleased.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, sorry, wasn’t paying attention. fixed.

@sebright
Copy link
Contributor

Thanks! I'll merge it when the build passes. Do you mind if I squash it?

@cwensel
Copy link
Contributor Author

cwensel commented May 10, 2018

please do squash it. i didn’t for fear of losing track of all the pending change comments.

@sebright
Copy link
Contributor

I think the nullness checker error is a false positive. You could fix the build by adding @SuppressWarnings("nullness") to the DisruptorEventQueue constructor, since the method already has some warning suppression that needs to be cleaned up.

@cwensel
Copy link
Contributor Author

cwensel commented May 11, 2018

ok, if i change @SuppressWarnings({"unchecked", "nullness”}) back to @SuppressWarnings({"unchecked”}) the verGJF check passes.

i’m using both the intellij google-java-format plugin and google java format style, which have slightly different results, but neither fixes the above issue.

i’m unsure how to make verGJF more informative on the failure.

@sebright
Copy link
Contributor

I'm surprised that ./gradlew verGJF is giving different results in CI. I thought it would behave consistently, since the Gradle version is fixed, and build.gradle specifies the version for google-java-format. Maybe we should run ./gradlew goGJF in CI and then print the diff to help with debugging.

I tried running ./gradlew goGJF locally, and the diff didn't include the line with SuppressWarnings:

diff --git a/impl/src/main/java/io/opencensus/impl/internal/DisruptorEventQueue.java b/impl/src/main/java/io/opencensus/impl/internal/DisruptorEventQueue.java
index 95c5d14d1..5145ca3b7 100644
--- a/impl/src/main/java/io/opencensus/impl/internal/DisruptorEventQueue.java
+++ b/impl/src/main/java/io/opencensus/impl/internal/DisruptorEventQueue.java
@@ -157,13 +157,11 @@ public final class DisruptorEventQueue implements EventQueue {
   @Override
   public void enqueue(Entry entry) {
     enqueuer.enqueue(entry);
   }
 
-  /**
-   * Shuts down the underlying disruptor.
-   */
+  /** Shuts down the underlying disruptor. */
   @Override
   public void shutdown() {
     enqueuer =
         new DisruptorEnqueuer() {
           final AtomicBoolean logged = new AtomicBoolean(false);
@@ -188,12 +186,11 @@ public final class DisruptorEventQueue implements EventQueue {
   // An event in the {@link EventQueue}. Just holds a reference to an EventQueue.Entry.
   private static final class DisruptorEvent {
 
     // TODO(bdrutu): Investigate if volatile is needed. This object is shared between threads so
     // intuitively this variable must be volatile.
-    @Nullable
-    private volatile Entry entry = null;
+    @Nullable private volatile Entry entry = null;
 
     // Sets the EventQueueEntry associated with this DisruptorEvent.
     void setEntry(@Nullable Entry entry) {
       this.entry = entry;
     }

@cwensel
Copy link
Contributor Author

cwensel commented May 11, 2018

ah, goGJF, new to me. will try that.

@sebright sebright merged commit 83fd637 into census-instrumentation:master May 12, 2018
@sebright
Copy link
Contributor

Thanks!

@sebright sebright removed the action required The pull request is blocked by something other than a need for code review. label May 12, 2018
@cwensel cwensel deleted the flush-spans branch May 12, 2018 00:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants