Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric "vertx.eventbus.messages.pending" can contain incorrect value. #1480

Closed
sfitts opened this issue Jun 28, 2016 · 24 comments
Closed

Metric "vertx.eventbus.messages.pending" can contain incorrect value. #1480

sfitts opened this issue Jun 28, 2016 · 24 comments

Comments

@sfitts
Copy link

@sfitts sfitts commented Jun 28, 2016

Version: 3.3.0

In the case where a handler is removed after a message is sent, but before it is received then the pending count can be incremented, but not decremented. Consider the following sequence:

A message is delivered locally and a handler is found at line 366 in EventBusImpl.java. This increments the pending count metric and calls deliverToHandler. When the runOnContext block is scheduled it is determined that the handler has been removed at line 498 and so the message is never actually processed. As a result the pending count is never decremented making is seem like there are pending messages when in fact there aren't.

This happens most often with the "credit" handler which is installed to receive the credit based flow control messages. It is often the case that the hander gets removed because the producer has been closed, but the consumer on the other side doesn't know this and sends the credit update.

@sfitts
Copy link
Author

@sfitts sfitts commented Jun 28, 2016

This also results in the message contributing to the "delivered" metric, even though it isn't (though since the % of these is low, that's not as problematic).

@vietj
Copy link
Member

@vietj vietj commented Jun 28, 2016

it seems to me that extrapolating a pending count based on these callbacks is not a good idea (although I think it's useful) because a message may not be effectively delivered for the reason you explained but there are other reasons, like a paused MessageConsumer that is unregistered.

@vietj
Copy link
Member

@vietj vietj commented Jun 28, 2016

Perhaps that having a callback to signal that a message has been received for a particular handler would be more useful and provide something equivalent :

void enqueuedMessage(H handler, boolean local);

called in deliverToHandler - this would maintain a pending count per handler, that the implementation could handle as it wants. This way the metrics implementation is aware that a message shall be delivered for a particular handler and if that handler is unregistered it can update anything accordingly.

@vietj
Copy link
Member

@vietj vietj commented Jun 29, 2016

@sfitts
Copy link
Author

@sfitts sfitts commented Jun 29, 2016

OK, took me a bit to reason out the approach here, but I think it looks good. Having the pending count is definitely useful since it is the only way to know if messages are getting delayed due to lack of scheduling resources. Tracking that on a per-handler basis makes sense (since it gives you a more granular picture of which resources are the issue).

Based on this I assume that you'd mark the pending count for the handler in scheduleMessage and then adjust in either beginHandleMessage or unregisterHandler. You could maintain a global count if you wanted by using state in HandlerMetric. This would also allow for tracking of the time spent waiting to be scheduled.

Thanks for looking at this.

vietj added a commit to vert-x3/vertx-dropwizard-metrics that referenced this issue Jun 30, 2016
… unsubscribed when the message is scheduled for delivery - see eclipse-vertx/vert.x#1480
@vietj
Copy link
Member

@vietj vietj commented Jun 30, 2016

the schedule message has been implemented and tested for Dropwizard in https://github.com/vert-x3/vertx-dropwizard-metrics/tree/eventbusmetrics-schedule-message

can you have a look and see how it goes for you ?

@sfitts
Copy link
Author

@sfitts sfitts commented Jul 1, 2016

Will do -- I have a newbie question however -- where are the snapshot releases hosted? I looked in sonatype snapshots and it seems to only have 3.0.0. Also came up empty poking around the wiki, though I could easily have missed it. TIA for the help.

@vietj
Copy link
Member

@vietj vietj commented Jul 1, 2016

    <profile>
      <id>sonatype-repository</id>
      <repositories>
        <repository>
          <id>sonatype-nexus-releases</id>
          <url>https://oss.sonatype.org/content/repositories/releases</url>
          <layout>default</layout>
        </repository>
        <repository>
          <id>sonatype-nexus-snapshots</id>
          <url>https://oss.sonatype.org/content/repositories/snapshots</url>
          <snapshots>
            <enabled>true</enabled>
          </snapshots>
          <layout>default</layout>
          <releases>
            <enabled>false</enabled>
          </releases>
        </repository>
    </repositories>
    </profile>

@sfitts
Copy link
Author

@sfitts sfitts commented Jul 1, 2016

Thanks -- not sure how I missed that.

@vietj
Copy link
Member

@vietj vietj commented Jul 4, 2016

hi any feedback ?

@sfitts
Copy link
Author

@sfitts sfitts commented Jul 5, 2016

Sorry -- I'm obviously missing something basic. I tried this on Thursday, but didn't see the newer code in sonatype, so I figured I'd wait a bit. Then life intruded so I just tried it again now I and still don't see the new code. Here is the URL to the sources JAR that I think is current:

https://oss.sonatype.org/content/repositories/snapshots/io/vertx/vertx-dropwizard-metrics/3.3.1-SNAPSHOT/vertx-dropwizard-metrics-3.3.1-20160704.234042-13-sources.jar

In my gradle build I have the following:

    repositories {
        maven {
            url 'https://oss.sonatype.org/content/repositories/snapshots'
        }
        mavenCentral()
        jcenter()

        // For hadrian analytics
        maven {
            url "http://repository.opendatagroup.com/maven"
        }

And I'm able to reference 3.3.1-SNAPSHOT just fine, but I don't see your new code. Sorry, I know I'm not being much help right now, but I'm not sure what I'm doing wrong.

@vietj
Copy link
Member

@vietj vietj commented Jul 5, 2016

hum actually it's not deployed because it's not merged, I'm afraid you should build both yourself.

@sfitts
Copy link
Author

@sfitts sfitts commented Jul 5, 2016

Got it -- failed to notice it was still in a branch. Unfortunately getting everything built and tested with our server is going to take a bit and I won't have time until after I'm done with some other work stuff. Walked through the code and it looks good, though I did have one quick question.

Once the code is available in a SNAPSHOT release I can easily give it a try and tell you how it behaves in our environment (for what's the worth).

@tsegismont
Copy link
Contributor

@tsegismont tsegismont commented Jul 26, 2017

@sfitts the fix has been released now. Please reopen if there's anything outstanding.

@tsegismont tsegismont closed this Jul 26, 2017
@dehypnosis
Copy link

@dehypnosis dehypnosis commented Jul 5, 2021

image

Hi, I think here comes a similar problem.
In my application, it seems exact number of handlers going up and down well as time goes. while some of pending handlers seem to forget to decrease themselves metric.

At the end of the cliff, I had restarted the application.
using vertx 4.1.0.

@vietj
Copy link
Member

@vietj vietj commented Jul 5, 2021

do you have a reproducer for this ?

@dehypnosis
Copy link

@dehypnosis dehypnosis commented Jul 7, 2021

@vietj

Hi, because of company property issue, rather I made a simple code. which creates transient message consumers periodically while providing massive messages into event bus for 10s.

Then I could reproduce the same issue like above. I think this mismatch supposed to happen at the point of unregistering... maybe?

Here is the code:
https://github.com/dehypnosis/vertx-pending-handlers/blob/main/src/main/java/com/kakaomobility/connection/Main.java

you can run with ./run.sh on UNIX system

And the metrics after message provider stopped, pending handlers not decreasing:

# HELP vertx_eventbus_pending Number of messages not processed yet
# TYPE vertx_eventbus_pending gauge
vertx_eventbus_pending{side="local",} 90000.0
# HELP vertx_http_server_active_requests Number of requests being processed
# TYPE vertx_http_server_active_requests gauge
vertx_http_server_active_requests{method="GET",} 1.0
# HELP vertx_http_server_request_bytes_max Size of requests in bytes
# TYPE vertx_http_server_request_bytes_max gauge
vertx_http_server_request_bytes_max{method="GET",} 0.0
# HELP vertx_http_server_request_bytes Size of requests in bytes
# TYPE vertx_http_server_request_bytes summary
vertx_http_server_request_bytes_count{method="GET",} 23.0
vertx_http_server_request_bytes_sum{method="GET",} 0.0
# HELP vertx_eventbus_processed_total Number of processed messages
# TYPE vertx_eventbus_processed_total counter
vertx_eventbus_processed_total{side="local",} 270000.0
# HELP vertx_http_server_bytes_written_total Number of bytes sent by the server
# TYPE vertx_http_server_bytes_written_total counter
vertx_http_server_bytes_written_total 73114.0
# HELP vertx_eventbus_received_total Number of messages received
# TYPE vertx_eventbus_received_total counter
vertx_eventbus_received_total{side="local",} 610000.0
# HELP vertx_http_server_requests_total Number of processed requests
# TYPE vertx_http_server_requests_total counter
vertx_http_server_requests_total{code="200",method="GET",route="/",} 23.0
# HELP vertx_eventbus_sent_total Number of messages sent (point-to-point)
# TYPE vertx_eventbus_sent_total counter
vertx_eventbus_sent_total{side="local",} 610000.0
# HELP vertx_http_server_active_connections Number of opened connections to the server
# TYPE vertx_http_server_active_connections gauge
vertx_http_server_active_connections 1.0
# HELP vertx_http_server_response_time_seconds_max Request processing time
# TYPE vertx_http_server_response_time_seconds_max gauge
vertx_http_server_response_time_seconds_max{code="200",method="GET",route="/",} 0.0
# HELP vertx_http_server_response_time_seconds Request processing time
# TYPE vertx_http_server_response_time_seconds summary
vertx_http_server_response_time_seconds_count{code="200",method="GET",route="/",} 23.0
vertx_http_server_response_time_seconds_sum{code="200",method="GET",route="/",} 0.051776341
# HELP vertx_eventbus_delivered_total Number of messages delivered to handlers
# TYPE vertx_eventbus_delivered_total counter
vertx_eventbus_delivered_total{side="local",} 360000.0
# HELP vertx_http_server_response_bytes_max Size of responses in bytes
# TYPE vertx_http_server_response_bytes_max gauge
vertx_http_server_response_bytes_max{code="200",method="GET",route="/",} 0.0
# HELP vertx_http_server_response_bytes Size of responses in bytes
# TYPE vertx_http_server_response_bytes summary
vertx_http_server_response_bytes_count{code="200",method="GET",route="/",} 23.0
vertx_http_server_response_bytes_sum{code="200",method="GET",route="/",} 73114.0
# HELP vertx_eventbus_reply_failures_total Number of message reply failures
# TYPE vertx_eventbus_reply_failures_total counter
vertx_eventbus_reply_failures_total 610000.0
# HELP vertx_eventbus_handlers Number of event bus handlers in use
# TYPE vertx_eventbus_handlers gauge
vertx_eventbus_handlers 1.0

Thank you.

@dehypnosis
Copy link

@dehypnosis dehypnosis commented Jul 20, 2021

do you think should i have to worry about this situation? 🥲

@vietj
Copy link
Member

@vietj vietj commented Jul 20, 2021

I will have a look at your code @dehypnosis

@vietj
Copy link
Member

@vietj vietj commented Jul 20, 2021

can you provide a reproducer that works locally ?

@dehypnosis
Copy link

@dehypnosis dehypnosis commented Jul 21, 2021

@vietj
Thank you. This code can work locally. I had made it for you. 😃

https://github.com/dehypnosis/vertx-pending-handlers

to run, clone it and just execute run.sh

@vietj
Copy link
Member

@vietj vietj commented Jul 22, 2021

can you open a new issue and link this reproducer ?

@vietj
Copy link
Member

@vietj vietj commented Jul 22, 2021

I tried to run it but I don't see where I should look at to the the issue can you elaborate ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants