Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 8138][pulsar-client] Improve timeout handling in ClientCnx to cover all remaining request types (GetLastMessageId, GetTopics, GetSchema, GetOrCreateSchema) #8149

Merged
merged 2 commits into from
Sep 29, 2020

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Sep 28, 2020

Master Issue: #8138

Motivation

Since one of the root causes for #8138 is that there isn't timeout handling for GetLastMessageId requests, this PR adds that.
The PR improves timeout handling in ClientCnx to cover all remaining request types that don't currently have timeout handling. The request types are:
- GetLastMessageId
- GetTopics
- GetSchema
- GetOrCreateSchema

Modifications

The existing requestTimeoutQueue solution to handle "ordinary" command requests has been extended to cover all the requests types.

Verifying this change

This change added tests and can be verified as follows:

  • Added unit tests in a new test class ClientCnxRequestTimeoutQueueTest to test request timeout handling.
    • cd pulsar-client; mvn test -Dtest=ClientCnxRequestTimeoutQueueTest

- add timeout handling for requests of
    - GetLastMessageId
    - GetTopics
    - GetSchema
    - GetOrCreateSchema

- extend the existing requestTimeoutQueue solution to handle
  these other request types besides the command requests that
  it is currently used for
@lhotari lhotari force-pushed the lh-pulsar-client-request-timeouts branch from 5dff7f1 to 2fb0ca5 Compare September 28, 2020 11:20
- pendingGetSchemaRequests and pendingGetOrCreateSchemaRequests
  should be cleaned in the same way as other pending requests
@lhotari
Copy link
Member Author

lhotari commented Sep 28, 2020

/pulsarbot run-failure-checks

1 similar comment
@lhotari
Copy link
Member Author

lhotari commented Sep 28, 2020

/pulsarbot run-failure-checks

@jiazhai jiazhai merged commit bf6a88e into apache:master Sep 29, 2020
@jiazhai jiazhai added this to the 2.7.0 milestone Sep 29, 2020
@lhotari lhotari deleted the lh-pulsar-client-request-timeouts branch September 29, 2020 09:54
jiazhai pushed a commit that referenced this pull request Sep 30, 2020
Fixes #8138

### Motivation

The issue #8138 is about multiple memory leaks happening when using the Pulsar Java Client Reader API.  
One of the leaks was previously fixed by #8149 . 

This PR fixes two remaining ConsumerImpl memory leak issues:
- leaks of ConsumerImpl instances via active timers (unAckedMessageTracker, acknowledgmentsGroupingTracker, stats)
- leaks of ConsumerImpl instances via ClientCnx when the consumer gets closed by the Broker and the consumer picks a new connection to use

### Modifications

- modifications to fix the leaks of ConsumerImpl via active timers:
  - calls to "closeConsumerTasks" were added in the locations where they were missing

- modifications to fix the leaks of ConsumerImpl via ClientCnx
  - a solution was added to track to which ClientCnx the ConsumerImpl has been registered. When the ConsumerImpl gets registered to a new ClientCnx, the same instance is unregistered from the previous ClientCnx it was previously registered to.

  - when the connection gets switched, the ConsumerImpl must be
    de-registered from the previous ClientCnx where it got registered
    previously
    - the connection gets switched with the current seek implementation
      that disconnects the consumer after a seek.
      - the new connection is picked randomly of the existing connections
        in the connection pool. Therefore reproducing the leak required
        setting connectionsPerBroker > 1.

  - attempt to cover also other cases where the ConsumerImpl instance
    wasn't de-registered properly.
    - whenever client.cleanupConsumer(this) is called, the
      de-registration from ClientCnx (where the ConsumerImpl was previously registered) should be made
    - fix a bug in closeAsync where cleanup was skipped when the channel
      was closed or an exception occurred in sending the close consumer
      request


### Verifying this change

- this PR doesn't include new tests. There is a manual test in lhotari@aff700c that has been used to reproduce the memory leak and to verify that no ConsumerImpl references are leaked after the fix has been applied.
lbenc135 pushed a commit to lbenc135/pulsar that referenced this pull request Oct 3, 2020
…cover all remaining request types (GetLastMessageId, GetTopics, GetSchema, GetOrCreateSchema) (apache#8149)

* Add timeout handling for all other request types in ClientCnx
Master Issue: apache#8138

### Motivation

Since one of the root causes for apache#8138 is that there isn't timeout handling for GetLastMessageId requests, this PR adds that.
The PR improves timeout handling in ClientCnx to cover all remaining request types that don't currently have timeout handling. The request types are:
    - GetLastMessageId
    - GetTopics
    - GetSchema
    - GetOrCreateSchema

### Modifications

The existing requestTimeoutQueue solution to handle "ordinary" command requests has been extended to cover all the requests types.

### Verifying this change

This change added tests and can be verified as follows:

- Added unit tests in a new test class ClientCnxRequestTimeoutQueueTest to test request timeout handling.
    - `cd pulsar-client; mvn test -Dtest=ClientCnxRequestTimeoutQueueTest`
lbenc135 pushed a commit to lbenc135/pulsar that referenced this pull request Oct 3, 2020
Fixes apache#8138

### Motivation

The issue apache#8138 is about multiple memory leaks happening when using the Pulsar Java Client Reader API.  
One of the leaks was previously fixed by apache#8149 . 

This PR fixes two remaining ConsumerImpl memory leak issues:
- leaks of ConsumerImpl instances via active timers (unAckedMessageTracker, acknowledgmentsGroupingTracker, stats)
- leaks of ConsumerImpl instances via ClientCnx when the consumer gets closed by the Broker and the consumer picks a new connection to use

### Modifications

- modifications to fix the leaks of ConsumerImpl via active timers:
  - calls to "closeConsumerTasks" were added in the locations where they were missing

- modifications to fix the leaks of ConsumerImpl via ClientCnx
  - a solution was added to track to which ClientCnx the ConsumerImpl has been registered. When the ConsumerImpl gets registered to a new ClientCnx, the same instance is unregistered from the previous ClientCnx it was previously registered to.

  - when the connection gets switched, the ConsumerImpl must be
    de-registered from the previous ClientCnx where it got registered
    previously
    - the connection gets switched with the current seek implementation
      that disconnects the consumer after a seek.
      - the new connection is picked randomly of the existing connections
        in the connection pool. Therefore reproducing the leak required
        setting connectionsPerBroker > 1.

  - attempt to cover also other cases where the ConsumerImpl instance
    wasn't de-registered properly.
    - whenever client.cleanupConsumer(this) is called, the
      de-registration from ClientCnx (where the ConsumerImpl was previously registered) should be made
    - fix a bug in closeAsync where cleanup was skipped when the channel
      was closed or an exception occurred in sending the close consumer
      request


### Verifying this change

- this PR doesn't include new tests. There is a manual test in lhotari@aff700c that has been used to reproduce the memory leak and to verify that no ConsumerImpl references are leaked after the fix has been applied.
wolfstudy pushed a commit that referenced this pull request Oct 30, 2020
…cover all remaining request types (GetLastMessageId, GetTopics, GetSchema, GetOrCreateSchema) (#8149)

* Add timeout handling for all other request types in ClientCnx
Master Issue: #8138

Since one of the root causes for #8138 is that there isn't timeout handling for GetLastMessageId requests, this PR adds that.
The PR improves timeout handling in ClientCnx to cover all remaining request types that don't currently have timeout handling. The request types are:
    - GetLastMessageId
    - GetTopics
    - GetSchema
    - GetOrCreateSchema

The existing requestTimeoutQueue solution to handle "ordinary" command requests has been extended to cover all the requests types.

This change added tests and can be verified as follows:

- Added unit tests in a new test class ClientCnxRequestTimeoutQueueTest to test request timeout handling.
    - `cd pulsar-client; mvn test -Dtest=ClientCnxRequestTimeoutQueueTest`

(cherry picked from commit bf6a88e)
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Nov 13, 2020
…cover all remaining request types (GetLastMessageId, GetTopics, GetSchema, GetOrCreateSchema) (apache#8149)

* Add timeout handling for all other request types in ClientCnx
Master Issue: apache#8138

### Motivation

Since one of the root causes for apache#8138 is that there isn't timeout handling for GetLastMessageId requests, this PR adds that.
The PR improves timeout handling in ClientCnx to cover all remaining request types that don't currently have timeout handling. The request types are:
    - GetLastMessageId
    - GetTopics
    - GetSchema
    - GetOrCreateSchema

### Modifications

The existing requestTimeoutQueue solution to handle "ordinary" command requests has been extended to cover all the requests types.

### Verifying this change

This change added tests and can be verified as follows:

- Added unit tests in a new test class ClientCnxRequestTimeoutQueueTest to test request timeout handling.
    - `cd pulsar-client; mvn test -Dtest=ClientCnxRequestTimeoutQueueTest`
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Nov 13, 2020
Fixes apache#8138

### Motivation

The issue apache#8138 is about multiple memory leaks happening when using the Pulsar Java Client Reader API.  
One of the leaks was previously fixed by apache#8149 . 

This PR fixes two remaining ConsumerImpl memory leak issues:
- leaks of ConsumerImpl instances via active timers (unAckedMessageTracker, acknowledgmentsGroupingTracker, stats)
- leaks of ConsumerImpl instances via ClientCnx when the consumer gets closed by the Broker and the consumer picks a new connection to use

### Modifications

- modifications to fix the leaks of ConsumerImpl via active timers:
  - calls to "closeConsumerTasks" were added in the locations where they were missing

- modifications to fix the leaks of ConsumerImpl via ClientCnx
  - a solution was added to track to which ClientCnx the ConsumerImpl has been registered. When the ConsumerImpl gets registered to a new ClientCnx, the same instance is unregistered from the previous ClientCnx it was previously registered to.

  - when the connection gets switched, the ConsumerImpl must be
    de-registered from the previous ClientCnx where it got registered
    previously
    - the connection gets switched with the current seek implementation
      that disconnects the consumer after a seek.
      - the new connection is picked randomly of the existing connections
        in the connection pool. Therefore reproducing the leak required
        setting connectionsPerBroker > 1.

  - attempt to cover also other cases where the ConsumerImpl instance
    wasn't de-registered properly.
    - whenever client.cleanupConsumer(this) is called, the
      de-registration from ClientCnx (where the ConsumerImpl was previously registered) should be made
    - fix a bug in closeAsync where cleanup was skipped when the channel
      was closed or an exception occurred in sending the close consumer
      request


### Verifying this change

- this PR doesn't include new tests. There is a manual test in lhotari@aff700c that has been used to reproduce the memory leak and to verify that no ConsumerImpl references are leaked after the fix has been applied.
merlimat pushed a commit to merlimat/pulsar that referenced this pull request Dec 19, 2020
…cover all remaining request types (GetLastMessageId, GetTopics, GetSchema, GetOrCreateSchema) (apache#8149)

* Add timeout handling for all other request types in ClientCnx
Master Issue: apache#8138

### Motivation

Since one of the root causes for apache#8138 is that there isn't timeout handling for GetLastMessageId requests, this PR adds that.
The PR improves timeout handling in ClientCnx to cover all remaining request types that don't currently have timeout handling. The request types are:
    - GetLastMessageId
    - GetTopics
    - GetSchema
    - GetOrCreateSchema

### Modifications

The existing requestTimeoutQueue solution to handle "ordinary" command requests has been extended to cover all the requests types.

### Verifying this change

This change added tests and can be verified as follows:

- Added unit tests in a new test class ClientCnxRequestTimeoutQueueTest to test request timeout handling.
    - `cd pulsar-client; mvn test -Dtest=ClientCnxRequestTimeoutQueueTest`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants