Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-8058: Fix ConnectClusterStateImpl.connectors() method #6384

Merged
merged 9 commits into from Apr 7, 2019

Conversation

@C0urante
Copy link
Contributor

C0urante commented Mar 7, 2019

This makes the ConnectClusterStateImpl.connectors() method synchronous, whereas before it was implicitly asynchronous with no way to tell whether it had completed or not.

More detail can be found in the Jira.

Tested manually. Seems like a light enough change that even unit tests would be overkill, but if reviewers feel differently tests can be added.
A unit test has also been added for the ConnectClusterStateImpl class.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)
@C0urante

This comment has been minimized.

Copy link
Contributor Author

C0urante commented Mar 7, 2019

@mageshn @wicknicks @rayokota could you guys take a look when you have a chance? 😃

Copy link
Contributor

wicknicks left a comment

good catch, @C0urante. one comment.

Copy link
Contributor

mageshn left a comment

Nice catch @C0urante. LGTM.

Copy link
Contributor

rayokota left a comment

LGTM

Copy link
Contributor

wicknicks left a comment

Thanks, @C0urante. LGTM.

@C0urante

This comment has been minimized.

Copy link
Contributor Author

C0urante commented Mar 11, 2019

@rhauch @ewencp Hi guys! Got a minor bug fix for you; been reviewed and approved by three contributors and seems ready to be looked at by a committer or two. Would appreciate your review when you get the chance.

Copy link
Contributor

kkonstantine left a comment

Nice find @C0urante !
I think using FutureCallback already allows you to do what you implement with the countdown latch.
Also I have a question on whether we need to forward this request to the leader as we do elsewhere.

@kkonstantine

This comment has been minimized.

Copy link
Contributor

kkonstantine commented Mar 12, 2019

Forgot to mention that a unit test would be nice (for the whole class even?), since this seems it never worked because it returned immediately.

…StateImpl.connectors()
@C0urante

This comment has been minimized.

Copy link
Contributor Author

C0urante commented Mar 12, 2019

@kkonstantine I think I've addressed your two inline comments, and I've added a unit test for the ConnectClusterStateImpl.connectors() method. The other method (connectorHealth(...)) is synchronous and doesn't really look like it warrants an additional test.

As far as the ten-second timeout goes, I'm open to suggestions but it doesn't appear that rebalances or forwarding to master should be a factor.

Copy link
Contributor

kkonstantine left a comment

@C0urante this looks even better now IMO and on par with the rest of the code.
One nit and one more suggested improvement on the tests. Thanks for including those btw. Definitely on the right track.

@C0urante

This comment has been minimized.

Copy link
Contributor Author

C0urante commented Mar 14, 2019

@kkonstantine thanks for the review! I've taken your comments into account and pushed the next iteration, which consists of changes to how the ConnectClusterStateImpl is tested. Hope this looks good and addresses your review remarks, but will be happy to keep iterating if not or if there's anything else that needs work.

Copy link
Contributor

kkonstantine left a comment

LGTM!

Copy link
Contributor

rhauch left a comment

Nice find, @C0urante! One really minor suggestion for slight improvement in readability, but not a bit deal.

This has never worked, right? This PR should be relatively easy to backport.

@C0urante

This comment has been minimized.

Copy link
Contributor Author

C0urante commented Mar 15, 2019

@rhauch yep, this bug has been present since REST extensions were added to Connect. I've added your suggested change to the PR, let me know if there are any other alterations you'd like.

@rhauch
rhauch approved these changes Mar 15, 2019
Copy link
Contributor

rhauch left a comment

Thanks, @C0urante! LGTM.

Copy link
Contributor

ewencp left a comment

Mostly looks good, one comment. This is the sort of thing that should come with list of backports -- the bug appears to already be tagged with at least some info about previous versions that are also affected so we probably don't want to close without backporting.

Regarding timeout, it should really be as conservative as possible without being longer than the minimum of (rebalance timeout, default REST API request timeout).

@C0urante

This comment has been minimized.

Copy link
Contributor Author

C0urante commented Mar 18, 2019

@ewencp can you shed some light on why we'd want to take rebalances into account for the timeout? I could be wrong, but I don't think a rebalance would affect the latency of a call to Herder.connectors(...).

@ewencp

This comment has been minimized.

Copy link
Contributor

ewencp commented Mar 21, 2019

Calling connectors() for DistributedHerder adds a request to the request queue, which are processed serially in the thread that manages cluster membership so they get a consistent view of the state of the Herder. Pretty much all the requests follow that pattern, though it is true some could potentially be optimized to not require it if some other locking were put in place for certain parts of the state.

The thread that processes the requests is also the one managing the member and calling poll() on it, which can block on ensuring membership in the group, which means request processing latency can be affected by rebalances.

@C0urante

This comment has been minimized.

Copy link
Contributor Author

C0urante commented Mar 24, 2019

@ewencp thanks for the clarification; I did not realize that membership and request handling were taken care of with the same thread. I've taken your suggestion literally and just made the timeout the minimum of the timeout for the /connectors resource and the rebalance timeout (if running in distributed mode). LMK how this looks!

@C0urante

This comment has been minimized.

Copy link
Contributor Author

C0urante commented Mar 25, 2019

BTW, I think we want to backport this to 2.0, which according to the KIP is when Connect REST extensions were introduced and AFAICT the code for the ConnectClusterStateImpl class hasn't been modified since.

@ewencp
ewencp approved these changes Mar 28, 2019
Copy link
Contributor

ewencp left a comment

one comment that we could follow up on, but i'm fine merging as is

@@ -308,10 +309,16 @@ void registerRestExtensions(HerderProvider provider, Plugins plugins, ResourceCo
config.getList(WorkerConfig.REST_EXTENSION_CLASSES_CONFIG),
config, ConnectRestExtension.class);

long herderRequestTimeoutMs = ConnectorsResource.REQUEST_TIMEOUT_MS;
if (config instanceof DistributedConfig) {

This comment has been minimized.

Copy link
@ewencp

ewencp Mar 28, 2019

Contributor

nit: i don't love instanceof solutions. might be better to generalize thee rebalance timeout so support a small/zero value for StandaloneConfig.

I wouldn't block merging on this, but this is an anti-pattern, so if there is a cleaner solution, it might be worth exploring.

This comment has been minimized.

Copy link
@rhauch

rhauch Apr 3, 2019

Contributor

@C0urante, what do you think about doing something like the following instead of the instanceof block?

        Integer rebalanceTimeoutMs = config.getInt(DistributedConfig.REBALANCE_TIMEOUT_MS_CONFIG);
        if (rebalanceTimeoutMs != null) {
            herderRequestTimeoutMs = Math.min(herderRequestTimeoutMs, rebalanceTimeoutMs.longValue());
        }

This still uses the DistributedConfig.REBALANCE_TIMEOUT_MS_CONFIG, but otherwise is simply checking for the existence of that property to determine if herderRequestTimeoutMs should be set to a new value.

This comment has been minimized.

Copy link
@C0urante

C0urante Apr 6, 2019

Author Contributor

@rhauch sgtm! I've added your code snippet verbatim to the PR

@rhauch rhauch merged commit 71e721f into apache:trunk Apr 7, 2019
2 checks passed
2 checks passed
JDK 11 and Scala 2.12 SUCCESS 10805 tests run, 69 skipped, 0 failed.
Details
JDK 8 and Scala 2.11 SUCCESS 10805 tests run, 69 skipped, 0 failed.
Details
rhauch added a commit that referenced this pull request Apr 7, 2019
Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test.

Author: Chris Egerton <chrise@confluent.io>

Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io>

Closes #6384 from C0urante:kafka-8058
rhauch added a commit that referenced this pull request Apr 7, 2019
Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test.

Author: Chris Egerton <chrise@confluent.io>

Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io>

Closes #6384 from C0urante:kafka-8058
rhauch added a commit that referenced this pull request Apr 7, 2019
Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test.

Author: Chris Egerton <chrise@confluent.io>

Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io>

Closes #6384 from C0urante:kafka-8058
@C0urante C0urante deleted the C0urante:kafka-8058 branch Apr 7, 2019
jarekr pushed a commit to confluentinc/kafka that referenced this pull request Apr 18, 2019
* apache/trunk:
  MINOR: Add security considerations for remote JMX in Kafka docs (apache#6544)
  MINOR: Remove redundant access specifiers from metrics interfaces (apache#6527)
  MINOR: Correct KStream documentation (apache#6552)
  KAFKA-8013; Avoid underflow when reading a Struct from a partially correct buffer (apache#6340)
  KAFKA-8058: Fix ConnectClusterStateImpl.connectors() method (apache#6384)
  MINOR: Move common consumer tests out of abstract consumer class (apache#6548)
  KAFKA-8168; Add a generated ApiMessageType class
  KAFKA-7893; Refactor ConsumerBounceTest to reuse functionality from BaseConsumerTest (apache#6238)
  MINOR: Tighten up metadata upgrade test (apache#6531)
  KAFKA-8190; Don't update keystore modification time during validation (apache#6539)
  MINOR: Fixed a few warning in core and connects (apache#6545)
  KAFKA-7904; Add AtMinIsr partition metric and TopicCommand option (KIP-427)
  MINOR: fix throttling and status in ConnectionStressWorker
  KAFKA-8090: Use automatic RPC generation in ControlledShutdown
  KAFKA-6399: Remove Streams max.poll.interval override (apache#6509)
  KAFKA-8126: Flaky Test org.apache.kafka.connect.runtime.WorkerTest.testAddRemoveTask (apache#6475)
  HOTFIX: Update unit test for KIP-443
  KAFKA-7190: KIP-443; Remove streams overrides on repartition topics (apache#6511)
  KAFKA-8183: Add retries to WorkerUtils#verifyTopics (apache#6532)
  KAFKA-8181: Removed Avro topic from TOC on kafka (apache#6529)
Pengxiaolong added a commit to Pengxiaolong/kafka that referenced this pull request Jun 14, 2019
)

Fixed the ConnectClusterStateImpl.connectors() method and throw an exception on timeout. Added unit test.

Author: Chris Egerton <chrise@confluent.io>

Reviewers: Magesh Nandakumar <magesh.n.kumar@gmail.com>, Robert Yokota <rayokota@gmail.com>, Arjun Satish <wicknicks@users.noreply.github.com>, Konstantine Karantasis <konstantine@confluent.io>, Randall Hauch <rhauch@gmail.com>, Ewen Cheslack-Postava <ewen@confluent.io>

Closes apache#6384 from C0urante:kafka-8058
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.