Skip to content

Conversation

@XComp
Copy link
Contributor

@XComp XComp commented Mar 22, 2022

What is the purpose of the change

There was a connection issue again with ZK that supposedly caused the IDCounter to be incremented once too often.

Brief change log

  • I fixed the issue in the same way it was done in FLINK-26120.
  • Additionally, I extended the ci log4j properties to log also request handling on the ZK server side

Verifying this change

A CI run with a forced ZK test failure was added temporarily to verify the log4j configuration changes.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 22, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@XComp
Copy link
Contributor Author

XComp commented Mar 22, 2022

The ci run failed due to the debug commit I added: The zookeeper-server.log contain additional debug logs about final requests, e.g. the first getAndIncrement call:

14:48:43,156 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:createSession cxid:0x0 zxid:0x1 txntype:-10 reqpath:n/a
14:48:43,165 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:createSession cxid:0x0 zxid:0x1 txntype:-10 reqpath:n/a
14:48:43,188 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:getData cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:/zookeeper/config
14:48:43,188 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:getData cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:/zookeeper/config
14:48:43,191 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:getData cxid:0x2 zxid:0xfffffffffffffffe txntype:unknown reqpath:/zookeeper/config
14:48:43,191 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:getData cxid:0x2 zxid:0xfffffffffffffffe txntype:unknown reqpath:/zookeeper/config
14:48:43,203 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:exists cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink
14:48:43,203 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:exists cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink
14:48:43,207 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:exists cxid:0x4 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink
14:48:43,207 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:exists cxid:0x4 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink
14:48:43,235 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:createContainer cxid:0x5 zxid:0x2 txntype:19 reqpath:n/a
14:48:43,271 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:createContainer cxid:0x5 zxid:0x2 txntype:19 reqpath:n/a
14:48:43,324 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:createContainer cxid:0x6 zxid:0x3 txntype:-1 reqpath:n/a
14:48:43,337 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:exists cxid:0x7 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default
14:48:43,337 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:exists cxid:0x7 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default
14:48:43,394 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:createContainer cxid:0x8 zxid:0x4 txntype:19 reqpath:n/a
14:48:43,395 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:createContainer cxid:0x8 zxid:0x4 txntype:19 reqpath:n/a
14:48:43,395 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:exists cxid:0x9 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default
14:48:43,395 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:exists cxid:0x9 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default
14:48:43,399 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:getData cxid:0xa zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/checkpoint_id_counter
14:48:43,399 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:getData cxid:0xa zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/checkpoint_id_counter
14:48:43,446 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:create2 cxid:0xb zxid:0x5 txntype:15 reqpath:n/a
14:48:43,446 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:create2 cxid:0xb zxid:0x5 txntype:15 reqpath:n/a
14:48:43,451 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:getData cxid:0xc zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/checkpoint_id_counter
14:48:43,452 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:getData cxid:0xc zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/checkpoint_id_counter
14:48:43,573 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - Processing request:: sessionid:0x101b66cf7c70000 type:setData cxid:0xd zxid:0x6 txntype:5 reqpath:n/a
14:48:43,580 [        SyncThread:0] DEBUG org.apache.zookeeper.server.FinalRequestProcessor            [] - sessionid:0x101b66cf7c70000 type:setData cxid:0xd zxid:0x6 txntype:5 reqpath:n/a

I agree that it's a bit verbose. But it should give us a way to see, what actually happened on the ZooKeeper server.

@zentol zentol removed their assignment Mar 28, 2022
XComp added 2 commits April 1, 2022 09:54
We only care about monotonically increasing numbers.
Unstable connections could cause a retry by the curator
client. See FLINK-26120 where a the same error cause was
fixed in the same way.
org.apache.zookeeper.server.FinalRequestProcessor provides debug logs for actual
operations being performed on the server, which might be interesting to see
when investigating test instabilities.

The following logs are part of the debug logs:
Processing request:: sessionid:0x1003757ccd40000 type:getData cxid:0x11 zxid:0xfffffffffffffffe txntype:unknown reqpath:/flink/default/checkpoint_id_counter

...contains the type of request (e.g. 'getData') and path (e.g. '/flink/default/checkpoint_id_counter').
@XComp
Copy link
Contributor Author

XComp commented Apr 1, 2022

Thanks, @autophagy . I removed the debug commit and rebased the branch. I'm gonna go ahead and create backport PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants