Skip to content

[fix](flink) Fix FetchResult and MemoryScratchSink stuck#42216

Merged
xinyiZzz merged 3 commits intoapache:masterfrom
xinyiZzz:20241021_fix_scratch_sink
Oct 22, 2024
Merged

[fix](flink) Fix FetchResult and MemoryScratchSink stuck#42216
xinyiZzz merged 3 commits intoapache:masterfrom
xinyiZzz:20241021_fix_scratch_sink

Conversation

@xinyiZzz
Copy link
Contributor

Before each get queue, will set sink task dependency ready.
so if the sink task put queue faster than the fetch result get queue, the queue size will always be 10.

Be sure to set sink dependency ready before getting queue.
otherwise, if queue is emptied after sink task put queue and before block dependency, get queue will stuck and will never set sink dependency ready.

Fix:

WARN  org.apache.doris.flink.backend.BackendClient                 [] - Get next from Doris BE{host='', port=9060} failed.
org.apache.doris.shaded.org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
	at org.apache.doris.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:179) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:464) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:362) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:245) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.sdk.thrift.TDorisExternalService$Client.recvGetNext(TDorisExternalService.java:92) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.sdk.thrift.TDorisExternalService$Client.getNext(TDorisExternalService.java:79) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.backend.BackendClient.getNext(BackendClient.java:185) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.source.reader.DorisValueReader.hasNext(DorisValueReader.java:243) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.source.split.DorisSplitRecords.nextRecordFromSplit(DorisSplitRecords.java:71) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.source.split.DorisSplitRecords.nextRecordFromSplit(DorisSplitRecords.java:34) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:140) ~[flink-connector-files-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) [flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) [flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) [flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) [flink-dist-1.17.1.jar:1.17.1]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_191]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_191]
	at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_191]
	at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_191]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_191]
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_191]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_191]
	at org.apache.doris.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:177) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	... 24 more

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@xinyiZzz
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.45% (9697/25891)
Line Coverage: 28.69% (80476/280472)
Region Coverage: 28.14% (41640/147990)
Branch Coverage: 24.71% (21161/85638)
Coverage Report: http://coverage.selectdb-in.cc/coverage/281953830700d43b5417339e15c4c6ae1f0245cc_281953830700d43b5417339e15c4c6ae1f0245cc/report/index.html

@xinyiZzz
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.45% (9703/25907)
Line Coverage: 28.70% (80519/280528)
Region Coverage: 28.14% (41652/148013)
Branch Coverage: 24.70% (21151/85644)
Coverage Report: http://coverage.selectdb-in.cc/coverage/43ca732fe07a91653ddf88c9e547427367b3904a_43ca732fe07a91653ddf88c9e547427367b3904a/report/index.html

Copy link
Contributor

@yiguolei yiguolei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Oct 22, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

Copy link
Contributor

@wangbo wangbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xinyiZzz xinyiZzz merged commit b24477c into apache:master Oct 22, 2024
xinyiZzz added a commit to xinyiZzz/incubator-doris that referenced this pull request Oct 28, 2024
Before each get queue, will set sink task dependency ready.
so if the sink task put queue faster than the fetch result get queue,
the queue size will always be 10.

Be sure to set sink dependency ready before getting queue.
otherwise, if queue is emptied after sink task put queue and before
block dependency, get queue will stuck and will never set sink
dependency ready.

Fix:
```
WARN  org.apache.doris.flink.backend.BackendClient                 [] - Get next from Doris BE{host='', port=9060} failed.
org.apache.doris.shaded.org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
	at org.apache.doris.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:179) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:464) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:362) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:245) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.sdk.thrift.TDorisExternalService$Client.recvGetNext(TDorisExternalService.java:92) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.sdk.thrift.TDorisExternalService$Client.getNext(TDorisExternalService.java:79) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.backend.BackendClient.getNext(BackendClient.java:185) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.source.reader.DorisValueReader.hasNext(DorisValueReader.java:243) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.source.split.DorisSplitRecords.nextRecordFromSplit(DorisSplitRecords.java:71) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.source.split.DorisSplitRecords.nextRecordFromSplit(DorisSplitRecords.java:34) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:140) ~[flink-connector-files-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) [flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) [flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) [flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) [flink-dist-1.17.1.jar:1.17.1]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_191]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_191]
	at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_191]
	at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_191]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_191]
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_191]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_191]
	at org.apache.doris.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:177) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	... 24 more
```
yiguolei pushed a commit that referenced this pull request Oct 28, 2024
xinyiZzz added a commit to xinyiZzz/incubator-doris that referenced this pull request Nov 11, 2024
Before each get queue, will set sink task dependency ready.
so if the sink task put queue faster than the fetch result get queue,
the queue size will always be 10.

Be sure to set sink dependency ready before getting queue.
otherwise, if queue is emptied after sink task put queue and before
block dependency, get queue will stuck and will never set sink
dependency ready.

Fix:
```
WARN  org.apache.doris.flink.backend.BackendClient                 [] - Get next from Doris BE{host='', port=9060} failed.
org.apache.doris.shaded.org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
	at org.apache.doris.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:179) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:464) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:362) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:245) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.shaded.org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.sdk.thrift.TDorisExternalService$Client.recvGetNext(TDorisExternalService.java:92) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.sdk.thrift.TDorisExternalService$Client.getNext(TDorisExternalService.java:79) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.backend.BackendClient.getNext(BackendClient.java:185) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.source.reader.DorisValueReader.hasNext(DorisValueReader.java:243) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.source.split.DorisSplitRecords.nextRecordFromSplit(DorisSplitRecords.java:71) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.doris.flink.source.split.DorisSplitRecords.nextRecordFromSplit(DorisSplitRecords.java:34) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	at org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:140) ~[flink-connector-files-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:417) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:68) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788) ~[flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) [flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) [flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) [flink-dist-1.17.1.jar:1.17.1]
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) [flink-dist-1.17.1.jar:1.17.1]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_191]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_191]
	at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_191]
	at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_191]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[?:1.8.0_191]
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[?:1.8.0_191]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[?:1.8.0_191]
	at org.apache.doris.shaded.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:177) ~[blob_p-44cf081797997465dd46a38b036d2b88e1b6e4d4-bb500662c4f1b3245c2c995a4e691a8a:2.1.4]
	... 24 more
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.3-merged p0_b reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants