[Flink-37703][UnitTest]The testRecoverFromIntermWithoutAdditionalState test failed of azure cron connector pipeline #26633

liangyu-1 · 2025-06-05T01:53:05Z

What is the purpose of the change

This PR is to find out what makes the hadoop-fs UT unstable.

Brief change log

add some log in file AbstractRecoverableWriterTest.java

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

…ARN's File Creation to Manage Disk Space and Network Load in Labeled YARN Nodes

…NK on YARN's File Creation to Manage Disk Space and Network Load in Labeled YARN Nodes

…cron connector pipeline

liangyu-1 · 2025-06-05T01:54:29Z

@lsyldliu is this pr OK?

flinkbot · 2025-06-05T01:56:20Z

CI report:

d1d5294 Azure: SUCCESS
3769ee4 Azure: PENDING
af43492 UNKNOWN

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

lsyldliu

@liangyu-1 Thanks for your contribution, I left some comments.

lsyldliu · 2025-06-05T11:21:42Z

flink-core/src/test/java/org/apache/flink/core/fs/AbstractRecoverableWriterTest.java

+                recoverables.put(INIT_EMPTY_PERSIST, stream.persist());
+            } catch (IOException e) {
+                System.err.println("Unable to open file for writing " + path.toString());
+                throw e;


Apr 28 12:19:16 java.io.IOException: All datanodes [DatanodeInfoWithStorage[127.0.0.1:46278,DS-26d47d25-42de-4eef-a409-8a700a8bc82a,DISK]] are bad. Aborting... Apr 28 12:19:16 at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1537) Apr 28 12:19:16 at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1472) Apr 28 12:19:16 at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1244) Apr 28 12:19:16 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:663)

Based on the original error message, we need to locate why the HDFS DataNode node is broken, can throwing an exception on the client side find the root cause? I don't know much about HDFS, so I'm not sure. Do we need to turn on the logging on the Server side when we pull up the HDFS cluster and observe the behavior on the Server side?

lsyldliu · 2025-06-05T11:35:21Z

flink-core/src/test/java/org/apache/flink/core/fs/AbstractRecoverableWriterTest.java

+            try {
+                stream.write(testData1.getBytes(StandardCharsets.UTF_8));
+            } catch (IOException e) {
+                System.err.println("Initial write failed: " + e.getMessage());


Can you use LOG.info to print the message?

lsyldliu · 2025-06-05T11:48:00Z

flink-core/src/test/java/org/apache/flink/core/fs/AbstractRecoverableWriterTest.java

-            recoverables.put(INIT_EMPTY_PERSIST, stream.persist());
-
-            stream.write(testData1.getBytes(StandardCharsets.UTF_8));
+            try {


I think we should simplify the print log logic, only print it the catch block as following:

// This is just for locate the root cause: // https://issues.apache.org/jira/browse/FLINK-37703 // After the fix, this logic should be reverted. int branch = 0; try { branch++; stream = initWriter.open(path); branch++; recoverables.put(INIT_EMPTY_PERSIST, stream.persist()); branch++; stream.write(testData1.getBytes(StandardCharsets.UTF_8)); branch++; recoverables.put(INTERM_WITH_STATE_PERSIST, stream.persist()); branch++; recoverables.put(INTERM_WITH_NO_ADDITIONAL_STATE_PERSIST, stream.persist()); // and write some more data branch++; stream.write(testData2.getBytes(StandardCharsets.UTF_8)); branch++; recoverables.put(FINAL_WITH_EXTRA_STATE, stream.persist()); } catch (IOException e) { LOG.info( "The exception branch was: {}, detail exception msg: {}", branch, e.getMessage()); throw e; } finally {

lsyldliu · 2025-06-05T11:49:03Z

flink-core/src/test/java/org/apache/flink/core/fs/AbstractRecoverableWriterTest.java

-
-            recoveredStream.write(testData3.getBytes(StandardCharsets.UTF_8));
-            recoveredStream.closeForCommit().commit();
+            try {


…/AbstractRecoverableWriterTest

lsyldliu

I'm just curious that why the history commits exists here? we don't need it.

You should use the git rebase to rebase the master branch.

lsyldliu · 2025-06-06T06:59:55Z

flink-core/src/test/java/org/apache/flink/core/fs/AbstractRecoverableWriterTest.java

 */
 public abstract class AbstractRecoverableWriterTest {

+    private static final Logger Log = LoggerFactory.getLogger(AbstractRecoverableWriterTest.class);


Suggested change

private static final Logger Log = LoggerFactory.getLogger(AbstractRecoverableWriterTest.class);

private static final Logger LOG = LoggerFactory.getLogger(AbstractRecoverableWriterTest.class);

liangyu-1 and others added 5 commits September 19, 2024 13:52

[FLINK-36112] Add Support for CreateFlag.NO_LOCAL_WRITE in FLINK on Y…

55b55fe

…ARN's File Creation to Manage Disk Space and Network Load in Labeled YARN Nodes

fixup! [FLINK-36112] Add Support for CreateFlag.NO_LOCAL_WRITE in FLI…

224c99c

…NK on YARN's File Creation to Manage Disk Space and Network Load in Labeled YARN Nodes

Merge branch 'apache:master' into FLINK-36112

09f589b

Merge branch 'apache:master' into FLINK-36112

fbe7491

The testRecoverFromIntermWithoutAdditionalState test failed of azure …

d1d5294

…cron connector pipeline

lsyldliu reviewed Jun 5, 2025

View reviewed changes

liangyu-1 added 2 commits June 6, 2025 14:42

add log in lig4j2.propeties and modify how we print error message in …

3769ee4

…/AbstractRecoverableWriterTest

modify how we print error message in /AbstractRecoverableWriterTest

af43492

lsyldliu reviewed Jun 6, 2025

View reviewed changes

liangyu-1 closed this Jun 6, 2025

liangyu-1 deleted the FLINK-37703 branch June 6, 2025 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flink-37703][UnitTest]The testRecoverFromIntermWithoutAdditionalState test failed of azure cron connector pipeline #26633

[Flink-37703][UnitTest]The testRecoverFromIntermWithoutAdditionalState test failed of azure cron connector pipeline #26633

Uh oh!

liangyu-1 commented Jun 5, 2025

Uh oh!

liangyu-1 commented Jun 5, 2025

Uh oh!

flinkbot commented Jun 5, 2025 •

edited

Loading

Uh oh!

lsyldliu left a comment

Uh oh!

lsyldliu Jun 5, 2025

Uh oh!

lsyldliu Jun 5, 2025

Uh oh!

lsyldliu Jun 5, 2025

Uh oh!

lsyldliu Jun 5, 2025

Uh oh!

lsyldliu left a comment •

edited

Loading

Uh oh!

lsyldliu Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	private static final Logger Log = LoggerFactory.getLogger(AbstractRecoverableWriterTest.class);
	private static final Logger LOG = LoggerFactory.getLogger(AbstractRecoverableWriterTest.class);

[Flink-37703][UnitTest]The testRecoverFromIntermWithoutAdditionalState test failed of azure cron connector pipeline #26633

[Flink-37703][UnitTest]The testRecoverFromIntermWithoutAdditionalState test failed of azure cron connector pipeline #26633

Uh oh!

Conversation

liangyu-1 commented Jun 5, 2025

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

liangyu-1 commented Jun 5, 2025

Uh oh!

flinkbot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

lsyldliu left a comment

Choose a reason for hiding this comment

Uh oh!

lsyldliu Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

lsyldliu Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

lsyldliu Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

lsyldliu Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

lsyldliu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lsyldliu Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flinkbot commented Jun 5, 2025 •

edited

Loading

lsyldliu left a comment •

edited

Loading