HDDS-10316. Speed up TestReconTasks #6223

raju-balpande · 2024-02-15T13:26:25Z

What changes were proposed in this pull request?

Speed up TestReconTasks
Creating cluster and initial setup is done once for all methods and modification accordingly.
Speed is improved from 140.549 seconds to 98.482 seconds.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10316

How was this patch tested?

Works locally it works with num of datanodes as 1 but in CI it worked with number of datanodes as 3 and hence kept 3.

devmadhuu

Thanks @raju-balpande for changes, Kindly update PR description with details, how you are trying to improve performance.

…apache_ozone into raju-b-hdds-10316

devmadhuu

Just a minor comment, Also can you pls share the screenshot and attach in PR showing previous performance figures and new performance figures.

devmadhuu · 2024-03-01T10:10:01Z

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/recon/TestReconTasks.java

@@ -74,21 +80,22 @@ public void init() throws Exception {

    conf.set("ozone.scm.stale.node.interval", "6s");
    conf.set("ozone.scm.dead.node.interval", "8s");
-    cluster =  MiniOzoneCluster.newBuilder(conf).setNumDatanodes(1)
+    cluster =  MiniOzoneCluster.newBuilder(conf).setNumDatanodes(3)


Thanks @raju-balpande for working on this patch. What is the need to increase the number of datanodes in cluster ?

With 1 datanode it was working fine in local but was getting stuck in a wait condition in CI. After multiple such tries I found the CI working when we switch to 3 datanodes.

The performance after changes can be viewed in https://github.com/raju-balpande/apache_ozone/actions/runs/8091976415/job/22112283975

and previously it was https://github.com/raju-balpande/apache_ozone/actions/runs/7843423779/job/21404234425

With 1 datanode it was working fine in local but was getting stuck in a wait condition in CI. After multiple such tries I found the CI working when we switch to 3 datanodes.

The performance after changes can be viewed in https://github.com/raju-balpande/apache_ozone/actions/runs/8091976415/job/22112283975

and previously it was https://github.com/raju-balpande/apache_ozone/actions/runs/7843423779/job/21404234425

Can you tell which test case and which wait condition it was getting stuck when 1 DN was used for cluster ?

Hi @devmadhuu ,
It was getting stuck on wait condition at TestReconTasks.java:173
LambdaTestUtils.await(120000, 6000, () -> { List<UnhealthyContainers> allMissingContainers = reconContainerManager.getContainerSchemaManager() .getUnhealthyContainers( ContainerSchemaDefinition.UnHealthyContainerStates.MISSING, 0, 1000); return (allMissingContainers.size() == 1); });

As I see in log https://github.com/raju-balpande/apache_ozone/actions/runs/7916623862/job/21611614999

Error: Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 365.519 s <<< FAILURE! - in org.apache.hadoop.ozone.recon.TestReconTasks Error: org.apache.hadoop.ozone.recon.TestReconTasks.testMissingContainerDownNode Time elapsed: 300.006 s <<< ERROR! java.util.concurrent.TimeoutException: testMissingContainerDownNode() timed out after 300 seconds at java.util.ArrayList.forEach(ArrayList.java:1259) at java.util.ArrayList.forEach(ArrayList.java:1259) Suppressed: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:133) at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:180) at org.apache.hadoop.ozone.recon.TestReconTasks.testMissingContainerDownNode(TestReconTasks.java:173) at java.lang.reflect.Method.invoke(Method.java:498)

Ok, thanks for the update @raju-balpande , However I am not sure why above test condition should timeout with just 1 DN in CI and pass in local, because with 1 DN in cluster, we are shutting down that only DN and so missing container count should be 1.

myskov

I ran TestReconTasks with your changes locally and faced the same result - testEmptyMissingContainerDownNode fails:

2024-03-01 23:50:24,505 [IPC Server handler 19 on default port 15002] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,505 [IPC Server handler 50 on default port 15009] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,510 [Recon-FixedThreadPoolWithAffinityExecutor-0-0] INFO scm.ReconContainerManager (ReconContainerManager.java:addNewContainer(246)) - Successfully added container #2 to Recon.
23:50:24.532 [8cc60fff-ccbe-46e4-9c74-b718081d73d5-ChunkReader-8] ERROR DNAudit - user=null | ip=null | op=UPDATE_CONTAINER {containerID=112022403450798084, forceUpdate=false} | ret=FAILURE
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: ContainerID 112022403450798084 does not exist
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:305) ~[classes/:?]
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183) ~[classes/:?]

raju-balpande · 2024-03-04T17:46:59Z

I ran TestReconTasks with your changes locally and faced the same result - testEmptyMissingContainerDownNode fails:

2024-03-01 23:50:24,505 [IPC Server handler 19 on default port 15002] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,505 [IPC Server handler 50 on default port 15009] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,510 [Recon-FixedThreadPoolWithAffinityExecutor-0-0] INFO scm.ReconContainerManager (ReconContainerManager.java:addNewContainer(246)) - Successfully added container #2 to Recon.
23:50:24.532 [8cc60fff-ccbe-46e4-9c74-b718081d73d5-ChunkReader-8] ERROR DNAudit - user=null | ip=null | op=UPDATE_CONTAINER {containerID=112022403450798084, forceUpdate=false} | ret=FAILURE
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: ContainerID 112022403450798084 does not exist
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:305) ~[classes/:?]
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183) ~[classes/:?]

Can you please attach the stracktrack log to understand the flow because I didn't see this error. Thanks.

myskov · 2024-03-05T17:39:48Z

Changing the maven runner's JRE to java11 fixes these tests for me locally (this is weird).

devmadhuu

Some comments. Pls check

devmadhuu · 2024-03-20T04:39:40Z

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/recon/TestReconTasks.java

@@ -74,21 +80,22 @@ public void init() throws Exception {

    conf.set("ozone.scm.stale.node.interval", "6s");
    conf.set("ozone.scm.dead.node.interval", "8s");
-    cluster =  MiniOzoneCluster.newBuilder(conf).setNumDatanodes(1)
+    cluster =  MiniOzoneCluster.newBuilder(conf).setNumDatanodes(3)


Ok, thanks for the update @raju-balpande , However I am not sure why above test condition should timeout with just 1 DN in CI and pass in local, because with 1 DN in cluster, we are shutting down that only DN and so missing container count should be 1.

devmadhuu · 2024-03-20T04:40:12Z

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/recon/TestReconTasks.java

@@ -141,7 +149,7 @@ public void testMissingContainerDownNode() throws Exception {
        (ReconContainerManager) reconScm.getContainerManager();
    ContainerInfo containerInfo =
        scmContainerManager
-            .allocateContainer(RatisReplicationConfig.getInstance(ONE), "test");
+            .allocateContainer(RatisReplicationConfig.getInstance(ONE), "testMissingContainer");


If you are increasing datanodes, then better to keep replication factor also as THREE

devmadhuu · 2024-03-20T04:40:20Z

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/recon/TestReconTasks.java

    ContainerInfo containerInfo =
        scmContainerManager
-            .allocateContainer(RatisReplicationConfig.getInstance(ONE), "test");
+            .allocateContainer(RatisReplicationConfig.getInstance(ONE), "testEmptyMissingContainer");


If you are increasing datanodes, then better to keep replication factor also as THREE

I tried changing it to HddsProtos.ReplicationFactor.THREE but it seems having problem with number of pipelines..
java.io.IOException: Could not allocate container. Cannot get any matching pipeline for replicationConfig: RATIS/THREE, State:PipelineState.OPEN

at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.allocateContainer(ContainerManagerImpl.java:202) at org.apache.hadoop.ozone.recon.TestReconTasks.testEmptyMissingContainerDownNode(TestReconTasks.java:236) at java.lang.reflect.Method.invoke(Method.java:498)

Ok, this might be because of not meeting the criteria of sufficient healthy nodes because default minRatisVolumeSizeBytes is 1 GB and containerSizeBytes is 5GB. For test case it is okay then to use ReplicationFactor.ONE

devmadhuu

Thanks @raju-balpande for continue working on this patch. Changes LGTM +1. For your pipeline issue, I have added the explanation.

devmadhuu · 2024-03-28T06:43:08Z

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/recon/TestReconTasks.java

    ContainerInfo containerInfo =
        scmContainerManager
-            .allocateContainer(RatisReplicationConfig.getInstance(ONE), "test");
+            .allocateContainer(RatisReplicationConfig.getInstance(ONE), "testEmptyMissingContainer");


Ok, this might be because of not meeting the criteria of sufficient healthy nodes because default minRatisVolumeSizeBytes is 1 GB and containerSizeBytes is 5GB. For test case it is okay then to use ReplicationFactor.ONE

adoroszlai · 2024-04-02T11:13:14Z

Thanks @raju-balpande for the patch, @devmadhuu, @myskov for the review.

dombizita · 2024-04-05T08:36:21Z

@raju-balpande can you take a look at HDDS-10654? I recently faced some flakiness in TestReconTasks, can you check if this change caused it?

This reverts commit ccaaf57. Reason for revert: intermittent test failures (HDDS-10654)

(cherry picked from commit ccaaf57)

This reverts commit ccaaf57. Reason for revert: intermittent test failures (HDDS-10654) (cherry picked from commit 31c2cfb)

Raju Balpande and others added 11 commits February 9, 2024 17:24

HDDS-10316 reducing the initiation, method ordering introduced.

53e2c11

HDDS-10316 reducing the initiation, method ordering introduced.

399b76a

HDDS-10316 reducing the initiation, method ordering introduced.

23df399

HDDS-10316 reducing the initiation, method ordering introduced.

e37e875

HDDS-10316 reducing the initiation, method ordering introduced.

d3f021e

HDDS-10316 reducing the initiation, method ordering introduced.

9b22c3c

HDDS-10316 common code is extracted to check the flow

64e35f9

HDDS-10316 common code is extracted to check the flow

e42e398

HDDS-10316 common code is extracted to check the flow

3ebe2a1

HDDS-10316 reducing the initiation, method ordering introduced.

833e008

Merge branch 'apache:master' into raju-b-hdds-10316

27d7902

devmadhuu reviewed Feb 15, 2024

View reviewed changes

Raju Balpande added 18 commits February 26, 2024 16:17

HDDS-10346 make test independent of ordering.

c1fe332

Merge branch 'raju-b-hdds-10316' of https://github.com/raju-balpande/…

c540681

…apache_ozone into raju-b-hdds-10316

HDDS-10316 reducing the initiation, method ordering introduced.

9665426

HDDS-10316 reducing the initiation, method ordering introduced.

8cc60b4

HDDS-10316 reducing the initiation, method ordering introduced.

68a7697

HDDS-10316 reducing the initiation, method ordering introduced.

c8a858a

HDDS-10316 reducing the initiation, method ordering introduced.

cb9d7c2

HDDS-10316 reducing the initiation, method ordering introduced.

317dcfa

HDDS-10316 reducing the initiation, method ordering introduced.

dc25ff1

HDDS-10316 reducing the initiation, method ordering introduced.

59c8920

HDDS-10316 reducing the initiation, method ordering introduced.

50033f7

HDDS-10316 reducing the initiation, method ordering introduced.

46f97dd

HDDS-10316 reducing the initiation, method ordering introduced.

3c5305e

HDDS-10316 reducing the initiation, method ordering introduced.

cd8ba26

HDDS-10316 reducing the initiation, method ordering introduced.

7725dfa

HDDS-10316 reducing the initiation, method ordering introduced.

14b4cd9

HDDS-10316 reducing the initiation, method ordering introduced.

3b1b51c

HDDS-10316 reducing the initiation, method ordering introduced.

f54cdbd

Raju Balpande and others added 10 commits February 28, 2024 18:47

HDDS-10316 reducing the initiation, method ordering introduced.

2844cec

HDDS-10316 reducing the initiation, method ordering introduced.

922f92d

HDDS-10316 reducing the initiation, method ordering introduced.

1f4a91e

HDDS-10316 reducing the initiation, method ordering introduced.

e75f857

HDDS-10316 reducing the initiation, method ordering introduced.

eada75b

HDDS-10316 reducing the initiation, method ordering introduced.

608631b

HDDS-10316 reducing the initiation, method ordering introduced.

285c084

HDDS-10316 reducing the initiation, method ordering introduced.

dc519e1

HDDS-10316 reducing the initiation, method ordering introduced.

c566ab6

Merge branch 'master' into raju-b-hdds-10316

1fd1b5b

raju-balpande marked this pull request as ready for review February 29, 2024 10:32

raju-balpande changed the title ~~hdds-10316 DO Not merge, changes in TestReconTasks to have a second thought~~ hdds-10316. changes in TestReconTasks to have a second thought Feb 29, 2024

raju-balpande changed the title ~~hdds-10316. changes in TestReconTasks to have a second thought~~ hdds-10316. Speed up TestReconTasks Feb 29, 2024

adoroszlai added test recon labels Feb 29, 2024

myskov self-requested a review March 1, 2024 11:40

devmadhuu reviewed Mar 1, 2024

View reviewed changes

myskov reviewed Mar 1, 2024

View reviewed changes

raju-balpande changed the title ~~hdds-10316. Speed up TestReconTasks~~ HDDS-10316. Speed up TestReconTasks Mar 4, 2024

devmadhuu reviewed Mar 20, 2024

View reviewed changes

devmadhuu approved these changes Mar 28, 2024

View reviewed changes

Merge remote-tracking branch 'origin/master' into raju-b-hdds-10316

8a8db39

adoroszlai merged commit ccaaf57 into apache:master Apr 2, 2024
29 checks passed

adoroszlai added a commit that referenced this pull request Apr 6, 2024

Revert "HDDS-10316. Speed up TestReconTasks (#6223)"

31c2cfb

This reverts commit ccaaf57. Reason for revert: intermittent test failures (HDDS-10654)

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024

HDDS-10316. Speed up TestReconTasks (apache#6223)

844e78e

(cherry picked from commit ccaaf57)

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024

Revert "HDDS-10316. Speed up TestReconTasks (apache#6223)"

50e0c72

This reverts commit ccaaf57. Reason for revert: intermittent test failures (HDDS-10654) (cherry picked from commit 31c2cfb)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-10316. Speed up TestReconTasks #6223

HDDS-10316. Speed up TestReconTasks #6223

raju-balpande commented Feb 15, 2024 •

edited

Loading

devmadhuu left a comment

devmadhuu left a comment

devmadhuu Mar 1, 2024

raju-balpande Mar 4, 2024

devmadhuu Mar 6, 2024

raju-balpande Mar 12, 2024 •

edited

Loading

devmadhuu Mar 20, 2024

myskov left a comment

raju-balpande commented Mar 4, 2024

myskov commented Mar 5, 2024

devmadhuu left a comment

devmadhuu Mar 20, 2024

devmadhuu Mar 20, 2024

devmadhuu Mar 20, 2024

raju-balpande Mar 22, 2024

devmadhuu Mar 28, 2024

devmadhuu left a comment

devmadhuu Mar 28, 2024

adoroszlai commented Apr 2, 2024

dombizita commented Apr 5, 2024

HDDS-10316. Speed up TestReconTasks #6223

HDDS-10316. Speed up TestReconTasks #6223

Conversation

raju-balpande commented Feb 15, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

devmadhuu left a comment

Choose a reason for hiding this comment

devmadhuu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raju-balpande Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myskov left a comment

Choose a reason for hiding this comment

raju-balpande commented Mar 4, 2024

myskov commented Mar 5, 2024

devmadhuu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devmadhuu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adoroszlai commented Apr 2, 2024

dombizita commented Apr 5, 2024

raju-balpande commented Feb 15, 2024 •

edited

Loading

raju-balpande Mar 12, 2024 •

edited

Loading