Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10316. Speed up TestReconTasks #6223

Merged
merged 40 commits into from
Apr 2, 2024

Conversation

raju-balpande
Copy link
Contributor

@raju-balpande raju-balpande commented Feb 15, 2024

What changes were proposed in this pull request?

Speed up TestReconTasks
Creating cluster and initial setup is done once for all methods and modification accordingly.
Speed is improved from 140.549 seconds to 98.482 seconds.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10316

How was this patch tested?

Works locally it works with num of datanodes as 1 but in CI it worked with number of datanodes as 3 and hence kept 3.

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raju-balpande for changes, Kindly update PR description with details, how you are trying to improve performance.

@raju-balpande raju-balpande marked this pull request as ready for review February 29, 2024 10:32
@raju-balpande raju-balpande changed the title hdds-10316 DO Not merge, changes in TestReconTasks to have a second thought hdds-10316. changes in TestReconTasks to have a second thought Feb 29, 2024
@raju-balpande raju-balpande changed the title hdds-10316. changes in TestReconTasks to have a second thought hdds-10316. Speed up TestReconTasks Feb 29, 2024
@myskov myskov self-requested a review March 1, 2024 11:40
Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment, Also can you pls share the screenshot and attach in PR showing previous performance figures and new performance figures.

@@ -74,21 +80,22 @@ public void init() throws Exception {

conf.set("ozone.scm.stale.node.interval", "6s");
conf.set("ozone.scm.dead.node.interval", "8s");
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(1)
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raju-balpande for working on this patch. What is the need to increase the number of datanodes in cluster ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 1 datanode it was working fine in local but was getting stuck in a wait condition in CI. After multiple such tries I found the CI working when we switch to 3 datanodes.

The performance after changes can be viewed in https://github.com/raju-balpande/apache_ozone/actions/runs/8091976415/job/22112283975
image

and previously it was https://github.com/raju-balpande/apache_ozone/actions/runs/7843423779/job/21404234425
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 1 datanode it was working fine in local but was getting stuck in a wait condition in CI. After multiple such tries I found the CI working when we switch to 3 datanodes.

The performance after changes can be viewed in https://github.com/raju-balpande/apache_ozone/actions/runs/8091976415/job/22112283975 image

and previously it was https://github.com/raju-balpande/apache_ozone/actions/runs/7843423779/job/21404234425 image

Can you tell which test case and which wait condition it was getting stuck when 1 DN was used for cluster ?

Copy link
Contributor Author

@raju-balpande raju-balpande Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @devmadhuu ,
It was getting stuck on wait condition at TestReconTasks.java:173
LambdaTestUtils.await(120000, 6000, () -> { List<UnhealthyContainers> allMissingContainers = reconContainerManager.getContainerSchemaManager() .getUnhealthyContainers( ContainerSchemaDefinition.UnHealthyContainerStates.MISSING, 0, 1000); return (allMissingContainers.size() == 1); });

As I see in log https://github.com/raju-balpande/apache_ozone/actions/runs/7916623862/job/21611614999

Error: Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 365.519 s <<< FAILURE! - in org.apache.hadoop.ozone.recon.TestReconTasks Error: org.apache.hadoop.ozone.recon.TestReconTasks.testMissingContainerDownNode Time elapsed: 300.006 s <<< ERROR! java.util.concurrent.TimeoutException: testMissingContainerDownNode() timed out after 300 seconds at java.util.ArrayList.forEach(ArrayList.java:1259) at java.util.ArrayList.forEach(ArrayList.java:1259) Suppressed: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:133) at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:180) at org.apache.hadoop.ozone.recon.TestReconTasks.testMissingContainerDownNode(TestReconTasks.java:173) at java.lang.reflect.Method.invoke(Method.java:498)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the update @raju-balpande , However I am not sure why above test condition should timeout with just 1 DN in CI and pass in local, because with 1 DN in cluster, we are shutting down that only DN and so missing container count should be 1.

Copy link
Contributor

@myskov myskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran TestReconTasks with your changes locally and faced the same result - testEmptyMissingContainerDownNode fails:

2024-03-01 23:50:24,505 [IPC Server handler 19 on default port 15002] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,505 [IPC Server handler 50 on default port 15009] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,510 [Recon-FixedThreadPoolWithAffinityExecutor-0-0] INFO scm.ReconContainerManager (ReconContainerManager.java:addNewContainer(246)) - Successfully added container #2 to Recon.
23:50:24.532 [8cc60fff-ccbe-46e4-9c74-b718081d73d5-ChunkReader-8] ERROR DNAudit - user=null | ip=null | op=UPDATE_CONTAINER {containerID=112022403450798084, forceUpdate=false} | ret=FAILURE
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: ContainerID 112022403450798084 does not exist
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:305) ~[classes/:?]
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183) ~[classes/:?]

@raju-balpande
Copy link
Contributor Author

I ran TestReconTasks with your changes locally and faced the same result - testEmptyMissingContainerDownNode fails:

2024-03-01 23:50:24,505 [IPC Server handler 19 on default port 15002] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,505 [IPC Server handler 50 on default port 15009] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,510 [Recon-FixedThreadPoolWithAffinityExecutor-0-0] INFO scm.ReconContainerManager (ReconContainerManager.java:addNewContainer(246)) - Successfully added container #2 to Recon.
23:50:24.532 [8cc60fff-ccbe-46e4-9c74-b718081d73d5-ChunkReader-8] ERROR DNAudit - user=null | ip=null | op=UPDATE_CONTAINER {containerID=112022403450798084, forceUpdate=false} | ret=FAILURE
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: ContainerID 112022403450798084 does not exist
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:305) ~[classes/:?]
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183) ~[classes/:?]

Can you please attach the stracktrack log to understand the flow because I didn't see this error. Thanks.

@raju-balpande raju-balpande changed the title hdds-10316. Speed up TestReconTasks HDDS-10316. Speed up TestReconTasks Mar 4, 2024
@myskov
Copy link
Contributor

myskov commented Mar 5, 2024

Changing the maven runner's JRE to java11 fixes these tests for me locally (this is weird).

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Pls check

@@ -74,21 +80,22 @@ public void init() throws Exception {

conf.set("ozone.scm.stale.node.interval", "6s");
conf.set("ozone.scm.dead.node.interval", "8s");
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(1)
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the update @raju-balpande , However I am not sure why above test condition should timeout with just 1 DN in CI and pass in local, because with 1 DN in cluster, we are shutting down that only DN and so missing container count should be 1.

@@ -141,7 +149,7 @@ public void testMissingContainerDownNode() throws Exception {
(ReconContainerManager) reconScm.getContainerManager();
ContainerInfo containerInfo =
scmContainerManager
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "test");
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "testMissingContainer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are increasing datanodes, then better to keep replication factor also as THREE

ContainerInfo containerInfo =
scmContainerManager
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "test");
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "testEmptyMissingContainer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are increasing datanodes, then better to keep replication factor also as THREE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing it to HddsProtos.ReplicationFactor.THREE but it seems having problem with number of pipelines..
java.io.IOException: Could not allocate container. Cannot get any matching pipeline for replicationConfig: RATIS/THREE, State:PipelineState.OPEN

at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.allocateContainer(ContainerManagerImpl.java:202)
at org.apache.hadoop.ozone.recon.TestReconTasks.testEmptyMissingContainerDownNode(TestReconTasks.java:236)
at java.lang.reflect.Method.invoke(Method.java:498)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this might be because of not meeting the criteria of sufficient healthy nodes because default minRatisVolumeSizeBytes is 1 GB and containerSizeBytes is 5GB. For test case it is okay then to use ReplicationFactor.ONE

Copy link
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raju-balpande for continue working on this patch. Changes LGTM +1. For your pipeline issue, I have added the explanation.

ContainerInfo containerInfo =
scmContainerManager
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "test");
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "testEmptyMissingContainer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this might be because of not meeting the criteria of sufficient healthy nodes because default minRatisVolumeSizeBytes is 1 GB and containerSizeBytes is 5GB. For test case it is okay then to use ReplicationFactor.ONE

@adoroszlai adoroszlai merged commit ccaaf57 into apache:master Apr 2, 2024
29 checks passed
@adoroszlai
Copy link
Contributor

Thanks @raju-balpande for the patch, @devmadhuu, @myskov for the review.

@dombizita
Copy link
Contributor

@raju-balpande can you take a look at HDDS-10654? I recently faced some flakiness in TestReconTasks, can you check if this change caused it?

adoroszlai added a commit that referenced this pull request Apr 6, 2024
This reverts commit ccaaf57.

Reason for revert: intermittent test failures (HDDS-10654)
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024
This reverts commit ccaaf57.

Reason for revert: intermittent test failures (HDDS-10654)

(cherry picked from commit 31c2cfb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants