-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10316. Speed up TestReconTasks #6223
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raju-balpande for changes, Kindly update PR description with details, how you are trying to improve performance.
…apache_ozone into raju-b-hdds-10316
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor comment, Also can you pls share the screenshot and attach in PR showing previous performance figures and new performance figures.
@@ -74,21 +80,22 @@ public void init() throws Exception { | |||
|
|||
conf.set("ozone.scm.stale.node.interval", "6s"); | |||
conf.set("ozone.scm.dead.node.interval", "8s"); | |||
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(1) | |||
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raju-balpande for working on this patch. What is the need to increase the number of datanodes in cluster ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With 1 datanode it was working fine in local but was getting stuck in a wait condition in CI. After multiple such tries I found the CI working when we switch to 3 datanodes.
The performance after changes can be viewed in https://github.com/raju-balpande/apache_ozone/actions/runs/8091976415/job/22112283975
and previously it was https://github.com/raju-balpande/apache_ozone/actions/runs/7843423779/job/21404234425
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With 1 datanode it was working fine in local but was getting stuck in a wait condition in CI. After multiple such tries I found the CI working when we switch to 3 datanodes.
The performance after changes can be viewed in https://github.com/raju-balpande/apache_ozone/actions/runs/8091976415/job/22112283975
and previously it was https://github.com/raju-balpande/apache_ozone/actions/runs/7843423779/job/21404234425
Can you tell which test case and which wait condition it was getting stuck when 1 DN was used for cluster ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @devmadhuu ,
It was getting stuck on wait condition at TestReconTasks.java:173
LambdaTestUtils.await(120000, 6000, () -> { List<UnhealthyContainers> allMissingContainers = reconContainerManager.getContainerSchemaManager() .getUnhealthyContainers( ContainerSchemaDefinition.UnHealthyContainerStates.MISSING, 0, 1000); return (allMissingContainers.size() == 1); });
As I see in log https://github.com/raju-balpande/apache_ozone/actions/runs/7916623862/job/21611614999
Error: Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 365.519 s <<< FAILURE! - in org.apache.hadoop.ozone.recon.TestReconTasks Error: org.apache.hadoop.ozone.recon.TestReconTasks.testMissingContainerDownNode Time elapsed: 300.006 s <<< ERROR! java.util.concurrent.TimeoutException: testMissingContainerDownNode() timed out after 300 seconds at java.util.ArrayList.forEach(ArrayList.java:1259) at java.util.ArrayList.forEach(ArrayList.java:1259) Suppressed: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:133) at org.apache.ozone.test.LambdaTestUtils.await(LambdaTestUtils.java:180) at org.apache.hadoop.ozone.recon.TestReconTasks.testMissingContainerDownNode(TestReconTasks.java:173) at java.lang.reflect.Method.invoke(Method.java:498)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the update @raju-balpande , However I am not sure why above test condition should timeout with just 1 DN in CI and pass in local, because with 1 DN in cluster, we are shutting down that only DN and so missing container count should be 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran TestReconTasks with your changes locally and faced the same result - testEmptyMissingContainerDownNode fails:
2024-03-01 23:50:24,505 [IPC Server handler 19 on default port 15002] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,505 [IPC Server handler 50 on default port 15009] DEBUG server.SCMDatanodeHeartbeatDispatcher (SCMDatanodeHeartbeatDispatcher.java:dispatch(157)) - Dispatching ICRs.
2024-03-01 23:50:24,510 [Recon-FixedThreadPoolWithAffinityExecutor-0-0] INFO scm.ReconContainerManager (ReconContainerManager.java:addNewContainer(246)) - Successfully added container #2 to Recon.
23:50:24.532 [8cc60fff-ccbe-46e4-9c74-b718081d73d5-ChunkReader-8] ERROR DNAudit - user=null | ip=null | op=UPDATE_CONTAINER {containerID=112022403450798084, forceUpdate=false} | ret=FAILURE
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: ContainerID 112022403450798084 does not exist
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:305) ~[classes/:?]
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183) ~[classes/:?]
Can you please attach the stracktrack log to understand the flow because I didn't see this error. Thanks. |
Changing the maven runner's JRE to java11 fixes these tests for me locally (this is weird). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments. Pls check
@@ -74,21 +80,22 @@ public void init() throws Exception { | |||
|
|||
conf.set("ozone.scm.stale.node.interval", "6s"); | |||
conf.set("ozone.scm.dead.node.interval", "8s"); | |||
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(1) | |||
cluster = MiniOzoneCluster.newBuilder(conf).setNumDatanodes(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the update @raju-balpande , However I am not sure why above test condition should timeout with just 1 DN in CI and pass in local, because with 1 DN in cluster, we are shutting down that only DN and so missing container count should be 1.
@@ -141,7 +149,7 @@ public void testMissingContainerDownNode() throws Exception { | |||
(ReconContainerManager) reconScm.getContainerManager(); | |||
ContainerInfo containerInfo = | |||
scmContainerManager | |||
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "test"); | |||
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "testMissingContainer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are increasing datanodes, then better to keep replication factor also as THREE
ContainerInfo containerInfo = | ||
scmContainerManager | ||
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "test"); | ||
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "testEmptyMissingContainer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are increasing datanodes, then better to keep replication factor also as THREE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried changing it to HddsProtos.ReplicationFactor.THREE but it seems having problem with number of pipelines..
java.io.IOException: Could not allocate container. Cannot get any matching pipeline for replicationConfig: RATIS/THREE, State:PipelineState.OPEN
at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.allocateContainer(ContainerManagerImpl.java:202)
at org.apache.hadoop.ozone.recon.TestReconTasks.testEmptyMissingContainerDownNode(TestReconTasks.java:236)
at java.lang.reflect.Method.invoke(Method.java:498)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this might be because of not meeting the criteria of sufficient healthy nodes because default minRatisVolumeSizeBytes
is 1 GB and containerSizeBytes
is 5GB. For test case it is okay then to use ReplicationFactor.ONE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @raju-balpande for continue working on this patch. Changes LGTM +1. For your pipeline issue, I have added the explanation.
ContainerInfo containerInfo = | ||
scmContainerManager | ||
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "test"); | ||
.allocateContainer(RatisReplicationConfig.getInstance(ONE), "testEmptyMissingContainer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this might be because of not meeting the criteria of sufficient healthy nodes because default minRatisVolumeSizeBytes
is 1 GB and containerSizeBytes
is 5GB. For test case it is okay then to use ReplicationFactor.ONE
Thanks @raju-balpande for the patch, @devmadhuu, @myskov for the review. |
@raju-balpande can you take a look at HDDS-10654? I recently faced some flakiness in TestReconTasks, can you check if this change caused it? |
This reverts commit ccaaf57. Reason for revert: intermittent test failures (HDDS-10654)
(cherry picked from commit ccaaf57)
What changes were proposed in this pull request?
Speed up TestReconTasks
Creating cluster and initial setup is done once for all methods and modification accordingly.
Speed is improved from 140.549 seconds to 98.482 seconds.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10316
How was this patch tested?
Works locally it works with num of datanodes as 1 but in CI it worked with number of datanodes as 3 and hence kept 3.