Skip to content

HDDS-10086. Intermittent timeout in TestSafeMode#5945

Merged
adoroszlai merged 4 commits intoapache:masterfrom
adoroszlai:HDDS-10086
Jan 9, 2024
Merged

HDDS-10086. Intermittent timeout in TestSafeMode#5945
adoroszlai merged 4 commits intoapache:masterfrom
adoroszlai:HDDS-10086

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

HDDS-8982 added a new assertion in TestSafeMode and set timeout of 1 minute for the test case. Encountered the following problem in a recent run:

Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 103.8 s <<< FAILURE! -- in org.apache.hadoop.fs.ozone.TestSafeMode
org.apache.hadoop.fs.ozone.TestSafeMode.o3fs -- Time elapsed: 72.90 s <<< ERROR!
java.util.concurrent.TimeoutException: o3fs() timed out after 60 seconds

Initial selectContainer has correctly found none:

2024-01-08 10:08:41,553 [main] WARN  container.ContainerManagerImpl (ContainerManagerImpl.java:getMatchingContainer(344)) - Container allocation failed on pipeline=Pipeline[ Id: 30c296b9-71b9-4744-8977-9b77b35a0eb3, Nodes: e6c09ac7-730a-4058-a1a8-e64ffc2fb789(fv-az1117-812.frmogvfxo1lepnjzqgguis2xib.cx.internal.cloudapp.net/10.1.0.19)d591e5af-e8f5-464c-8a55-75d5e8fc5b83(fv-az1117-812.frmogvfxo1lepnjzqgguis2xib.cx.internal.cloudapp.net/10.1.0.19)6df5c950-b1f8-41e9-a8bd-47cd1228c3d6(fv-az1117-812.frmogvfxo1lepnjzqgguis2xib.cx.internal.cloudapp.net/10.1.0.19), ReplicationConfig: RATIS/THREE, State:OPEN, leaderId:e6c09ac7-730a-4058-a1a8-e64ffc2fb789, CreationTimestamp2024-01-08T10:08:39.438Z[Etc/UTC]]
java.lang.IllegalArgumentException
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.minHealthyVolumeNum(SCMNodeManager.java:1204)
	at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.minHealthyVolumeNum(PipelineManagerImpl.java:669)
	at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getOpenContainerCountPerPipeline(ContainerManagerImpl.java:351)
	at org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getMatchingContainer(ContainerManagerImpl.java:331)
	at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.selectContainer(WritableRatisContainerProvider.java:193)
	at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:163)
	at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:92)
	at org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
	at org.apache.hadoop.fs.ozone.TestSafeMode.lambda$testSafeMode$0(TestSafeMode.java:123)

Pipeline creation failed, since no datanodes were available:

2024-01-08 10:08:41,553 [main] WARN  pipeline.WritableRatisContainerProvider (WritableRatisContainerProvider.java:getContainer(106)) - Pipeline creation failed for repConfig RATIS/THREE Datanodes may be used up. Try to see if any pipeline is in ALLOCATED state, and then will wait for it to be OPEN
org.apache.hadoop.hdds.scm.exceptions.SCMException: Ratis pipeline number meets the limit: 3 replicationConfig : RATIS/THREE
	at org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.create(RatisPipelineProvider.java:153)
	at org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.create(RatisPipelineProvider.java:57)
	at org.apache.hadoop.hdds.scm.pipeline.PipelineFactory.create(PipelineFactory.java:89)
	at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:255)
	at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:241)
	at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:100)
	at org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
	at org.apache.hadoop.fs.ozone.TestSafeMode.lambda$testSafeMode$0(TestSafeMode.java:123)

However, one pipeline was found to be ALLOCATED, so the call waited for that to be opened:

2024-01-08 10:09:41,554 [main] WARN  pipeline.WritableRatisContainerProvider (WritableRatisContainerProvider.java:getContainer(122)) - Waiting for one of pipelines [PipelineID=57157ebf-cb57-4f69-817d-8bea082c3750] to be OPEN failed. 
java.io.IOException: Pipeline 57157ebf-cb57-4f69-817d-8bea082c3750 is not ready in 60000 ms
	at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitOnePipelineReady(PipelineManagerImpl.java:772)
	at org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:120)
	at org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
	at org.apache.hadoop.fs.ozone.TestSafeMode.lambda$testSafeMode$0(TestSafeMode.java:123)

The problem is that both timeouts are 60 seconds, thus the test may be aborted just before getting the expected IOException.

This PR increases test timeout to 2 minutes. At first I tried to reduce pipeline report time to avoid unnecessary wait, and it has fixed the original issue, but hit another intermittent timeout shutting down datanodes (which is part of the original test, before the getContainer call).

https://issues.apache.org/jira/browse/HDDS-10086

How was this patch tested?

Passed in 10x20 runs:
https://github.com/adoroszlai/ozone/actions/runs/7447762180

@adoroszlai adoroszlai added the test label Jan 8, 2024
@adoroszlai adoroszlai self-assigned this Jan 8, 2024
@SaketaChalamchala
Copy link
Contributor

cc @SaketaChalamchala

Copy link
Contributor

@duongkame duongkame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change @adoroszlai . Looks reasonable to me.

@adoroszlai adoroszlai merged commit c23b713 into apache:master Jan 9, 2024
@adoroszlai adoroszlai deleted the HDDS-10086 branch January 9, 2024 21:03
@adoroszlai
Copy link
Contributor Author

Thanks @duongkame for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants