Skip to content

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

  • Log at info-level when EC reconstruction is started. Update existing messages on completion/failure to be similar.
  • Add debug-level message for container create/close commands.

https://issues.apache.org/jira/browse/HDDS-7080

How was this patch tested?

Ran TestECContainerRecovery locally, checked output.

2023-05-16 11:45:57,038 [ContainerReplicationThread-0] INFO  reconstruction.ECReconstructionCoordinatorTask (ECReconstructionCoordinatorTask.java:runTask(65)) - IN_PROGRESS reconstructECContainersCommand: containerID=1, replication=rs-3-2-1024k, missingIndexes=[1], sources={2=4f8c1ee8-843d-4e20-a85d-84a8bafed0a1(localhost/127.0.0.1), 3=ef236338-4845-41a8-aac7-e4a6b965d1de(localhost/127.0.0.1), 4=954c0e80-343e-491f-8fd9-01a4f3fbc54a(localhost/127.0.0.1), 5=4f1313dc-72fe-469f-b9c7-97ffc1f000ae(localhost/127.0.0.1)}, targets={1=bcf6c97b-1dba-46e8-b7da-8fd5295ca1c7(localhost/127.0.0.1)}
2023-05-16 11:45:57,181 [ContainerReplicationThread-0] INFO  reconstruction.ECReconstructionCoordinatorTask (ECReconstructionCoordinatorTask.java:runTask(75)) - DONE reconstructECContainersCommand: containerID=1, replication=rs-3-2-1024k, missingIndexes=[1], sources={2=4f8c1ee8-843d-4e20-a85d-84a8bafed0a1(localhost/127.0.0.1), 3=ef236338-4845-41a8-aac7-e4a6b965d1de(localhost/127.0.0.1), 4=954c0e80-343e-491f-8fd9-01a4f3fbc54a(localhost/127.0.0.1), 5=4f1313dc-72fe-469f-b9c7-97ffc1f000ae(localhost/127.0.0.1)}, targets={1=bcf6c97b-1dba-46e8-b7da-8fd5295ca1c7(localhost/127.0.0.1)} in 143 ms
2023-05-16 11:46:36,634 [ContainerReplicationThread-0] INFO  reconstruction.ECReconstructionCoordinatorTask (ECReconstructionCoordinatorTask.java:runTask(65)) - IN_PROGRESS reconstructECContainersCommand: containerID=2, replication=rs-3-2-1024k, missingIndexes=[1], sources={2=4f1313dc-72fe-469f-b9c7-97ffc1f000ae(localhost/127.0.0.1), 3=66abd4c3-c150-40f4-9c64-748ed52588f8(localhost/127.0.0.1), 4=21cc8efd-52be-41fb-89ae-fc02f677a135(localhost/127.0.0.1), 5=7ea5f635-94c5-4b17-863e-2be9fa008825(localhost/127.0.0.1)}, targets={1=ef236338-4845-41a8-aac7-e4a6b965d1de(localhost/127.0.0.1)}
2023-05-16 11:46:39,831 [ContainerReplicationThread-0] WARN  reconstruction.ECReconstructionCoordinatorTask (ECReconstructionCoordinatorTask.java:runTask(79)) - FAILED reconstructECContainersCommand: containerID=2, replication=rs-3-2-1024k, missingIndexes=[1], sources={2=4f1313dc-72fe-469f-b9c7-97ffc1f000ae(localhost/127.0.0.1), 3=66abd4c3-c150-40f4-9c64-748ed52588f8(localhost/127.0.0.1), 4=21cc8efd-52be-41fb-89ae-fc02f677a135(localhost/127.0.0.1), 5=7ea5f635-94c5-4b17-863e-2be9fa008825(localhost/127.0.0.1)}, targets={1=ef236338-4845-41a8-aac7-e4a6b965d1de(localhost/127.0.0.1)} after 3198 ms
java.io.IOException: Chunk write failed at the new target node: ef236338-4845-41a8-aac7-e4a6b965d1de(localhost/127.0.0.1). Aborting the reconstruction process.
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.checkFailures(ECReconstructionCoordinator.java:333)
	at org.apache.hadoop.ozone.container.TestECContainerRecovery.lambda$testECContainerRecoveryWithTimedOutRecovery$1(TestECContainerRecovery.java:321)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECBlockGroup(ECReconstructionCoordinator.java:232)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:171)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:141)
	at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68)
	at org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:348)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Unexpected Storage Container Exception: org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: Requested operation not allowed as ContainerState is UNHEALTHY
	at org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:632)
	at org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:303)
	at org.apache.hadoop.hdds.scm.storage.BlockOutputStream.lambda$writeChunkToContainer$2(BlockOutputStream.java:714)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
	... 3 more
Caused by: org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: Requested operation not allowed as ContainerState is UNHEALTHY
	at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:718)
	at org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:301)
	... 7 more

@adoroszlai adoroszlai self-assigned this May 16, 2023
@adoroszlai adoroszlai added the EC label May 16, 2023
@adoroszlai adoroszlai requested a review from sodonnel May 16, 2023 15:35
Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adoroszlai adoroszlai marked this pull request as ready for review May 16, 2023 16:28
@adoroszlai adoroszlai merged commit ed85c7e into apache:master May 17, 2023
@adoroszlai adoroszlai deleted the HDDS-7080 branch May 17, 2023 05:45
errose28 added a commit to errose28/ozone that referenced this pull request May 17, 2023
* master: (78 commits)
  HDDS-8575. Intermittent failure in TestCloseContainerEventHandler.testCloseContainerWithDelayByLeaseManager (apache#4688)
  HDDS-7241. EC: Reconstruction could fail with orphan blocks. (apache#4718)
  HDDS-8577. [Snapshot] Disable compaction log when loading metadata for snapshot (apache#4697)
  HDDS-7080. EC: Offline reconstruction needs better logging (apache#4719)
  HDDS-8626. Config thread pool in ReplicationServer (apache#4715)
  HDDS-8616. Underreplication not fixed if all replicas start decommissioning (apache#4711)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8254. Close containers when volume reaches utilisation threshold (apache#4583)
  HDDS-8615. Explicitly show EC block type in 'ozone debug chunkinfo' command output (apache#4706)
  HDDS-8623. Delete duplicate getBucketInfo in OMKeyCommitRequest (apache#4712)
  HDDS-8339. Recon Show the number of keys marked for Deletion in Recon UI. (apache#4519)
  HDDS-8572. Support CodecBuffer for protobuf v3 codecs. (apache#4693)
  HDDS-8010. Improve DN warning message when getBlock does not find the block. (apache#4698)
  HDDS-8621. IOException is never thrown in SCMRatisServer.getRatisRoles(). (apache#4710)
  HDDS-8463. S3 key uniqueness in deletedTable (apache#4660)
  HDDS-8584. Hadoop client write slowly when stream enabled (apache#4703)
  HDDS-7732. EC: Verify block deletion from missing EC containers (apache#4705)
  HDDS-8581. Avoid random ports in integration tests (apache#4699)
  HDDS-8504. ReplicationManager: Pass used and excluded node separately for Under and Mis-Replication (apache#4694)
  HDDS-8576. Close RocksDB instance in RDBStore if RDBStore's initialization fails after RocksDB instance creation (apache#4692)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants