HDDS-15135. Import container to another volume if already exists in selected one#10161
HDDS-15135. Import container to another volume if already exists in selected one#10161devmadhuu wants to merge 3 commits intoapache:masterfrom
Conversation
… when selected target volume already has same container directory.
ArafatKhan2198
left a comment
There was a problem hiding this comment.
Leaving a few comments @devmadhuu please take a look.
| } | ||
| } | ||
|
|
||
| public void importContainerWithVolumeRetry(long containerID, Path tarFilePath, |
There was a problem hiding this comment.
We can add JavaDoc to clarify committed-space ownership.
/**
* Imports a container and retries on alternate volumes only when the selected
* volume already has the container directory.
*
* The caller is responsible for releasing committed bytes on
* initialTargetVolume. This method releases committed bytes only for
* retry-selected volumes.
*/
| doImportContainer(containerID, tarFilePath, targetVolume, compression); | ||
| } finally { | ||
| importContainerProgress.remove(containerID); | ||
| deleteFileQuietely(tarFilePath); |
There was a problem hiding this comment.
We already have a utility method for it - FileUtils.deleteQuietly
could we use it?
| } | ||
|
|
||
| @Test | ||
| public void testImportWithVolumeRetrySkipsSelectedVolumeWithContainerDir() |
There was a problem hiding this comment.
The new tests cover the successful retry case, which is good. But I would also add a negative test where every candidate volume already has the stale container directory.
Expected behavior:
Import/create tries all candidate volumes.
It fails with CONTAINER_ALREADY_EXISTS.
Committed bytes are restored on all attempted volumes.
The tarball is deleted.
No container is added to containerSet.
This is important because the PR changes the final error message to “already exists on all candidate volumes”, but currently the main added tests mostly verify the happy retry path.
| @@ -260,6 +262,43 @@ protected void createContainerMetaData(File containerMetaDataPath, | |||
| assertEquals(2, callCount.get()); | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
Suggestion - DO you think the test coverage should verify non-retriable failures are not retried
The PR explicitly says retry is scoped only to CONTAINER_ALREADY_EXISTS. That behavior should have a test.
For example, mock controller.importContainer(...) to throw a normal IOException or a StorageContainerException with a different result code. Then verify only the initially selected volume was attempted.
There was a problem hiding this comment.
@ArafatKhan2198 this comment doesn't hold right in this test class. We do mock controller.importContainer in ContainerImporter#importContainerWithVolumeRetry which belongs to TestContainerImporter class.
|
Hi @devmadhuu , do we see container replication eventually failed in real case due to this? |
@ChenSammi thanks for your review. This was identified as part of some use case discussed. We don't have good conflict resolution for Ratis or EC when a DN ends up with multiple copies, but we can end up in this situation due to volume failures as well. We need to prioritize moving to a safer state, which means allowing replication to pass. We do not want to end up in a situation where containers are perpetually under-replicated because no target nodes are valid. Since replication failure is not the only way DNs can end up with duplicate replicas, we should allow replication and just pick a different volume. Better duplicate replica handling on startup may require internal reconciliation within the DN but that can be done later. cc: @errose28 |
This change prevents DN replication/import from failing the whole target DN when the volume selected by the volume choosing policy already contains a stale container directory.
Key changes:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15135
How was this patch tested?
This patch was tested using some new unit tests and some existing ones:
Tests: