-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10230. Preventing V3 Schema from Creating Container DB in the Wrong Location #6113
Conversation
Thanks @xichen01 for working on this. There are some test failures: https://github.com/xichen01/ozone/actions/runs/7685544604/job/20943567917#step:6:11 ( |
if (result != VolumeCheckResult.HEALTHY || | ||
!df.getContainerSchemaV3Enabled() || !isDbLoaded()) { | ||
return result; | ||
} | ||
|
||
// Check that per-volume RocksDB is present. | ||
File dbFile = new File(dbParentDir, CONTAINER_DB_NAME); | ||
if (!dbFile.exists() || !dbFile.canRead()) { | ||
File dbFile = dbParentDir == null ? null : new File(dbParentDir, CONTAINER_DB_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it passes L261 check, does it still have the chance to have a null dbParentDir?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbParentDir
will non-null if it passes L261 check. I think this can be simplified
} catch (IOException e) { | ||
throw new IOException("Can't init db instance under path " | ||
+ containerDBPath + " for volume " + getStorageID(), e); | ||
dbLoadFailure.set(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xichen01 , thanks for reporting this. Can you share more detail about which operation will cause the DB load failure and throw the IOException, besides the known initPerDiskDBStore? Is there any exception stack can be shared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Manually destroy the RocksDB file. 2. Then restart DN and write data. can reproduce a similar problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERROR org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer: Load db store for HddsVolume /xxx/xxx/xxx/ozone/hdds failed
java.io.IOException: Can't init db instance under path /xxx/xxxx/xxx/ozone/hdds/CID-xxx/DS-xxx/container.db for volume DS-xxx
at org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:235)
at org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.loadAllHddsVolumeDbStore(HddsVolumeUtil.java:99)
at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.<init>(OzoneContainer.java:146)
at org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.<init>(DatanodeStateMachine.java:153)
at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:295)
at org.apache.hadoop.ozone.HddsDatanodeService.start(HddsDatanodeService.java:227)
at org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:195)
at org.apache.hadoop.ozone.HddsDatanodeService.call(HddsDatanodeService.java:104)
at picocli.CommandLine.executeUserObject(CommandLine.java:1953)
at picocli.CommandLine.access$1300(CommandLine.java:145)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2352)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2346)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2311)
at picocli.CommandLine$AbstractParseResultHandler.handleParseResult(CommandLine.java:2172)
at picocli.CommandLine.parseWithHandlers(CommandLine.java:2550)
at picocli.CommandLine.parseWithHandler(CommandLine.java:2485)
at org.apache.hadoop.hdds.cli.GenericCli.execute(GenericCli.java:96)
at org.apache.hadoop.hdds.cli.GenericCli.run(GenericCli.java:87)
at org.apache.hadoop.ozone.HddsDatanodeService.main(HddsDatanodeService.java:178)
Caused by: java.io.IOException: Failed init RocksDB, db path : /xxx/xxx/xxx/ozone/hdds/CID-xxx/DS-xxx/container.db, exception :org.rocksdb.RocksDBException CURRENT file does not end with newline
at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:130)
at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:191)
at org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.start(AbstractDatanodeStore.java:146)
at org.apache.hadoop.ozone.container.metadata.AbstractDatanodeStore.<init>(AbstractDatanodeStore.java:99)
at org.apache.hadoop.ozone.container.metadata.DatanodeStoreSchemaThreeImpl.<init>(DatanodeStoreSchemaThreeImpl.java:54)
at org.apache.hadoop.ozone.container.keyvalue.helpers.BlockUtils.getUncachedDatanodeStore(BlockUtils.java:84)
at org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.initPerDiskDBStore(HddsVolumeUtil.java:79)
at org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:232)
... 18 more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to root cause this issue. Could you kindly check the related DN logs to see if there is any related suspicious logs? If we don't have the root cause, then this extra IOException catch may not enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not found others related log which can help to locate the root cause, this may be a legacy issue that may have existed when the cluster was created, but with this PR, we could have found this issue faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have observed another impact when volume metadata path is removed incorrectly in test env
. But at that point of time, container was getting loaded having NPE in other flow.
NPE was handled with #5921 avoiding loading of container.
Since it was test script issue where cleanup was not proper, referring similar case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xichen01 , in case we don't know the exact case of DB loading failure, let's change this catch from IOException to Throwable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xichen01 , could you please address my above comment and provide a new patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will
@@ -192,6 +195,9 @@ private static HddsDispatcher createDispatcher(DatanodeDetails dd, UUID scmId, | |||
conf.set(OZONE_METADATA_DIRS, TEST_DIR); | |||
VolumeSet volumeSet = new MutableVolumeSet(dd.getUuidString(), conf, null, | |||
StorageVolume.VolumeType.DATA_VOLUME, null); | |||
Path tempDir = Files.createTempDirectory(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tempDir should be explicitly deleted, other wise it will leak disk space on CI server, resulting out of space error at last.
Same for change in TestSecureContainerServer.java.
Thanks @xichen01 for the contribution. |
…ong Location (apache#6113) (cherry picked from commit 6ed1e58)
What changes were proposed in this pull request?
Preventing V3 Schema from Creating Container DB in the Wrong Location.
HddsVolume
cannot load the "V3 Schema Container DB", it will create a "V3 Schema Container DB" in the logged-in user's home directory (dbOptions.setCreateIfMissing(true);
). If there are multiple abnormalHddsVolume
, all the Containers from these abnormal Volumes will using the same DB.root cause
When the
HddsVolume
load "V3 Schema Container DB" abnormally, thecontainerData.getVolume().getDbParentDir()
will be null.The
will be
then
will return
${user.dir}/container.db
java.io.File#getAbsolutePath:
KeyValueContainerLocationUtil#getContainerDBFile:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10230
How was this patch tested?
Unit Test