-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Add kernel support for v2 checkpoints #2826
[Kernel] Add kernel support for v2 checkpoints #2826
Conversation
|
||
// invalid checkpoints | ||
intercept[RuntimeException] { | ||
new CheckpointInstance( | ||
new Path(FAKE_DELTA_LOG_PATH, | ||
"00000000000000000010.checkpoint.0000000002.parquet").toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Had to change to make in valid - previously was valid V2 checkpoint path
kernel/kernel-api/src/test/scala/io/delta/kernel/internal/DeltaHistoryManagerSuite.scala
Show resolved
Hide resolved
@@ -598,6 +629,22 @@ protected Optional<LogSegment> getLogSegmentForVersion( | |||
); | |||
throw new IllegalStateException(msg); | |||
} | |||
|
|||
// Reading sidecars only applies for v2 checkpoints. | |||
if (newCheckpoint.format == CheckpointInstance.CheckpointFormat.V2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: For a V2 checkpoint instance, do not include the compatibility file in the checkpoints in the LogSegment. Only include checkpoint manifest + sidecar files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can potentially lead to incorrect results being served to the user:
- Table A only has one checkpoint which happens to be a compatibility V2 checkpoint. Because of https://github.com/delta-io/delta/pull/2826/files#diff-12397663510cec22feeea64e8e3f0ddee114de86025e00495b02ab89cc3d2d01R69, newCheckpoint.format will be
single_checkpoint
. - Since we only read sidecars if format == v2, no sidecars will be read for this checkpoint and only the non-file actions will be read.
This could result in the construction of a snapshot which looks legal but is missing references to manyAddFile
s that are present in those sidecars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed - now, we read the highest version checkpoint, preferring V2 over V1 checkpoints if multiple exist at the same version.
|
||
if (pathParts.length == 3 && pathParts[1].equals("checkpoint") && | ||
pathParts[2].equals("parquet")) { | ||
// Classic checkpoint 00000000000000000010.checkpoint.parquet | ||
this.version = Long.parseLong(pathParts[0]); | ||
this.numParts = Optional.empty(); | ||
this.format = CheckpointFormat.SINGLE_PART; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With V2 Checkpoints, a checkpoint that appears to be a single_part checkpoint can actually be a V2 checkpoint. The only way to be sure about whether the checkpoint is V2 or classic is to read the checkpoint and look for the CheckpointManifest
option action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
few questions:
- How does the ActionsIterator.java handle the v2 checkpoint (of json format) or sidecar files when reading the actions?
- Can we add some integration tests that test see end-2-end? You can look at
DeltaTableReadsSuite
for examples. - Do we need to read the sidecar files list within a checkpoint file while creating the
LogSegment
? Can this be delayed until we start reading the delta actions for table state construction? - Small doc explaining the design and any decision will be helpful in understanding.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/checkpoints/Checkpointer.java
Outdated
Show resolved
Hide resolved
} | ||
return Optional.of(files); | ||
} catch (Exception e) { | ||
logger.warn("Failed to load sidecars from file {}", checkpointPath, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the logger.warn(new ParametrizedMessage(..., checkpointPath), e)
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/SnapshotManager.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/checkpoints/CheckpointInstance.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/checkpoints/Checkpointer.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/checkpoints/Checkpointer.java
Outdated
Show resolved
Hide resolved
// Either V2 Parquet manifest or V1 Parquet single checkpoint. | ||
actionItr = tableClient.getParquetHandler().readParquetFiles( | ||
singletonCloseableIterator(checkpointFile), | ||
SidecarFile.READ_SCHEMA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qn: is the read schema SidecardFile.READ_SCHEMA
valid? From my understanding looking at the spec, the checkpoint parquet file will have top-level columns add
, remove
, sidecar
etc. The read schema should be sidecard: struct (path: String, sizeInBytes: Long, modTime: Long)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/checkpoints/CheckpointInstance.java
Show resolved
Hide resolved
if (newCheckpoint.format.usesSidecars()) { | ||
final List<FileStatus> sidecarFiles = listSidecars(tableClient); | ||
// Only read the checkpoint for sidecars if any sidecars exist in the table. | ||
if (!sidecarFiles.isEmpty()) { | ||
// Safe to call .get() here, as we know the CheckpointInstance was initialized | ||
// with a valid filepath. | ||
final Set<Path> referencedSidecars = new HashSet<>( | ||
newCheckpoint.getReferencedSidecars(tableClient, logPath).get()); | ||
newCheckpointFileList.addAll(sidecarFiles.stream() | ||
.filter(f -> referencedSidecars.contains(new Path(f.getPath()))) | ||
.collect(Collectors.toList())); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we delay the loading of the sidecard files until we start reading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so - will update.
…nto kernel-v2-checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Few minor changes.
@prakharjain09 @dhruvarya-db Please take a look at it.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/FileNames.java
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/checkpoints/CheckpointInstance.java
Show resolved
Hide resolved
if (format == CheckpointFormat.V2) { | ||
return Objects.hash(version, numParts, format, filePath); | ||
} | ||
return Objects.hash(version, numParts, format, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my other comment.
).foreach(ci => assert(ci.compareTo(CheckpointInstance.MAX_VALUE) < 0)) | ||
} | ||
|
||
test("checkpoint instance equality") { | ||
val single = new CheckpointInstance("01.checkpoint.parquet") | ||
val multipartPart1 = new CheckpointInstance("01.checkpoint.01.02.parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delta spec mandates the versions to be 20 length padded. So these are invalid names.
I think delta-spark also doesn't enforce this right now and this should be fixed in both places. Should not block this PR though.
} | ||
} | ||
|
||
test("v2 checkpoint support with multiple sidecars") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also test empty sidecars?
} | ||
} | ||
|
||
test("compatibility checkpoint with sidecar files") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have a test case where both checkpoints - v2 and compat are present and assert that v2 is picked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is covered in SnapshotManagerSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add here also. From the SnapshotImpl
you can get the LogSegment and verify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chirag-s-db Is this handled?
|
||
// Validate snapshot and data. | ||
validateSnapshot(path.toString, DeltaLog.forTable(spark, path.toString).update()) | ||
checkTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also assert that the right checkpoint is being used for the snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be covered in SnapshotManagerSuite
- this suite is end-to-end just to cover the overall read process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes its already covered in UTs. It will be good to assert that the tests are exactly testing the scenario that they were supposed to. Helps with avoiding and configuration related issues e.g. this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, this is now in the test as well.
spark.sql("INSERT INTO tbl VALUES (1, 'a'), (2, 'b')") | ||
spark.sql("INSERT INTO tbl VALUES (3, 'c'), (4, 'd')") | ||
spark.sql("INSERT INTO tbl VALUES (5, 'e'), (6, 'f')") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we insert bit more data, may be 100 rows using Seq.range
. Set the parquet/json handler batch size to less than 100, so that we can multiple batches from the ParquetHandler when reading checkpoint manifest or side car files. This exercises the iterator changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/CheckpointV2ReadSuite.scala
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
test("compatibility checkpoint with sidecar files") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chirag-s-db Is this handled?
kernel/kernel-api/src/main/java/io/delta/kernel/internal/SnapshotImpl.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/checkpoints/CheckpointInstance.java
Outdated
Show resolved
Hide resolved
// For V2 checkpoints, the filepath is included in the hash of the instance (as we consider | ||
// different UUID checkpoints to be different checkpoint instances. Otherwise, ignore | ||
// the filepath (which is empty) when hashing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, has been updated for only V2 containing filepath.
public static StructType READ_SCHEMA = new StructType() | ||
.add("path", StringType.STRING, false /* nullable */) | ||
.add("sizeInBytes", LongType.LONG, false /* nullable */) | ||
.add("modificationTime", LongType.LONG, false /* nullable */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the public static fields to the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/ActionsIterator.java
Outdated
Show resolved
Hide resolved
@@ -224,7 +221,7 @@ class CheckpointerSuite extends AnyFunSuite | |||
} | |||
} | |||
|
|||
object CheckpointerSuite { | |||
object CheckpointerSuite extends MockTableClientUtils with VectorTestUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why extend MockTableClientUtils
? VectorTestUtils
makes sense as it contains the stringVector
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
kernel/kernel-api/src/test/scala/io/delta/kernel/internal/SnapshotManagerSuite.scala
Outdated
Show resolved
Hide resolved
"delta.kernel.default.parquet.reader.batch-size" -> "10", | ||
"delta.kernel.default.json.reader.batch-size" -> "10", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting configs like this works for Kernel?
I thought we need to create a custom table client with Configuration
containing these. And pass that custom table client to checkTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
val expectedV2CkptToRead = | ||
ckptVersionExpected.getOrElse(snapshotFromSpark.version - (snapshotFromSpark.version % 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this check? snapshotFromSpark.version % 2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checks that it is the most recent even version.
assert(snapshotImpl.getLogSegment.checkpoints.asScala.map( | ||
f => new Path(f.getPath).getName.contains("-")).contains(expectV2CheckpointFormat)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the regex added in FileNames
to identify the v2 checkpoint conclusively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a check in FileNames to identify conclusively, used CheckpointInstance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which Delta project/connector is this regarding?
Description
Add support for V2 checkpoints. When reconstructing the
LogSegment
of a table at a given version, check if the checkpoint file to be read is a checkpoint manifest. If it is, include the sidecar files referenced by that manifest in theLogSegment
checkpoint files. See #2232How was this patch tested?
See changes to
LogReplaySuite
,SnapshotManagementSuite
,CheckpointerSuite
,FileNamesSuite
, andCheckpointInstanceSuite
.Does this PR introduce any user-facing changes?
No.