-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata Log Entries metadata table #667
base: main
Are you sure you want to change the base?
Metadata Log Entries metadata table #667
Conversation
@@ -292,6 +292,13 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]: | |||
return self.snapshot_by_id(ref.snapshot_id) | |||
return None | |||
|
|||
def _snapshot_as_of_timestamp_ms(self, timestamp_ms: int) -> Optional[Snapshot]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to make this public? PyIceberg ought to have an interface for this, though I suppose it's understandable if we don't want this to be it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting a snapshot by timestamp should be a public function, I'm not opposed to making this public. But I'm unsure if timestamp_ms: int
is the preferred signature we want as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp_ms certainly keeps us closer to the spec, although I could see a case for some kind of int | datetime.datetime
interface (maybe requiring datetimes to be timezone-aware?). Still, it's not exactly hard for callers to do something like datatime.now().timestamp()
so I don't know how necessary it is to work with datetime objects directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 for just supporting timestamp_ms: int
for now, because it is what we support consistently across all Iceberg APIs, and as a bonus we don't have to worry about validating timezone-awareness.
pyiceberg/table/__init__.py
Outdated
|
||
table_schema = pa.schema([ | ||
("timestamp", pa.timestamp(unit='ms'), True), | ||
("file", pa.string(), True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are file and timestamp nullable? In what scenarios is it expected to have a log entry with a snapshot id but no timestamp or file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good catch. timestamp
and file
should both be required fields, according to the Java schema. The third element of the tuple represents nullable
, which should be False
for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means the rest of the field's nullable fields are also wrong.
@@ -226,7 +226,8 @@ def __eq__(self, other: Any) -> bool: | |||
class Snapshot(IcebergBaseModel): | |||
snapshot_id: int = Field(alias="snapshot-id") | |||
parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None) | |||
sequence_number: Optional[int] = Field(alias="sequence-number", default=None) | |||
# cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import | |||
sequence_number: Optional[int] = Field(alias="sequence-number", default=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason the default value for the sequence number has to be changed to 0 as opposed to None?
"latest_sequence_number": latest_snapshot.sequence_number if latest_snapshot else None, | ||
} | ||
|
||
# imitates `addPreviousFile` from Java, might could move this to `metadata_log` constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# imitates `addPreviousFile` from Java, might could move this to `metadata_log` constructor | |
# imitates `addPreviousFile` from Java; this could move this to `metadata_log` constructor |
Resolves #594 (and part of #511)
This PR creates a metadata table for "Metadata Log Entries", similar to its spark equivalent (
metadata_log_entries
).To query the metadata table, use
References
Spark metadata log entries table is implemented in
MetadataLogEntriesTable.java
The metadata log entries log is modified during
TableMetadata
creation, in which the current metadata log entry is appended (1, 2, 3). This leads to a surprising behavior where the last row of metadata entries table is based on when the query ran.For example,
Get
Snapshot
by timestamp (_snapshot_as_of_timestamp_ms
) is modeled aftersnapshotIdAsOfTime
from JavaThere's an issue with reading V1 spec where the
sequence-number
isNone
instead of0
. According to the Iceberg spec, when reading v1 metadata for v2,Snapshot
fieldsequence-number
must default to 0 (source).