Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata Log Entries metadata table #667

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kevinjqliu
Copy link
Collaborator

@kevinjqliu kevinjqliu commented Apr 28, 2024

Resolves #594 (and part of #511)

This PR creates a metadata table for "Metadata Log Entries", similar to its spark equivalent (metadata_log_entries).

To query the metadata table, use

tbl.inspect.metadata_log_entries()

References

Spark metadata log entries table is implemented in MetadataLogEntriesTable.java

The metadata log entries log is modified during TableMetadata creation, in which the current metadata log entry is appended (1, 2, 3). This leads to a surprising behavior where the last row of metadata entries table is based on when the query ran.

For example,

a = spark.sql(f"SELECT * FROM {identifier}.metadata_log_entries").toPandas()
import time
time.sleep(5)
b = spark.sql(f"SELECT * FROM {identifier}.metadata_log_entries").toPandas()

(Pdb) display(a)
display (a):                 timestamp                                               file  latest_snapshot_id  latest_schema_id  latest_sequence_number
0 2024-04-28 17:21:31.336  s3://warehouse/default/table_metadata_log_entr...                 NaN               NaN                     NaN
1 2024-04-28 17:21:31.531  s3://warehouse/default/table_metadata_log_entr...        4.105762e+18               0.0                     0.0
2 2024-04-28 17:21:31.600  s3://warehouse/default/table_metadata_log_entr...        7.201925e+18               0.0                     0.0
3 2024-04-28 17:21:34.204  s3://warehouse/default/table_metadata_log_entr...        1.984627e+18               0.0                     0.0

(Pdb) display(b)
display (b):                 timestamp                                               file  latest_snapshot_id  latest_schema_id  latest_sequence_number
0 2024-04-28 17:21:31.336  s3://warehouse/default/table_metadata_log_entr...                 NaN               NaN                     NaN
1 2024-04-28 17:21:31.531  s3://warehouse/default/table_metadata_log_entr...        4.105762e+18               0.0                     0.0
2 2024-04-28 17:21:31.600  s3://warehouse/default/table_metadata_log_entr...        7.201925e+18               0.0                     0.0
3 2024-04-28 17:21:42.336  s3://warehouse/default/table_metadata_log_entr...        1.984627e+18               0.0                     0.0

# Notice the timestamp in the last row of a and b differs by more than 5 seconds

Get Snapshot by timestamp (_snapshot_as_of_timestamp_ms) is modeled after snapshotIdAsOfTime from Java

There's an issue with reading V1 spec where the sequence-number is None instead of 0. According to the Iceberg spec, when reading v1 metadata for v2, Snapshot field sequence-number must default to 0 (source).

@kevinjqliu kevinjqliu marked this pull request as ready for review April 28, 2024 17:18
@@ -292,6 +292,13 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
return self.snapshot_by_id(ref.snapshot_id)
return None

def _snapshot_as_of_timestamp_ms(self, timestamp_ms: int) -> Optional[Snapshot]:
Copy link

@corleyma corleyma Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to make this public? PyIceberg ought to have an interface for this, though I suppose it's understandable if we don't want this to be it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting a snapshot by timestamp should be a public function, I'm not opposed to making this public. But I'm unsure if timestamp_ms: int is the preferred signature we want as input.

Copy link

@corleyma corleyma Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp_ms certainly keeps us closer to the spec, although I could see a case for some kind of int | datetime.datetime interface (maybe requiring datetimes to be timezone-aware?). Still, it's not exactly hard for callers to do something like datatime.now().timestamp() so I don't know how necessary it is to work with datetime objects directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 for just supporting timestamp_ms: int for now, because it is what we support consistently across all Iceberg APIs, and as a bonus we don't have to worry about validating timezone-awareness.


table_schema = pa.schema([
("timestamp", pa.timestamp(unit='ms'), True),
("file", pa.string(), True),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are file and timestamp nullable? In what scenarios is it expected to have a log entry with a snapshot id but no timestamp or file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good catch. timestamp and file should both be required fields, according to the Java schema. The third element of the tuple represents nullable, which should be False for both.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the rest of the field's nullable fields are also wrong.

@@ -226,7 +226,8 @@ def __eq__(self, other: Any) -> bool:
class Snapshot(IcebergBaseModel):
snapshot_id: int = Field(alias="snapshot-id")
parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None)
sequence_number: Optional[int] = Field(alias="sequence-number", default=None)
# cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import
sequence_number: Optional[int] = Field(alias="sequence-number", default=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the default value for the sequence number has to be changed to 0 as opposed to None?

"latest_sequence_number": latest_snapshot.sequence_number if latest_snapshot else None,
}

# imitates `addPreviousFile` from Java, might could move this to `metadata_log` constructor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# imitates `addPreviousFile` from Java, might could move this to `metadata_log` constructor
# imitates `addPreviousFile` from Java; this could move this to `metadata_log` constructor

@Fokko Fokko mentioned this pull request May 13, 2024
8 tasks
@kevinjqliu kevinjqliu mentioned this pull request May 14, 2024
32 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat request] Add metadata_log_entries metadata table
3 participants