Metadata Log Entries metadata table #667

kevinjqliu · 2024-04-28T16:59:56Z

Resolves #594 (and part of #511)

This PR creates a metadata table for "Metadata Log Entries", similar to its spark equivalent (metadata_log_entries).

To query the metadata table, use

tbl.inspect.metadata_log_entries()

References

Add Snapshots table metadata #524 (snapshots metadata table)
Add Refs metadata table #602 (references metadata table)
Add entries metadata table #551 (entries metadata table)

Spark metadata log entries table is implemented in MetadataLogEntriesTable.java

The metadata log entries log is modified during TableMetadata creation, in which the current metadata log entry is appended (1, 2, 3). This leads to a surprising behavior where the last row of metadata entries table is based on when the query ran.

For example,

a = spark.sql(f"SELECT * FROM {identifier}.metadata_log_entries").toPandas()
import time
time.sleep(5)
b = spark.sql(f"SELECT * FROM {identifier}.metadata_log_entries").toPandas()

(Pdb) display(a)
display (a):                 timestamp                                               file  latest_snapshot_id  latest_schema_id  latest_sequence_number
0 2024-04-28 17:21:31.336  s3://warehouse/default/table_metadata_log_entr...                 NaN               NaN                     NaN
1 2024-04-28 17:21:31.531  s3://warehouse/default/table_metadata_log_entr...        4.105762e+18               0.0                     0.0
2 2024-04-28 17:21:31.600  s3://warehouse/default/table_metadata_log_entr...        7.201925e+18               0.0                     0.0
3 2024-04-28 17:21:34.204  s3://warehouse/default/table_metadata_log_entr...        1.984627e+18               0.0                     0.0

(Pdb) display(b)
display (b):                 timestamp                                               file  latest_snapshot_id  latest_schema_id  latest_sequence_number
0 2024-04-28 17:21:31.336  s3://warehouse/default/table_metadata_log_entr...                 NaN               NaN                     NaN
1 2024-04-28 17:21:31.531  s3://warehouse/default/table_metadata_log_entr...        4.105762e+18               0.0                     0.0
2 2024-04-28 17:21:31.600  s3://warehouse/default/table_metadata_log_entr...        7.201925e+18               0.0                     0.0
3 2024-04-28 17:21:42.336  s3://warehouse/default/table_metadata_log_entr...        1.984627e+18               0.0                     0.0

# Notice the timestamp in the last row of a and b differs by more than 5 seconds

Get Snapshot by timestamp (_snapshot_as_of_timestamp_ms) is modeled after snapshotIdAsOfTime from Java

There's an issue with reading V1 spec where the sequence-number is None instead of 0. According to the Iceberg spec, when reading v1 metadata for v2, Snapshot field sequence-number must default to 0 (source).

corleyma · 2024-04-29T21:11:44Z

pyiceberg/table/metadata.py

@@ -292,6 +292,13 @@ def snapshot_by_name(self, name: str) -> Optional[Snapshot]:
            return self.snapshot_by_id(ref.snapshot_id)
        return None

+    def _snapshot_as_of_timestamp_ms(self, timestamp_ms: int) -> Optional[Snapshot]:


Any reason not to make this public? PyIceberg ought to have an interface for this, though I suppose it's understandable if we don't want this to be it.

Getting a snapshot by timestamp should be a public function, I'm not opposed to making this public. But I'm unsure if timestamp_ms: int is the preferred signature we want as input.

timestamp_ms certainly keeps us closer to the spec, although I could see a case for some kind of int | datetime.datetime interface (maybe requiring datetimes to be timezone-aware?). Still, it's not exactly hard for callers to do something like datatime.now().timestamp() so I don't know how necessary it is to work with datetime objects directly.

I'm +1 for just supporting timestamp_ms: int for now, because it is what we support consistently across all Iceberg APIs, and as a bonus we don't have to worry about validating timezone-awareness.

corleyma · 2024-04-29T21:25:05Z

pyiceberg/table/__init__.py

+
+        table_schema = pa.schema([
+            ("timestamp", pa.timestamp(unit='ms'), True),
+            ("file", pa.string(), True),


why are file and timestamp nullable? In what scenarios is it expected to have a log entry with a snapshot id but no timestamp or file?

ah, good catch. timestamp and file should both be required fields, according to the Java schema. The third element of the tuple represents nullable, which should be False for both.

This means the rest of the field's nullable fields are also wrong.

syun64 · 2024-04-30T12:51:11Z

pyiceberg/table/snapshots.py

@@ -226,7 +226,8 @@ def __eq__(self, other: Any) -> bool:
 class Snapshot(IcebergBaseModel):
    snapshot_id: int = Field(alias="snapshot-id")
    parent_snapshot_id: Optional[int] = Field(alias="parent-snapshot-id", default=None)
-    sequence_number: Optional[int] = Field(alias="sequence-number", default=None)
+    # cannot import `INITIAL_SEQUENCE_NUMBER` due to circular import
+    sequence_number: Optional[int] = Field(alias="sequence-number", default=0)


Is there a reason the default value for the sequence number has to be changed to 0 as opposed to None?

syun64 · 2024-04-30T12:51:55Z

pyiceberg/table/__init__.py

+                "latest_sequence_number": latest_snapshot.sequence_number if latest_snapshot else None,
+            }
+
+        # imitates `addPreviousFile` from Java, might could move this to `metadata_log` constructor


Suggested change

# imitates `addPreviousFile` from Java, might could move this to `metadata_log` constructor

# imitates `addPreviousFile` from Java; this could move this to `metadata_log` constructor

kevinjqliu added 6 commits April 28, 2024 11:06

add metadata_entries table with tests

9a0423d

make test work

ecec57e

remove comment

9e506c2

add doc

9c77d57

make lint

b26f08f

comment

58b0609

kevinjqliu marked this pull request as ready for review April 28, 2024 17:18

comment

f7dd165

corleyma reviewed Apr 29, 2024

View reviewed changes

use pa.field and set nullable properly

4655c97

syun64 reviewed Apr 30, 2024

View reviewed changes

Fokko mentioned this pull request May 13, 2024

Add metadata tables #511

Open

8 tasks

kevinjqliu mentioned this pull request May 14, 2024

PyIceberg Near-Term Roadmap #736

Open

32 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata Log Entries metadata table #667

Metadata Log Entries metadata table #667

kevinjqliu commented Apr 28, 2024 •

edited

corleyma Apr 29, 2024 •

edited

kevinjqliu Apr 30, 2024

corleyma Apr 30, 2024 •

edited

syun64 May 2, 2024

corleyma Apr 29, 2024

kevinjqliu Apr 30, 2024

kevinjqliu Apr 30, 2024

syun64 Apr 30, 2024

syun64 Apr 30, 2024

	# imitates `addPreviousFile` from Java, might could move this to `metadata_log` constructor
	# imitates `addPreviousFile` from Java; this could move this to `metadata_log` constructor

Metadata Log Entries metadata table #667

Are you sure you want to change the base?

Metadata Log Entries metadata table #667

Conversation

kevinjqliu commented Apr 28, 2024 • edited

corleyma Apr 29, 2024 • edited

Choose a reason for hiding this comment

kevinjqliu Apr 30, 2024

Choose a reason for hiding this comment

corleyma Apr 30, 2024 • edited

Choose a reason for hiding this comment

syun64 May 2, 2024

Choose a reason for hiding this comment

corleyma Apr 29, 2024

Choose a reason for hiding this comment

kevinjqliu Apr 30, 2024

Choose a reason for hiding this comment

kevinjqliu Apr 30, 2024

Choose a reason for hiding this comment

syun64 Apr 30, 2024

Choose a reason for hiding this comment

syun64 Apr 30, 2024

Choose a reason for hiding this comment

kevinjqliu commented Apr 28, 2024 •

edited

corleyma Apr 29, 2024 •

edited

corleyma Apr 30, 2024 •

edited