[8619] feat(client-python): add statistic value serdes by tsungchih · Pull Request #10877 · apache/gravitino

tsungchih · 2026-04-27T14:25:37Z

What changes were proposed in this pull request?

This PR aims to add implementation of statistic value serdes in Python client that refers to the following Java classes.

JsonUtils.java

StatisticValueSerializer
StatisticValueDeserializer

StatisticValues.java

StatisticValues

Why are the changes needed?

We need to have these classes before being able to implement MetadataObjectStatisticsOperations.

Fix: #8619

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

tsungchih · 2026-04-27T15:11:41Z

@jerryshao Could you please help with reviewing this PR when you are available? I would greatly appreciate any feedback from you. Thanks.

github-actions · 2026-04-27T15:32:57Z

Code Coverage Report

Overall Project	65.53%	🟢
Files changed	No Java source files changed	-

Module	Coverage
aliyun	1.73%	🔴
api	47.13%	🟢
authorization-common	85.96%	🟢
aws	1.1%	🔴
azure	2.6%	🔴
catalog-common	10.2%	🔴
catalog-fileset	80.02%	🟢
catalog-glue	82.47%	🟢
catalog-hive	81.83%	🟢
catalog-jdbc-clickhouse	79.06%	🟢
catalog-jdbc-common	43.93%	🟢
catalog-jdbc-doris	80.28%	🟢
catalog-jdbc-hologres	54.03%	🟢
catalog-jdbc-mysql	79.23%	🟢
catalog-jdbc-oceanbase	78.38%	🟢
catalog-jdbc-postgresql	82.05%	🟢
catalog-jdbc-starrocks	78.27%	🟢
catalog-kafka	77.01%	🟢
catalog-lakehouse-generic	45.07%	🟢
catalog-lakehouse-hudi	79.1%	🟢
catalog-lakehouse-iceberg	86.98%	🟢
catalog-lakehouse-paimon	77.71%	🟢
catalog-model	77.72%	🟢
cli	44.51%	🟢
client-java	77.63%	🟢
common	48.75%	🟢
core	81.5%	🟢
filesystem-hadoop3	76.97%	🟢
flink	40.55%	🟢
flink-runtime	0.0%	🔴
gcp	14.2%	🔴
hadoop-common	10.39%	🔴
hive-metastore-common	46.83%	🟢
iceberg-common	55.24%	🟢
iceberg-rest-server	68.05%	🟢
integration-test-common	0.0%	🔴
jobs	66.17%	🟢
lance-common	23.88%	🔴
lance-rest-server	57.84%	🟢
lineage	53.02%	🟢
optimizer	82.95%	🟢
optimizer-api	21.95%	🔴
server	85.6%	🟢
server-common	70.01%	🟢
spark	32.79%	🔴
spark-common	39.09%	🔴
trino-connector	34.28%	🔴

Copilot

Pull request overview

This PR adds Python client support for statistic value modeling and JSON (de)serialization, intended as a prerequisite for implementing statistics-related operations (e.g., MetadataObjectStatisticsOperations).

Changes:

Introduces StatisticValues (typed wrappers for boolean/long/double/string/list/object statistic values).
Adds StatisticValueSerdes to serialize/deserialize StatisticValue instances to/from JSON-compatible Python primitives.
Adds unit tests covering StatisticValues behavior and JSON serdes roundtrips.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
clients/client-python/gravitino/api/stats/statistic_values.py	Adds concrete `StatisticValue` implementations and factory helpers.
clients/client-python/gravitino/api/stats/json_serdes/statistic_value_serdes.py	Implements JSON serialization/deserialization for `StatisticValue`.
clients/client-python/gravitino/api/stats/json_serdes/init.py	Introduces the stats JSON serdes package.
clients/client-python/gravitino/api/rel/types/json_serdes/base.py	Extends `JsonSerializable` type bounds to include `StatisticValue`.
clients/client-python/tests/unittests/api/stats/test_statistic_values.py	Adds unit tests for `StatisticValues`.
clients/client-python/tests/unittests/api/stats/test_statistic_value_serdes.py	Adds unit tests for `StatisticValueSerdes`.

Copilot · 2026-04-29T10:56:58Z

+        def __hash__(self) -> int:
+            return hash(tuple(v.value() for v in self._value_list))
+


ListValue.__hash__ hashes v.value() for each element. This will raise TypeError for nested statistic values whose .value() is unhashable (e.g., a nested ListValue returns list, ObjectValue returns dict). Hashing should be based on the StatisticValue elements themselves (or on a stable serialized representation), not their raw .value().

Copilot · 2026-04-29T10:56:59Z

+        def __hash__(self) -> int:
+            return hash(tuple(v.value() for v in self._value_map.values()))
+


ObjectValue.__hash__ hashes only value_map.values() in insertion order. This can violate the hash contract because two equal dicts (same key/value pairs but different insertion order) compare equal yet can produce different hashes; it also fails for nested unhashable .value() results. Use a key-inclusive, order-independent representation (e.g., a frozenset/tuple of sorted (key, value) pairs, hashing the StatisticValue objects).

Copilot · 2026-04-29T10:56:59Z

+
+        self.assertEqual(value.value(), value_dict)
+        self.assertEqual(
+            value.data_type().name(),
+            Types.StructType.of(
+                Types.StructType.Field.nullable_field(key, value.data_type())
+                for key, value in value_dict.items()
+            ).name(),
+        )
+        self.assertEqual(
+            hash(value), hash(tuple(v.value() for v in value_dict.values()))
+        )
+        self.assertEqual(value, twin_value)
+        self.assertNotEqual(value, another_value)
+        self.assertEqual(
+            value.data_type().name(),
+            Types.StructType.of(
+                Types.StructType.Field.nullable_field(key, value.data_type())
+                for key, value in value_dict.items()
+            ).name(),
+        )
+        self.assertEqual(


This assertion doesn’t actually validate the struct’s field types: StructType.name() is always Name.STRUCT, and the expected StructType.of(...) call is also constructed incorrectly (it passes a generator as a single field). Consider asserting on value.data_type().fields() or simple_string() and build the expected struct with Types.StructType.of(*field_list) using each entry’s StatisticValue.data_type().

Suggested change

self.assertEqual(value.value(), value_dict)

self.assertEqual(

value.data_type().name(),

Types.StructType.of(

Types.StructType.Field.nullable_field(key, value.data_type())

for key, value in value_dict.items()

).name(),

)

self.assertEqual(

hash(value), hash(tuple(v.value() for v in value_dict.values()))

)

self.assertEqual(value, twin_value)

self.assertNotEqual(value, another_value)

self.assertEqual(

value.data_type().name(),

Types.StructType.of(

Types.StructType.Field.nullable_field(key, value.data_type())

for key, value in value_dict.items()

).name(),

)

self.assertEqual(

expected_data_type = Types.StructType.of(

*[

Types.StructType.Field.nullable_field(

key, statistic_value.data_type()

)

for key, statistic_value in value_dict.items()

]

)

self.assertEqual(value.value(), value_dict)

self.assertEqual(value.data_type().fields(), expected_data_type.fields())

self.assertEqual(

hash(value), hash(tuple(v.value() for v in value_dict.values()))

)

self.assertEqual(value, twin_value)

self.assertNotEqual(value, another_value)

self.assertEqual(value.data_type().fields(), expected_data_type.fields())

self.assertEqual(

Copilot · 2026-04-29T10:56:59Z

+        self.assertEqual(
+            value.data_type().name(),
+            Types.ListType.nullable(value_list[0].data_type()).name(),
+        )
+        self.assertEqual(hash(value), hash(tuple(v.value() for v in value_list)))
+        self.assertEqual(value, twin_value)
+        self.assertNotEqual(value, another_value)


There are duplicated assertions in this test (the data_type().name()/hash()/==/!= checks are repeated). Removing the duplicates will make the test easier to maintain without reducing coverage.

Copilot · 2026-04-29T10:57:00Z

+        self.assertEqual(
+            value.data_type().name(),
+            Types.StructType.of(
+                Types.StructType.Field.nullable_field(key, value.data_type())
+                for key, value in value_dict.items()
+            ).name(),
+        )
+        self.assertEqual(
+            hash(value), hash(tuple(v.value() for v in value_dict.values()))
+        )
+        self.assertEqual(value, twin_value)
+        self.assertNotEqual(value, another_value)


There are duplicated assertions in this test (the data_type().name()/hash()/==/!= checks are repeated). Removing the duplicates will make the test easier to maintain without reducing coverage.

Suggested change

self.assertEqual(

value.data_type().name(),

Types.StructType.of(

Types.StructType.Field.nullable_field(key, value.data_type())

for key, value in value_dict.items()

).name(),

)

self.assertEqual(

hash(value), hash(tuple(v.value() for v in value_dict.values()))

)

self.assertEqual(value, twin_value)

self.assertNotEqual(value, another_value)

Copilot · 2026-04-29T10:57:00Z

+class StatisticValueSerdes(JsonSerializable[StatisticValue[Any]]):
+    """Customized JSON Serializer and Deserializer for StatisticValue."""
+


The PR description mentions adding Java classes (e.g., JsonUtils.java / StatisticValues.java), but the actual changes in this PR are in the Python client. Please update the PR description to reflect the Python implementation so reviewers/users aren’t confused.

Copilot · 2026-04-29T10:57:00Z

+    def object_value(value_list: dict[str, StatisticValue[Any]]) -> "ObjectValue[Any]":
+        """Creates a statistic value that holds a list of other statistic values.
+
+        Args:
+            value_list: the list of statistic values to be held by this statistic value
+
+        Returns:
+            A ListValue instance containing the provided list of statistic values
+        """
+        return StatisticValues.ObjectValue(value_list)


object_value takes a dict/map, but its docstring/argument naming still refers to a “list”, and the Returns section says it returns ListValue. This is misleading for API consumers; please update the docstring and rename the parameter to something like value_map (and fix the Returns description to ObjectValue).

Suggested change

def object_value(value_list: dict[str, StatisticValue[Any]]) -> "ObjectValue[Any]":

"""Creates a statistic value that holds a list of other statistic values.

Args:

value_list: the list of statistic values to be held by this statistic value

Returns:

A ListValue instance containing the provided list of statistic values

"""

return StatisticValues.ObjectValue(value_list)

def object_value(value_map: dict[str, StatisticValue[Any]]) -> "ObjectValue[Any]":

"""Creates a statistic value that holds a map of other statistic values.

Args:

value_map: the map of statistic values to be held by this statistic value

Returns:

An ObjectValue instance containing the provided map of statistic values

"""

return StatisticValues.ObjectValue(value_map)

Copilot · 2026-04-29T10:57:01Z

+                Types.StructType.Field.nullable_field(key, value.data_type())
+                for key, value in self._value_map.items()


Types.StructType.of is a varargs factory (of(*fields)), but this code passes a generator expression as a single argument. That builds a StructType whose only “field” is a generator object, which will break when callers use fields() / simple_string() / serialization. Build a list of Field and splat it (or use of(*field_list)).

Suggested change

Types.StructType.Field.nullable_field(key, value.data_type())

for key, value in self._value_map.items()

*[

Types.StructType.Field.nullable_field(key, value.data_type())

for key, value in self._value_map.items()

]

jerryshao · 2026-04-29T11:03:00Z

Generally LGTM, can you fix the Copilot's comment?

apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

tsungchih · 2026-04-30T09:54:47Z

@jerryshao I've addressed review comments from Copilot. Could you please take a review on the changes when you're available? Thanks!

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Copilot · 2026-04-30T10:06:44Z

+        self.assertEqual(
+            hash(value), hash(tuple(v.value() for v in value_dict.values()))
+        )


ObjectValue.__hash__ is computed via StatisticValues._make_hash, which includes keys (sorted) and hashes of nested values. This test’s expected hash (hash(tuple(v.value() for v in value_dict.values()))) ignores keys and insertion-order differences, so it won’t match the actual implementation. Update the assertion to either compare hash(value) with hash(twin_value) only, or compute the expected hash using the same key-inclusive, order-independent representation.

Suggested change

self.assertEqual(

hash(value), hash(tuple(v.value() for v in value_dict.values()))

)

self.assertEqual(hash(value), hash(twin_value))

Copilot · 2026-04-30T10:06:45Z

 from gravitino.api.rel.types.types import Type
+from gravitino.api.stats.statistic_value import StatisticValue
 from gravitino.dto.rel.partitioning.partitioning import Partitioning
 from gravitino.dto.rel.partitions.partition_dto import PartitionDTO

 _GravitinoTypeT = TypeVar(
    "_GravitinoTypeT",
    bound=Union[
-        Expression, Type, Partitioning, PartitionDTO, Distribution, Index, SortOrder
+        Expression,
+        Type,
+        Partitioning,
+        PartitionDTO,
+        Distribution,
+        Index,
+        SortOrder,
+        StatisticValue,
    ],


Adding StatisticValue to the JsonSerializable type bound introduces an unnecessary dependency from api/rel/types/json_serdes into api/stats (and forces importing stats whenever JsonSerializable is imported). Since JsonSerializable is a generic serializer interface, consider removing the restrictive bound=Union[...] (or loosening it to a generic unconstrained TypeVar) so new serdes types don’t require editing this base module and to keep package layering clean.

The suggested change is already there in the branch. I have no idea where to address this comment.

Copilot · 2026-04-30T10:06:45Z

+                    Types.StructType.Field.nullable_field(key, value.data_type())
+                    for key, value in self._value_map.items()


ObjectValue.data_type() builds a StructType using self._value_map.items() iteration order. Since Types.StructType.__eq__ is order-sensitive (fields list equality), two ObjectValues with identical key/value pairs but different insertion order will report different data_type(), and ListValue type-homogeneity checks can fail for lists of such objects. Consider constructing struct fields in a deterministic order (e.g., sort by key) so the schema is stable and independent of dict insertion order.

Suggested change

Types.StructType.Field.nullable_field(key, value.data_type())

for key, value in self._value_map.items()

Types.StructType.Field.nullable_field(

key, self._value_map[key].data_type()

)

for key in sorted(self._value_map)

Copilot · 2026-04-30T10:06:45Z

+    def test_deserialize_unsupported_type(self):
+        with self.assertRaises(ValueError):
+            StatisticValueSerdes.deserialize(None)
+


StatisticValueSerdes.deserialize(None) triggers Precondition.check_argument(data is not None, ...) which raises IllegalArgumentException, not ValueError. This test will fail as written; update it to expect IllegalArgumentException (and optionally assert on the message).

apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

tsungchih · 2026-05-01T05:23:24Z

@jerryshao I've addressed the second round of review comments from Copilot. Could you please take a review on the changes when you're available? Thanks!

tsungchih and others added 4 commits April 27, 2026 22:08

feat: add statistic values

e5bb20e

apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

test: add unit tests for statistic values

a1c6b84

apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

feat: add statistic value serdes

ff4768b

apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

test: add unit tests for statistic value serdes

619faf3

apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

tsungchih marked this pull request as ready for review April 27, 2026 15:11

jerryshao requested a review from Copilot April 29, 2026 10:52

Copilot started reviewing on behalf of jerryshao April 29, 2026 10:53 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

tsungchih and others added 3 commits April 30, 2026 16:26

refactor: fix copilot review comment for nested value hash

952a1b4

apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

test: enrich unit tests to cover nested value hash

02402f6

apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

refactor: address copilot review feedback

49efeea

apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

jerryshao requested a review from Copilot April 30, 2026 10:01

Copilot started reviewing on behalf of jerryshao April 30, 2026 10:02 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

refactor: address review comments from Copilot round 2

feed726

apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>

jerryshao assigned tsungchih May 6, 2026

jerryshao approved these changes May 6, 2026

View reviewed changes

jerryshao merged commit 81a2edd into apache:main May 6, 2026
26 checks passed

		def __hash__(self) -> int:
		return hash(tuple(v.value() for v in self._value_list))

		class StatisticValueSerdes(JsonSerializable[StatisticValue[Any]]):
		"""Customized JSON Serializer and Deserializer for StatisticValue."""

		Types.StructType.Field.nullable_field(key, value.data_type())
		for key, value in self._value_map.items()

Conversation

tsungchih commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

tsungchih commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryshao commented Apr 29, 2026

Uh oh!

tsungchih commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

tsungchih commented Apr 27, 2026 •

edited

Loading

github-actions Bot commented Apr 27, 2026 •

edited

Loading