Skip to content

[8619] feat(client-python): add statistic value serdes#10877

Merged
jerryshao merged 8 commits intoapache:mainfrom
tsungchih:feat-client-python-statistic-value-serdes
May 6, 2026
Merged

[8619] feat(client-python): add statistic value serdes#10877
jerryshao merged 8 commits intoapache:mainfrom
tsungchih:feat-client-python-statistic-value-serdes

Conversation

@tsungchih
Copy link
Copy Markdown
Collaborator

@tsungchih tsungchih commented Apr 27, 2026

What changes were proposed in this pull request?

This PR aims to add implementation of statistic value serdes in Python client that refers to the following Java classes.

JsonUtils.java

  • StatisticValueSerializer
  • StatisticValueDeserializer

StatisticValues.java

  • StatisticValues

Why are the changes needed?

We need to have these classes before being able to implement MetadataObjectStatisticsOperations.

Fix: #8619

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

tsungchih and others added 4 commits April 27, 2026 22:08
apache#8619

Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619

Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619

Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619

Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
@tsungchih tsungchih marked this pull request as ready for review April 27, 2026 15:11
@tsungchih
Copy link
Copy Markdown
Collaborator Author

@jerryshao Could you please help with reviewing this PR when you are available? I would greatly appreciate any feedback from you. Thanks.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 27, 2026

Code Coverage Report

Overall Project 65.53% 🟢
Files changed No Java source files changed -

Module Coverage
aliyun 1.73% 🔴
api 47.13% 🟢
authorization-common 85.96% 🟢
aws 1.1% 🔴
azure 2.6% 🔴
catalog-common 10.2% 🔴
catalog-fileset 80.02% 🟢
catalog-glue 82.47% 🟢
catalog-hive 81.83% 🟢
catalog-jdbc-clickhouse 79.06% 🟢
catalog-jdbc-common 43.93% 🟢
catalog-jdbc-doris 80.28% 🟢
catalog-jdbc-hologres 54.03% 🟢
catalog-jdbc-mysql 79.23% 🟢
catalog-jdbc-oceanbase 78.38% 🟢
catalog-jdbc-postgresql 82.05% 🟢
catalog-jdbc-starrocks 78.27% 🟢
catalog-kafka 77.01% 🟢
catalog-lakehouse-generic 45.07% 🟢
catalog-lakehouse-hudi 79.1% 🟢
catalog-lakehouse-iceberg 86.98% 🟢
catalog-lakehouse-paimon 77.71% 🟢
catalog-model 77.72% 🟢
cli 44.51% 🟢
client-java 77.63% 🟢
common 48.75% 🟢
core 81.5% 🟢
filesystem-hadoop3 76.97% 🟢
flink 40.55% 🟢
flink-runtime 0.0% 🔴
gcp 14.2% 🔴
hadoop-common 10.39% 🔴
hive-metastore-common 46.83% 🟢
iceberg-common 55.24% 🟢
iceberg-rest-server 68.05% 🟢
integration-test-common 0.0% 🔴
jobs 66.17% 🟢
lance-common 23.88% 🔴
lance-rest-server 57.84% 🟢
lineage 53.02% 🟢
optimizer 82.95% 🟢
optimizer-api 21.95% 🔴
server 85.6% 🟢
server-common 70.01% 🟢
spark 32.79% 🔴
spark-common 39.09% 🔴
trino-connector 34.28% 🔴

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Python client support for statistic value modeling and JSON (de)serialization, intended as a prerequisite for implementing statistics-related operations (e.g., MetadataObjectStatisticsOperations).

Changes:

  • Introduces StatisticValues (typed wrappers for boolean/long/double/string/list/object statistic values).
  • Adds StatisticValueSerdes to serialize/deserialize StatisticValue instances to/from JSON-compatible Python primitives.
  • Adds unit tests covering StatisticValues behavior and JSON serdes roundtrips.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
clients/client-python/gravitino/api/stats/statistic_values.py Adds concrete StatisticValue implementations and factory helpers.
clients/client-python/gravitino/api/stats/json_serdes/statistic_value_serdes.py Implements JSON serialization/deserialization for StatisticValue.
clients/client-python/gravitino/api/stats/json_serdes/init.py Introduces the stats JSON serdes package.
clients/client-python/gravitino/api/rel/types/json_serdes/base.py Extends JsonSerializable type bounds to include StatisticValue.
clients/client-python/tests/unittests/api/stats/test_statistic_values.py Adds unit tests for StatisticValues.
clients/client-python/tests/unittests/api/stats/test_statistic_value_serdes.py Adds unit tests for StatisticValueSerdes.

Comment on lines +205 to +207
def __hash__(self) -> int:
return hash(tuple(v.value() for v in self._value_list))

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListValue.__hash__ hashes v.value() for each element. This will raise TypeError for nested statistic values whose .value() is unhashable (e.g., a nested ListValue returns list, ObjectValue returns dict). Hashing should be based on the StatisticValue elements themselves (or on a stable serialized representation), not their raw .value().

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +232 to +234
def __hash__(self) -> int:
return hash(tuple(v.value() for v in self._value_map.values()))

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectValue.__hash__ hashes only value_map.values() in insertion order. This can violate the hash contract because two equal dicts (same key/value pairs but different insertion order) compare equal yet can produce different hashes; it also fails for nested unhashable .value() results. Use a key-inclusive, order-independent representation (e.g., a frozenset/tuple of sorted (key, value) pairs, hashing the StatisticValue objects).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +119 to +140

self.assertEqual(value.value(), value_dict)
self.assertEqual(
value.data_type().name(),
Types.StructType.of(
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in value_dict.items()
).name(),
)
self.assertEqual(
hash(value), hash(tuple(v.value() for v in value_dict.values()))
)
self.assertEqual(value, twin_value)
self.assertNotEqual(value, another_value)
self.assertEqual(
value.data_type().name(),
Types.StructType.of(
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in value_dict.items()
).name(),
)
self.assertEqual(
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion doesn’t actually validate the struct’s field types: StructType.name() is always Name.STRUCT, and the expected StructType.of(...) call is also constructed incorrectly (it passes a generator as a single field). Consider asserting on value.data_type().fields() or simple_string() and build the expected struct with Types.StructType.of(*field_list) using each entry’s StatisticValue.data_type().

Suggested change
self.assertEqual(value.value(), value_dict)
self.assertEqual(
value.data_type().name(),
Types.StructType.of(
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in value_dict.items()
).name(),
)
self.assertEqual(
hash(value), hash(tuple(v.value() for v in value_dict.values()))
)
self.assertEqual(value, twin_value)
self.assertNotEqual(value, another_value)
self.assertEqual(
value.data_type().name(),
Types.StructType.of(
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in value_dict.items()
).name(),
)
self.assertEqual(
expected_data_type = Types.StructType.of(
*[
Types.StructType.Field.nullable_field(
key, statistic_value.data_type()
)
for key, statistic_value in value_dict.items()
]
)
self.assertEqual(value.value(), value_dict)
self.assertEqual(value.data_type().fields(), expected_data_type.fields())
self.assertEqual(
hash(value), hash(tuple(v.value() for v in value_dict.values()))
)
self.assertEqual(value, twin_value)
self.assertNotEqual(value, another_value)
self.assertEqual(value.data_type().fields(), expected_data_type.fields())
self.assertEqual(

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +96 to +102
self.assertEqual(
value.data_type().name(),
Types.ListType.nullable(value_list[0].data_type()).name(),
)
self.assertEqual(hash(value), hash(tuple(v.value() for v in value_list)))
self.assertEqual(value, twin_value)
self.assertNotEqual(value, another_value)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are duplicated assertions in this test (the data_type().name()/hash()/==/!= checks are repeated). Removing the duplicates will make the test easier to maintain without reducing coverage.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +133 to +144
self.assertEqual(
value.data_type().name(),
Types.StructType.of(
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in value_dict.items()
).name(),
)
self.assertEqual(
hash(value), hash(tuple(v.value() for v in value_dict.values()))
)
self.assertEqual(value, twin_value)
self.assertNotEqual(value, another_value)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are duplicated assertions in this test (the data_type().name()/hash()/==/!= checks are repeated). Removing the duplicates will make the test easier to maintain without reducing coverage.

Suggested change
self.assertEqual(
value.data_type().name(),
Types.StructType.of(
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in value_dict.items()
).name(),
)
self.assertEqual(
hash(value), hash(tuple(v.value() for v in value_dict.values()))
)
self.assertEqual(value, twin_value)
self.assertNotEqual(value, another_value)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +29 to +31
class StatisticValueSerdes(JsonSerializable[StatisticValue[Any]]):
"""Customized JSON Serializer and Deserializer for StatisticValue."""

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions adding Java classes (e.g., JsonUtils.java / StatisticValues.java), but the actual changes in this PR are in the Python client. Please update the PR description to reflect the Python implementation so reviewers/users aren’t confused.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +93 to +102
def object_value(value_list: dict[str, StatisticValue[Any]]) -> "ObjectValue[Any]":
"""Creates a statistic value that holds a list of other statistic values.

Args:
value_list: the list of statistic values to be held by this statistic value

Returns:
A ListValue instance containing the provided list of statistic values
"""
return StatisticValues.ObjectValue(value_list)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object_value takes a dict/map, but its docstring/argument naming still refers to a “list”, and the Returns section says it returns ListValue. This is misleading for API consumers; please update the docstring and rename the parameter to something like value_map (and fix the Returns description to ObjectValue).

Suggested change
def object_value(value_list: dict[str, StatisticValue[Any]]) -> "ObjectValue[Any]":
"""Creates a statistic value that holds a list of other statistic values.
Args:
value_list: the list of statistic values to be held by this statistic value
Returns:
A ListValue instance containing the provided list of statistic values
"""
return StatisticValues.ObjectValue(value_list)
def object_value(value_map: dict[str, StatisticValue[Any]]) -> "ObjectValue[Any]":
"""Creates a statistic value that holds a map of other statistic values.
Args:
value_map: the map of statistic values to be held by this statistic value
Returns:
An ObjectValue instance containing the provided map of statistic values
"""
return StatisticValues.ObjectValue(value_map)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +228 to +229
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in self._value_map.items()
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types.StructType.of is a varargs factory (of(*fields)), but this code passes a generator expression as a single argument. That builds a StructType whose only “field” is a generator object, which will break when callers use fields() / simple_string() / serialization. Build a list of Field and splat it (or use of(*field_list)).

Suggested change
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in self._value_map.items()
*[
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in self._value_map.items()
]

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@jerryshao
Copy link
Copy Markdown
Contributor

Generally LGTM, can you fix the Copilot's comment?

tsungchih and others added 3 commits April 30, 2026 16:26
apache#8619

Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619

Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619

Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
@tsungchih
Copy link
Copy Markdown
Collaborator Author

@jerryshao I've addressed review comments from Copilot. Could you please take a review on the changes when you're available? Thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment on lines +123 to +125
self.assertEqual(
hash(value), hash(tuple(v.value() for v in value_dict.values()))
)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectValue.__hash__ is computed via StatisticValues._make_hash, which includes keys (sorted) and hashes of nested values. This test’s expected hash (hash(tuple(v.value() for v in value_dict.values()))) ignores keys and insertion-order differences, so it won’t match the actual implementation. Update the assertion to either compare hash(value) with hash(twin_value) only, or compute the expected hash using the same key-inclusive, order-independent representation.

Suggested change
self.assertEqual(
hash(value), hash(tuple(v.value() for v in value_dict.values()))
)
self.assertEqual(hash(value), hash(twin_value))

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines 27 to 43
from gravitino.api.rel.types.types import Type
from gravitino.api.stats.statistic_value import StatisticValue
from gravitino.dto.rel.partitioning.partitioning import Partitioning
from gravitino.dto.rel.partitions.partition_dto import PartitionDTO

_GravitinoTypeT = TypeVar(
"_GravitinoTypeT",
bound=Union[
Expression, Type, Partitioning, PartitionDTO, Distribution, Index, SortOrder
Expression,
Type,
Partitioning,
PartitionDTO,
Distribution,
Index,
SortOrder,
StatisticValue,
],
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding StatisticValue to the JsonSerializable type bound introduces an unnecessary dependency from api/rel/types/json_serdes into api/stats (and forces importing stats whenever JsonSerializable is imported). Since JsonSerializable is a generic serializer interface, consider removing the restrictive bound=Union[...] (or loosening it to a generic unconstrained TypeVar) so new serdes types don’t require editing this base module and to keep package layering clean.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested change is already there in the branch. I have no idea where to address this comment.

Comment on lines +254 to +255
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in self._value_map.items()
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectValue.data_type() builds a StructType using self._value_map.items() iteration order. Since Types.StructType.__eq__ is order-sensitive (fields list equality), two ObjectValues with identical key/value pairs but different insertion order will report different data_type(), and ListValue type-homogeneity checks can fail for lists of such objects. Consider constructing struct fields in a deterministic order (e.g., sort by key) so the schema is stable and independent of dict insertion order.

Suggested change
Types.StructType.Field.nullable_field(key, value.data_type())
for key, value in self._value_map.items()
Types.StructType.Field.nullable_field(
key, self._value_map[key].data_type()
)
for key in sorted(self._value_map)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +101 to +104
def test_deserialize_unsupported_type(self):
with self.assertRaises(ValueError):
StatisticValueSerdes.deserialize(None)

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatisticValueSerdes.deserialize(None) triggers Precondition.check_argument(data is not None, ...) which raises IllegalArgumentException, not ValueError. This test will fail as written; update it to expect IllegalArgumentException (and optionally assert on the message).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

apache#8619

Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
@tsungchih
Copy link
Copy Markdown
Collaborator Author

@jerryshao I've addressed the second round of review comments from Copilot. Could you please take a review on the changes when you're available? Thanks!

@jerryshao jerryshao merged commit 81a2edd into apache:main May 6, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Subtask] Add Serdes of StatisticValue

3 participants