[8619] feat(client-python): add statistic value serdes#10877
[8619] feat(client-python): add statistic value serdes#10877jerryshao merged 8 commits intoapache:mainfrom
Conversation
apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
|
@jerryshao Could you please help with reviewing this PR when you are available? I would greatly appreciate any feedback from you. Thanks. |
Code Coverage Report
|
There was a problem hiding this comment.
Pull request overview
This PR adds Python client support for statistic value modeling and JSON (de)serialization, intended as a prerequisite for implementing statistics-related operations (e.g., MetadataObjectStatisticsOperations).
Changes:
- Introduces
StatisticValues(typed wrappers for boolean/long/double/string/list/object statistic values). - Adds
StatisticValueSerdesto serialize/deserializeStatisticValueinstances to/from JSON-compatible Python primitives. - Adds unit tests covering
StatisticValuesbehavior and JSON serdes roundtrips.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| clients/client-python/gravitino/api/stats/statistic_values.py | Adds concrete StatisticValue implementations and factory helpers. |
| clients/client-python/gravitino/api/stats/json_serdes/statistic_value_serdes.py | Implements JSON serialization/deserialization for StatisticValue. |
| clients/client-python/gravitino/api/stats/json_serdes/init.py | Introduces the stats JSON serdes package. |
| clients/client-python/gravitino/api/rel/types/json_serdes/base.py | Extends JsonSerializable type bounds to include StatisticValue. |
| clients/client-python/tests/unittests/api/stats/test_statistic_values.py | Adds unit tests for StatisticValues. |
| clients/client-python/tests/unittests/api/stats/test_statistic_value_serdes.py | Adds unit tests for StatisticValueSerdes. |
| def __hash__(self) -> int: | ||
| return hash(tuple(v.value() for v in self._value_list)) | ||
|
|
There was a problem hiding this comment.
ListValue.__hash__ hashes v.value() for each element. This will raise TypeError for nested statistic values whose .value() is unhashable (e.g., a nested ListValue returns list, ObjectValue returns dict). Hashing should be based on the StatisticValue elements themselves (or on a stable serialized representation), not their raw .value().
| def __hash__(self) -> int: | ||
| return hash(tuple(v.value() for v in self._value_map.values())) | ||
|
|
There was a problem hiding this comment.
ObjectValue.__hash__ hashes only value_map.values() in insertion order. This can violate the hash contract because two equal dicts (same key/value pairs but different insertion order) compare equal yet can produce different hashes; it also fails for nested unhashable .value() results. Use a key-inclusive, order-independent representation (e.g., a frozenset/tuple of sorted (key, value) pairs, hashing the StatisticValue objects).
|
|
||
| self.assertEqual(value.value(), value_dict) | ||
| self.assertEqual( | ||
| value.data_type().name(), | ||
| Types.StructType.of( | ||
| Types.StructType.Field.nullable_field(key, value.data_type()) | ||
| for key, value in value_dict.items() | ||
| ).name(), | ||
| ) | ||
| self.assertEqual( | ||
| hash(value), hash(tuple(v.value() for v in value_dict.values())) | ||
| ) | ||
| self.assertEqual(value, twin_value) | ||
| self.assertNotEqual(value, another_value) | ||
| self.assertEqual( | ||
| value.data_type().name(), | ||
| Types.StructType.of( | ||
| Types.StructType.Field.nullable_field(key, value.data_type()) | ||
| for key, value in value_dict.items() | ||
| ).name(), | ||
| ) | ||
| self.assertEqual( |
There was a problem hiding this comment.
This assertion doesn’t actually validate the struct’s field types: StructType.name() is always Name.STRUCT, and the expected StructType.of(...) call is also constructed incorrectly (it passes a generator as a single field). Consider asserting on value.data_type().fields() or simple_string() and build the expected struct with Types.StructType.of(*field_list) using each entry’s StatisticValue.data_type().
| self.assertEqual(value.value(), value_dict) | |
| self.assertEqual( | |
| value.data_type().name(), | |
| Types.StructType.of( | |
| Types.StructType.Field.nullable_field(key, value.data_type()) | |
| for key, value in value_dict.items() | |
| ).name(), | |
| ) | |
| self.assertEqual( | |
| hash(value), hash(tuple(v.value() for v in value_dict.values())) | |
| ) | |
| self.assertEqual(value, twin_value) | |
| self.assertNotEqual(value, another_value) | |
| self.assertEqual( | |
| value.data_type().name(), | |
| Types.StructType.of( | |
| Types.StructType.Field.nullable_field(key, value.data_type()) | |
| for key, value in value_dict.items() | |
| ).name(), | |
| ) | |
| self.assertEqual( | |
| expected_data_type = Types.StructType.of( | |
| *[ | |
| Types.StructType.Field.nullable_field( | |
| key, statistic_value.data_type() | |
| ) | |
| for key, statistic_value in value_dict.items() | |
| ] | |
| ) | |
| self.assertEqual(value.value(), value_dict) | |
| self.assertEqual(value.data_type().fields(), expected_data_type.fields()) | |
| self.assertEqual( | |
| hash(value), hash(tuple(v.value() for v in value_dict.values())) | |
| ) | |
| self.assertEqual(value, twin_value) | |
| self.assertNotEqual(value, another_value) | |
| self.assertEqual(value.data_type().fields(), expected_data_type.fields()) | |
| self.assertEqual( |
| self.assertEqual( | ||
| value.data_type().name(), | ||
| Types.ListType.nullable(value_list[0].data_type()).name(), | ||
| ) | ||
| self.assertEqual(hash(value), hash(tuple(v.value() for v in value_list))) | ||
| self.assertEqual(value, twin_value) | ||
| self.assertNotEqual(value, another_value) |
There was a problem hiding this comment.
There are duplicated assertions in this test (the data_type().name()/hash()/==/!= checks are repeated). Removing the duplicates will make the test easier to maintain without reducing coverage.
| self.assertEqual( | ||
| value.data_type().name(), | ||
| Types.StructType.of( | ||
| Types.StructType.Field.nullable_field(key, value.data_type()) | ||
| for key, value in value_dict.items() | ||
| ).name(), | ||
| ) | ||
| self.assertEqual( | ||
| hash(value), hash(tuple(v.value() for v in value_dict.values())) | ||
| ) | ||
| self.assertEqual(value, twin_value) | ||
| self.assertNotEqual(value, another_value) |
There was a problem hiding this comment.
There are duplicated assertions in this test (the data_type().name()/hash()/==/!= checks are repeated). Removing the duplicates will make the test easier to maintain without reducing coverage.
| self.assertEqual( | |
| value.data_type().name(), | |
| Types.StructType.of( | |
| Types.StructType.Field.nullable_field(key, value.data_type()) | |
| for key, value in value_dict.items() | |
| ).name(), | |
| ) | |
| self.assertEqual( | |
| hash(value), hash(tuple(v.value() for v in value_dict.values())) | |
| ) | |
| self.assertEqual(value, twin_value) | |
| self.assertNotEqual(value, another_value) |
| class StatisticValueSerdes(JsonSerializable[StatisticValue[Any]]): | ||
| """Customized JSON Serializer and Deserializer for StatisticValue.""" | ||
|
|
There was a problem hiding this comment.
The PR description mentions adding Java classes (e.g., JsonUtils.java / StatisticValues.java), but the actual changes in this PR are in the Python client. Please update the PR description to reflect the Python implementation so reviewers/users aren’t confused.
| def object_value(value_list: dict[str, StatisticValue[Any]]) -> "ObjectValue[Any]": | ||
| """Creates a statistic value that holds a list of other statistic values. | ||
|
|
||
| Args: | ||
| value_list: the list of statistic values to be held by this statistic value | ||
|
|
||
| Returns: | ||
| A ListValue instance containing the provided list of statistic values | ||
| """ | ||
| return StatisticValues.ObjectValue(value_list) |
There was a problem hiding this comment.
object_value takes a dict/map, but its docstring/argument naming still refers to a “list”, and the Returns section says it returns ListValue. This is misleading for API consumers; please update the docstring and rename the parameter to something like value_map (and fix the Returns description to ObjectValue).
| def object_value(value_list: dict[str, StatisticValue[Any]]) -> "ObjectValue[Any]": | |
| """Creates a statistic value that holds a list of other statistic values. | |
| Args: | |
| value_list: the list of statistic values to be held by this statistic value | |
| Returns: | |
| A ListValue instance containing the provided list of statistic values | |
| """ | |
| return StatisticValues.ObjectValue(value_list) | |
| def object_value(value_map: dict[str, StatisticValue[Any]]) -> "ObjectValue[Any]": | |
| """Creates a statistic value that holds a map of other statistic values. | |
| Args: | |
| value_map: the map of statistic values to be held by this statistic value | |
| Returns: | |
| An ObjectValue instance containing the provided map of statistic values | |
| """ | |
| return StatisticValues.ObjectValue(value_map) |
| Types.StructType.Field.nullable_field(key, value.data_type()) | ||
| for key, value in self._value_map.items() |
There was a problem hiding this comment.
Types.StructType.of is a varargs factory (of(*fields)), but this code passes a generator expression as a single argument. That builds a StructType whose only “field” is a generator object, which will break when callers use fields() / simple_string() / serialization. Build a list of Field and splat it (or use of(*field_list)).
| Types.StructType.Field.nullable_field(key, value.data_type()) | |
| for key, value in self._value_map.items() | |
| *[ | |
| Types.StructType.Field.nullable_field(key, value.data_type()) | |
| for key, value in self._value_map.items() | |
| ] |
|
Generally LGTM, can you fix the Copilot's comment? |
apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
apache#8619 Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
|
@jerryshao I've addressed review comments from Copilot. Could you please take a review on the changes when you're available? Thanks! |
| self.assertEqual( | ||
| hash(value), hash(tuple(v.value() for v in value_dict.values())) | ||
| ) |
There was a problem hiding this comment.
ObjectValue.__hash__ is computed via StatisticValues._make_hash, which includes keys (sorted) and hashes of nested values. This test’s expected hash (hash(tuple(v.value() for v in value_dict.values()))) ignores keys and insertion-order differences, so it won’t match the actual implementation. Update the assertion to either compare hash(value) with hash(twin_value) only, or compute the expected hash using the same key-inclusive, order-independent representation.
| self.assertEqual( | |
| hash(value), hash(tuple(v.value() for v in value_dict.values())) | |
| ) | |
| self.assertEqual(hash(value), hash(twin_value)) |
| from gravitino.api.rel.types.types import Type | ||
| from gravitino.api.stats.statistic_value import StatisticValue | ||
| from gravitino.dto.rel.partitioning.partitioning import Partitioning | ||
| from gravitino.dto.rel.partitions.partition_dto import PartitionDTO | ||
|
|
||
| _GravitinoTypeT = TypeVar( | ||
| "_GravitinoTypeT", | ||
| bound=Union[ | ||
| Expression, Type, Partitioning, PartitionDTO, Distribution, Index, SortOrder | ||
| Expression, | ||
| Type, | ||
| Partitioning, | ||
| PartitionDTO, | ||
| Distribution, | ||
| Index, | ||
| SortOrder, | ||
| StatisticValue, | ||
| ], |
There was a problem hiding this comment.
Adding StatisticValue to the JsonSerializable type bound introduces an unnecessary dependency from api/rel/types/json_serdes into api/stats (and forces importing stats whenever JsonSerializable is imported). Since JsonSerializable is a generic serializer interface, consider removing the restrictive bound=Union[...] (or loosening it to a generic unconstrained TypeVar) so new serdes types don’t require editing this base module and to keep package layering clean.
There was a problem hiding this comment.
The suggested change is already there in the branch. I have no idea where to address this comment.
| Types.StructType.Field.nullable_field(key, value.data_type()) | ||
| for key, value in self._value_map.items() |
There was a problem hiding this comment.
ObjectValue.data_type() builds a StructType using self._value_map.items() iteration order. Since Types.StructType.__eq__ is order-sensitive (fields list equality), two ObjectValues with identical key/value pairs but different insertion order will report different data_type(), and ListValue type-homogeneity checks can fail for lists of such objects. Consider constructing struct fields in a deterministic order (e.g., sort by key) so the schema is stable and independent of dict insertion order.
| Types.StructType.Field.nullable_field(key, value.data_type()) | |
| for key, value in self._value_map.items() | |
| Types.StructType.Field.nullable_field( | |
| key, self._value_map[key].data_type() | |
| ) | |
| for key in sorted(self._value_map) |
| def test_deserialize_unsupported_type(self): | ||
| with self.assertRaises(ValueError): | ||
| StatisticValueSerdes.deserialize(None) | ||
|
|
There was a problem hiding this comment.
StatisticValueSerdes.deserialize(None) triggers Precondition.check_argument(data is not None, ...) which raises IllegalArgumentException, not ValueError. This test will fail as written; update it to expect IllegalArgumentException (and optionally assert on the message).
apache#8619 Co-authored-by: Copilot <copilot@github.com> Signed-off-by: George T. C. Lai <tsungchih.hd@gmail.com>
|
@jerryshao I've addressed the second round of review comments from Copilot. Could you please take a review on the changes when you're available? Thanks! |
What changes were proposed in this pull request?
This PR aims to add implementation of statistic value serdes in Python client that refers to the following Java classes.
JsonUtils.javaStatisticValues.javaWhy are the changes needed?
We need to have these classes before being able to implement
MetadataObjectStatisticsOperations.Fix: #8619
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests