Add UnboundReference class#4679
Add UnboundReference class#4679rdblue merged 4 commits intoapache:masterfrom samredai:UnboundReference
Conversation
|
This has been rebased and is ready for review |
|
BTW, might want to add a note in the PR description that this class corresponds to the Java API's NamedReference. A lot of times after looking through the git blame, I’ll go back and look at PRs for context and given we're changing the name that might give others a jumping off point to go investigate further. |
|
|
||
|
|
||
| class UnboundReference: | ||
| """A reference not yet bounded to a field in a schema |
There was a problem hiding this comment.
I think you want "bound" instead of "bounded".
|
@kbendick good point, I've added this to the docstring: |
| """A reference not yet bound to a field in a schema | ||
|
|
||
| Args: | ||
| name (str): The name of the field |
There was a problem hiding this comment.
it might pay to provide more details on name here for nested fields, or at least point to other documentation on how they are expected to be referred to.
This isn't directly related to this PR, but i could find in the specification what is considered a valid field name other than the fact that they can contain periods "." at least for imports. This is quite possibly user error though.
There was a problem hiding this comment.
There aren't spec requirements for field names. In the implementations, we index using . to join the names so that you can easily project names like a.b.
There was a problem hiding this comment.
@emkornfield agreed this could use more details. Once the accessor functionality is added I'm planning to circle back here and add an example in the docstring along the lines of:
Example:
>>> from iceberg.expressions.base import UnboundReference
>>> from iceberg.schema import Schema
>>> from iceberg.types import FloatType, MapType, NestedField, StringType
>>> schema = Schema(
... NestedField(
... field_id=1,
... name="location",
... field_type=MapType(
... key_id=2,
... key_type=StringType(),
... value_id=3,
... value_type=FloatType(),
... value_is_optional=False,
... ),
... is_optional=False,
... ),
... schema_id=1,
... )
>>> name = "location.lat"
>>> unbound_reference = UnboundReference(name="location.lat")
>>> unbound_reference.bind(schema=schema, case_sensitive=False)
BoundReference(field=1, accessor=Accessor(position=1, inner=Accessor(position=2)))I'll also expand name description to be more descriptive. Thanks!
| Returns: | ||
| BoundReference: A reference bound to the specific field in the Iceberg schema | ||
| """ | ||
| field = schema.find_field(name_or_id=self.name, case_sensitive=case_sensitive) |
There was a problem hiding this comment.
are nested field names guaranteed to be unambigous?
Can there be a field named "a.b" and a field named "b" nested under a struct "a"?
There was a problem hiding this comment.
Yes, they are guaranteed to be unambiguous. Iceberg will fail if there are names that map to the same flattened version.
This adds an
UnboundReferenceclass that has abindmethod which returns aBoundReferenceclass. Merging this relies on PR #4678 which adds theSchema.accessor_for_fieldmethod. For complete functionality, a subsequent PR needs to be opened that implements thestructmethod in theBuildPositionAccessorsvisitor (added in the other PR) that builds a map of field ID to schema position accessor (the same PR should probably include tests forUnboundReferenceandBuildPositionAccessors.To summarize:
BuildPositionAccessorsandUnboundReferencecc: @rdblue @emkornfield @dramaticlly @CircArgs @jun-he @dhruv-pratap