Skip to content

Add UnboundReference class#4679

Merged
rdblue merged 4 commits intoapache:masterfrom
samredai:UnboundReference
May 4, 2022
Merged

Add UnboundReference class#4679
rdblue merged 4 commits intoapache:masterfrom
samredai:UnboundReference

Conversation

@samredai
Copy link
Contributor

@samredai samredai commented May 2, 2022

This adds an UnboundReference class that has a bind method which returns a BoundReference class. Merging this relies on PR #4678 which adds the Schema.accessor_for_field method. For complete functionality, a subsequent PR needs to be opened that implements the struct method in the BuildPositionAccessors visitor (added in the other PR) that builds a map of field ID to schema position accessor (the same PR should probably include tests for UnboundReference and BuildPositionAccessors.

To summarize:

cc: @rdblue @emkornfield @dramaticlly @CircArgs @jun-he @dhruv-pratap

@samredai samredai marked this pull request as ready for review May 3, 2022 15:59
@samredai
Copy link
Contributor Author

samredai commented May 3, 2022

This has been rebased and is ready for review

@kbendick
Copy link
Contributor

kbendick commented May 3, 2022

BTW, might want to add a note in the PR description that this class corresponds to the Java API's NamedReference.

A lot of times after looking through the git blame, I’ll go back and look at PRs for context and given we're changing the name that might give others a jumping off point to go investigate further.



class UnboundReference:
"""A reference not yet bounded to a field in a schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want "bound" instead of "bounded".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Thanks

@samredai
Copy link
Contributor Author

samredai commented May 3, 2022

@kbendick good point, I've added this to the docstring:

    Note:
        An unbound reference is sometimes referred to as a "named" reference

"""A reference not yet bound to a field in a schema

Args:
name (str): The name of the field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might pay to provide more details on name here for nested fields, or at least point to other documentation on how they are expected to be referred to.

This isn't directly related to this PR, but i could find in the specification what is considered a valid field name other than the fact that they can contain periods "." at least for imports. This is quite possibly user error though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't spec requirements for field names. In the implementations, we index using . to join the names so that you can easily project names like a.b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield agreed this could use more details. Once the accessor functionality is added I'm planning to circle back here and add an example in the docstring along the lines of:

    Example:
        >>> from iceberg.expressions.base import UnboundReference
        >>> from iceberg.schema import Schema
        >>> from iceberg.types import FloatType, MapType, NestedField, StringType
        >>> schema = Schema(
        ...     NestedField(
        ...         field_id=1,
        ...         name="location",
        ...         field_type=MapType(
        ...             key_id=2,
        ...             key_type=StringType(),
        ...             value_id=3,
        ...             value_type=FloatType(),
        ...             value_is_optional=False,
        ...         ),
        ...         is_optional=False,
        ...     ),
        ...     schema_id=1,
        ... )
        >>> name = "location.lat"
        >>> unbound_reference = UnboundReference(name="location.lat")
        >>> unbound_reference.bind(schema=schema, case_sensitive=False)
        BoundReference(field=1, accessor=Accessor(position=1, inner=Accessor(position=2)))

I'll also expand name description to be more descriptive. Thanks!

Returns:
BoundReference: A reference bound to the specific field in the Iceberg schema
"""
field = schema.find_field(name_or_id=self.name, case_sensitive=case_sensitive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are nested field names guaranteed to be unambigous?

Can there be a field named "a.b" and a field named "b" nested under a struct "a"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are guaranteed to be unambiguous. Iceberg will fail if there are names that map to the same flattened version.

@rdblue rdblue merged commit 42e47f5 into apache:master May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants