Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Support iceberg base catalog in python library (#3245) #4706

Merged
merged 32 commits into from
May 27, 2022

Conversation

dhruv-pratap
Copy link
Contributor

@dhruv-pratap dhruv-pratap commented May 5, 2022

READY FOR REVIEW. The current PR has changes for the base Catalog with table and namespace APIs. It also has a concrete In-Memory Catalog implementation, that is Dictionary based, for testing purposes. It also helps us validate our Catalog interface and see if it holds firm in action.

The PR also includes placeholders for Table and PartitionSpec, and minutely touches issue #3227 i.e., support table in iceberg python library, and #3228 i.e. support partition spec in iceberg python library, as the catalog management is dependent on those classes.

@github-actions github-actions bot added the python label May 5, 2022
python/src/iceberg/catalog/base.py Outdated Show resolved Hide resolved
python/src/iceberg/catalog/base.py Outdated Show resolved Hide resolved
python/src/iceberg/catalog/base.py Outdated Show resolved Hide resolved
python/src/iceberg/catalog/base.py Outdated Show resolved Hide resolved
python/src/iceberg/catalog/base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start!

python/src/iceberg/catalog/base.py Outdated Show resolved Hide resolved
python/src/iceberg/catalog/base.py Outdated Show resolved Hide resolved
@dhruv-pratap dhruv-pratap marked this pull request as ready for review May 10, 2022 15:12
python/setup.cfg Outdated Show resolved Hide resolved
…ass to remove 'attrs' dependency. Move in-memory catalog completely in test sources for testing base catalog only. (apache#3245)
@danielcweeks danielcweeks self-requested a review May 12, 2022 16:26
@dhruv-pratap dhruv-pratap changed the base branch from master to 0.9.x May 18, 2022 23:53
@dhruv-pratap dhruv-pratap changed the base branch from 0.9.x to master May 18, 2022 23:54
@dhruv-pratap
Copy link
Contributor Author

Sorry for the mess above ^. Had to do a rebase from master to pull in recent PRs to add namespace to spellcheck-dictionary.txt for the CI checks to pass.

@dhruv-pratap dhruv-pratap requested a review from rdblue May 19, 2022 14:18

# Then
assert table
assert table is not given_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just asserting that the table isn't the same instance? Is that useful?

Copy link
Contributor Author

@dhruv-pratap dhruv-pratap May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue In the earlier commits I was asserting if the table name has changed, but since I had to rollback all changes to the Table interface in this PR I resorted to checking just that the table is changing in some form post renaming. I will eventually be changing this to assert table name change once we add attributes back to Table in #3227

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. I worked around it for the time being. Since the Table class isn't frozen yet I added an attribute identifier post creation in the InMemoryCatalog and asserting on identifier renaming for now. This obviously would fail, and rightly so, once we make Table immutable/frozen.

@rdblue
Copy link
Contributor

rdblue commented May 19, 2022

I'm ready to commit this once the error message format is fixed.

@dhruv-pratap dhruv-pratap requested a review from rdblue May 19, 2022 22:15
@dhruv-pratap
Copy link
Contributor Author

@rdblue Are we good to merge?

class InMemoryCatalog(Catalog):
"""An in-memory catalog implementation for testing purposes."""

__tables: Dict[Identifier, Table]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit / question: I know that the usage of double underscore somewhat mangles function names to make them "private". Does that apply to class fields as well?

I've always preferred single underscore for this reason, but I'm not a huge pythonista and this is a test class. So more just for my own knowledge as well as to better understand the current "best practices".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this too. In some ways I think the double underscore here signals "These properties don't normally exist in a non-testing Catalog implementation" but I think it'd be clearer to just directly describe that in the docstring and use property names without any underscores. You're right in that the double underscore would mangle the names, but only for a child class, something like Foo(InMemoryCatalog), which isn't done anywhere here. Python would then make the property name _InMemoryCatalog__tables for instances of Foo.

I came to the same conclusion as you, this is a test class so it doesn't matter as much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single underscore is more appropriate here. It is an internal variable of the InMemoryCatalog.

If we want to set this as a ClassVar, then we also have to annotation for it:

Suggested change
__tables: Dict[Identifier, Table]
_tables: ClassVar[Dict[Identifier, Table]]

But I'm not sure if we want to do that since we would like to instantiate a fresh InMemoryCatalog for the different tests.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few last remarks, looks great in general!

properties(Properties): Catalog properties
"""

def __init__(self, name: str, properties: Properties):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll continue my quest for data classes here as well. Instead of creating the getters/setters, we could also let the dataclasses do this:

@dataclass
class Catalog(ABC):

    name: str =  field()
    name: Properties =  field(default_factory=dict)

This will automatically generate the getters and setters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to go full blown pythonic in the future, we could also validate these using Pydantic validators: https://pydantic-docs.helpmanual.io/usage/validators/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using @dataclass is a good idea, but we can move on here and refactor it later.

class InMemoryCatalog(Catalog):
"""An in-memory catalog implementation for testing purposes."""

__tables: Dict[Identifier, Table]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single underscore is more appropriate here. It is an internal variable of the InMemoryCatalog.

If we want to set this as a ClassVar, then we also have to annotation for it:

Suggested change
__tables: Dict[Identifier, Table]
_tables: ClassVar[Dict[Identifier, Table]]

But I'm not sure if we want to do that since we would like to instantiate a fresh InMemoryCatalog for the different tests.

if namespace in self.__namespaces:
raise AlreadyExistsError(f"Namespace already exists: {namespace}")
else:
self.__namespaces[namespace] = properties if properties else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less is more:

Suggested change
self.__namespaces[namespace] = properties if properties else {}
self.__namespaces[namespace] = properties or {}

"""

@staticmethod
def identifier_to_tuple(identifier: Union[str, Identifier]) -> Identifier:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming things is one of the hardest things in Computer Science, but I find this one a bit awkward since:

Identifier = Tuple[str, ...]

So it is actually already a tuple. How about:

Suggested change
def identifier_to_tuple(identifier: Union[str, Identifier]) -> Identifier:
def to_identifier(identifier: Union[str, Identifier]) -> Identifier:

TEST_TABLE_LOCATION = "protocol://some/location"
TEST_TABLE_PARTITION_SPEC = PartitionSpec()
TEST_TABLE_PROPERTIES = {"key1": "value1", "key2": "value2"}
NO_SUCH_TABLE_ERROR = "Table does not exist: \\('com', 'organization', 'department', 'my_table'\\)"
Copy link
Contributor

@rdblue rdblue May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should use string names, not tuples. Using a tuple leaks the internal details to people using the library or something built on this library, and this is needlessly confusing for them.


@pytest.fixture
def catalog() -> InMemoryCatalog:
return InMemoryCatalog("test.in.memory.catalog", {"test.key": "test.value"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test.key isn't used anywhere, nor is test.in.memory.catalog. Can you add tests for properties and name?

@rdblue
Copy link
Contributor

rdblue commented May 27, 2022

There are still a few things to fix, but I think we can clean up things like moving to @dataclass separately to unblock other work.

@rdblue rdblue merged commit f792052 into apache:master May 27, 2022
@dhruv-pratap dhruv-pratap deleted the python_issue_3245_base_catalog branch May 27, 2022 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants