Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Declare enums explicitly, fix type hints #74

Merged
merged 6 commits into from
Jul 29, 2022

Conversation

vnlitvinov
Copy link
Contributor

Signed-off-by: Vasily Litvinov vasilij.n.litvinov@intel.com

This is related to #73

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks @vnlitvinov. Just one question about ColumnNullType

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved
protocol/dataframe_protocol.py Outdated Show resolved Hide resolved
@vnlitvinov vnlitvinov marked this pull request as draft February 24, 2022 07:57
@vnlitvinov
Copy link
Contributor Author

I'm converting this to draft until I finish migrating the prototype to pandas - I've already had to change a few things here...

@vnlitvinov
Copy link
Contributor Author

I hope this is enough to get chewing, I'll probably polish the formatting so at least PEP-8 checks would be happy.

@vnlitvinov
Copy link
Contributor Author

Do note three changes to the spec:

  1. Column.describe_categories now returns a tuple of three elements (aligning the spec with what all existing implementations do). This keeps things in sync, but hardens future extensions - so we should discuss that.
  2. Column.describe_categories is now raising TypeError on non-categorical dtypes instead of RuntimeError
  3. DataFrame.__dataframe__ was removed (it was confusing), it's version field was moved to DataFrame.version attribute instead

Any feedback on those changes is welcome!

@rgommers
Copy link
Member

Thanks @vnlitvinov!

2. Column.describe_categories is now raising TypeError on non-categorical dtypes instead of RuntimeError

This seems fine to me.

  1. Column.describe_categories now returns a tuple of three elements (aligning the spec with what all existing implementations do). This keeps things in sync, but hardens future extensions - so we should discuss that.

These are the implementations:

The docstrings also talk about a dictionary in both places; for cuDF the type annotation matches the code (tuple), in Vaex it matches the docs.

I think we're fine making backwards incompatible changes at this point. So I'd vote for using a dictionary, it's not much more verbose, and makes it possible to add another return value in case that would ever be needed. @maartenbreddels, @shwina what do you think?

@rgommers
Copy link
Member

3. DataFrame.__dataframe__ was removed (it was confusing), it's version field was moved to DataFrame.version attribute instead

This was inherited from:

Both of those define this, and it kinda makes sense to have it. There was discussion elsewhere too that I cannot find right now.

The rationale for removing it is unclear, and you are also removing docs and keywords that are needed. Can you please explain in more detail why you want to do this @vnlitvinov?

@vnlitvinov
Copy link
Contributor Author

vnlitvinov commented Feb 25, 2022

The comment you're linking at describes something which is somewhat different from what we have now in the spec.
That cuDFFrame.__dataframe__ method was describing the internals to some extent, while the implementation I removed returned self and version.

My main point is __dataframe__ should belong to a real dataframe object, not to an exchange one, serving as a distinction flag if you wish.

Alternatives to removing the method on exchange object are:

  • change the signature of this __dataframe__ to always return self and keep .version as class attribute, but, to be honest, I don't see much value in that;
  • change the spec of .__dataframe__ method on a real dataframe object to return a dictionary instead.

I am personally inclined to just remove the method on the exchange object, but I'm curios what others think (cc @jorisvandenbossche @shwina @jreback @aregm)

Oh, forgot to add... as for changing the .describe_categorical - should we turn other description fields (.decribe_null and .dytpe) into dicts instead in the sake of better extensibility?

@rgommers
Copy link
Member

My main point is __dataframe__ should belong to a real dataframe object, not to an exchange one, serving as a distinction flag if you wish.

I reviewed this again:

I think return self would still make sense, and not having __dataframe__ isn't distinguishing of this being a protocol-conforming object, but I'm fine with merging this as is.

change the signature of this __dataframe__ to always return self and keep .version as class attribute, but, to be honest, I don't see much value in that;

The point of this would be that if you want code that handles a mix of dataframe types/objects, the only guarantee you have that if you call df.__dataframe__ you get a protocol-conforming object (if it exists of course). There's no other robust isinstance check you can do. But it only matters if people are going to be passing around these objects to begin with.

Oh, forgot to add... as for changing the .describe_categorical - should we turn other description fields (.decribe_null and .dytpe) into dicts instead in the sake of better extensibility?

Not sure about that, that'd be quite verbose and quite a bit of churn. Given that we'd have to version the protocol to do anything with it, and can deal with a backwards-compatible extension via versioning, I'd suggest to leave it as is.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed again, all looks fine except for my one comment about not deleting docs and the _nan_as_null and _allow_copy attributes.

protocol/dataframe_protocol.py Show resolved Hide resolved
@vnlitvinov
Copy link
Contributor Author

@rgommers I've updated the PR again following my logic at Pandas PR

I've changed Column.describe_categorical back to returning a dictionary (which I've typed), and I've returned DataFrame.__dataframe__() method to both keep the documentation on arguments and to make it possible to pass into from_dataframe().

I don't like the idea of saving the arguments as attributes - those should be implementation-specific IMHO, hence I'm marking it as @abstractmethod (as all other methods here, too).

"""

@property
def size(self) -> Optional[int]:
@abstractmethod
def size(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this may be None is that the size may be unknown, or expensive to calculate. IIRC there was a fair bit of discussion about this, and this was on purpose. Same as null_count further down.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgommers, could you elaborate when the size could be unknow? Also, if the size is expensive to calculate, why should we return None?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think of a library using delayed evaluation of method calls that can give a variable size (filter/select/etc.). Accessing .size will trigger immediate evaluation, which can take an arbitrary amount of time. Dask is an example of such a library.

And why: a property lookup is supposed to be fast, and not trigger expensive computations. If so, it should be converted to a method. But .size is well-established, and the preference was for this property to remain but for libraries to allow returning None if the size is unknown without triggering a computation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user eventually wants to get the size of a column? Does the user have to use a different/extra API to get it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that Buffer has .bufsize which is somewhat equal to this size computing.
Hence maybe we should introduce some separate method of getting the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgommers ping on the questions above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided here that None (and hence Optional) can indeed be dropped. So the current change here is fine.

Let me point out that we then indeed change .size from a property to a method. I'll not resolve this issue yet to give folks a chance to respond.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me point out that we then indeed change .size from a property to a method. I'll not resolve this issue yet to give folks a chance to respond.

Which is not what this PR does right now, this declares a property in an ABC:

@property
@abstractmethod

Going from .size to .size() would be a pretty annoying bc-breaking change at this point (goes for null_count as well).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-discussed once more (apologies for the churn): consensus was to turn these into methods indeed, to indicate that they may trigger a potentially expensive computation to return the size as an integer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a commit to turn .size into a method (with an explanation in its docstring).

null_count remains Optional[int], so I did not touch it.

@rgommers
Copy link
Member

rgommers commented Apr 2, 2022

Thanks @vnlitvinov. LGTM modulo my one comment.

@vnlitvinov
Copy link
Contributor Author

Note: this PR is aiming to change the spec, so #69 is also related.

@vnlitvinov
Copy link
Contributor Author

I would say it's blocked by #69, let's merge that and then update this and merge it as well.

NON_NULLABLE : int
Non-nullable column.
USE_NAN : int
Use explicit float NaN/NaT value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is NaT a thing outside of numpy? As far as I know that's just a sentinel value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much insight here, was just moving the enum descriptions from .describe_null docstring. @rgommers might know more about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, it's not a standardized thing, it's a sentinel value that behaves similar to nan in comparisons and sort. So this was already a mistake in the content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USE_NAN is just for floating point data then. NaT should be removed here.

In the USE_SENTINEL value it should be usable - it's just a particular value of the same data type as the rest of the data in the column, which is always understandable by the receiving library I'd say. So it can stay there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pandas implementation would also need to be updated (it currently uses USE_NAN for datetime data):

https://github.com/pandas-dev/pandas/blob/a7b8c1dc8cf76ab36e45315386a1a9d8274a8974/pandas/core/exchange/column.py#L41

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit removing NaT.

protocol/dataframe_protocol.py Show resolved Hide resolved
honno added a commit to data-apis/dataframe-interchange-tests that referenced this pull request Jul 22, 2022
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
This address the review comment that NaT is not a thing outside of
NumPy. Hence for not-a-datetime, all implementers should be using
sentinel values, because those are explicit.
@rgommers
Copy link
Member

I pushed a rebase after fixing conflicts introduced after merging gh-69 (in describe_categorical, and also updated the related CategoricalDescription).

@rgommers rgommers added the bc-breaking A change that is not fully backwards compatible label Jul 29, 2022
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments have been resolved now, so I will merge this. Thanks a lot @vnlitvinov and everyone else who commented.

Between this PR and gh-69, there are now two bc-breaking changes. We've agreed that those changes will be synced quickly to Pandas (by @mroeschke), Vaex (by @honno), Modin by (@vnlitvinov) and cuDF (by @shwina), so that won't give problems with implementations that are not in sync with each other.

@vnlitvinov
Copy link
Contributor Author

Pull request for syncing Modin with new spec: modin-project/modin#4763

@honno
Copy link
Member

honno commented Aug 8, 2022

vaex PR for syncing with new spec vaexio/vaex#2150

@mroeschke
Copy link

pandas PR for syncing with new spec

pandas-dev/pandas#47887
pandas-dev/pandas#47886

@vnlitvinov
Copy link
Contributor Author

Follow-up question - Column.describe_categorical among others returns is_dictionary: bool element and categories: Column (the latter could be None). Is is_dictionary still valid and is it even needed if categories can just be None to signify the same meaning that we cannot convey the categories?..

cc @rgommers @shwina @mroeschke @honno

@honno
Copy link
Member

honno commented Aug 10, 2022

Is is_dictionary still valid and is it even needed if categories can just be None to signify the same meaning that we cannot convey the categories?

In terms of necessity in interchange, I don't think so, as yeah you can just go col.describe_categorical["categories"] is None. Now you mention it, in terms of semantics IMO I'd prefer checking for None over checking a separate variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bc-breaking A change that is not fully backwards compatible interchange-protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants