Declare enums explicitly, fix type hints #74

vnlitvinov · 2022-02-10T17:46:34Z

Signed-off-by: Vasily Litvinov vasilij.n.litvinov@intel.com

This is related to #73

rgommers

Overall LGTM, thanks @vnlitvinov. Just one question about ColumnNullType

protocol/dataframe_protocol.py

vnlitvinov · 2022-02-24T07:57:40Z

I'm converting this to draft until I finish migrating the prototype to pandas - I've already had to change a few things here...

vnlitvinov · 2022-02-24T18:03:34Z

I hope this is enough to get chewing, I'll probably polish the formatting so at least PEP-8 checks would be happy.

vnlitvinov · 2022-02-24T19:05:08Z

Do note three changes to the spec:

Column.describe_categories now returns a tuple of three elements (aligning the spec with what all existing implementations do). This keeps things in sync, but hardens future extensions - so we should discuss that.
Column.describe_categories is now raising TypeError on non-categorical dtypes instead of RuntimeError
DataFrame.__dataframe__ was removed (it was confusing), it's version field was moved to DataFrame.version attribute instead

Any feedback on those changes is welcome!

rgommers · 2022-02-25T14:40:34Z

Thanks @vnlitvinov!

2. Column.describe_categories is now raising TypeError on non-categorical dtypes instead of RuntimeError

This seems fine to me.

Column.describe_categories now returns a tuple of three elements (aligning the spec with what all existing implementations do). This keeps things in sync, but hardens future extensions - so we should discuss that.

These are the implementations:

The docstrings also talk about a dictionary in both places; for cuDF the type annotation matches the code (tuple), in Vaex it matches the docs.

I think we're fine making backwards incompatible changes at this point. So I'd vote for using a dictionary, it's not much more verbose, and makes it possible to add another return value in case that would ever be needed. @maartenbreddels, @shwina what do you think?

rgommers · 2022-02-25T14:53:15Z

3. DataFrame.__dataframe__ was removed (it was confusing), it's version field was moved to DataFrame.version attribute instead

This was inherited from:

Both of those define this, and it kinda makes sense to have it. There was discussion elsewhere too that I cannot find right now.

The rationale for removing it is unclear, and you are also removing docs and keywords that are needed. Can you please explain in more detail why you want to do this @vnlitvinov?

vnlitvinov · 2022-02-25T16:51:19Z

The comment you're linking at describes something which is somewhat different from what we have now in the spec.
That cuDFFrame.__dataframe__ method was describing the internals to some extent, while the implementation I removed returned self and version.

My main point is __dataframe__ should belong to a real dataframe object, not to an exchange one, serving as a distinction flag if you wish.

Alternatives to removing the method on exchange object are:

change the signature of this __dataframe__ to always return self and keep .version as class attribute, but, to be honest, I don't see much value in that;
change the spec of .__dataframe__ method on a real dataframe object to return a dictionary instead.

I am personally inclined to just remove the method on the exchange object, but I'm curios what others think (cc @jorisvandenbossche @shwina @jreback @aregm)

Oh, forgot to add... as for changing the .describe_categorical - should we turn other description fields (.decribe_null and .dytpe) into dicts instead in the sake of better extensibility?

rgommers · 2022-03-16T17:24:34Z

My main point is __dataframe__ should belong to a real dataframe object, not to an exchange one, serving as a distinction flag if you wish.

I reviewed this again:

you are right that the current implementation returning a dict of self and version isn't good.
Vaex left it out: https://github.com/vaexio/vaex/blob/master/packages/vaex-core/vaex/dataframe_protocol.py#L678
cuDF also left it out: https://github.com/rapidsai/cudf/blob/aa746ae3d2035809c516ab6a2100e4336e1cfa21/python/cudf/cudf/core/df_protocol.py#L504

I think return self would still make sense, and not having __dataframe__ isn't distinguishing of this being a protocol-conforming object, but I'm fine with merging this as is.

change the signature of this __dataframe__ to always return self and keep .version as class attribute, but, to be honest, I don't see much value in that;

The point of this would be that if you want code that handles a mix of dataframe types/objects, the only guarantee you have that if you call df.__dataframe__ you get a protocol-conforming object (if it exists of course). There's no other robust isinstance check you can do. But it only matters if people are going to be passing around these objects to begin with.

Oh, forgot to add... as for changing the .describe_categorical - should we turn other description fields (.decribe_null and .dytpe) into dicts instead in the sake of better extensibility?

Not sure about that, that'd be quite verbose and quite a bit of churn. Given that we'd have to version the protocol to do anything with it, and can deal with a backwards-compatible extension via versioning, I'd suggest to leave it as is.

rgommers

Reviewed again, all looks fine except for my one comment about not deleting docs and the _nan_as_null and _allow_copy attributes.

protocol/dataframe_protocol.py

vnlitvinov · 2022-04-01T13:17:01Z

@rgommers I've updated the PR again following my logic at Pandas PR

I've changed Column.describe_categorical back to returning a dictionary (which I've typed), and I've returned DataFrame.__dataframe__() method to both keep the documentation on arguments and to make it possible to pass into from_dataframe().

I don't like the idea of saving the arguments as attributes - those should be implementation-specific IMHO, hence I'm marking it as @abstractmethod (as all other methods here, too).

rgommers · 2022-04-02T19:48:40Z

protocol/dataframe_protocol.py

    """

    @property
-    def size(self) -> Optional[int]:
+    @abstractmethod
+    def size(self) -> int:


The reason this may be None is that the size may be unknown, or expensive to calculate. IIRC there was a fair bit of discussion about this, and this was on purpose. Same as null_count further down.

@rgommers, could you elaborate when the size could be unknow? Also, if the size is expensive to calculate, why should we return None?

Think of a library using delayed evaluation of method calls that can give a variable size (filter/select/etc.). Accessing .size will trigger immediate evaluation, which can take an arbitrary amount of time. Dask is an example of such a library.

And why: a property lookup is supposed to be fast, and not trigger expensive computations. If so, it should be converted to a method. But .size is well-established, and the preference was for this property to remain but for libraries to allow returning None if the size is unknown without triggering a computation.

What if the user eventually wants to get the size of a column? Does the user have to use a different/extra API to get it?

Also note that Buffer has .bufsize which is somewhat equal to this size computing.
Hence maybe we should introduce some separate method of getting the size?

@rgommers ping on the questions above?

We decided here that None (and hence Optional) can indeed be dropped. So the current change here is fine.

Let me point out that we then indeed change .size from a property to a method. I'll not resolve this issue yet to give folks a chance to respond.

Let me point out that we then indeed change .size from a property to a method. I'll not resolve this issue yet to give folks a chance to respond.

Which is not what this PR does right now, this declares a property in an ABC:

@property @abstractmethod

Going from .size to .size() would be a pretty annoying bc-breaking change at this point (goes for null_count as well).

Re-discussed once more (apologies for the churn): consensus was to turn these into methods indeed, to indicate that they may trigger a potentially expensive computation to return the size as an integer.

Pushed a commit to turn .size into a method (with an explanation in its docstring).

null_count remains Optional[int], so I did not touch it.

rgommers · 2022-04-02T19:49:29Z

Thanks @vnlitvinov. LGTM modulo my one comment.

vnlitvinov · 2022-05-04T12:21:23Z

Note: this PR is aiming to change the spec, so #69 is also related.

vnlitvinov · 2022-07-07T12:11:19Z

I would say it's blocked by #69, let's merge that and then update this and merge it as well.

kkraus14 · 2022-07-07T14:46:36Z

protocol/dataframe_protocol.py

+    NON_NULLABLE : int
+        Non-nullable column.
+    USE_NAN : int
+        Use explicit float NaN/NaT value.


Is NaT a thing outside of numpy? As far as I know that's just a sentinel value.

I don't have much insight here, was just moving the enum descriptions from .describe_null docstring. @rgommers might know more about that.

Yes indeed, it's not a standardized thing, it's a sentinel value that behaves similar to nan in comparisons and sort. So this was already a mistake in the content.

USE_NAN is just for floating point data then. NaT should be removed here.

In the USE_SENTINEL value it should be usable - it's just a particular value of the same data type as the rest of the data in the column, which is always understandable by the receiving library I'd say. So it can stay there.

The pandas implementation would also need to be updated (it currently uses USE_NAN for datetime data):

https://github.com/pandas-dev/pandas/blob/a7b8c1dc8cf76ab36e45315386a1a9d8274a8974/pandas/core/exchange/column.py#L41

I pushed a commit removing NaT.

protocol/dataframe_protocol.py

data-apis/dataframe-api#74

Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>

This address the review comment that NaT is not a thing outside of NumPy. Hence for not-a-datetime, all implementers should be using sentinel values, because those are explicit.

rgommers · 2022-07-28T12:28:35Z

I pushed a rebase after fixing conflicts introduced after merging gh-69 (in describe_categorical, and also updated the related CategoricalDescription).

rgommers

All comments have been resolved now, so I will merge this. Thanks a lot @vnlitvinov and everyone else who commented.

Between this PR and gh-69, there are now two bc-breaking changes. We've agreed that those changes will be synced quickly to Pandas (by @mroeschke), Vaex (by @honno), Modin by (@vnlitvinov) and cuDF (by @shwina), so that won't give problems with implementations that are not in sync with each other.

vnlitvinov · 2022-08-04T10:17:24Z

Pull request for syncing Modin with new spec: modin-project/modin#4763

honno · 2022-08-08T19:15:05Z

vaex PR for syncing with new spec vaexio/vaex#2150

mroeschke · 2022-08-08T20:51:34Z

pandas PR for syncing with new spec

pandas-dev/pandas#47887
pandas-dev/pandas#47886

vnlitvinov · 2022-08-10T08:23:26Z

Follow-up question - Column.describe_categorical among others returns is_dictionary: bool element and categories: Column (the latter could be None). Is is_dictionary still valid and is it even needed if categories can just be None to signify the same meaning that we cannot convey the categories?..

cc @rgommers @shwina @mroeschke @honno

honno · 2022-08-10T08:31:01Z

Is is_dictionary still valid and is it even needed if categories can just be None to signify the same meaning that we cannot convey the categories?

In terms of necessity in interchange, I don't think so, as yeah you can just go col.describe_categorical["categories"] is None. Now you mention it, in terms of semantics IMO I'd prefer checking for None over checking a separate variable.

vnlitvinov requested a review from rgommers February 15, 2022 11:12

rgommers reviewed Feb 23, 2022

View reviewed changes

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved

protocol/dataframe_protocol.py Outdated Show resolved Hide resolved

vnlitvinov marked this pull request as draft February 24, 2022 07:57

This was referenced Feb 24, 2022

FEAT-#4245: Define base interface for dataframe exchange protocol modin-project/modin#4246

Merged

ENH: Implement DataFrame interchange protocol pandas-dev/pandas#46141

Merged

vnlitvinov marked this pull request as ready for review February 24, 2022 18:02

rgommers added the interchange-protocol label Feb 25, 2022

dchigarev mentioned this pull request Feb 28, 2022

Add more tests for the dataframe interchange protocol #75

Closed

rgommers requested changes Mar 16, 2022

View reviewed changes

protocol/dataframe_protocol.py Show resolved Hide resolved

rgommers reviewed Apr 2, 2022

View reviewed changes

kkraus14 reviewed Jul 7, 2022

View reviewed changes

honno mentioned this pull request Jul 13, 2022

Have dataframe-api as a submodule data-apis/dataframe-interchange-tests#6

Open

rgommers mentioned this pull request Jul 14, 2022

Clarify if interchange dataframes should also have __dataframe__() #80

Open

jorisvandenbossche reviewed Jul 14, 2022

View reviewed changes

protocol/dataframe_protocol.py Show resolved Hide resolved

honno added a commit to data-apis/dataframe-interchange-tests that referenced this pull request Jul 22, 2022

Assume unmerged spec behaviour

0b188ab

data-apis/dataframe-api#74

vnlitvinov added 4 commits July 28, 2022 14:11

Declare enums explicitly, fix hints

8eab8a2

Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>

Align spec with existing implementations

2b35e5d

Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>

Format the spec with black

6b49f22

Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>

Change API a bit, align formatting with pandas

d772b47

Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>

rgommers force-pushed the improve-proto branch from 825b6d7 to d772b47 Compare July 28, 2022 12:16

Remove NaT (not-a-datetime) from the USE_NAN description.

0e9e173

This address the review comment that NaT is not a thing outside of NumPy. Hence for not-a-datetime, all implementers should be using sentinel values, because those are explicit.

mroeschke mentioned this pull request Jul 28, 2022

REF: Change _NULL_DESCRIPTION[datetime] to use NaT sentinel pandas-dev/pandas#47887

Merged

2 tasks

Change Column.size from a property to a method

f1f1eac

rgommers added the bc-breaking A change that is not fully backwards compatible label Jul 29, 2022

rgommers approved these changes Jul 29, 2022

View reviewed changes

rgommers merged commit b414d30 into data-apis:main Jul 29, 2022

vnlitvinov deleted the improve-proto branch August 1, 2022 19:07

vnlitvinov mentioned this pull request Aug 1, 2022

Fix Modin conformance to data interchange API spec modin-project/modin#4746

Closed

honno mentioned this pull request Aug 8, 2022

Interchange protocol fixes and updates vaexio/vaex#2150

Merged

honno mentioned this pull request Sep 5, 2022

BUG: Interchange Column.size is a property, not a method pandas-dev/pandas#48392

Closed

3 tasks

rgommers mentioned this pull request Sep 13, 2022

DataFrame interchange protocol: should NaT be like NaN or a sentinel? #64

Closed

This was referenced Dec 13, 2022

TYP #37715: Fix mypy errors in column.py pandas-dev/pandas#50164

Closed

Outdated interchange protocol documentation #91

Closed

Declare enums explicitly, fix type hints #74

Declare enums explicitly, fix type hints #74

Conversation

vnlitvinov commented Feb 10, 2022

rgommers left a comment

Choose a reason for hiding this comment

vnlitvinov commented Feb 24, 2022

vnlitvinov commented Feb 24, 2022

vnlitvinov commented Feb 24, 2022

rgommers commented Feb 25, 2022

rgommers commented Feb 25, 2022

vnlitvinov commented Feb 25, 2022 • edited Loading

rgommers commented Mar 16, 2022

rgommers left a comment

Choose a reason for hiding this comment

vnlitvinov commented Apr 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers commented Apr 2, 2022

vnlitvinov commented May 4, 2022

vnlitvinov commented Jul 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers commented Jul 28, 2022

rgommers left a comment

Choose a reason for hiding this comment

vnlitvinov commented Aug 4, 2022

honno commented Aug 8, 2022

mroeschke commented Aug 8, 2022

vnlitvinov commented Aug 10, 2022

honno commented Aug 10, 2022

vnlitvinov commented Feb 25, 2022 •

edited

Loading