GH-34979: [Python] Create a base class for Table and RecordBatch #34980

danepitkin · 2023-04-07T19:54:57Z

Rationale for this change

This is an incremental first step towards #30559

What changes are included in this PR?

Introduce class _Table in table.pxi.

Are these changes tested?

Existing pytests will check for regression.

Are there any user-facing changes?

No

Closes: [Python] Create a base class for Table and RecordBatch #34979

github-actions · 2023-04-07T19:55:52Z

Closes: [Python] Create a base class for Table and RecordBatch #34979

github-actions · 2023-04-07T19:55:55Z

⚠️ GitHub issue #34979 has been automatically assigned in GitHub to PR creator.

danepitkin · 2023-04-07T20:00:06Z

The first commit moves the take() method from the individual Table and RecordBatch classes to the base class _Table. The docstring is shared, but ends up being a bit confusing. I think the best path forward is to refactor this and create an internal method on the _Table class e.g. _Table._take() and override it in the subclasses with a subclass-specific docstring.

E.g.

cdef class Table(_Table):
    def take(self, object indices):
    """ <Class-specific docstring> """
        return self._take(indices)

Also, the class name is wrong in the docstring

>>> print(pa.RecordBatch.take.__doc__)
_Take.take(self, indices)

        Select rows from the record batch.
        ...

danepitkin · 2023-04-07T20:41:19Z

Docstrings look good with commit 2, but we do have a wrapper to maintain now.

>>> print(pa.RecordBatch.take.__doc__)
RecordBatch.take(self, indices)

        Select rows from the record batch.
        ...

AlenkaF

LGTM 👍

jorisvandenbossche · 2023-04-11T12:56:15Z

The docstring is shared, but ends up being a bit confusing.

On the other hand, a significant part of the usefulness of sharing the implementation is lost if we can't share the docstring? (especially in this example where the function is only a one liner, for others that might be more worth it)
What did you find confusing about it? (the need to constantly say "RecordBatch or Table"?)

danepitkin · 2023-04-11T19:53:54Z

python/pyarrow/table.pxi

+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.


I left only the Table examples. I can include RecordBatch explicitly if preferred.

I think that's fine to just use a single example. We could always add a comment like # or pa.RecordBatch.from_pandas(df) above the equivalent Table line, to make it clear how to run the equivalent example with a RecordBatch instead of a Table

Or adding a standard sentence like "The following example uses a Table, but it works the same for RecordBatch". Or would that just be unnecessary noise in most cases?

jorisvandenbossche

Looks good! Some minor comments on naming and wording

jorisvandenbossche · 2023-04-13T14:51:34Z

python/pyarrow/lib.pxd

@@ -469,15 +469,19 @@ cdef class ChunkedArray(_PandasConvertible):
    cdef getitem(self, int64_t i)


-cdef class Table(_PandasConvertible):
+cdef class _Table(_PandasConvertible):


Do we maybe want to use "Base" in the name (to make it clearer that it is a shared base class?) _BaseTable would be fine (or _BaseTabular, if that's a better word to describe the commonality between RecordBatch and Table)

How about _Tabular? It's an adjective so class RecordBatch(_Tabular) and class Table(_Tabular) sound very descriptive IMO. My initial thought is that adding Base here is redundant, but let me know if you think otherwise.

"tabular" sounds good!

jorisvandenbossche · 2023-04-13T14:52:19Z

python/pyarrow/table.pxi

+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.


Suggested change

Remove missing values from a Table.

Remove missing values from a RecordBatch or Table.

Do we always want to mention both like this?

Updated. I could also change to Remove missing values from tabular object.

jorisvandenbossche · 2023-04-13T14:53:38Z

python/pyarrow/table.pxi

+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.


I think that's fine to just use a single example. We could always add a comment like # or pa.RecordBatch.from_pandas(df) above the equivalent Table line, to make it clear how to run the equivalent example with a RecordBatch instead of a Table

jorisvandenbossche · 2023-04-13T14:54:15Z

python/pyarrow/table.pxi

+
+    def take(self, object indices):
+        """
+        Select rows from the table.


Suggested change

Select rows from the table.

Select rows from the RecordBatch or Table.

jorisvandenbossche · 2023-04-13T14:54:53Z

python/pyarrow/table.pxi

+
+        Returns
+        -------
+        taken : Table


Suggested change

taken : Table

taken : RecordBatch or Table

jorisvandenbossche · 2023-04-13T14:55:39Z

python/pyarrow/table.pxi

+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.


Or adding a standard sentence like "The following example uses a Table, but it works the same for RecordBatch". Or would that just be unnecessary noise in most cases?

danepitkin · 2023-04-13T23:20:46Z

I addressed the comments and also added methods to_string() and __repr__ to the base class because I was testing out docstrings for Table vs RecordBatch for similarity. Sorry, I hope not to add too much more scope to this PR! 😬

danepitkin · 2023-04-13T23:15:41Z

python/pyarrow/table.pxi

+
+    def __repr__(self):
+        if not self._is_initialized():
+            raise ValueError("This object's internal pointer is NULL, do not "


Do we want this ValueError? Table had it, but RecordBatch didn't. It seems superfluous IMO. I had to add _is_initialized() so each subclass could implement checking its C++ object for validity.

Looking back at when this was added:

ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options #6067

https://github.com/apache/arrow/pull/6067/files#diff-cede36e8e2e0eb6e6e1ee21745db9687174527f463520c6e6d8b9e8f957bf304

it might make sense to keep it but note I am not familiar with the self-destruct option for table.

Yes, you can have a table object that no longer is being backed by Arrow data when doing df = table.to_pandas(self_destruct=True). This check and error then prevents getting a segfault when just printing table (calling any other method on it will still segfault)

This options seems to have no effect for RecordBatch (I assume this is because the RecordBatch.to_pandas method converts the batch into a Table and then calls Table.to_pandas, so even though the Table C++ object is destructed, the original RecordBatch still owns that data)

So we need to keep this check, and I think it is fine that for RecordBatch is this essentially never used

danepitkin · 2023-04-13T23:17:28Z

python/pyarrow/table.pxi

    def __repr__(self):
+        # TODO remove this and update pytests/doctests for


I will follow up with a subsequent diff. When I remove this, RecordBatch prints out partial tabular data like Table, but a bunch of doctests need to be updated so I felt its better done in a separate diff.

I think it makes sense to have a separate PR for the changes in the doctest. Am looking forward to seeing a better repr for the RecordBatch also! =)

jorisvandenbossche

Thanks!

ursabot · 2023-04-27T16:53:09Z

Benchmark runs are scheduled for baseline = 07d02d6 and contender = 7bf1dec. 7bf1dec is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️10.69% ⬆️6.3%] test-mac-arm
[Failed] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7bf1dec7 ec2-t3-xlarge-us-east-2
[Finished] 7bf1dec7 test-mac-arm
[Finished] 7bf1dec7 ursa-i9-9960x
[Finished] 7bf1dec7 ursa-thinkcentre-m75q
[Finished] 07d02d6c ec2-t3-xlarge-us-east-2
[Finished] 07d02d6c test-mac-arm
[Failed] 07d02d6c ursa-i9-9960x
[Failed] 07d02d6c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2023-04-27T16:53:44Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

apache#34980) ### Rationale for this change This is an incremental first step towards apache#30559 ### What changes are included in this PR? Introduce `class _Table` in `table.pxi`. ### Are these changes tested? Existing pytests will check for regression. ### Are there any user-facing changes? No * Closes: apache#34979 Authored-by: Dane Pitkin <dane@voltrondata.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

apacheGH-34979: [Python] Create a base class for Table and RecordBatch

f6daac2

danepitkin requested a review from AlenkaF as a code owner April 7, 2023 19:54

github-actions bot added the awaiting review Awaiting review label Apr 7, 2023

github-actions bot added the Component: Python label Apr 7, 2023

Move docstrings to subsclasses

148bae6

Add init method

b1287d0

AlenkaF approved these changes Apr 11, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 11, 2023

danepitkin marked this pull request as draft April 11, 2023 19:34

Use shared docstrings, move second shared compute func to base class

bb72dc9

danepitkin commented Apr 11, 2023

View reviewed changes

danepitkin marked this pull request as ready for review April 11, 2023 19:54

jorisvandenbossche reviewed Apr 13, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 13, 2023

Update Base class name, update docstrings with more examples

daa562c

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 13, 2023

danepitkin added 2 commits April 13, 2023 17:42

Add __repr__ and to_string methods for identical outputs

2b44da7

Fix tests, remove commented code

d771c6a

danepitkin commented Apr 14, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 21, 2023

jorisvandenbossche approved these changes Apr 27, 2023

View reviewed changes

jorisvandenbossche merged commit 7bf1dec into apache:main Apr 27, 2023
17 of 18 checks passed

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34979: [Python] Create a base class for Table and RecordBatch #34980

GH-34979: [Python] Create a base class for Table and RecordBatch #34980

danepitkin commented Apr 7, 2023 •

edited by github-actions bot

github-actions bot commented Apr 7, 2023

github-actions bot commented Apr 7, 2023

danepitkin commented Apr 7, 2023 •

edited

danepitkin commented Apr 7, 2023

AlenkaF left a comment

jorisvandenbossche commented Apr 11, 2023

danepitkin Apr 11, 2023

jorisvandenbossche Apr 13, 2023

jorisvandenbossche Apr 13, 2023

jorisvandenbossche left a comment

jorisvandenbossche Apr 13, 2023

danepitkin Apr 13, 2023

jorisvandenbossche Apr 21, 2023

jorisvandenbossche Apr 13, 2023

danepitkin Apr 13, 2023

jorisvandenbossche Apr 13, 2023

jorisvandenbossche Apr 13, 2023

danepitkin Apr 13, 2023

jorisvandenbossche Apr 13, 2023

danepitkin Apr 13, 2023

jorisvandenbossche Apr 13, 2023

danepitkin commented Apr 13, 2023

danepitkin Apr 13, 2023

AlenkaF Apr 21, 2023

jorisvandenbossche Apr 21, 2023

jorisvandenbossche Apr 21, 2023

danepitkin Apr 13, 2023

AlenkaF Apr 21, 2023

jorisvandenbossche left a comment

ursabot commented Apr 27, 2023

ursabot commented Apr 27, 2023

	Remove missing values from a Table.
	Remove missing values from a RecordBatch or Table.

	Select rows from the table.
	Select rows from the RecordBatch or Table.

		def __repr__(self):
		# TODO remove this and update pytests/doctests for

GH-34979: [Python] Create a base class for Table and RecordBatch #34980

GH-34979: [Python] Create a base class for Table and RecordBatch #34980

Conversation

danepitkin commented Apr 7, 2023 • edited by github-actions bot

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Apr 7, 2023

github-actions bot commented Apr 7, 2023

danepitkin commented Apr 7, 2023 • edited

danepitkin commented Apr 7, 2023

AlenkaF left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danepitkin commented Apr 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

ursabot commented Apr 27, 2023

ursabot commented Apr 27, 2023

danepitkin commented Apr 7, 2023 •

edited by github-actions bot

danepitkin commented Apr 7, 2023 •

edited