Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34979: [Python] Create a base class for Table and RecordBatch #34980

Merged
merged 7 commits into from
Apr 27, 2023

Conversation

danepitkin
Copy link
Contributor

@danepitkin danepitkin commented Apr 7, 2023

Rationale for this change

This is an incremental first step towards #30559

What changes are included in this PR?

Introduce class _Table in table.pxi.

Are these changes tested?

Existing pytests will check for regression.

Are there any user-facing changes?

No

@danepitkin danepitkin requested a review from AlenkaF as a code owner April 7, 2023 19:54
@github-actions github-actions bot added the awaiting review Awaiting review label Apr 7, 2023
@github-actions
Copy link

github-actions bot commented Apr 7, 2023

@github-actions
Copy link

github-actions bot commented Apr 7, 2023

⚠️ GitHub issue #34979 has been automatically assigned in GitHub to PR creator.

@danepitkin
Copy link
Contributor Author

danepitkin commented Apr 7, 2023

The first commit moves the take() method from the individual Table and RecordBatch classes to the base class _Table. The docstring is shared, but ends up being a bit confusing. I think the best path forward is to refactor this and create an internal method on the _Table class e.g. _Table._take() and override it in the subclasses with a subclass-specific docstring.

E.g.

cdef class Table(_Table):
    def take(self, object indices):
    """ <Class-specific docstring> """
        return self._take(indices)

Also, the class name is wrong in the docstring

>>> print(pa.RecordBatch.take.__doc__)
_Take.take(self, indices)

        Select rows from the record batch.
        ...

@danepitkin
Copy link
Contributor Author

Docstrings look good with commit 2, but we do have a wrapper to maintain now.

>>> print(pa.RecordBatch.take.__doc__)
RecordBatch.take(self, indices)

        Select rows from the record batch.
        ...

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 11, 2023
@jorisvandenbossche
Copy link
Member

The docstring is shared, but ends up being a bit confusing.

On the other hand, a significant part of the usefulness of sharing the implementation is lost if we can't share the docstring? (especially in this example where the function is only a one liner, for others that might be more worth it)
What did you find confusing about it? (the need to constantly say "RecordBatch or Table"?)

@danepitkin danepitkin marked this pull request as draft April 11, 2023 19:34

def drop_null(self):
"""
Remove missing values from a Table.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left only the Table examples. I can include RecordBatch explicitly if preferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine to just use a single example. We could always add a comment like # or pa.RecordBatch.from_pandas(df) above the equivalent Table line, to make it clear how to run the equivalent example with a RecordBatch instead of a Table

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or adding a standard sentence like "The following example uses a Table, but it works the same for RecordBatch". Or would that just be unnecessary noise in most cases?

@danepitkin danepitkin marked this pull request as ready for review April 11, 2023 19:54
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some minor comments on naming and wording

@@ -469,15 +469,19 @@ cdef class ChunkedArray(_PandasConvertible):
cdef getitem(self, int64_t i)


cdef class Table(_PandasConvertible):
cdef class _Table(_PandasConvertible):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we maybe want to use "Base" in the name (to make it clearer that it is a shared base class?) _BaseTable would be fine (or _BaseTabular, if that's a better word to describe the commonality between RecordBatch and Table)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about _Tabular? It's an adjective so class RecordBatch(_Tabular) and class Table(_Tabular) sound very descriptive IMO. My initial thought is that adding Base here is redundant, but let me know if you think otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"tabular" sounds good!


def drop_null(self):
"""
Remove missing values from a Table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Remove missing values from a Table.
Remove missing values from a RecordBatch or Table.

Do we always want to mention both like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I could also change to Remove missing values from tabular object.


def drop_null(self):
"""
Remove missing values from a Table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine to just use a single example. We could always add a comment like # or pa.RecordBatch.from_pandas(df) above the equivalent Table line, to make it clear how to run the equivalent example with a RecordBatch instead of a Table


def take(self, object indices):
"""
Select rows from the table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Select rows from the table.
Select rows from the RecordBatch or Table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!


Returns
-------
taken : Table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
taken : Table
taken : RecordBatch or Table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!


def drop_null(self):
"""
Remove missing values from a Table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or adding a standard sentence like "The following example uses a Table, but it works the same for RecordBatch". Or would that just be unnecessary noise in most cases?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 13, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 13, 2023
@danepitkin
Copy link
Contributor Author

I addressed the comments and also added methods to_string() and __repr__ to the base class because I was testing out docstrings for Table vs RecordBatch for similarity. Sorry, I hope not to add too much more scope to this PR! 😬


def __repr__(self):
if not self._is_initialized():
raise ValueError("This object's internal pointer is NULL, do not "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this ValueError? Table had it, but RecordBatch didn't. It seems superfluous IMO. I had to add _is_initialized() so each subclass could implement checking its C++ object for validity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can have a table object that no longer is being backed by Arrow data when doing df = table.to_pandas(self_destruct=True). This check and error then prevents getting a segfault when just printing table (calling any other method on it will still segfault)

This options seems to have no effect for RecordBatch (I assume this is because the RecordBatch.to_pandas method converts the batch into a Table and then calls Table.to_pandas, so even though the Table C++ object is destructed, the original RecordBatch still owns that data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need to keep this check, and I think it is fine that for RecordBatch is this essentially never used

def __repr__(self):
# TODO remove this and update pytests/doctests for
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will follow up with a subsequent diff. When I remove this, RecordBatch prints out partial tabular data like Table, but a bunch of doctests need to be updated so I felt its better done in a separate diff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have a separate PR for the changes in the doctest. Am looking forward to seeing a better repr for the RecordBatch also! =)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 21, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jorisvandenbossche jorisvandenbossche merged commit 7bf1dec into apache:main Apr 27, 2023
17 of 18 checks passed
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Apr 27, 2023
@ursabot
Copy link

ursabot commented Apr 27, 2023

Benchmark runs are scheduled for baseline = 07d02d6 and contender = 7bf1dec. 7bf1dec is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️10.69% ⬆️6.3%] test-mac-arm
[Failed] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7bf1dec7 ec2-t3-xlarge-us-east-2
[Finished] 7bf1dec7 test-mac-arm
[Finished] 7bf1dec7 ursa-i9-9960x
[Finished] 7bf1dec7 ursa-thinkcentre-m75q
[Finished] 07d02d6c ec2-t3-xlarge-us-east-2
[Finished] 07d02d6c test-mac-arm
[Failed] 07d02d6c ursa-i9-9960x
[Failed] 07d02d6c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Apr 27, 2023

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
apache#34980)

### Rationale for this change

This is an incremental first step towards apache#30559

### What changes are included in this PR?

Introduce `class _Table` in `table.pxi`.

### Are these changes tested?

Existing pytests will check for regression.

### Are there any user-facing changes?

No
* Closes: apache#34979

Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
apache#34980)

### Rationale for this change

This is an incremental first step towards apache#30559

### What changes are included in this PR?

Introduce `class _Table` in `table.pxi`.

### Are these changes tested?

Existing pytests will check for regression.

### Are there any user-facing changes?

No
* Closes: apache#34979

Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
apache#34980)

### Rationale for this change

This is an incremental first step towards apache#30559

### What changes are included in this PR?

Introduce `class _Table` in `table.pxi`.

### Are these changes tested?

Existing pytests will check for regression.

### Are there any user-facing changes?

No
* Closes: apache#34979

Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Create a base class for Table and RecordBatch
4 participants