-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34979: [Python] Create a base class for Table and RecordBatch #34980
GH-34979: [Python] Create a base class for Table and RecordBatch #34980
Conversation
|
The first commit moves the E.g.
Also, the class name is wrong in the docstring
|
Docstrings look good with commit 2, but we do have a wrapper to maintain now.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
On the other hand, a significant part of the usefulness of sharing the implementation is lost if we can't share the docstring? (especially in this example where the function is only a one liner, for others that might be more worth it) |
python/pyarrow/table.pxi
Outdated
|
||
def drop_null(self): | ||
""" | ||
Remove missing values from a Table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left only the Table
examples. I can include RecordBatch
explicitly if preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine to just use a single example. We could always add a comment like # or pa.RecordBatch.from_pandas(df)
above the equivalent Table line, to make it clear how to run the equivalent example with a RecordBatch instead of a Table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or adding a standard sentence like "The following example uses a Table, but it works the same for RecordBatch". Or would that just be unnecessary noise in most cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Some minor comments on naming and wording
python/pyarrow/lib.pxd
Outdated
@@ -469,15 +469,19 @@ cdef class ChunkedArray(_PandasConvertible): | |||
cdef getitem(self, int64_t i) | |||
|
|||
|
|||
cdef class Table(_PandasConvertible): | |||
cdef class _Table(_PandasConvertible): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we maybe want to use "Base" in the name (to make it clearer that it is a shared base class?) _BaseTable
would be fine (or _BaseTabular
, if that's a better word to describe the commonality between RecordBatch and Table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about _Tabular
? It's an adjective so class RecordBatch(_Tabular)
and class Table(_Tabular)
sound very descriptive IMO. My initial thought is that adding Base
here is redundant, but let me know if you think otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"tabular" sounds good!
python/pyarrow/table.pxi
Outdated
|
||
def drop_null(self): | ||
""" | ||
Remove missing values from a Table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove missing values from a Table. | |
Remove missing values from a RecordBatch or Table. |
Do we always want to mention both like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I could also change to Remove missing values from tabular object.
python/pyarrow/table.pxi
Outdated
|
||
def drop_null(self): | ||
""" | ||
Remove missing values from a Table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine to just use a single example. We could always add a comment like # or pa.RecordBatch.from_pandas(df)
above the equivalent Table line, to make it clear how to run the equivalent example with a RecordBatch instead of a Table
python/pyarrow/table.pxi
Outdated
|
||
def take(self, object indices): | ||
""" | ||
Select rows from the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Select rows from the table. | |
Select rows from the RecordBatch or Table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
python/pyarrow/table.pxi
Outdated
|
||
Returns | ||
------- | ||
taken : Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
taken : Table | |
taken : RecordBatch or Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
python/pyarrow/table.pxi
Outdated
|
||
def drop_null(self): | ||
""" | ||
Remove missing values from a Table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or adding a standard sentence like "The following example uses a Table, but it works the same for RecordBatch". Or would that just be unnecessary noise in most cases?
I addressed the comments and also added methods |
|
||
def __repr__(self): | ||
if not self._is_initialized(): | ||
raise ValueError("This object's internal pointer is NULL, do not " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this ValueError
? Table
had it, but RecordBatch
didn't. It seems superfluous IMO. I had to add _is_initialized()
so each subclass could implement checking its C++ object for validity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back at when this was added:
- ARROW-3789: [Python] Use common conversion path for Arrow to pandas.Series/DataFrame. Zero copy optimizations for DataFrame, add split_blocks and self_destruct options #6067
- https://github.com/apache/arrow/pull/6067/files#diff-cede36e8e2e0eb6e6e1ee21745db9687174527f463520c6e6d8b9e8f957bf304
it might make sense to keep it but note I am not familiar with the self-destruct option for table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can have a table object that no longer is being backed by Arrow data when doing df = table.to_pandas(self_destruct=True)
. This check and error then prevents getting a segfault when just printing table
(calling any other method on it will still segfault)
This options seems to have no effect for RecordBatch (I assume this is because the RecordBatch.to_pandas method converts the batch into a Table and then calls Table.to_pandas, so even though the Table C++ object is destructed, the original RecordBatch still owns that data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need to keep this check, and I think it is fine that for RecordBatch is this essentially never used
def __repr__(self): | ||
# TODO remove this and update pytests/doctests for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will follow up with a subsequent diff. When I remove this, RecordBatch
prints out partial tabular data like Table
, but a bunch of doctests need to be updated so I felt its better done in a separate diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have a separate PR for the changes in the doctest. Am looking forward to seeing a better repr for the RecordBatch also! =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Benchmark runs are scheduled for baseline = 07d02d6 and contender = 7bf1dec. 7bf1dec is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
apache#34980) ### Rationale for this change This is an incremental first step towards apache#30559 ### What changes are included in this PR? Introduce `class _Table` in `table.pxi`. ### Are these changes tested? Existing pytests will check for regression. ### Are there any user-facing changes? No * Closes: apache#34979 Authored-by: Dane Pitkin <dane@voltrondata.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
apache#34980) ### Rationale for this change This is an incremental first step towards apache#30559 ### What changes are included in this PR? Introduce `class _Table` in `table.pxi`. ### Are these changes tested? Existing pytests will check for regression. ### Are there any user-facing changes? No * Closes: apache#34979 Authored-by: Dane Pitkin <dane@voltrondata.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
apache#34980) ### Rationale for this change This is an incremental first step towards apache#30559 ### What changes are included in this PR? Introduce `class _Table` in `table.pxi`. ### Are these changes tested? Existing pytests will check for regression. ### Are there any user-facing changes? No * Closes: apache#34979 Authored-by: Dane Pitkin <dane@voltrondata.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
This is an incremental first step towards #30559
What changes are included in this PR?
Introduce
class _Table
intable.pxi
.Are these changes tested?
Existing pytests will check for regression.
Are there any user-facing changes?
No