Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9967: [Python] Add compute module docs + expose more option classes #8145

Closed
wants to merge 18 commits into from

Conversation

arw2019
Copy link
Contributor

@arw2019 arw2019 commented Sep 9, 2020

#8163 exposes pyarrow.compute kernels and generates their docstrings. This PR adds documentation for the module in the User Guide and the Python API reference.

@arw2019 arw2019 marked this pull request as draft September 9, 2020 04:22
@github-actions
Copy link

github-actions bot commented Sep 9, 2020

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@arw2019 arw2019 changed the title WIP: ARROW-7871: [Python] [DOC] Expose more compute kernels and add compute module docs ARROW-9967: [Python] Add compute module docs Sep 10, 2020
@github-actions
Copy link

@arw2019 arw2019 force-pushed the ARROW-7871 branch 2 times, most recently from edb57cc to 3f99ae8 Compare September 13, 2020 03:18
@arw2019 arw2019 marked this pull request as ready for review September 13, 2020 03:18
@arw2019 arw2019 changed the title ARROW-9967: [Python] Add compute module docs ARROW-9967: [Python] Add compute module documentation Sep 13, 2020
@@ -201,6 +201,11 @@ an ``Invalid`` :class:`Status` when overflow is detected.
+--------------------------+------------+--------------------+---------------------+
| subtract_checked | Binary | Numeric | Numeric |
+--------------------------+------------+--------------------+---------------------+
| divide | Binary | Numeric | Numeric |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that this was missing on the c++ page

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you! However, can you leave the table alphabetically-sorted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my bad, you already had it there. Reverted

@jorisvandenbossche
Copy link
Member

@arw2019 thanks a lot for working on this! (and sorry for the overlap with #8163)

docs/source/python/api/compute.rst Outdated Show resolved Hide resolved

ascii_lower
ascii_upper
binary_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not really a transform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair. Would Structural Transforms be a good place for it?

docs/source/python/compute.rst Show resolved Hide resolved
>>> a = pa.array([1, 1, 2, 3])
>>> pc.sum(a)
<pyarrow.Int64Scalar: 7>
>>> b = pa.array([4, 5, 8, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this array is not used? (maybe show pc.equals(a, b) ?)

@arw2019 arw2019 force-pushed the ARROW-7871 branch 2 times, most recently from e79559c to d117ed1 Compare September 15, 2020 04:10
@@ -294,6 +294,46 @@ def mode(array):
return call_function("mode", [array])


def and_(x1, x2):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @pitrou

AFAIK we can't have methods named and and or in Python. For now I went with and_ and or_ here, for consistency with and_kleene and or_kleene, but for example numpy uses np.logical_and and np.logical_or

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a function named "and":

>>> from pyarrow import compute
>>> getattr(compute, "and")
<function pyarrow.compute.and(left, right, *, memory_pool=None)>

So you should just fix compute.py so that it's exported as and_.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Fixed now

mean
min_max
sum
mode
Copy link
Member

@pitrou pitrou Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this render the docstrings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. It's what's done on the other pages:
https://github.com/apache/arrow/blob/master/docs/source/python/api/datatypes.rst

I'm having trouble building the docs locally (related to ARROW-10018 I think) so haven't been able to check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't render the docstring here, but it will create a separate page for each of those functions and link to that from this table.

(the separate pages might be a bit overkill since the docstrings here are often not that informative yet, but it's indeed how we do it for other functions as well)

@arw2019 arw2019 force-pushed the ARROW-7871 branch 5 times, most recently from c2445c1 to c1c5846 Compare September 16, 2020 20:26
@arw2019
Copy link
Contributor Author

arw2019 commented Sep 17, 2020

Question: in 952649d I'm attempting to expose PartitionNthOptions but it errors at compilation. What is (are) my mistake(s)?

This is needed to make the partition_nth_indices Python wrapper usable.

@pitrou
Copy link
Member

pitrou commented Sep 21, 2020

Cython generally doesn't convert implicitly between arbitrary Python objects and C/C++ types, you have to type your code explicitly. Also, it is better to use __cinit__ here so that the constructor never gets skipped, so for example (untested):

    def __cinit__(self, int64_t pivot):
        self.partition_nth_options.reset(new CPartitionNthOptions(pivot))

@arw2019
Copy link
Contributor Author

arw2019 commented Sep 21, 2020

Cython generally doesn't convert implicitly between arbitrary Python objects and C/C++ types, you have to type your code explicitly.

Thanks @pitrou! That was the problem

@arw2019 arw2019 changed the title ARROW-9967: [Python] Add compute module documentation ARROW-9967: [Python] Add compute module docs + expose more option classes Sep 22, 2020
@arw2019
Copy link
Contributor Author

arw2019 commented Sep 22, 2020

This is ready for re-review.

I believe that I've addressed the feedback from previous reviews. I've also now exposed all the option classes so that all the kernels listed in the docs are usable from Python.


.. seealso::

:ref:`C++ compute functions documentation <cpp-compute>`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Will this work? I'm not 100% sure as I haven't yet compiled the docs to explicitly check


cdef class VarianceOptions(FunctionOptions):
cdef:
unique_ptr[CVarianceOptions] variance_options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why unique_ptr is used in some of the other classes, but it shouldn't be necessary. See MinMaxOptions for an example without.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PartitionNthOptions we need it because it doesn't have a null constructor so it can't be allocated on the stack. That's not the case for VarianceOptions, though, so I'll take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten VarianceOptions without unique_ptr.

I think that with the other three options classes exposed in this PR (SetLookupOptions, PartitionNthOptions and StrptimeOptions) it's necessary to use unique_ptr or, alternatively, we would need to define default constructors

@kszucs
Copy link
Member

kszucs commented Oct 7, 2020

@github-actions crossbow submit test-ubuntu-18.04-docs

@github-actions
Copy link

github-actions bot commented Oct 7, 2020

Revision: 73272bf

Submitted crossbow builds: ursa-labs/crossbow @ actions-615

Task Status
test-ubuntu-18.04-docs Azure

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Will merge after a couple changes. Thank you @arw2019 !

@pitrou
Copy link
Member

pitrou commented Oct 8, 2020

CI failures are unrelated.

@pitrou pitrou closed this in ba7ee65 Oct 8, 2020
@arw2019
Copy link
Contributor Author

arw2019 commented Oct 9, 2020

Thanks @pitrou @jorisvandenbossche for reviewing!

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…sses

apache#8163 exposes `pyarrow.compute` kernels and generates their docstrings. This PR adds documentation for the module in the User Guide and the Python API reference.

Closes apache#8145 from arw2019/ARROW-7871

Lead-authored-by: arw2019 <andrew.r.wieteska@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants