ARROW-14738: [Python][Doc] Make return types clickable #11726

amol- · 2021-11-17T14:59:46Z

No description provided.

github-actions · 2021-11-17T15:00:11Z

https://issues.apache.org/jira/browse/ARROW-14738

jorisvandenbossche · 2021-11-18T07:47:26Z

I suppose you tried it locally?

amol- · 2021-11-18T09:51:53Z

I suppose you tried it locally?

yep, tried locally. With napoleon_use_rtype the resulting HTML is a little worse because you end up with

Returns: The table with blah blah
Return Type: Table

instead of

Returns: Table - The table with blah blah

but the Table in "Return Type" becomes clickable.

jorisvandenbossche · 2021-11-22T14:00:48Z

I was thinking we could maybe also use numpydoc for our docstring rendering. They also support making the return type into a clickable link, but generalize this to parameter types as well (and it doesn't result in the duplicate "Returns" issue you mentioned above).
See eg https://numpydoc.readthedocs.io/en/latest/example.html#module-example

amol- · 2021-11-25T13:00:12Z

I was thinking we could maybe also use numpydoc for our docstring rendering. They also support making the return type into a clickable link, but generalize this to parameter types as well (and it doesn't result in the duplicate "Returns" issue you mentioned above). See eg https://numpydoc.readthedocs.io/en/latest/example.html#module-example

I did a quick test and it seems to work.
The only thing I'm not confident with is that it seems doc was moved from numpydoc to napeoleon in the past ( https://github.com/apache/arrow/pull/208/files ) to fix some build issues. @xhochy might know more about what lead to that change and if there is any risk in going back to numpydoc

jorisvandenbossche · 2021-11-25T16:59:25Z

I think it should be fine to switch to numpydoc in general (for example, pandas, numpy, scikit-learn are all using it). There might be some pyarrow-specific things (since there are slight differences between both), but I assume those be solvable.

xhochy · 2021-11-25T19:21:14Z

@xhochy might know more about what lead to that change and if there is any risk in going back to numpydoc

No, I don't remember anymore why I did that :(

jorisvandenbossche · 2021-12-09T13:03:07Z

Is this ready, or still draft?

amol- · 2021-12-09T13:17:40Z

Is this ready, or still draft?

It was a draft as it started as an experiment, but locally all the parts of the API reference I could check seem to render correctly and thus the transition to numpydoc seems reasonable. There was a CI failure when building the docs that I think I have addressed in 8365150 so I think we can consider this one ready to go

jorisvandenbossche · 2021-12-09T16:49:39Z

Can you add numpydoc to conda_env_sphinx.txt?

Looking at the doc build on CI, there are some warnings:

/arrow/docs/source/python/api/filesystems.rst.rst:38: WARNING: autosummary: stub file not found 'HadoopFileSystem'. Check your autosummary_generate setting.
/arrow/docs/source/docstring of pyarrow.Array.rst:33: WARNING: autosummary: stub file not found 'pyarrow.Array.buffers'. Check your autosummary_generate setting.
/arrow/docs/source/docstring of pyarrow.Array.rst:33: WARNING: autosummary: stub file not found 'pyarrow.Array.cast'. Check your autosummary_generate setting.
/arrow/docs/source/docstring of pyarrow.Array.rst:33: WARNING: autosummary: stub file not found 'pyarrow.Array.dictionary_encode'. Check your autosummary_generate setting.
...

(I can also try to take a look at this next week)

Another set look like:

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/parquet.py:docstring of pyarrow.parquet.write_table:120: WARNING: undefined label: python:bltin-boolean-values (if the link has no caption the label must precede a section header)

we might need to add "bool" (and some other words) to the list of things that should not be linked (see numpydoc_xref_ignore option at https://numpydoc.readthedocs.io/en/latest/install.html#configuration)

amol- · 2021-12-13T09:38:28Z

I was able to resolve the bool and numpy:array_like warnings by adding intersphinx references to Python documentation and Numpy documentation. I also fixed errors related to Example and Warning sections as all section names must be plural for numpydoc.

I'll have to investigate the missing stub files

amol- · 2021-12-13T16:26:59Z

Most warnings seems to have been fixed, I still see some errors in cuda.rst which I'm not sure why they are happening given that the file seems to correctly specify a currentmodule:

WARNING: don't know which module to import for autodocumenting 'BufferReader' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'BufferWriter' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'Context' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'CudaBuffer' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'HostBuffer' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'IpcMemHandle' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'new_host_buffer' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'read_message' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'read_record_batch' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'serialize_record_batch' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)

amol- · 2021-12-13T17:05:33Z

Most warnings seems to have been fixed, I still see some errors in cuda.rst which I'm not sure why they are happening given that the file seems to correctly specify a currentmodule:

Seems it might be related to pyarrow.cuda being mocked when cuda is not enabled ( -DPYARROW_BUILD_CUDA=off ), but this doesn't happen locally for me even though I'm building without cuda too.

wjones127 · 2021-12-14T21:40:42Z

Took a look at this locally. For other's benefit, here's how this changes the RecordBatch API reference page:

Before

After

Looks pretty nice!

Althrough I don't love the extra-heavy font-weight on the parameter names. Would be be willing to add this to the CSS at docs/source/_static/theme_overrides.css?

b, strong {
  font-weight: bold;
}

That would make the above page look like:

Fixed

pitrou · 2021-12-15T15:50:10Z

docs/source/conf.py

    'IPython.sphinxext.ipython_directive',
    'IPython.sphinxext.ipython_console_highlighting',
    'breathe',
-    'sphinx_tabs.tabs'
+    'sphinx_tabs.tabs',
+    'sphinx.ext.intersphinx'


Hmm, can we order this list alphabetically?

pitrou · 2021-12-15T15:51:45Z

python/pyarrow/table.pxi

@@ -1234,8 +1234,8 @@ cdef class Table(_PandasConvertible):
    """
    A collection of top-level named, equal length Arrow arrays.

-    Warning
-    -------
+    Warnings


There is a single warning here, is the plural intended?

Yes, it's intended because numpydoc does only recognise "Warnings" as a valid header, not "Warning".

jorisvandenbossche · 2021-12-20T10:13:28Z

https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.mean.html

Althrough I don't love the extra-heavy font-weight on the parameter names. Would be be willing to add this to the CSS at docs/source/_static/theme_overrides.css?

That's indeed odd (and I agree it would be good to add that small css snippet to make this look better).
It might be worth reporting this to the upstream theme, though. Strange thing is that I don't see the same issue in eg the pandas docs (eg https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean)

amol- · 2021-12-20T10:53:52Z

https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.mean.html

Althrough I don't love the extra-heavy font-weight on the parameter names. Would be be willing to add this to the CSS at docs/source/_static/theme_overrides.css?

That's indeed odd (and I agree it would be good to add that small css snippet to make this look better). It might be worth reporting this to the upstream theme, though. Strange thing is that I don't see the same issue in eg the pandas docs (eg https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean)

Actually it's there for the pandas docs too

pandas.mov

jorisvandenbossche · 2021-12-21T20:02:21Z

What's the video showing exactly? I see it changing back and forth between bold and very bold. But how is that triggered?

wjones127 · 2021-12-21T20:10:57Z

@jorisvandenbossche @amol- Let's continue the font discussion in the theme repo. I don't want to hold up this PR on that.

jorisvandenbossche · 2021-12-22T07:15:39Z

@wjones127 thanks for opening the issue!

I tried this out locally, and I am seeing some other strange formatting artifacts (but related to what sphinx / numpydoc output). For example on the Table page (https://arrow.apache.org/docs/python/generated/pyarrow.Table.html in the online docs):

For the first parameter it is done correctly, but for the second parameter the other words which are not auto-linked are formatted as code instead of normal text, for some reason.
That might need some digging into numpydoc to understand where this comes from (I might have time for that later)

amol- · 2022-01-03T14:53:26Z

Given that further formatting discussions are probably expected to happen in the theme issue ( pydata/pydata-sphinx-theme#527 ) should we ship this to move documentation to numpydocs in preparation for 7.0.0?

jorisvandenbossche · 2022-01-11T13:33:13Z

Personally, I find #11726 (comment) a somewhat annoying issue to ship as is (the strange formatting of the type list).

I looked a bit into what numpydoc is doing here, and it seems that it is not very smart in how it tries to create links. Basically every word in the type section in the docstring gets transformed in a reference, and thus if nothing linkable is found by sphinx, it gets rendered as code. That's the reason you get "of" in "list of Array" rendered as code.

Now, for a solution for this, I currently see two options:

Option 1 is to use the numpydoc feature numpydoc_xref_ignore (specified in conf.py) to list a set of words that should be ignored and not transformed in a reference. Having checked the Table page (https://arrow.apache.org/docs/python/generated/pyarrow.Table.html, the screenshot above is from that page), this would already give something like:

numpydoc_xref_ignore = {
    "optional", "default", "None", "True", "False", "or", "of",
    "iterator", "function", "object",
    # TODO those could be removed if we rewrite the docstring a bit
    "values", "coercible", "to", "arrays",
}

Option 2 could be to use some custom CSS to let the code in the type explanation look like normal text. That would be something like:

span.classifier code.xref span.pre {
  color: rgba(var(--pst-color-text-base),1);
  font-family: var(--pst-font-family-base);
  font-size: 1rem;
}

This might actually be the "easier" solution, but it's a bit more a hack (while the other uses an actual numpydoc option), and is a bit less robust (eg if the html structure generated by sphinx changes).

amol- · 2022-01-11T15:02:56Z

@jorisvandenbossche
Looking at numpydoc there are a few things that come to my mind:

We should avoid the form list of int because of is nowhere recognised for containers, containers are dealt with list[int] form. Se should probably migrate all cases to that form because it solves some misunderstandings from numpydoc
numpydoc is somewhat buggy, or is only detected as a literal in some cases, in some other cases is parsed as a reference and becomes obj:`or` .
Some words it's correct that they are identified as generic references (function, Mapping, iterator) as they are in practice interfaces/protocols, so they might not have a reference in the Python docs but we want to identify them as code anyway.

My suggestion would be to apply (1) and solve (2) using numpydoc_xref_ignore = {"or", "and", "of", ",", "default", "optional"}. For bool, True, False, object I think it would be wrong to place them in ignore as they are currently correctly detected and linked to the Python documentation.

That solves the majority of cases, few instances of problems still remain like

where depending on still remains as code
or

where the whole parameter is misinterpreted for unknown reasons.

jorisvandenbossche · 2022-01-11T15:46:15Z

We should avoid the form list of int because of is nowhere recognised for containers, containers are dealt with list[int] form. Se should probably migrate all cases to that form because it solves some misunderstandings from numpydoc

Personally I am not in favour of using type annotation syntax in the docstrings (if we want type annotations, we can actually use type annoations in the signature). My stance is that docstrings should be meant as human readable text, and I think only a minority of python users is familiar with typing syntax.
(I know list[int] is of course readable, and probably relatively easy to understand even for someone that doesn't know type annotation syntax, but you quickly get into more complicated stuff with unions etc if you want to do this consistently)

If the "of" is the problem here, having that in the numpydoc_xref_ignore list (as you already did) solves it as well.

2. numpydoc is somewhat buggy, or is only detected as a literal in some cases, in some other cases is parsed as a reference and becomes obj:`or` .

Yes, I noticed that as well. I started looking into why, but didn't look further as just including in the ignore list also fixes it.

3. Some words it's correct that they are identified as generic references (function, Mapping, iterator) as they are in practice interfaces/protocols, so they might not have a reference in the Python docs but we want to identify them as code anyway.

I don't think for example "function" should be identified/formatted as code? While it's indeed a generic python term, it's not an actual Python code keyword or builtin or .. (i.e. "if you type it in a console, you get an error"), so I think we should see that as prose text?

For bool, True, False, object I think it would be wrong to place them in ignore as they are currently correctly detected and linked to the Python documentation.

The reason that I included None, True and False in the ignore list in my comment above, is because we very often have things like "param : type, default None" or "..., default True". I don't see much added value in each time linking the None/True/False to the python docs, while this linking does add visual "noise" (by changing a prose text "default None" into something with 2 colors because one word is a link and the other not).

(now, there is a bug in numpydoc that listing those 3 values in the ignore list doesn't have any effect, because the "ignore" list has no priority over some default values they link (which you can supplement with numpydoc_xref_aliases), so this discussion item is a bit moot for now)

[about "use_threads : bool, default True"] where the whole parameter is misinterpreted for unknown reasons.

That's because there is a missing space in our code:

arrow/python/pyarrow/array.pxi

Lines 714 to 716 in ccffcea

    
                   use_threads: bool, default True 
        
                       Whether to parallelize the conversion using multiple threads. 
        
                   deduplicate_objects : bool, default False

numpydoc is sensitive to the delimiter actually being " : " with space before/after the colon. Normally the validation should catch that (#7732), but I suppose this rule is not yet activated in the checks.

wjones127 · 2022-01-11T16:34:23Z

My stance is that docstrings should be meant as human readable text, and I think only a minority of python users is familiar with typing syntax.

Well that's sort of the problem here; the numpydoc's having a hard time reading the human-readable text 😉 That being said, I think it's totally fine in the case of list of; that's something this was explicitly designed for according to this test: https://github.com/numpy/numpydoc/blob/main/numpydoc/tests/test_xref.py#L44

While list of is pretty standard, there are some more deviant descriptions that might need to be changed. For example, the pandas.Series or pandas.Dataframe depending on type of object, should probably be changed to just pandas.Series or pandas.Dataframe, and then in the return description we should add Returns series if ___ and DataFrame otherwise.

jorisvandenbossche · 2022-01-11T16:48:57Z

Well that's sort of the problem here; the numpydoc's having a hard time reading the human-readable text wink

But I didn't mean any random "free-form" text :). But yes, I agree with you that most cases that don't fall into this "type1 or type2" or "type1 of type2" cases, should probably be rewritten anyway, and moved into the description field.
That's also the reason why I put the "values coercible to arrays" above with a TODO comment. That is needed in the ignore list for formatting the current docstrings as is, but I think we should rather use a fixed set of terms like "array-like" for this, and if needed have a more free-form explanation of what that means in the parameter description on the next line.

So fully agreed with:

For example, the pandas.Series or pandas.Dataframe depending on type of object, should probably be changed to just pandas.Series or pandas.Dataframe, and then in the return description we should add Returns series if ___ and DataFrame otherwise.

Note that the test you linked also uses some basic items in an ignore list: https://github.com/numpy/numpydoc/blob/e9384ce346359cbec556454ae69c1af44d6a9017/numpydoc/tests/test_xref.py#L200 (so that's something we in any case want to copy)

amol- · 2022-01-12T15:34:59Z

Well that's sort of the problem here; the numpydoc's having a hard time reading the human-readable text 😉 That being said, I think it's totally fine in the case of list of; that's something this was explicitly designed for according to this test: https://github.com/numpy/numpydoc/blob/main/numpydoc/tests/test_xref.py#L44

Well, list of is frequently interpreted in a decent way, but the code seems to be written to mostly support containers in the form list[str], dict[str, int] and so on by default...
See https://github.com/numpy/numpydoc/blob/main/numpydoc/xref.py#L26-L30 of is not even listed as a divisor of text blocks https://github.com/numpy/numpydoc/blob/main/numpydoc/xref.py#L54-L57 and I think that in the end what's going on is that it splits on ' ' and by chance interprets the part before the space as one type and the part after the space as another type.

That's also the reason why I put the "values coercible to arrays" above with a TODO comment. That is needed in the ignore list for formatting the current docstrings as is, but I think we should rather use a fixed set of terms like "array-like" for this, and if needed have a more free-form explanation of what that means in the parameter description on the next line.

Note that array-like is a recognised expression by numpy docs, but it means numpy.array. We might want to add a definition of arrow-array to have a similar thing for Arrow arrays.

ci/scripts/python_build.sh

kszucs · 2022-01-18T20:34:01Z

@jorisvandenbossche do we want to include this in 7.0? Personally I'd like to :)

jorisvandenbossche · 2022-01-19T10:22:06Z

@amol- I pushed a commit with some changes:

A bunch of general docstring fixes that will ensure it gets rendered better with this PR
Some more aliases (it would maybe be a good feature request for numpydoc that you can specify a "namespace" to look into as well, eg to always try looking in the "pyarrow" namespace as well. For a similar effect, I added some aliases for often-used objects like RecordBatch and Table)
Extra ignored words. I know you were not really a fan of adding more, but I also added a TODO comment to many of them, and we can fix those in follow-up PRs. But if we want to get this merged for 7.0, I would prefer to keep those ignores short-term to avoid incorrect rendering for those.

amol- · 2022-01-19T10:45:48Z

@amol- I pushed a commit with some changes:

A bunch of general docstring fixes that will ensure it gets rendered better with this PR

Some more aliases (it would maybe be a good feature request for numpydoc that you can specify a "namespace" to look into as well, eg to always try looking in the "pyarrow" namespace as well. For a similar effect, I added some aliases for often-used objects like RecordBatch and Table)

Extra ignored words. I know you were not really a fan of adding more, but I also added a TODO comment to many of them, and we can fix those in follow-up PRs. But if we want to get this merged for 7.0, I would prefer to keep those ignores short-term to avoid incorrect rendering for those.

Fine for me, we can always iterate.

kszucs

Thanks @amol-, @jorisvandenbossche! Merging on green.

ursabot · 2022-01-19T18:52:51Z

Benchmark runs are scheduled for baseline = fd580db and contender = e9e16c9. e9e16c9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.09%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: Python label Nov 17, 2021

amol- mentioned this pull request Nov 18, 2021

ARROW-14608: [Python] Provide access to hash_aggregate functions through a Table.group_by method #11624

Closed

amol- marked this pull request as ready for review November 19, 2021 10:24

amol- marked this pull request as draft November 19, 2021 10:24

jorisvandenbossche marked this pull request as ready for review December 9, 2021 16:41

pitrou reviewed Dec 15, 2021

View reviewed changes

wjones127 mentioned this pull request Dec 21, 2021

Bolder text difficult to read on some systems pydata/pydata-sphinx-theme#527

Closed

github-actions bot added the Component: C++ label Jan 13, 2022

kszucs reviewed Jan 18, 2022

View reviewed changes

ci/scripts/python_build.sh Outdated Show resolved Hide resolved

amol- and others added 14 commits January 19, 2022 11:30

Proof of concept

6f9ece2

Switch to numpydoc

8fde0ef

Install docs requirements

768203d

Tweak

ff21bf8

Solve missing stubs

7172ac3

Fix one more warning

e124775

Address more warnings

4d9edfb

sort

3ad3bd4

fix

b659bc1

Few improvements

60cb066

Introduce custom reference for array-like term

68510de

add more ignores and aliases + fixup some docstrings

8662ca8

Add numpydoc to conda_env_sphinx.txt

2cc8083

Restore parquet-testing submodule's revision

54f934a

kszucs approved these changes Jan 19, 2022

View reviewed changes

kszucs closed this in e9e16c9 Jan 19, 2022

ARROW-14738: [Python][Doc] Make return types clickable #11726

ARROW-14738: [Python][Doc] Make return types clickable #11726

Conversation

amol- commented Nov 17, 2021

github-actions bot commented Nov 17, 2021

jorisvandenbossche commented Nov 18, 2021

amol- commented Nov 18, 2021

jorisvandenbossche commented Nov 22, 2021

amol- commented Nov 25, 2021

jorisvandenbossche commented Nov 25, 2021

xhochy commented Nov 25, 2021

jorisvandenbossche commented Dec 9, 2021

amol- commented Dec 9, 2021

jorisvandenbossche commented Dec 9, 2021

amol- commented Dec 13, 2021 • edited

amol- commented Dec 13, 2021

amol- commented Dec 13, 2021

wjones127 commented Dec 14, 2021 • edited

pitrou Dec 15, 2021

Choose a reason for hiding this comment

amol- Dec 20, 2021

Choose a reason for hiding this comment

pitrou Dec 15, 2021

Choose a reason for hiding this comment

amol- Dec 20, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 20, 2021

amol- commented Dec 20, 2021

jorisvandenbossche commented Dec 21, 2021

wjones127 commented Dec 21, 2021

jorisvandenbossche commented Dec 22, 2021

amol- commented Jan 3, 2022

jorisvandenbossche commented Jan 11, 2022 • edited

amol- commented Jan 11, 2022 • edited

jorisvandenbossche commented Jan 11, 2022

wjones127 commented Jan 11, 2022

jorisvandenbossche commented Jan 11, 2022

amol- commented Jan 12, 2022 • edited

kszucs commented Jan 18, 2022 • edited

jorisvandenbossche commented Jan 19, 2022

amol- commented Jan 19, 2022

kszucs left a comment

Choose a reason for hiding this comment

ursabot commented Jan 19, 2022 • edited

amol- commented Dec 13, 2021 •

edited

wjones127 commented Dec 14, 2021 •

edited

jorisvandenbossche commented Jan 11, 2022 •

edited

amol- commented Jan 11, 2022 •

edited

amol- commented Jan 12, 2022 •

edited

kszucs commented Jan 18, 2022 •

edited

ursabot commented Jan 19, 2022 •

edited