Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14738: [Python][Doc] Make return types clickable #11726

Closed
wants to merge 14 commits into from
Closed

ARROW-14738: [Python][Doc] Make return types clickable #11726

wants to merge 14 commits into from

Conversation

amol-
Copy link
Contributor

@amol- amol- commented Nov 17, 2021

No description provided.

@github-actions
Copy link

@jorisvandenbossche
Copy link
Member

I suppose you tried it locally?

@amol-
Copy link
Contributor Author

amol- commented Nov 18, 2021

I suppose you tried it locally?

yep, tried locally. With napoleon_use_rtype the resulting HTML is a little worse because you end up with

Returns: The table with blah blah
Return Type: Table

instead of

Returns: Table - The table with blah blah

but the Table in "Return Type" becomes clickable.

@jorisvandenbossche
Copy link
Member

I was thinking we could maybe also use numpydoc for our docstring rendering. They also support making the return type into a clickable link, but generalize this to parameter types as well (and it doesn't result in the duplicate "Returns" issue you mentioned above).
See eg https://numpydoc.readthedocs.io/en/latest/example.html#module-example

@amol-
Copy link
Contributor Author

amol- commented Nov 25, 2021

I was thinking we could maybe also use numpydoc for our docstring rendering. They also support making the return type into a clickable link, but generalize this to parameter types as well (and it doesn't result in the duplicate "Returns" issue you mentioned above). See eg https://numpydoc.readthedocs.io/en/latest/example.html#module-example

I did a quick test and it seems to work.
The only thing I'm not confident with is that it seems doc was moved from numpydoc to napeoleon in the past ( https://github.com/apache/arrow/pull/208/files ) to fix some build issues. @xhochy might know more about what lead to that change and if there is any risk in going back to numpydoc

@jorisvandenbossche
Copy link
Member

I think it should be fine to switch to numpydoc in general (for example, pandas, numpy, scikit-learn are all using it). There might be some pyarrow-specific things (since there are slight differences between both), but I assume those be solvable.

@xhochy
Copy link
Member

xhochy commented Nov 25, 2021

@xhochy might know more about what lead to that change and if there is any risk in going back to numpydoc

No, I don't remember anymore why I did that :(

@jorisvandenbossche
Copy link
Member

Is this ready, or still draft?

@amol-
Copy link
Contributor Author

amol- commented Dec 9, 2021

Is this ready, or still draft?

It was a draft as it started as an experiment, but locally all the parts of the API reference I could check seem to render correctly and thus the transition to numpydoc seems reasonable. There was a CI failure when building the docs that I think I have addressed in 8365150 so I think we can consider this one ready to go

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review December 9, 2021 16:41
@jorisvandenbossche
Copy link
Member

Can you add numpydoc to conda_env_sphinx.txt?

Looking at the doc build on CI, there are some warnings:

/arrow/docs/source/python/api/filesystems.rst.rst:38: WARNING: autosummary: stub file not found 'HadoopFileSystem'. Check your autosummary_generate setting.
/arrow/docs/source/docstring of pyarrow.Array.rst:33: WARNING: autosummary: stub file not found 'pyarrow.Array.buffers'. Check your autosummary_generate setting.
/arrow/docs/source/docstring of pyarrow.Array.rst:33: WARNING: autosummary: stub file not found 'pyarrow.Array.cast'. Check your autosummary_generate setting.
/arrow/docs/source/docstring of pyarrow.Array.rst:33: WARNING: autosummary: stub file not found 'pyarrow.Array.dictionary_encode'. Check your autosummary_generate setting.
...

(I can also try to take a look at this next week)

Another set look like:

/opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/parquet.py:docstring of pyarrow.parquet.write_table:120: WARNING: undefined label: python:bltin-boolean-values (if the link has no caption the label must precede a section header)

we might need to add "bool" (and some other words) to the list of things that should not be linked (see numpydoc_xref_ignore option at https://numpydoc.readthedocs.io/en/latest/install.html#configuration)

@amol-
Copy link
Contributor Author

amol- commented Dec 13, 2021

I was able to resolve the bool and numpy:array_like warnings by adding intersphinx references to Python documentation and Numpy documentation. I also fixed errors related to Example and Warning sections as all section names must be plural for numpydoc.

I'll have to investigate the missing stub files

@amol-
Copy link
Contributor Author

amol- commented Dec 13, 2021

Most warnings seems to have been fixed, I still see some errors in cuda.rst which I'm not sure why they are happening given that the file seems to correctly specify a currentmodule:

WARNING: don't know which module to import for autodocumenting 'BufferReader' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'BufferWriter' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'Context' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'CudaBuffer' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'HostBuffer' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'IpcMemHandle' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'new_host_buffer' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'read_message' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'read_record_batch' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)
WARNING: don't know which module to import for autodocumenting 'serialize_record_batch' (try placing a "module" or "currentmodule" directive in the document, or giving an explicit module name)

@amol-
Copy link
Contributor Author

amol- commented Dec 13, 2021

Most warnings seems to have been fixed, I still see some errors in cuda.rst which I'm not sure why they are happening given that the file seems to correctly specify a currentmodule:

Seems it might be related to pyarrow.cuda being mocked when cuda is not enabled ( -DPYARROW_BUILD_CUDA=off ), but this doesn't happen locally for me even though I'm building without cuda too.

@wjones127
Copy link
Member

wjones127 commented Dec 14, 2021

Took a look at this locally. For other's benefit, here's how this changes the RecordBatch API reference page:

Before

before screen shot

After

After screen shot

Looks pretty nice!

Althrough I don't love the extra-heavy font-weight on the parameter names. Would be be willing to add this to the CSS at docs/source/_static/theme_overrides.css?

b, strong {
  font-weight: bold;
}

That would make the above page look like:

Fixed

Screen Shot 2021-12-14 at 1 39 23 PM

'IPython.sphinxext.ipython_directive',
'IPython.sphinxext.ipython_console_highlighting',
'breathe',
'sphinx_tabs.tabs'
'sphinx_tabs.tabs',
'sphinx.ext.intersphinx'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can we order this list alphabetically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 done

@@ -1234,8 +1234,8 @@ cdef class Table(_PandasConvertible):
"""
A collection of top-level named, equal length Arrow arrays.

Warning
-------
Warnings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a single warning here, is the plural intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's intended because numpydoc does only recognise "Warnings" as a valid header, not "Warning".

@jorisvandenbossche
Copy link
Member

https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.mean.html

Althrough I don't love the extra-heavy font-weight on the parameter names. Would be be willing to add this to the CSS at docs/source/_static/theme_overrides.css?

That's indeed odd (and I agree it would be good to add that small css snippet to make this look better).
It might be worth reporting this to the upstream theme, though. Strange thing is that I don't see the same issue in eg the pandas docs (eg https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean)

@amol-
Copy link
Contributor Author

amol- commented Dec 20, 2021

https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.mean.html

Althrough I don't love the extra-heavy font-weight on the parameter names. Would be be willing to add this to the CSS at docs/source/_static/theme_overrides.css?

That's indeed odd (and I agree it would be good to add that small css snippet to make this look better). It might be worth reporting this to the upstream theme, though. Strange thing is that I don't see the same issue in eg the pandas docs (eg https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean)

Actually it's there for the pandas docs too

pandas.mov

@jorisvandenbossche
Copy link
Member

What's the video showing exactly? I see it changing back and forth between bold and very bold. But how is that triggered?

@wjones127
Copy link
Member

@jorisvandenbossche @amol- Let's continue the font discussion in the theme repo. I don't want to hold up this PR on that.

@jorisvandenbossche
Copy link
Member

@wjones127 thanks for opening the issue!

I tried this out locally, and I am seeing some other strange formatting artifacts (but related to what sphinx / numpydoc output). For example on the Table page (https://arrow.apache.org/docs/python/generated/pyarrow.Table.html in the online docs):

image

For the first parameter it is done correctly, but for the second parameter the other words which are not auto-linked are formatted as code instead of normal text, for some reason.
That might need some digging into numpydoc to understand where this comes from (I might have time for that later)

@amol-
Copy link
Contributor Author

amol- commented Jan 3, 2022

Given that further formatting discussions are probably expected to happen in the theme issue ( pydata/pydata-sphinx-theme#527 ) should we ship this to move documentation to numpydocs in preparation for 7.0.0?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 11, 2022

Personally, I find #11726 (comment) a somewhat annoying issue to ship as is (the strange formatting of the type list).

I looked a bit into what numpydoc is doing here, and it seems that it is not very smart in how it tries to create links. Basically every word in the type section in the docstring gets transformed in a reference, and thus if nothing linkable is found by sphinx, it gets rendered as code. That's the reason you get "of" in "list of Array" rendered as code.

Now, for a solution for this, I currently see two options:

Option 1 is to use the numpydoc feature numpydoc_xref_ignore (specified in conf.py) to list a set of words that should be ignored and not transformed in a reference. Having checked the Table page (https://arrow.apache.org/docs/python/generated/pyarrow.Table.html, the screenshot above is from that page), this would already give something like:

numpydoc_xref_ignore = {
    "optional", "default", "None", "True", "False", "or", "of",
    "iterator", "function", "object",
    # TODO those could be removed if we rewrite the docstring a bit
    "values", "coercible", "to", "arrays",
}

Option 2 could be to use some custom CSS to let the code in the type explanation look like normal text. That would be something like:

span.classifier code.xref span.pre {
  color: rgba(var(--pst-color-text-base),1);
  font-family: var(--pst-font-family-base);
  font-size: 1rem;
}

This might actually be the "easier" solution, but it's a bit more a hack (while the other uses an actual numpydoc option), and is a bit less robust (eg if the html structure generated by sphinx changes).

@amol-
Copy link
Contributor Author

amol- commented Jan 11, 2022

@jorisvandenbossche
Looking at numpydoc there are a few things that come to my mind:

  1. We should avoid the form list of int because of is nowhere recognised for containers, containers are dealt with list[int] form. Se should probably migrate all cases to that form because it solves some misunderstandings from numpydoc
  2. numpydoc is somewhat buggy, or is only detected as a literal in some cases, in some other cases is parsed as a reference and becomes obj:`or` .
  3. Some words it's correct that they are identified as generic references (function, Mapping, iterator) as they are in practice interfaces/protocols, so they might not have a reference in the Python docs but we want to identify them as code anyway.

My suggestion would be to apply (1) and solve (2) using numpydoc_xref_ignore = {"or", "and", "of", ",", "default", "optional"}. For bool, True, False, object I think it would be wrong to place them in ignore as they are currently correctly detected and linked to the Python documentation.

That solves the majority of cases, few instances of problems still remain like
Screenshot 2022-01-11 at 15 56 56
where depending on still remains as code
or
Screenshot 2022-01-11 at 15 57 09
where the whole parameter is misinterpreted for unknown reasons.

@jorisvandenbossche
Copy link
Member

  1. We should avoid the form list of int because of is nowhere recognised for containers, containers are dealt with list[int] form. Se should probably migrate all cases to that form because it solves some misunderstandings from numpydoc

Personally I am not in favour of using type annotation syntax in the docstrings (if we want type annotations, we can actually use type annoations in the signature). My stance is that docstrings should be meant as human readable text, and I think only a minority of python users is familiar with typing syntax.
(I know list[int] is of course readable, and probably relatively easy to understand even for someone that doesn't know type annotation syntax, but you quickly get into more complicated stuff with unions etc if you want to do this consistently)

If the "of" is the problem here, having that in the numpydoc_xref_ignore list (as you already did) solves it as well.

2. numpydoc is somewhat buggy, or is only detected as a literal in some cases, in some other cases is parsed as a reference and becomes obj:`or` .

Yes, I noticed that as well. I started looking into why, but didn't look further as just including in the ignore list also fixes it.

3. Some words it's correct that they are identified as generic references (function, Mapping, iterator) as they are in practice interfaces/protocols, so they might not have a reference in the Python docs but we want to identify them as code anyway.

I don't think for example "function" should be identified/formatted as code? While it's indeed a generic python term, it's not an actual Python code keyword or builtin or .. (i.e. "if you type it in a console, you get an error"), so I think we should see that as prose text?

For bool, True, False, object I think it would be wrong to place them in ignore as they are currently correctly detected and linked to the Python documentation.

The reason that I included None, True and False in the ignore list in my comment above, is because we very often have things like "param : type, default None" or "..., default True". I don't see much added value in each time linking the None/True/False to the python docs, while this linking does add visual "noise" (by changing a prose text "default None" into something with 2 colors because one word is a link and the other not).

(now, there is a bug in numpydoc that listing those 3 values in the ignore list doesn't have any effect, because the "ignore" list has no priority over some default values they link (which you can supplement with numpydoc_xref_aliases), so this discussion item is a bit moot for now)

[about "use_threads : bool, default True"] where the whole parameter is misinterpreted for unknown reasons.

That's because there is a missing space in our code:

use_threads: bool, default True
Whether to parallelize the conversion using multiple threads.
deduplicate_objects : bool, default False

numpydoc is sensitive to the delimiter actually being " : " with space before/after the colon. Normally the validation should catch that (#7732), but I suppose this rule is not yet activated in the checks.

@wjones127
Copy link
Member

My stance is that docstrings should be meant as human readable text, and I think only a minority of python users is familiar with typing syntax.

Well that's sort of the problem here; the numpydoc's having a hard time reading the human-readable text 😉 That being said, I think it's totally fine in the case of list of; that's something this was explicitly designed for according to this test: https://github.com/numpy/numpydoc/blob/main/numpydoc/tests/test_xref.py#L44

While list of is pretty standard, there are some more deviant descriptions that might need to be changed. For example, the pandas.Series or pandas.Dataframe depending on type of object, should probably be changed to just pandas.Series or pandas.Dataframe, and then in the return description we should add Returns series if ___ and DataFrame otherwise.

@jorisvandenbossche
Copy link
Member

Well that's sort of the problem here; the numpydoc's having a hard time reading the human-readable text wink

But I didn't mean any random "free-form" text :). But yes, I agree with you that most cases that don't fall into this "type1 or type2" or "type1 of type2" cases, should probably be rewritten anyway, and moved into the description field.
That's also the reason why I put the "values coercible to arrays" above with a TODO comment. That is needed in the ignore list for formatting the current docstrings as is, but I think we should rather use a fixed set of terms like "array-like" for this, and if needed have a more free-form explanation of what that means in the parameter description on the next line.

So fully agreed with:

For example, the pandas.Series or pandas.Dataframe depending on type of object, should probably be changed to just pandas.Series or pandas.Dataframe, and then in the return description we should add Returns series if ___ and DataFrame otherwise.

Note that the test you linked also uses some basic items in an ignore list: https://github.com/numpy/numpydoc/blob/e9384ce346359cbec556454ae69c1af44d6a9017/numpydoc/tests/test_xref.py#L200 (so that's something we in any case want to copy)

@amol-
Copy link
Contributor Author

amol- commented Jan 12, 2022

Well that's sort of the problem here; the numpydoc's having a hard time reading the human-readable text 😉 That being said, I think it's totally fine in the case of list of; that's something this was explicitly designed for according to this test: https://github.com/numpy/numpydoc/blob/main/numpydoc/tests/test_xref.py#L44

Well, list of is frequently interpreted in a decent way, but the code seems to be written to mostly support containers in the form list[str], dict[str, int] and so on by default...
See https://github.com/numpy/numpydoc/blob/main/numpydoc/xref.py#L26-L30 of is not even listed as a divisor of text blocks https://github.com/numpy/numpydoc/blob/main/numpydoc/xref.py#L54-L57 and I think that in the end what's going on is that it splits on ' ' and by chance interprets the part before the space as one type and the part after the space as another type.

That's also the reason why I put the "values coercible to arrays" above with a TODO comment. That is needed in the ignore list for formatting the current docstrings as is, but I think we should rather use a fixed set of terms like "array-like" for this, and if needed have a more free-form explanation of what that means in the parameter description on the next line.

Note that array-like is a recognised expression by numpy docs, but it means numpy.array. We might want to add a definition of arrow-array to have a similar thing for Arrow arrays.

ci/scripts/python_build.sh Outdated Show resolved Hide resolved
@kszucs
Copy link
Member

kszucs commented Jan 18, 2022

@jorisvandenbossche do we want to include this in 7.0? Personally I'd like to :)

@jorisvandenbossche
Copy link
Member

@amol- I pushed a commit with some changes:

  • A bunch of general docstring fixes that will ensure it gets rendered better with this PR
  • Some more aliases (it would maybe be a good feature request for numpydoc that you can specify a "namespace" to look into as well, eg to always try looking in the "pyarrow" namespace as well. For a similar effect, I added some aliases for often-used objects like RecordBatch and Table)
  • Extra ignored words. I know you were not really a fan of adding more, but I also added a TODO comment to many of them, and we can fix those in follow-up PRs. But if we want to get this merged for 7.0, I would prefer to keep those ignores short-term to avoid incorrect rendering for those.

@amol-
Copy link
Contributor Author

amol- commented Jan 19, 2022

@amol- I pushed a commit with some changes:

  • A bunch of general docstring fixes that will ensure it gets rendered better with this PR
  • Some more aliases (it would maybe be a good feature request for numpydoc that you can specify a "namespace" to look into as well, eg to always try looking in the "pyarrow" namespace as well. For a similar effect, I added some aliases for often-used objects like RecordBatch and Table)
  • Extra ignored words. I know you were not really a fan of adding more, but I also added a TODO comment to many of them, and we can fix those in follow-up PRs. But if we want to get this merged for 7.0, I would prefer to keep those ignores short-term to avoid incorrect rendering for those.

Fine for me, we can always iterate.

Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amol-, @jorisvandenbossche! Merging on green.

@kszucs kszucs closed this in e9e16c9 Jan 19, 2022
@ursabot
Copy link

ursabot commented Jan 19, 2022

Benchmark runs are scheduled for baseline = fd580db and contender = e9e16c9. e9e16c9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.13% ⬆️0.09%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants