Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7874: [Python][Archery] Validate docstrings with numpydoc #6420

Closed
wants to merge 16 commits into from

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Feb 13, 2020

No description provided.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@jorisvandenbossche
Copy link
Member

Cool!

I wanted to try it out, but do you know if there is a way to only run this check with archery? Eg archery lint --with-numpydoc 1 is still running all other lint checks as well, so to just run numpydoc you would need to manually disable all others

@jorisvandenbossche
Copy link
Member

I think we should maybe start with a subset of the checks (an overview is here: https://github.com/numpy/numpydoc/blob/5e9a629c5882a32b5539345f807cd8fc8f0dc624/numpydoc/validate.py#L35).

Maybe things like all parameters documented, no wrong section names, are most important to start with.

@kszucs
Copy link
Member Author

kszucs commented Feb 17, 2020

@jorisvandenbossche SGTM

@kszucs
Copy link
Member Author

kszucs commented Feb 17, 2020

Yeah, I'm running it via

archery lint --with-clang-format false --with-cpplint false --with-iwyu false --with-cmake-format false --with-rat false --with-r false --with-rust false --with-docker false --with-numpydoc true

@kszucs kszucs changed the title [Python][Archery] Validate docstrings with numpydoc ARROW-7874: [Python][Archery] Validate docstrings with numpydoc Feb 18, 2020
@kszucs kszucs marked this pull request as ready for review February 18, 2020 13:54
@kszucs
Copy link
Member Author

kszucs commented Feb 18, 2020

@jorisvandenbossche I've added an standalone archery command for running it, because it requires pyarrow to be avilable for import during runtime. You can also pass a list of rules to ignore or include.

archery numpydoc --whitelist ... --blacklist ...
archery numpydoc -w RULE1 -w RULE2  # enable two rules
archery numpydoc -b RULE1 -b RULE2  # desallow two rules

It will help with iterating over the high priority violations.

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

dev/archery/archery/cli.py Outdated Show resolved Hide resolved
dev/archery/archery/lang/python.py Outdated Show resolved Hide resolved
dev/archery/archery/utils/lint.py Show resolved Hide resolved
@kszucs
Copy link
Member Author

kszucs commented Feb 19, 2020

@jorisvandenbossche ready for review

@jorisvandenbossche
Copy link
Member

In the output, I see things like

pyarrow.lib.table
-> table(data, names=None, schema=None, metadata=None)
SS03: Summary does not end with a period
SS06: Summary should fit in a single line
PR02: Unknown parameters {'data', 'schema', 'metadata', 'names'}
SA04: Missing description for See Also "Table.from_arrays" reference
SA04: Missing description for See Also "Table.from_pandas" reference
SA04: Missing description for See Also "Table.from_pydict" reference

So it seems that for cython methods the parameters are not recognized? (I thought this worked in pandas, though, I might need to check)

@kszucs
Copy link
Member Author

kszucs commented Feb 19, 2020

Numpydoc looks for the signature, but:

>>> import pyarrow as pa
>>> import inspect
>>> inspect.signature(pa.table)
ValueError: no signature found for builtin <built-in function table>

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 19, 2020

Pandas example:

In [28]: inspect.signature(pd.Timestamp)   
Out[28]: <Signature (ts_input=<object object at 0x7faa0506d3a0>, freq=None, tz=None, unit=None, year=None, month=None, day=None, hour=None, minute=None, second=None, microsecond=None, nanosecond=None, tzinfo=None)>

where Timestamp is also defined in cython.
Could this be some cython option?

@jorisvandenbossche
Copy link
Member

Ah, but for a function, it also doesn't work in pandas (we don't have many functions from cython directly exposed publicly). And Timestamp is a class, not cdef class, which might be the difference

@kszucs
Copy link
Member Author

kszucs commented Feb 19, 2020

Yeah, cython doesn't set __text_signature__ for the cdef'd classes and functions, despite that embedsignature directive is set to True, and inspect.signature looks for that property.

@jorisvandenbossche
Copy link
Member

What I find surprising is that eg pa.table is actually not a cdef or cpdef, but just a def

@jorisvandenbossche
Copy link
Member

Anyway, I think this can be merged? Or should we wait a bit more for someone else to review?

@kszucs
Copy link
Member Author

kszucs commented Feb 19, 2020

@jorisvandenbossche I managed to add a workaround, so now we can parse the signature of pa.table as

<Signature (data, names=None, schema=None, metadata=None)>

It alseo required some additional gymnastics to get rid of the cython specific typehints, se we convert

array(obj, type=None, mask=None, size=None, from_pandas=None,
      bool safe=True, MemoryPool memory_pool=None)

To:

<Signature (obj, type=None, mask=None, size=None, from_pandas=None,
            safe=True, memory_pool=None)>

@kszucs
Copy link
Member Author

kszucs commented Feb 19, 2020

You can test it with archery numpydoc -w PR02

Copy link
Member Author

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, merging on green

@kszucs kszucs closed this in dc01a36 Feb 24, 2020
kszucs added a commit that referenced this pull request Mar 25, 2020
…pydoc

Depends on #6420.

Reduces the number of docstring violations from 1335 to 793 (fixes 542).

This is going to require more patches, but we need to start somewhere.

Closes #6444 from kszucs/docstrings

Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants