Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34868: [Python] Share docstrings between classes #34894

Merged
merged 7 commits into from
Apr 24, 2023

Conversation

danepitkin
Copy link
Member

@danepitkin danepitkin commented Apr 4, 2023

Rationale for this change

Python classes sometimes duplicate docstrings, but change one word such as class name. Add a decorator function as a utility to help deduplicate docstring descriptions. Only works in Python. Does not work in Cython due to this CPython issue python/cpython#91309.

What changes are included in this PR?

Add a decorator @doc that can copy, concatenate, and/or format docstrings between classes.

Are these changes tested?

Tests added.

>>> import pyarrow
>>> from pyarrow.filesystem import FileSystem, LocalFileSystem, DaskFileSystem, S3FSWrapper
>>> from pyarrow.hdfs import HadoopFileSystem
>>> for fs in [FileSystem, LocalFileSystem, DaskFileSystem, S3FSWrapper, HadoopFileSystem]:
...     print(fs.__name__)
...     print(fs.isdir.__doc__)
... 
FileSystem

        Return True if path is a directory.

        Parameters
        ----------
        path : str
            Path to check.
        
LocalFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

DaskFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

S3FSWrapper

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

HadoopFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

Note that FileSystem.isdir.__doc__ is not dedented because it does not use the @doc decorator.

Are there any user-facing changes?

No

@danepitkin danepitkin requested a review from AlenkaF as a code owner April 4, 2023 20:02
@github-actions
Copy link

github-actions bot commented Apr 4, 2023

@github-actions
Copy link

github-actions bot commented Apr 4, 2023

⚠️ GitHub issue #34868 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good utility for writing docstrings 👍
Do we know if there are other places in pyarrow that would benefit from it?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 5, 2023
@danepitkin
Copy link
Member Author

danepitkin commented Apr 5, 2023

arrow/python/pyarrow/types.py could be a good option! This is the first one that stood out to me, but I haven't yet thoroughly reviewed all python code. Most of the benefit would be in Cython code unfortunately..

Do you think we should add those to this PR? I was thinking of doing it in separate PRs, but who knows when that will happen!

@jorisvandenbossche
Copy link
Member

Do we know if there are other places in pyarrow that would benefit from it?

I am not sure we have many places that currently are not in cython, and have repetition in docstrings. Looking through our python files, we do already have some manual templating in parquet/core.py, ipc.py or orc.py (eg to share the explanation of certain parameters between multiple docstrings), but I don't know if they would fit this pattern for using @doc.

pyarrow.filesystem is also deprecated, so that will be removed in a few releases (not that we can't improve the docs with this PR in the meantime). types.py indeed looks like the main obvious candidate where this could be useful.

@danepitkin danepitkin marked this pull request as draft April 6, 2023 17:35
@danepitkin
Copy link
Member Author

Updated types.py:

>>> import pyarrow
>>> from pyarrow.types import *
>>> print(is_null.__doc__)

Return True if value is an instance of type: null.

Parameters
----------
t : DataType

>>> print(is_boolean.__doc__)

Return True if value is an instance of type: boolean.

Parameters
----------
t : DataType

>>> print(is_map.__doc__)

Return True if value is an instance of type: logical map.

Parameters
----------
t : DataType

>>> print(is_unicode.__doc__)

Alias for is_string.

Parameters
----------
t : DataType

>>> print(is_large_unicode.__doc__)

Alias for is_large_string.

Parameters
----------
t : DataType

return t.id in _UNION_TYPES


@doc(is_null, datatype="nested")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we like nested or nested type?

Copy link
Member Author

@danepitkin danepitkin Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose nested type since it's describing where the type comes from.

return t.id == lib.Type_DATE64


@doc(is_null, datatype="logical map")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep "logical" in logical map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying implementation is @GARROW_TYPE_MAP: A repeated struct logical type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove that. We don't have a clear definition of "logical" types in the arrow format spec, and the fact that a map type is basically a list of structs is an implementation detail (and adding "logical" here doesn't help making that clear)

return t.id in _DECIMAL_TYPES


@doc(is_null, datatype="decimal128")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "128" to decimal128 here.

return t.id == lib.Type_DECIMAL128


@doc(is_null, datatype="decimal256")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "256" to decimal256 here.

return t.id == lib.Type_INTERVAL_MONTH_DAY_NANO


@doc(is_null, datatype="primitive")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we like primitive or primitive type better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose primitive type since it's describing a family of types.

@danepitkin danepitkin requested a review from AlenkaF April 7, 2023 13:26
@danepitkin
Copy link
Member Author

I applied @doc to the types.py!

@danepitkin
Copy link
Member Author

One interesting limitation is if there are docstrings that use curly braces {}. If it's a doctest, we would need to replace {} with an equivalent like dict() or set(). If it's part of the descriptions, it can be escaped e.g. {{}}

example:

        Examples
        --------
        >>> import pyarrow as pa
        >>> import pandas as pd
        >>> df = pd.DataFrame({'year': [2020, 2022, 2019, 2021],
        ...                    'n_legs': [2, 4, 5, 100],
        ...                    'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})

vs.

        Examples
        --------
        >>> import pyarrow as pa
        >>> import pandas as pd
        >>> df = pd.DataFrame(dict('year': [2020, 2022, 2019, 2021],
        ...                    'n_legs': [2, 4, 5, 100],
        ...                    'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]))

@danepitkin danepitkin marked this pull request as ready for review April 7, 2023 19:13
@AlenkaF
Copy link
Member

AlenkaF commented Apr 11, 2023

Thanks so much for the updates, the optimisation in the types.py looks great! 👏

One interesting limitation is if there are docstrings that use curly braces {}. If it's a doctest, we would need to replace {} with an equivalent like dict() or set(). If it's part of the descriptions, it can be escaped e.g. {{}}

Oh, this is not ideal.
What exactly happens in the case where the examples use curly braces? Does this happen often?

@danepitkin
Copy link
Member Author

danepitkin commented Apr 11, 2023

Thanks so much for the updates, the optimisation in the types.py looks great! 👏

One interesting limitation is if there are docstrings that use curly braces {}. If it's a doctest, we would need to replace {} with an equivalent like dict() or set(). If it's part of the descriptions, it can be escaped e.g. {{}}

Oh, this is not ideal. What exactly happens in the case where the examples use curly braces? Does this happen often?

If the example uses curly braces, pyarrow will throw a runtime error:

@doc(is_null, datatype="primitive type")
def is_primitive(t):
    """
    Example

    >>> pydict = {"cat": 0, "dog": 1}
    """
    return lib._is_primitive(t.id)
>>> import pyarrow as pa
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dane/code/arrow/python/pyarrow/__init__.py", line 288, in <module>
    import pyarrow.types as types
  File "/Users/dane/code/arrow/python/pyarrow/types.py", line 291, in <module>
    @doc(is_null, datatype="primitive type")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dane/code/arrow/python/pyarrow/util.py", line 76, in decorator
    params_applied = [
                     ^
  File "/Users/dane/code/arrow/python/pyarrow/util.py", line 77, in <listcomp>
    component.format(**params)
KeyError: '"cat"'

The fix would be to replace them like this:

    >>> pydict = dict(cat = 0, dog = 1)

This mostly shows up in Cython docstrings, which are not applicable at the moment!

@AlenkaF
Copy link
Member

AlenkaF commented Apr 12, 2023

OK. Well, I do not have much against that (using dict or set instead of {}).

return t.id == lib.Type_BOOL


@doc(is_null, datatype="integer")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@doc(is_null, datatype="integer")
@doc(is_null, datatype="any integer")

I would keep the "any" to clearly indicate this is for any integer (i.e. all bitwidths and signed/unsigned)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Will update

return t.id == lib.Type_DATE64


@doc(is_null, datatype="logical map")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove that. We don't have a clear definition of "logical" types in the arrow format spec, and the fact that a map type is basically a list of structs is an implementation detail (and adding "logical" here doesn't help making that clear)

@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 13, 2023
@jorisvandenbossche
Copy link
Member

If it's a doctest, we would need to replace {} with an equivalent like dict() or set(). If it's part of the descriptions, it can be escaped e.g. {{}}

I think also in a doctest you can use the {{ .. }} escape approach?

(and to be clear, the current PR doesn't run into that, right? It's only for when we would want to use it more broadly for docstrings with examples?)

@danepitkin
Copy link
Member Author

If it's a doctest, we would need to replace {} with an equivalent like dict() or set(). If it's part of the descriptions, it can be escaped e.g. {{}}

I think also in a doctest you can use the {{ .. }} escape approach?

(and to be clear, the current PR doesn't run into that, right? It's only for when we would want to use it more broadly for docstrings with examples?)

Correct! There are no uses of {} in the docstrings of this PR. You can also use {{ .. }} for descriptions, but my guess is that doctests would fail if you tried to use it in the code examples. I noticed Pandas usage always uses dict() instead of {{ .. }} in code examples explicitly, but does use {{ .. }} in the descriptions.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Apr 14, 2023
@danepitkin
Copy link
Member Author

All feedback is applied!

@AlenkaF AlenkaF merged commit cd14e20 into apache:main Apr 24, 2023
@ursabot
Copy link

ursabot commented Apr 24, 2023

Benchmark runs are scheduled for baseline = f4bd43d and contender = cd14e20. cd14e20 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.75% ⬆️0.12%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] cd14e201 ec2-t3-xlarge-us-east-2
[Failed] cd14e201 test-mac-arm
[Finished] cd14e201 ursa-i9-9960x
[Finished] cd14e201 ursa-thinkcentre-m75q
[Finished] f4bd43d2 ec2-t3-xlarge-us-east-2
[Failed] f4bd43d2 test-mac-arm
[Finished] f4bd43d2 ursa-i9-9960x
[Finished] f4bd43d2 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
### Rationale for this change

Python classes sometimes duplicate docstrings, but change one word such as class name. Add a decorator function as a utility to help deduplicate docstring descriptions. Only works in Python. Does not work in Cython due to this CPython issue python/cpython#91309.

### What changes are included in this PR?

Add a decorator `@ doc` that can copy, concatenate, and/or format docstrings between classes.

### Are these changes tested?

Tests added. 

```
>>> import pyarrow
>>> from pyarrow.filesystem import FileSystem, LocalFileSystem, DaskFileSystem, S3FSWrapper
>>> from pyarrow.hdfs import HadoopFileSystem
>>> for fs in [FileSystem, LocalFileSystem, DaskFileSystem, S3FSWrapper, HadoopFileSystem]:
...     print(fs.__name__)
...     print(fs.isdir.__doc__)
... 
FileSystem

        Return True if path is a directory.

        Parameters
        ----------
        path : str
            Path to check.
        
LocalFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

DaskFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

S3FSWrapper

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

HadoopFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

```
Note that `FileSystem.isdir.__doc__` is not dedented because it does not use the `@ doc` decorator.

### Are there any user-facing changes?

No
* Closes: apache#34868

Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Alenka Frim <frim.alenka@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
### Rationale for this change

Python classes sometimes duplicate docstrings, but change one word such as class name. Add a decorator function as a utility to help deduplicate docstring descriptions. Only works in Python. Does not work in Cython due to this CPython issue python/cpython#91309.

### What changes are included in this PR?

Add a decorator `@ doc` that can copy, concatenate, and/or format docstrings between classes.

### Are these changes tested?

Tests added. 

```
>>> import pyarrow
>>> from pyarrow.filesystem import FileSystem, LocalFileSystem, DaskFileSystem, S3FSWrapper
>>> from pyarrow.hdfs import HadoopFileSystem
>>> for fs in [FileSystem, LocalFileSystem, DaskFileSystem, S3FSWrapper, HadoopFileSystem]:
...     print(fs.__name__)
...     print(fs.isdir.__doc__)
... 
FileSystem

        Return True if path is a directory.

        Parameters
        ----------
        path : str
            Path to check.
        
LocalFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

DaskFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

S3FSWrapper

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

HadoopFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

```
Note that `FileSystem.isdir.__doc__` is not dedented because it does not use the `@ doc` decorator.

### Are there any user-facing changes?

No
* Closes: apache#34868

Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Alenka Frim <frim.alenka@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
### Rationale for this change

Python classes sometimes duplicate docstrings, but change one word such as class name. Add a decorator function as a utility to help deduplicate docstring descriptions. Only works in Python. Does not work in Cython due to this CPython issue python/cpython#91309.

### What changes are included in this PR?

Add a decorator `@ doc` that can copy, concatenate, and/or format docstrings between classes.

### Are these changes tested?

Tests added. 

```
>>> import pyarrow
>>> from pyarrow.filesystem import FileSystem, LocalFileSystem, DaskFileSystem, S3FSWrapper
>>> from pyarrow.hdfs import HadoopFileSystem
>>> for fs in [FileSystem, LocalFileSystem, DaskFileSystem, S3FSWrapper, HadoopFileSystem]:
...     print(fs.__name__)
...     print(fs.isdir.__doc__)
... 
FileSystem

        Return True if path is a directory.

        Parameters
        ----------
        path : str
            Path to check.
        
LocalFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

DaskFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

S3FSWrapper

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

HadoopFileSystem

Return True if path is a directory.

Parameters
----------
path : str
    Path to check.

```
Note that `FileSystem.isdir.__doc__` is not dedented because it does not use the `@ doc` decorator.

### Are there any user-facing changes?

No
* Closes: apache#34868

Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Alenka Frim <frim.alenka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Sharing docstrings between classes
4 participants