Docstring formatting and linting using `pydocstyle` and `darglint` #2670

angela97lin · 2021-08-23T15:43:05Z

Closes #878

I introduced two packages in this PR: pydocstyle and darglint.

Some noteable changes via pydocstyle:

Catches missing arguments: D417: Missing argument descriptions in the docstring (argument(s) X are missing descriptions in 'validate' docstring)--unfortunately, this doesn't always catch everything :(
Consistent spacing between sections of multi-line docstrings
Consistent punctuation (ex: must end in period for description / summary), unfortunately not for arguments :'(
Use "Args" instead of "Arguments"; I think before we tried to conform to "Arguments" but since "Args" is more Google-style (https://google.github.io/styleguide/pyguide.html) and recognized as a valid section header, updating to use Args.
If there's any formatting that we don't like, we can choose to ignore it, if there's any formatting we want to include on top of the Google style, we can add it! I chose to add D400 because it felt more consistent :)

Pydoc error codes: http://www.pydocstyle.org/en/5.0.1/error_codes.html

Noteable changes via darglint:

Detects when "Returns" is missing for multiline docstrings or misformatted.
Detects missing arguments even if misformatted sections.
Requires "Raises" with the appropriate exception if the method raises an exception.
If method does not return, delete Returns section rather than use Returns: None.
Detects excess docstrings (when parameters are deleted but docstrings are not)

** However, a strong con for darglint is that it takes a noticeable amount of time (1-2 minutes?)--this increases our lint job from ~2 minutes to ~5-6 minutes on github. **
I wonder if adding this as a pre-commit hook could help with this time, but could be worth looking into as well.

I found https://pypi.org/project/docformatter/ as an automatic docformatter but in practice, it didn't work extremely well--for example, it likes to break docstrings up by character count by Google convention is that the first line of a docstring must fit in one line. Given that this PR makes the adjustments necessary for the repo at large, future PRs should only need to worry about what code lines have been modified, which is not as big of a lift.

codecov · 2021-08-23T15:48:31Z

Codecov Report

Merging #2670 (8f714bf) into main (b239dbc) will not change coverage.
The diff coverage is 100.0%.

@@          Coverage Diff          @@
##            main   #2670   +/-   ##
=====================================
  Coverage   99.8%   99.8%           
=====================================
  Files        301     301           
  Lines      27904   27904           
=====================================
  Hits       27841   27841           
  Misses        63      63

Impacted Files	Coverage Δ
evalml/__init__.py	`100.0% <ø> (ø)`
evalml/__main__.py	`100.0% <ø> (ø)`
evalml/automl/__init__.py	`100.0% <ø> (ø)`
evalml/automl/automl_algorithm/__init__.py	`100.0% <ø> (ø)`
evalml/automl/automl_algorithm/automl_algorithm.py	`100.0% <ø> (ø)`
...valml/automl/automl_algorithm/default_algorithm.py	`100.0% <ø> (ø)`
...lml/automl/automl_algorithm/iterative_algorithm.py	`100.0% <ø> (ø)`
evalml/automl/automl_search.py	`99.9% <ø> (ø)`
evalml/automl/callbacks.py	`100.0% <ø> (ø)`
evalml/automl/engine/__init__.py	`100.0% <ø> (ø)`
... and 171 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b239dbc...8f714bf. Read the comment docs.

freddyaboulton · 2021-08-23T18:24:54Z

Can't wait for this pr to be ready 🥳

…into 878_docstring_formatting

…o test changes in doc api ref

…into 878_docstring_formatting

angela97lin · 2021-09-13T02:29:07Z

setup.cfg

@@ -11,6 +11,7 @@ exclude = docs/*
 ignore = E501,W504,W503
 per-file-ignores =
    **/__init__.py:F401
+    **/tests/*:D


What this means: Ignore all docstring linting errors for test files

angela97lin · 2021-09-13T02:31:57Z

setup.cfg

+skip=__init__.py
+[darglint]
+ignore=DAR402


DAR402: When a docstring describes an exception not explicitly raised.

Should we also skip 401? Looks like darglint only checks if the current method raises an exception and not any methods called within that method.

Just starting a discussion. I don't think we need to resolve this to merge but curious what everyone thinks.

def _raise_if_a_is_odd(a): """Raises a Value error if a is odd. Raises: ValueError if a is odd """ if a % 2 == 1: raise ValueError("A is odd!") def my_func(a, b): """Does my_func. Args: a (int): A number. b (int): B number. Returns: Sum of a and b """ _raise_if_a_is_odd(a) return a + b

…into 878_docstring_formatting

chukarsten

I feel super vindicated by the reintroduction of "Raises". That was the hill I was ready to die on. I definitely feel able to accept the risk of a longer lint time on the server CI. I don't think that's a huge deal and is particularly mitigated by the pre-commit hook you linked to. I also feel like there's gotta be a way to do the lint and lint fix locally but with just a diff of files between the current commit and HEAD to minimize local lint time. But we can try it and see whether we get irritated. Great work!!!

ParthivNaresh

Now that was a workout. Excellent work!

ParthivNaresh · 2021-09-13T20:41:21Z

evalml/pipelines/components/ensemble/sklearn_stacked_ensemble_base.py

@@ -10,7 +11,7 @@
 class SklearnStackedEnsembleBase(Estimator):
    """Stacked Ensemble Base Class.

-    Arguments:
+    Args:


Is there supposed to be a "Raises" part of the docstring here?

Yeah, though I think these linting packages aren't smart enough to pick up that a class docstring should be associated with the init docstring 😅

ParthivNaresh · 2021-09-13T20:50:50Z

evalml/utils/cli_utils.py

-    """Prints information about the system, evalml, and dependencies of evalml.
-
-    Returns:
-        None


Good riddance of the returns

freddyaboulton

@angela97lin Thank you for doing this! This is awesome. I'm glad we're making sure that all of our functions/methods have docstrings and that all of the arguments are properly formatted.

I am ok with the increase in the lint job time. Locally, I tend to run that only once before I push up and I don't think we'd notice on GH since the other jobs run longer than lint.

I left a couple of questions about the changes in this PR, like some docstrings got condensed into one line and None got removed from the type list for some arguments.

What style choices are not captured by pydocstyle/darglint that we'll want to still keep an eye out during PR review? I have the following:

First letter of argument description is capitalized.
Docstring sentences end in periods.
Types are in lower-case bool vs Bool and dict vs Dict.

Let's add these and any others to our contributing guide?

freddyaboulton · 2021-09-13T20:03:19Z

setup.cfg

+skip=__init__.py
+[darglint]
+ignore=DAR402


Should we also skip 401? Looks like darglint only checks if the current method raises an exception and not any methods called within that method.

Just starting a discussion. I don't think we need to resolve this to merge but curious what everyone thinks.

def _raise_if_a_is_odd(a): """Raises a Value error if a is odd. Raises: ValueError if a is odd """ if a % 2 == 1: raise ValueError("A is odd!") def my_func(a, b): """Does my_func. Args: a (int): A number. b (int): B number. Returns: Sum of a and b """ _raise_if_a_is_odd(a) return a + b

freddyaboulton · 2021-09-13T20:08:18Z

evalml/automl/automl_search.py

@@ -1042,8 +1045,7 @@ def search(self, show_iteration_plot=True):
        self._searched = True

    def _find_best_pipeline(self):
-        """Finds the best pipeline in the rankings
-        If self._best_pipeline already exists, check to make sure it is different from the current best pipeline before training and thresholding"""
+        """Finds the best pipeline in the rankings If self._best_pipeline already exists, check to make sure it is different from the current best pipeline before training and thresholding."""


freddyaboulton · 2021-09-13T20:13:16Z

evalml/automl/automl_search.py

-            pickle_protocol (int): the pickle data stream format.
+        Args:
+            file_path (str): Location to save file.
+            pickle_type ({"pickle", "cloudpickle"}): The pickling library to use.


Interesting that ("pickle", "cloudpickle") passes the darglint check.

freddyaboulton · 2021-09-13T20:16:50Z

evalml/automl/automl_search.py


        Returns:
-            Dict[str, Dict[str, float]]: Dictionary keyed by pipeline name that maps to a dictionary of scores.
+            dict[str, Dict[str, float]]: Dictionary keyed by pipeline name that maps to a dictionary of scores.


I don't think Dict raises an error with flake8/pydocstyle?

Nope, you're correct! Just a consistency thing--we could choose either caps or not :)

freddyaboulton · 2021-09-13T20:17:26Z

evalml/automl/engine/cf_engine.py

@@ -132,36 +129,37 @@ def submit_evaluation_job(self, automl_config, pipeline, X, y) -> EngineComputat
        )
        return CFComputation(future)

-    def submit_training_job(self, automl_config, pipeline, X, y) -> EngineComputation:


freddyaboulton · 2021-09-13T20:34:27Z

evalml/data_checks/outliers_data_check.py

-        provides a seventh order Taylor series approximation to the two true
-        functional relationships, and was estimated using least squares
-        regression.
+        """Calculate the probability that there are no true outliers in a numeric (integer or float) column. It is based on creating 100,000 samples consisting of a given number of records, and then repeating this over a grid of sample sizes. Each value in a sample is drawn from a log normal distribution, and then the number of potential outliers in the data is determined using the skew adjusted box plot approach based on the medcouple statistic. It was observed that the distribution of the percentage of outliers could be described by a gamma distribution, with the shape and scale parameters changing with the sample size. For each sample size, the shape and scale parameters of the gamma distriubtion were estimated using maximum likelihood methods. The set of estimate shape and scale parameters for different sample size were then used to fit equations that relate these two parameters to the sample size. These equations use a transendental logrithmic functional form that provides a seventh order Taylor series approximation to the two true functional relationships, and was estimated using least squares regression.


Also @eccabay, private methods/functions are not shown in the docs.

freddyaboulton · 2021-09-13T20:45:55Z

evalml/automl/utils.py


    Raises:
-        ValueError: if any pipeline names are duplicated.
+        ValueError: If any pipeline names are duplicated.


The capital If is because we care about starting with a capital letter. darglint/pydocstyle don't care right?

Yup yup 😅 Definitely more of a stylistic thing, but I think it looks a tad bit cleaner!

evalml/pipelines/components/transformers/samplers/base_sampler.py

evalml/pipelines/time_series_classification_pipelines.py

freddyaboulton · 2021-09-13T21:00:24Z

evalml/pipelines/time_series_classification_pipelines.py

+            y (pd.Series): Future target of shape [n_samples].
+            X_train (pd.DataFrame): Data the pipeline was trained on of shape [n_samples_train, n_feautures].
+            y_train (pd.Series): Targets used to train the pipeline of shape [n_samples_train].
+            objective (ObjectiveBase, str): Objective used to threshold predicted probabilities, optional.


Why did you get rid of None? The parameter can be None in this method.

This could be a stylistic thing but following Google's example, it seemed like even if None was an acceptable parameter, it wasn't listed as one. Instead, they used "optional" and the "Defaults as None" to indicate that!

freddyaboulton · 2021-09-13T21:09:16Z

@chukarsten @angela97lin Looks like scikit-learn lints only on the diff but it looks kinda hairy. Maybe it's cause I don't know how to bash script lol

https://github.com/scikit-learn/scikit-learn/blob/main/build_tools/circle/linting.sh

angela97lin · 2021-09-14T18:30:25Z

@freddyaboulton Really cool, I had no idea! I filed #2779 with a link to your comment 😁

Also, thanks for the suggestions! I think another stylistic thing is adding the default value to the argument string. Updated the contributing guide with all of these: https://github.com/alteryx/evalml/blob/878_docstring_formatting/contributing.md#code-style-guide

…into 878_docstring_formatting

angela97lin added 2 commits August 20, 2021 12:18

init

69e7b11

add auto

82f358c

angela97lin self-assigned this Aug 23, 2021

angela97lin added 20 commits August 26, 2021 16:28

merge

61f2207

Merge branch 'main' into 878_docstring_formatting

fa838da

do some more cleanup

542eb74

Merge branch '878_docstring_formatting' of github.com:alteryx/evalml …

f74c174

…into 878_docstring_formatting

fix some data checks

28e67a2

start to add lines

3752f8b

dashes under sections

4ba3d25

sphinx dont tree warnings as errors, need to remove later but using t…

0cec01d

…o test changes in doc api ref

try update make bat

0801840

try to fix class imbal dc as example

8c646c8

empty commit

c6a0e54

change to parameters

2bdbb99

fix data checks

677385c

cleanup more

4674ea6

fixing more

3ad3847

add ignore test directory flag to lint command

cf45bb7

more cleanup

7bc5d94

merge main

6849d3a

adding more fixes

7049183

Merge branch 'main' into 878_docstring_formatting

3a06632

angela97lin changed the title ~~Docstring formatting~~ Docstring formatting and linting using pydocstyle Aug 30, 2021

angela97lin added 4 commits August 30, 2021 18:19

revert tests and clean more

aee0e45

Merge branch '878_docstring_formatting' of github.com:alteryx/evalml …

8afc300

…into 878_docstring_formatting

merge

913dc99

clean up objectives

70572a0

angela97lin added 11 commits September 10, 2021 23:53

change indentation

11cd30e

try

3c889a8

try again

6720618

test square brackets

39ca658

test underscore

9b4e2cc

remove raises

2a166d8

backslash

9a98ba8

try removal

32d5cc1

test

4fc6db7

clean up setup

eea28fc

Merge branch 'main' into 878_docstring_formatting

d45408c

angela97lin commented Sep 13, 2021

View reviewed changes

angela97lin added 2 commits September 12, 2021 22:53

cleanup and revert accidental merge diffs

57941fc

Merge branch '878_docstring_formatting' of github.com:alteryx/evalml …

4fbaa63

…into 878_docstring_formatting

angela97lin marked this pull request as ready for review September 13, 2021 02:54

chukarsten approved these changes Sep 13, 2021

View reviewed changes

Merge branch 'main' into 878_docstring_formatting

da14413

ParthivNaresh approved these changes Sep 13, 2021

View reviewed changes

freddyaboulton approved these changes Sep 13, 2021

View reviewed changes

angela97lin mentioned this pull request Sep 14, 2021

Decrease runtime of lint job #2779

Open

angela97lin added 4 commits September 14, 2021 15:48

clean up from comments

a818536

Merge branch 'main' into 878_docstring_formatting

1da0d4d

add default to contributing

f0bc3db

Merge branch '878_docstring_formatting' of github.com:alteryx/evalml …

8f714bf

…into 878_docstring_formatting

angela97lin merged commit 2719871 into main Sep 14, 2021

angela97lin deleted the 878_docstring_formatting branch September 14, 2021 20:28

chukarsten mentioned this pull request Sep 15, 2021

Release v0.33.0 #2784

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docstring formatting and linting using `pydocstyle` and `darglint` #2670

Docstring formatting and linting using `pydocstyle` and `darglint` #2670

angela97lin commented Aug 23, 2021 •

edited

codecov bot commented Aug 23, 2021 •

edited

freddyaboulton commented Aug 23, 2021

angela97lin Sep 13, 2021

angela97lin Sep 13, 2021

freddyaboulton Sep 13, 2021

chukarsten left a comment

ParthivNaresh left a comment

ParthivNaresh Sep 13, 2021

angela97lin Sep 14, 2021

ParthivNaresh Sep 13, 2021

freddyaboulton left a comment

freddyaboulton Sep 13, 2021

freddyaboulton Sep 13, 2021

freddyaboulton Sep 13, 2021

freddyaboulton Sep 13, 2021

angela97lin Sep 14, 2021

freddyaboulton Sep 13, 2021

freddyaboulton Sep 13, 2021

freddyaboulton Sep 13, 2021

angela97lin Sep 14, 2021

freddyaboulton Sep 13, 2021

angela97lin Sep 14, 2021

freddyaboulton commented Sep 13, 2021

angela97lin commented Sep 14, 2021 •

edited

Docstring formatting and linting using pydocstyle and darglint #2670

Docstring formatting and linting using pydocstyle and darglint #2670

Conversation

angela97lin commented Aug 23, 2021 • edited

codecov bot commented Aug 23, 2021 • edited

Codecov Report

freddyaboulton commented Aug 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

ParthivNaresh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton commented Sep 13, 2021

angela97lin commented Sep 14, 2021 • edited

Docstring formatting and linting using `pydocstyle` and `darglint` #2670

Docstring formatting and linting using `pydocstyle` and `darglint` #2670

angela97lin commented Aug 23, 2021 •

edited

codecov bot commented Aug 23, 2021 •

edited

angela97lin commented Sep 14, 2021 •

edited