[ENH] Pyjanitor fill method #704

samukweku · 2020-07-26T11:28:37Z

PR Description

Please describe the changes proposed in the pull request:

A method-chainable fill method for forward and backward fills on selected columns of a dataframe.

This PR resolves #700 .

Please ensure that you have done the following:

PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.

If you're not on the contributors list, add yourself to AUTHORS.rst.

Add a line to CHANGELOG.rst under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Quick Check

To do a very quick check that everything is correct, follow these steps below:

Run the command make check from pyjanitor's top-level directory. This will automatically run:
- black formatting
- flake8 checking
- running the test suite
- docs build

Once done, please check off the check-box above.

If make check does not work for you, you can execute the commands listed in the Makefile individually.

Code Changes

If you are adding code changes, please ensure the following:

Ensure that you have added tests.
Run all tests ($ pytest .) locally on your machine.
- Check to ensure that test coverage covers the lines of code that you have added.
- Ensure that all tests pass.

Documentation Changes

If you are adding documentation changes, please ensure the following:

Build the docs locally.
View the docs to check that it renders correctly.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

codecov · 2020-07-26T11:42:59Z

Codecov Report

Merging #704 into dev will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##              dev     #704   +/-   ##
=======================================
  Coverage   93.16%   93.16%           
=======================================
  Files          16       16           
  Lines         600      600           
=======================================
  Hits          559      559           
  Misses         41       41

ericmjl

Hey @samukweku!

First off, tremendous job here! Thanks for kickstarting the fill API!

I'm making a bunch of suggestions to the front-facing API, mostly for the sake of consistency with the rest of the API. In particular, I think the implementation of fill() can be simplified by making the API match that of find_replace. The "category" of actions there are similar: map a set of column names to an action that is taking place per column name. I think we should simplify the API such that it follows the use of dictionaries, which is the Pythonic way of doing discrete mappings.

The only other thing we probably want to think about is whether we should have a fill function or a fill_<something> function as the name. In the category of "filling" functions, there's fillna() (in pandas), fill_empty (in pyjanitor), and so fill feels a bit "God-like" in that it is named very generically. I know naming things is hard, so I'm going to propose a few straw man names, with their pros and cons. If you think none of them are good, please let me know, ignore the suggestions, and proceed as is.

fill_direction: Makes the directionality aspect a bit more explicit, but the use of direction in the name overlaps with what I've proposed as the API change (directions={"col1":"up", "col2": "down"})
fill_in: Doesn't have the name overlap problem, but is less explicit.
fillin: Matches more closely to the fillna API, but I can imagine it being mistyped all the time as filling.

Having written all of that, maybe fill is ok 😄.

Looking forward to the changes, @samukweku! I'll make an issue for the notebook, so that someone else can take it on. (Human map reduce is pretty awesome for these independent tasks.)

janitor/functions.py

ericmjl · 2020-07-26T14:24:43Z

janitor/functions.py

+    if not columns:
+        raise ValueError("columns argument cannot be empty")
+
+    if not isinstance(columns, (list, tuple, str)):
+        raise TypeError(
+            """
+            The columns argument should be a list/tuple of column names,
+            a single column, or a string of columns separated by ','
+            """
+        )
+
+    if not isinstance(directions, (list, tuple, str)):
+        raise TypeError(
+            """
+            The directions argument should be a list/tuple of directions,
+            a single direction, or a string of directions separated by
+            ','
+            """
+        )
+
+    # if columns/directions is a string, convert to list
+    if isinstance(columns, str):
+        columns = columns.split(",")
+    if isinstance(directions, str):
+        directions = directions.split(",")
+
+    # check that all columns supplied exist in the dataframe
+    non_existent_columns = set(columns).difference(df.columns)
+    if any(non_existent_columns):
+        # handles multiple columns
+        if len(non_existent_columns) > 1:
+            outcome = [f"'{word}'" for word in non_existent_columns]
+            outcome = ", ".join(outcome)
+            raise ValueError(
+                f"Columns {outcome} do not exist in the dataframe"
+            )
+        outcome = "".join(non_existent_columns)
+        raise ValueError(f"Column '{outcome}' does not exist in the dataframe")
+
+    # check supplied directions exist in directions_options
+    direction_options = ("up", "down", "updown", "downup")
+    directions = [direction.lower() for direction in directions]
+    incorrect_direction = set(directions).difference(direction_options)
+    if any(incorrect_direction):
+        raise ValueError(
+            """
+            The direction should be either 'up', 'down', 'updown', 'downup',
+            or a combination of the aforementioned options
+            """
+        )
+
+    # check that the length of columns matches the length of directions
+    # if the length of directions is greater than 1
+    length_columns_arg = len(columns)
+    length_directions_arg = len(directions)
+
+    if length_columns_arg != length_directions_arg:
+        if length_directions_arg > 1:
+            raise ValueError(
+                """
+                The number of arguments in the columns and
+                directions do not match.
+                """
+            )
+        directions = directions * length_columns_arg
+
+    fill_pairs = dict(zip(columns, directions))


With the API change, a lot of these checks can be removed. I think the only ones that would be necessary when using a dictionary are that:

It is not an empty dictionary

All keys exist as columns in the dataframe,

All values are one of the four valid options.

We'd have fewer lines of code to maintain this way.

tests/functions/test_fill.py

samukweku · 2020-07-26T21:44:29Z

@ericmjl , thanks for taking out the time to review the code. I wonder if it is better to lump the code, somewhat, into fill_empty:

df = fill_empty(df, column_names=['col1', 'col2'], value=0, directions=None)

if the value argument is not None, then directions must be None and vice versa. It might make it simpler and remove the need for one more function that might be confusing name wise. The drawback might be that I may not be able to pass in a dictionary for the mapping. @hectormz @zbarry @szuckerman , and others, I would also love your thoughts on this.

For the names suggested, fill_direction seems most promising.

hectormz · 2020-07-26T23:01:20Z

@samukweku I'm just catching up to speed on the issue/PR.

FYI, you've been tagging me at @HectorM14 which is my old account. Tag me at @hectormz if you want to get my attention 😁

samukweku · 2020-07-26T23:09:42Z

@samukweku I'm just catching up to speed on the issue/PR.

FYI, you've been tagging me at @HectorM14 which is my old account. Tag me at @hectormz if you want to get my attention

Haha!, my bad. Now I know the right way to get your attention

ericmjl · 2020-07-28T23:43:41Z

The PR's been open for a few days, and we've not had much in the way of replies.

@samukweku, I propose we go forward with fill_direction as the function name. (By convention, most function names are two words, after all.) This is probably more informative than lumping it into fill_empty. In any case, I am not sure that we're near a 1.x release, so if this turns out to be an API wart after a while, we can always change it later. (Famous last words - I know that people end up just adapting, ha ha ha.)

Co-authored-by: Eric Ma <ericmjl@users.noreply.github.com>

samukweku · 2020-07-30T10:34:34Z

the failing test is from test_filter_date.py, specifically the test_filter_date_column_name. I think it should raise a keyerror, since the column 42 does not exist in the dataframe. Not sure how to fix it

ericmjl · 2020-07-30T11:37:58Z

@samukweku I think I see what’s going on! Could you change the expected error raised from TypeError to KeyError? That probably will fix it. (A local check before pushing can confirm for you whether this was right or not.)

samukweku · 2020-07-30T12:13:26Z

@ericmjl I did change it to KeyError but it failed on the local test, whereas TypeError doesn't trigger an error locally

ericmjl · 2020-08-01T17:12:16Z

@samukweku hmm, I'm quite baffled myself. Can you decorate it with @pytest.mark.xfail, and put a reason on the xfail decorator as "Unknown error; KeyError fails locally but ValueError fails on CI"?

https://docs.pytest.org/en/latest/skipping.html#xfail-mark-test-functions-as-expected-to-fail

In doing this, we'll at least preserve the test, but it won't break the CI unnecessarily.

hectormz · 2020-08-01T17:30:35Z

@ericmjl @samukweku I'm intrigued! I'll look at this quickly myself

hectormz · 2020-08-01T17:51:13Z

@samukweku @ericmjl I figured it out 🤔

So Pandas 1.1.0 came out in the past few days, and the CI is grabbing the latest version. You likely have something like 1.0.5 locally.

According to the release notes:

KeyErrors raised by loc specify missing labels¶
Previously, if labels were missing for a .loc call, a KeyError was raised stating that this was no longer supported.
Now the error message also includes a list of the missing labels (max 10 items, display width 80 characters). See GH34272.

So, we should either be flexible here, or set a maximum or minimum pandas version for now (1.1.0)

hectormz · 2020-08-01T20:41:39Z

Are you still using the Docker container? The CI grabs the latest version it can, while following the min/max version rules we set. These things pop up for me when I install once locally, and don't keep the packages up to date

samukweku · 2020-08-01T20:43:49Z

@hectormz Yes, I'm using the docker container, and the version there is 1.0.4. Thanks for sharing your CI knowledge by the way.

ericmjl · 2020-08-01T21:25:33Z

@hectormz and @samukweku I think I have an idea. I'm also using the Docker container. There's probably a cached layer that's causing this error to not be revealed. I'm going to clear out my cache:

docker system prune -a -f

and see what happens.

ericmjl · 2020-08-01T23:27:34Z

OK! Looping back and confirming that pandas 1.1.0 was the reason why we have had issues.

I suspect pandas 1.0.5 was what got resolved by conda, but in the CI system, 1.1.0 got installed because we pip install the package.

I’m going to suggest we mark the test as xfail with 1.1.0 as the reason why it’s failing. It’s not crucial at the moment, I think, for that test to be working, and with the new .loc changes in pandas, I think it means we can have some other package-wide changes underneath the hood (i.e. by not needing to do checks to make sure a column is present.)

@samukweku, if you've got the time, would you like to do the honors of finishing up the PR? If you're blocked because of time, let me know, I'll send a PR into your branch.

ericmjl

I approve! Let's leave this open for one more day to get any more comments in.

ericmjl · 2020-08-04T01:04:21Z

@samukweku it's past 24 hours! @hectormz, @zbarry or @szuckerman would you like to do the honors of hitting the big green button?

hectormz · 2020-08-04T01:47:54Z

@ericmjl I can do it, but @samukweku can you add a one line summary docstring? (Unless my phone is not showing it properly)

samukweku · 2020-08-04T01:58:56Z

@ericmjl I can do it, but @samukweku can you add a one line summary docstring? (Unless my phone is not showing it properly)

Hi @hectormz , not sure what you mean. Where should the one line summary docstring be located?

hectormz · 2020-07-26T22:08:58Z

janitor/functions.py

+    directions: Union[Tuple[str], List[str], str] = "down",
+    limit: Optional[int] = None,
+) -> pd.DataFrame:
+    """


Can you add a one-line docstring summary here?

janitor/functions.py

tests/functions/test_fill.py

janitor/functions.py

tests/functions/test_fill_direction.py

hectormz · 2020-08-04T04:37:00Z

@samukweku I realize I still haven't resolved the issue/PR on docstrings, so it should be clearer after that!

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

ericmjl · 2020-08-05T01:39:12Z

Anytime you’d like to, @hectormz, please do the honors!

root added 10 commits July 26, 2020 10:12

forward and backward fill methods

f4f3b65

updates to fill method

4c8f93a

updates to fill method

6ad127f

updates to fill_method

4e44eb7

updates to fill method

2562938

updates to fill method

b98ac93

updates to fill method

3c06a72

updates to fill method

13b49ec

updates to fill method

914d363

update to changelog

ea43252

ericmjl requested changes Jul 26, 2020

View reviewed changes

samukweku and others added 8 commits July 30, 2020 07:16

Update tests/functions/test_fill.py

d6be2a6

Co-authored-by: Eric Ma <ericmjl@users.noreply.github.com>

updates to fill_method

cf9678d

updates to fill_method

3d71da6

update fill_method

4ffd4a6

update fill_method

1a685b1

updates to fill_method

13487ab

update fill_method

7932965

updates to fill_method

5a83f8f

root added 3 commits August 2, 2020 04:53

updates to fill_method

be5b286

updates to fill_method

6c442d2

updates to fill_method

cf65898

ericmjl approved these changes Aug 2, 2020

View reviewed changes

hectormz requested changes Aug 4, 2020

View reviewed changes

hectormz reviewed Aug 4, 2020

View reviewed changes

janitor/functions.py Outdated Show resolved Hide resolved

hectormz requested changes Aug 4, 2020

View reviewed changes

samukweku and others added 12 commits August 4, 2020 16:37

Update janitor/functions.py

4741546

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update janitor/functions.py

36200e3

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update tests/functions/test_fill_direction.py

98b2e0c

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update tests/functions/test_fill_direction.py

c4c40cd

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update tests/functions/test_fill_direction.py

f56c573

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update tests/functions/test_fill_direction.py

5267a44

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update tests/functions/test_fill_direction.py

480d5da

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update tests/functions/test_fill_direction.py

35d3f9e

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update tests/functions/test_fill_direction.py

62247d8

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update tests/functions/test_fill_direction.py

52c5c89

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

Update janitor/functions.py

5caede8

Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>

docstring updates

306eb80

hectormz merged commit 350be8e into pyjanitor-devs:dev Aug 5, 2020

samukweku mentioned this pull request Nov 25, 2020

[ENH] fill_direction improve #778

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Pyjanitor fill method #704

[ENH] Pyjanitor fill method #704

samukweku commented Jul 26, 2020 •

edited

codecov bot commented Jul 26, 2020

ericmjl left a comment

ericmjl Jul 26, 2020

samukweku commented Jul 26, 2020 •

edited by hectormz

hectormz commented Jul 26, 2020

samukweku commented Jul 26, 2020 •

edited

ericmjl commented Jul 28, 2020

samukweku commented Jul 30, 2020

ericmjl commented Jul 30, 2020

samukweku commented Jul 30, 2020 •

edited

ericmjl commented Aug 1, 2020 •

edited

hectormz commented Aug 1, 2020

hectormz commented Aug 1, 2020

hectormz commented Aug 1, 2020

samukweku commented Aug 1, 2020 •

edited

ericmjl commented Aug 1, 2020

ericmjl commented Aug 1, 2020 •

edited

ericmjl left a comment

ericmjl commented Aug 4, 2020

hectormz commented Aug 4, 2020

samukweku commented Aug 4, 2020

hectormz Jul 26, 2020

hectormz commented Aug 4, 2020

ericmjl commented Aug 5, 2020

[ENH] Pyjanitor fill method #704

[ENH] Pyjanitor fill method #704

Conversation

samukweku commented Jul 26, 2020 • edited

PR Description

Quick Check

Code Changes

Documentation Changes

Relevant Reviewers

codecov bot commented Jul 26, 2020

Codecov Report

ericmjl left a comment

Choose a reason for hiding this comment

ericmjl Jul 26, 2020

Choose a reason for hiding this comment

samukweku commented Jul 26, 2020 • edited by hectormz

hectormz commented Jul 26, 2020

samukweku commented Jul 26, 2020 • edited

ericmjl commented Jul 28, 2020

samukweku commented Jul 30, 2020

ericmjl commented Jul 30, 2020

samukweku commented Jul 30, 2020 • edited

ericmjl commented Aug 1, 2020 • edited

hectormz commented Aug 1, 2020

hectormz commented Aug 1, 2020

hectormz commented Aug 1, 2020

samukweku commented Aug 1, 2020 • edited

ericmjl commented Aug 1, 2020

ericmjl commented Aug 1, 2020 • edited

ericmjl left a comment

Choose a reason for hiding this comment

ericmjl commented Aug 4, 2020

hectormz commented Aug 4, 2020

samukweku commented Aug 4, 2020

hectormz Jul 26, 2020

Choose a reason for hiding this comment

hectormz commented Aug 4, 2020

ericmjl commented Aug 5, 2020

samukweku commented Jul 26, 2020 •

edited

samukweku commented Jul 26, 2020 •

edited by hectormz

samukweku commented Jul 26, 2020 •

edited

samukweku commented Jul 30, 2020 •

edited

ericmjl commented Aug 1, 2020 •

edited

samukweku commented Aug 1, 2020 •

edited

ericmjl commented Aug 1, 2020 •

edited