Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Pyjanitor fill method #704

Merged
merged 33 commits into from Aug 5, 2020
Merged

[ENH] Pyjanitor fill method #704

merged 33 commits into from Aug 5, 2020

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Jul 26, 2020

PR Description

Please describe the changes proposed in the pull request:

  • A method-chainable fill method for forward and backward fills on selected columns of a dataframe.

This PR resolves #700 .

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.rst.
  1. Add a line to CHANGELOG.rst under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Quick Check

To do a very quick check that everything is correct, follow these steps below:

  • Run the command make check from pyjanitor's top-level directory. This will automatically run:
    • black formatting
    • flake8 checking
    • running the test suite
    • docs build

Once done, please check off the check-box above.

If make check does not work for you, you can execute the commands listed in the Makefile individually.

Code Changes

If you are adding code changes, please ensure the following:

  • Ensure that you have added tests.
  • Run all tests ($ pytest .) locally on your machine.
    • Check to ensure that test coverage covers the lines of code that you have added.
    • Ensure that all tests pass.

Documentation Changes

If you are adding documentation changes, please ensure the following:

  • Build the docs locally.
  • View the docs to check that it renders correctly.

Relevant Reviewers

Please tag maintainers to review.

@codecov
Copy link

codecov bot commented Jul 26, 2020

Codecov Report

Merging #704 into dev will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##              dev     #704   +/-   ##
=======================================
  Coverage   93.16%   93.16%           
=======================================
  Files          16       16           
  Lines         600      600           
=======================================
  Hits          559      559           
  Misses         41       41           

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @samukweku!

First off, tremendous job here! Thanks for kickstarting the fill API!

I'm making a bunch of suggestions to the front-facing API, mostly for the sake of consistency with the rest of the API. In particular, I think the implementation of fill() can be simplified by making the API match that of find_replace. The "category" of actions there are similar: map a set of column names to an action that is taking place per column name. I think we should simplify the API such that it follows the use of dictionaries, which is the Pythonic way of doing discrete mappings.

The only other thing we probably want to think about is whether we should have a fill function or a fill_<something> function as the name. In the category of "filling" functions, there's fillna() (in pandas), fill_empty (in pyjanitor), and so fill feels a bit "God-like" in that it is named very generically. I know naming things is hard, so I'm going to propose a few straw man names, with their pros and cons. If you think none of them are good, please let me know, ignore the suggestions, and proceed as is.

  • fill_direction: Makes the directionality aspect a bit more explicit, but the use of direction in the name overlaps with what I've proposed as the API change (directions={"col1":"up", "col2": "down"})
  • fill_in: Doesn't have the name overlap problem, but is less explicit.
  • fillin: Matches more closely to the fillna API, but I can imagine it being mistyped all the time as filling.

Having written all of that, maybe fill is ok 😄.

Looking forward to the changes, @samukweku! I'll make an issue for the notebook, so that someone else can take it on. (Human map reduce is pretty awesome for these independent tasks.)

janitor/functions.py Outdated Show resolved Hide resolved
Comment on lines 4231 to 4297
if not columns:
raise ValueError("columns argument cannot be empty")

if not isinstance(columns, (list, tuple, str)):
raise TypeError(
"""
The columns argument should be a list/tuple of column names,
a single column, or a string of columns separated by ','
"""
)

if not isinstance(directions, (list, tuple, str)):
raise TypeError(
"""
The directions argument should be a list/tuple of directions,
a single direction, or a string of directions separated by
','
"""
)

# if columns/directions is a string, convert to list
if isinstance(columns, str):
columns = columns.split(",")
if isinstance(directions, str):
directions = directions.split(",")

# check that all columns supplied exist in the dataframe
non_existent_columns = set(columns).difference(df.columns)
if any(non_existent_columns):
# handles multiple columns
if len(non_existent_columns) > 1:
outcome = [f"'{word}'" for word in non_existent_columns]
outcome = ", ".join(outcome)
raise ValueError(
f"Columns {outcome} do not exist in the dataframe"
)
outcome = "".join(non_existent_columns)
raise ValueError(f"Column '{outcome}' does not exist in the dataframe")

# check supplied directions exist in directions_options
direction_options = ("up", "down", "updown", "downup")
directions = [direction.lower() for direction in directions]
incorrect_direction = set(directions).difference(direction_options)
if any(incorrect_direction):
raise ValueError(
"""
The direction should be either 'up', 'down', 'updown', 'downup',
or a combination of the aforementioned options
"""
)

# check that the length of columns matches the length of directions
# if the length of directions is greater than 1
length_columns_arg = len(columns)
length_directions_arg = len(directions)

if length_columns_arg != length_directions_arg:
if length_directions_arg > 1:
raise ValueError(
"""
The number of arguments in the columns and
directions do not match.
"""
)
directions = directions * length_columns_arg

fill_pairs = dict(zip(columns, directions))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the API change, a lot of these checks can be removed. I think the only ones that would be necessary when using a dictionary are that:

  • It is not an empty dictionary
  • All keys exist as columns in the dataframe,
  • All values are one of the four valid options.

We'd have fewer lines of code to maintain this way.

tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
@samukweku
Copy link
Collaborator Author

samukweku commented Jul 26, 2020

@ericmjl , thanks for taking out the time to review the code. I wonder if it is better to lump the code, somewhat, into fill_empty:

df = fill_empty(df, column_names=['col1', 'col2'], value=0, directions=None)

if the value argument is not None, then directions must be None and vice versa. It might make it simpler and remove the need for one more function that might be confusing name wise. The drawback might be that I may not be able to pass in a dictionary for the mapping. @hectormz @zbarry @szuckerman , and others, I would also love your thoughts on this.

For the names suggested, fill_direction seems most promising.

@hectormz
Copy link
Collaborator

@samukweku I'm just catching up to speed on the issue/PR.

FYI, you've been tagging me at @HectorM14 which is my old account. Tag me at @hectormz if you want to get my attention 😁

@samukweku
Copy link
Collaborator Author

samukweku commented Jul 26, 2020

@samukweku I'm just catching up to speed on the issue/PR.

FYI, you've been tagging me at @HectorM14 which is my old account. Tag me at @hectormz if you want to get my attention

Haha!, my bad. Now I know the right way to get your attention

@ericmjl
Copy link
Member

ericmjl commented Jul 28, 2020

The PR's been open for a few days, and we've not had much in the way of replies.

@samukweku, I propose we go forward with fill_direction as the function name. (By convention, most function names are two words, after all.) This is probably more informative than lumping it into fill_empty. In any case, I am not sure that we're near a 1.x release, so if this turns out to be an API wart after a while, we can always change it later. (Famous last words - I know that people end up just adapting, ha ha ha.)

@samukweku
Copy link
Collaborator Author

the failing test is from test_filter_date.py, specifically the test_filter_date_column_name. I think it should raise a keyerror, since the column 42 does not exist in the dataframe. Not sure how to fix it

@ericmjl
Copy link
Member

ericmjl commented Jul 30, 2020

@samukweku I think I see what’s going on! Could you change the expected error raised from TypeError to KeyError? That probably will fix it. (A local check before pushing can confirm for you whether this was right or not.)

@samukweku
Copy link
Collaborator Author

samukweku commented Jul 30, 2020

@ericmjl I did change it to KeyError but it failed on the local test, whereas TypeError doesn't trigger an error locally

@ericmjl
Copy link
Member

ericmjl commented Aug 1, 2020

@samukweku hmm, I'm quite baffled myself. Can you decorate it with @pytest.mark.xfail, and put a reason on the xfail decorator as "Unknown error; KeyError fails locally but ValueError fails on CI"?

https://docs.pytest.org/en/latest/skipping.html#xfail-mark-test-functions-as-expected-to-fail

In doing this, we'll at least preserve the test, but it won't break the CI unnecessarily.

@hectormz
Copy link
Collaborator

hectormz commented Aug 1, 2020

@ericmjl @samukweku I'm intrigued! I'll look at this quickly myself

@hectormz
Copy link
Collaborator

hectormz commented Aug 1, 2020

@samukweku @ericmjl I figured it out 🤔

So Pandas 1.1.0 came out in the past few days, and the CI is grabbing the latest version. You likely have something like 1.0.5 locally.

According to the release notes:

KeyErrors raised by loc specify missing labels¶
Previously, if labels were missing for a .loc call, a KeyError was raised stating that this was no longer supported.
Now the error message also includes a list of the missing labels (max 10 items, display width 80 characters). See GH34272.

So, we should either be flexible here, or set a maximum or minimum pandas version for now (1.1.0)

@hectormz
Copy link
Collaborator

hectormz commented Aug 1, 2020

Are you still using the Docker container? The CI grabs the latest version it can, while following the min/max version rules we set. These things pop up for me when I install once locally, and don't keep the packages up to date

@samukweku
Copy link
Collaborator Author

samukweku commented Aug 1, 2020

@hectormz Yes, I'm using the docker container, and the version there is 1.0.4. Thanks for sharing your CI knowledge by the way.

@ericmjl
Copy link
Member

ericmjl commented Aug 1, 2020

@hectormz and @samukweku I think I have an idea. I'm also using the Docker container. There's probably a cached layer that's causing this error to not be revealed. I'm going to clear out my cache:

docker system prune -a -f

and see what happens.

@ericmjl
Copy link
Member

ericmjl commented Aug 1, 2020

OK! Looping back and confirming that pandas 1.1.0 was the reason why we have had issues.

I suspect pandas 1.0.5 was what got resolved by conda, but in the CI system, 1.1.0 got installed because we pip install the package.

I’m going to suggest we mark the test as xfail with 1.1.0 as the reason why it’s failing. It’s not crucial at the moment, I think, for that test to be working, and with the new .loc changes in pandas, I think it means we can have some other package-wide changes underneath the hood (i.e. by not needing to do checks to make sure a column is present.)

@samukweku, if you've got the time, would you like to do the honors of finishing up the PR? If you're blocked because of time, let me know, I'll send a PR into your branch.

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve! Let's leave this open for one more day to get any more comments in.

@ericmjl
Copy link
Member

ericmjl commented Aug 4, 2020

@samukweku it's past 24 hours! @hectormz, @zbarry or @szuckerman would you like to do the honors of hitting the big green button?

@hectormz
Copy link
Collaborator

hectormz commented Aug 4, 2020

@ericmjl I can do it, but @samukweku can you add a one line summary docstring? (Unless my phone is not showing it properly)

@samukweku
Copy link
Collaborator Author

@ericmjl I can do it, but @samukweku can you add a one line summary docstring? (Unless my phone is not showing it properly)

Hi @hectormz , not sure what you mean. Where should the one line summary docstring be located?

directions: Union[Tuple[str], List[str], str] = "down",
limit: Optional[int] = None,
) -> pd.DataFrame:
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a one-line docstring summary here?

janitor/functions.py Outdated Show resolved Hide resolved
janitor/functions.py Outdated Show resolved Hide resolved
janitor/functions.py Outdated Show resolved Hide resolved
janitor/functions.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
tests/functions/test_fill.py Outdated Show resolved Hide resolved
janitor/functions.py Outdated Show resolved Hide resolved
janitor/functions.py Outdated Show resolved Hide resolved
tests/functions/test_fill_direction.py Outdated Show resolved Hide resolved
tests/functions/test_fill_direction.py Outdated Show resolved Hide resolved
tests/functions/test_fill_direction.py Outdated Show resolved Hide resolved
tests/functions/test_fill_direction.py Outdated Show resolved Hide resolved
tests/functions/test_fill_direction.py Outdated Show resolved Hide resolved
tests/functions/test_fill_direction.py Outdated Show resolved Hide resolved
tests/functions/test_fill_direction.py Outdated Show resolved Hide resolved
tests/functions/test_fill_direction.py Outdated Show resolved Hide resolved
@hectormz
Copy link
Collaborator

hectormz commented Aug 4, 2020

@samukweku I realize I still haven't resolved the issue/PR on docstrings, so it should be clearer after that!

samukweku and others added 12 commits August 4, 2020 16:37
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
@ericmjl
Copy link
Member

ericmjl commented Aug 5, 2020

Anytime you’d like to, @hectormz, please do the honors!

@hectormz hectormz merged commit 350be8e into pyjanitor-devs:dev Aug 5, 2020
@samukweku samukweku mentioned this pull request Nov 25, 2020
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Forward/Backward Fill
3 participants