New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Pyjanitor fill method #704
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #704 +/- ##
=======================================
Coverage 93.16% 93.16%
=======================================
Files 16 16
Lines 600 600
=======================================
Hits 559 559
Misses 41 41 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @samukweku!
First off, tremendous job here! Thanks for kickstarting the fill API!
I'm making a bunch of suggestions to the front-facing API, mostly for the sake of consistency with the rest of the API. In particular, I think the implementation of fill()
can be simplified by making the API match that of find_replace
. The "category" of actions there are similar: map a set of column names to an action that is taking place per column name. I think we should simplify the API such that it follows the use of dictionaries, which is the Pythonic way of doing discrete mappings.
The only other thing we probably want to think about is whether we should have a fill
function or a fill_<something>
function as the name. In the category of "filling" functions, there's fillna()
(in pandas), fill_empty
(in pyjanitor), and so fill
feels a bit "God-like" in that it is named very generically. I know naming things is hard, so I'm going to propose a few straw man names, with their pros and cons. If you think none of them are good, please let me know, ignore the suggestions, and proceed as is.
fill_direction
: Makes the directionality aspect a bit more explicit, but the use ofdirection
in the name overlaps with what I've proposed as the API change (directions={"col1":"up", "col2": "down"}
)fill_in
: Doesn't have the name overlap problem, but is less explicit.fillin
: Matches more closely to thefillna
API, but I can imagine it being mistyped all the time asfilling
.
Having written all of that, maybe fill
is ok 😄.
Looking forward to the changes, @samukweku! I'll make an issue for the notebook, so that someone else can take it on. (Human map reduce is pretty awesome for these independent tasks.)
janitor/functions.py
Outdated
if not columns: | ||
raise ValueError("columns argument cannot be empty") | ||
|
||
if not isinstance(columns, (list, tuple, str)): | ||
raise TypeError( | ||
""" | ||
The columns argument should be a list/tuple of column names, | ||
a single column, or a string of columns separated by ',' | ||
""" | ||
) | ||
|
||
if not isinstance(directions, (list, tuple, str)): | ||
raise TypeError( | ||
""" | ||
The directions argument should be a list/tuple of directions, | ||
a single direction, or a string of directions separated by | ||
',' | ||
""" | ||
) | ||
|
||
# if columns/directions is a string, convert to list | ||
if isinstance(columns, str): | ||
columns = columns.split(",") | ||
if isinstance(directions, str): | ||
directions = directions.split(",") | ||
|
||
# check that all columns supplied exist in the dataframe | ||
non_existent_columns = set(columns).difference(df.columns) | ||
if any(non_existent_columns): | ||
# handles multiple columns | ||
if len(non_existent_columns) > 1: | ||
outcome = [f"'{word}'" for word in non_existent_columns] | ||
outcome = ", ".join(outcome) | ||
raise ValueError( | ||
f"Columns {outcome} do not exist in the dataframe" | ||
) | ||
outcome = "".join(non_existent_columns) | ||
raise ValueError(f"Column '{outcome}' does not exist in the dataframe") | ||
|
||
# check supplied directions exist in directions_options | ||
direction_options = ("up", "down", "updown", "downup") | ||
directions = [direction.lower() for direction in directions] | ||
incorrect_direction = set(directions).difference(direction_options) | ||
if any(incorrect_direction): | ||
raise ValueError( | ||
""" | ||
The direction should be either 'up', 'down', 'updown', 'downup', | ||
or a combination of the aforementioned options | ||
""" | ||
) | ||
|
||
# check that the length of columns matches the length of directions | ||
# if the length of directions is greater than 1 | ||
length_columns_arg = len(columns) | ||
length_directions_arg = len(directions) | ||
|
||
if length_columns_arg != length_directions_arg: | ||
if length_directions_arg > 1: | ||
raise ValueError( | ||
""" | ||
The number of arguments in the columns and | ||
directions do not match. | ||
""" | ||
) | ||
directions = directions * length_columns_arg | ||
|
||
fill_pairs = dict(zip(columns, directions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the API change, a lot of these checks can be removed. I think the only ones that would be necessary when using a dictionary are that:
- It is not an empty dictionary
- All keys exist as columns in the dataframe,
- All values are one of the four valid options.
We'd have fewer lines of code to maintain this way.
@ericmjl , thanks for taking out the time to review the code. I wonder if it is better to lump the code, somewhat, into fill_empty:
if the For the names suggested, |
@samukweku I'm just catching up to speed on the issue/PR. FYI, you've been tagging me at |
Haha!, my bad. Now I know the right way to get your attention |
The PR's been open for a few days, and we've not had much in the way of replies. @samukweku, I propose we go forward with |
Co-authored-by: Eric Ma <ericmjl@users.noreply.github.com>
the failing test is from |
@samukweku I think I see what’s going on! Could you change the expected error raised from |
@ericmjl I did change it to |
@samukweku hmm, I'm quite baffled myself. Can you decorate it with https://docs.pytest.org/en/latest/skipping.html#xfail-mark-test-functions-as-expected-to-fail In doing this, we'll at least preserve the test, but it won't break the CI unnecessarily. |
@ericmjl @samukweku I'm intrigued! I'll look at this quickly myself |
@samukweku @ericmjl I figured it out 🤔 So Pandas 1.1.0 came out in the past few days, and the CI is grabbing the latest version. You likely have something like 1.0.5 locally. According to the release notes:
So, we should either be flexible here, or set a maximum or minimum pandas version for now (1.1.0) |
Are you still using the Docker container? The CI grabs the latest version it can, while following the min/max version rules we set. These things pop up for me when I install once locally, and don't keep the packages up to date |
@hectormz Yes, I'm using the docker container, and the version there is 1.0.4. Thanks for sharing your CI knowledge by the way. |
@hectormz and @samukweku I think I have an idea. I'm also using the Docker container. There's probably a cached layer that's causing this error to not be revealed. I'm going to clear out my cache:
and see what happens. |
OK! Looping back and confirming that pandas 1.1.0 was the reason why we have had issues. I suspect pandas 1.0.5 was what got resolved by conda, but in the CI system, 1.1.0 got installed because we pip install the package. I’m going to suggest we mark the test as xfail with 1.1.0 as the reason why it’s failing. It’s not crucial at the moment, I think, for that test to be working, and with the new @samukweku, if you've got the time, would you like to do the honors of finishing up the PR? If you're blocked because of time, let me know, I'll send a PR into your branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve! Let's leave this open for one more day to get any more comments in.
@samukweku it's past 24 hours! @hectormz, @zbarry or @szuckerman would you like to do the honors of hitting the big green button? |
@ericmjl I can do it, but @samukweku can you add a one line summary docstring? (Unless my phone is not showing it properly) |
Hi @hectormz , not sure what you mean. Where should the one line summary docstring be located? |
janitor/functions.py
Outdated
directions: Union[Tuple[str], List[str], str] = "down", | ||
limit: Optional[int] = None, | ||
) -> pd.DataFrame: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a one-line docstring summary here?
@samukweku I realize I still haven't resolved the issue/PR on docstrings, so it should be clearer after that! |
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Anytime you’d like to, @hectormz, please do the honors! |
PR Description
Please describe the changes proposed in the pull request:
fill
method for forward and backward fills on selected columns of a dataframe.This PR resolves #700 .
Please ensure that you have done the following:
<your_username>
:dev
, but rather from<your_username>
:<feature-branch_name>
.AUTHORS.rst
.CHANGELOG.rst
under the latest version header (i.e. the one that is "on deck") describing the contribution.Quick Check
To do a very quick check that everything is correct, follow these steps below:
make check
from pyjanitor's top-level directory. This will automatically run:Once done, please check off the check-box above.
If
make check
does not work for you, you can execute the commands listed in the Makefile individually.Code Changes
If you are adding code changes, please ensure the following:
$ pytest .
) locally on your machine.Documentation Changes
If you are adding documentation changes, please ensure the following:
Relevant Reviewers
Please tag maintainers to review.