Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Pandas String Methods in a Single Class #694

Merged
merged 39 commits into from Jul 13, 2020
Merged

[ENH] Pandas String Methods in a Single Class #694

merged 39 commits into from Jul 13, 2020

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Jul 3, 2020

PR Description

Please describe the changes proposed in the pull request:

  • Single function to access pandas' string methods
  • Small subset to see if the current method is acceptable

This PR resolves #360 .

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.rst.
  1. Add a line to CHANGELOG.rst under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Quick Check

To do a very quick check that everything is correct, follow these steps below:

  • Run the command make check from pyjanitor's top-level directory. This will automatically run:
    • black formatting
    • flake8 checking
    • running the test suite
    • docs build

Once done, please check off the check-box above.

If make check does not work for you, you can execute the commands listed in the Makefile individually.

Code Changes

If you are adding code changes, please ensure the following:

  • Ensure that you have added tests.
  • Run all tests ($ pytest .) locally on your machine.
    • Check to ensure that test coverage covers the lines of code that you have added.
    • Ensure that all tests pass.

Documentation Changes

If you are adding documentation changes, please ensure the following:

  • Build the docs locally.
  • View the docs to check that it renders correctly.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl
Copy link
Member

ericmjl commented Jul 3, 2020

@samukweku thanks for the PR! I have to admit my brain is struggling a bit to decide whether this goes in functions.py or another submodule. Can you share your thoughts here?

The other piece I'm noticing is that some of the names are inconsistent with the string methods. Would it make sense to simply wrap them verbatim?

@samukweku
Copy link
Collaborator Author

@ericmjl Thanks for the feedback!
Another submodule is not a bad idea. However, in functions.py we have find_replace, which has a string method attached to it, and filter_string which uses str_contains underneath. As such, I am not sure if that would create confusion.

For the names, the only one is concat, which i felt conveyed what the function was to do, compared to pandas' string cat. But, seeing that users would be more familiar with the pandas' naming, I'll stick with pandas' names.

The other thought I had, which I want feedback on, applies to chaining. An example could be:

 df.string.split(column='A', pat=',').string.get(2)

Does it look pleasing to the eye? the multiple string calls? Dont know if it is something that would put users off.

@hectormz
Copy link
Collaborator

hectormz commented Jul 4, 2020

Is this meant to handle a subset of pandas.str methods, or all of them? In the past when I looked at this issue I was intrigued by using a programmatic way to wrap them all

@samukweku
Copy link
Collaborator Author

@hectormz Yes, handle all of them. So, just one class to access all string methods

@hectormz
Copy link
Collaborator

hectormz commented Jul 4, 2020

Okay got it. My concern is that there are about 53 pandas.Series.str methods.
I'm torn by the manual aspect of it, and would like to see something like the programmatic/generic approach @szuckerman mentioned. But it would also need to be obvious and documented enough for users to understand what they're using.

@ericmjl
Copy link
Member

ericmjl commented Jul 4, 2020

@hectormz Yes, handle all of them. So, just one class to access all string methods

I agree with you, @samukweku, this should be the end goal, being able to chain up all of the string methods.

Okay got it. My concern is that there are about 53 pandas.Series.str methods.
I'm torn by the manual aspect of it, and would like to see something like the programmatic/generic approach @szuckerman mentioned. But it would also need to be obvious and documented enough for users to understand what they're using.

To make it easier to refer to, here's all of the string methods. I agree with @hectormz and @szuckerman - an automated thing might be better.

@samukweku, I think we can approach this task in a slightly smarter way than manually wrapping each function manually. This should probably be done over a few PRs, but the general idea probably would involve the following functions:

  • A function to identify all pandas string methods, and return their names.
  • A wrapper class that checks a function call for its availability as a string method, and executes the function.

From my prior experience, you might end up using the object.__dict__ representation to make this dynamic passthrough a reality. I'm not sure how deep you'll have to go with calling object.__dict__["something"].__dict__["somethingelse"] to consistently passthrough all of the functions, though, so it might take some experimenting. That said, I'm quite sure that this is a supremely fun programming problem!

Now, I just wanted to also mention that even though there may be a more automated way of approaching this task, your work thus far in this PR is still valuable, @samukweku, because it prompted the conversation and realization of how much work it would be to wrap all of the methods manually. So thank you for taking the initiative here!

@samukweku
Copy link
Collaborator Author

Thanks for the feedback @ericmjl @hectormz !

@VPerrollaz
Copy link
Collaborator

  • I personally believe that a string module might make the methods more discoverable and keep the namespaces tidy.
  • From a technical viewpoint, there might also be some useful stuff to pick in this for the __doc__, __name__ and __signature__ dynamic manipulation.
  • However this would only help users in ipython/jupyter sessions, I'm not sure how to generate static documentation with sphynx (or anything else actually) from this kind of monkey patching.

@samukweku
Copy link
Collaborator Author

samukweku commented Jul 9, 2020

Hi Team. Made some updates based on the suggestions. As always, love to get your feedback and suggestions for further improvement.

@codecov
Copy link

codecov bot commented Jul 9, 2020

Codecov Report

Merging #694 into dev will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##              dev     #694   +/-   ##
=======================================
  Coverage   93.16%   93.16%           
=======================================
  Files          16       16           
  Lines         600      600           
=======================================
  Hits          559      559           
  Misses         41       41           

@ericmjl
Copy link
Member

ericmjl commented Jul 9, 2020

Wonderful, wonderful effort, @samukweku! Thank you for handling this. I will be reviewing it this evening, after I finish my work today :).

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggested edits, @samukweku, for consistency. Otherwise, great work, thanks for doing the wrapping piece!

Let's get another review in before merging.

janitor/functions.py Outdated Show resolved Hide resolved
janitor/functions.py Outdated Show resolved Hide resolved
Comment on lines 4079 to 4086
data = [
func.__name__
for _, func in inspect.getmembers(pd.Series.str, inspect.isfunction)
if not func.__name__.startswith("_")
]

if string_function not in data:
raise KeyError(f"{string_function} is not a Pandas string method.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WOW! 🎉

tests/functions/test_process_text.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@hectormz hectormz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @samukweku ! I echo Eric.

I'll take his docstring requests one small step further by asking for periods at the end of each.

Currently, pandas will raise an exception if an invalid kwarg is provided right? Should we leave that as is, or add our JanitorError into the mix as well?

I'm assuming that the tests are the same ones you made when you started to implement this manually. Do you think they are sufficient, or are there other tests for other functions that would cover some corner cases? It would be too much to test them all, but just a thought that came to mind.

@samukweku
Copy link
Collaborator Author

@ericmjl @HectorM14 Thanks! got some guidance from @VPerrollaz script, so it's a community effort :) I will add more tests, and possibly check Stack Overflow or some blogs for some edge cases. Not familiar with JanitorError, so cant say much on that; I default to Pandas exception handling. As always, more feedback are welcome.

@hectormz
Copy link
Collaborator

@samukweku You don't have to add new tests. I was just curious if they were picked for a reason, or if there is an ideal set that covers different issues.

Do all the string methods return a series?

@samukweku
Copy link
Collaborator Author

@HectorM14 , yes, all the string methods return a series, same as in Pandas. The tests were picked at random, just testing for those that needed extra parameters and just one test for string methods that required no parameters

@hectormz
Copy link
Collaborator

mamba IRL! 🤩

janitor/functions.py Outdated Show resolved Hide resolved
janitor/functions.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
tests/functions/test_process_text.py Outdated Show resolved Hide resolved
@hectormz
Copy link
Collaborator

If the conda/mamba solving is fixed, I just added some requested changes to tighten up the docstrings and it's all good to go @samukweku !

samukweku and others added 7 commits July 13, 2020 15:22
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
Co-authored-by: Hector <23343812+hectormz@users.noreply.github.com>
@samukweku
Copy link
Collaborator Author

Yes, problem fixed. @hectormz @ericmjl how did you know what to fix for the build error? New to this, so questions might be silly. When I checked the view log, the only error I noticed was deploy.provider : unknown value pages:git.

@samukweku
Copy link
Collaborator Author

@ericmjl Also, how do I fix the deepsource error?

@ericmjl
Copy link
Member

ericmjl commented Jul 13, 2020

@samukweku when I looked at the build log, I saw that only that particular build stalled at the docker container build step in which the conda env solving piece looked slow.

As such, I had the following hypotheses:

  • Something went wrong with conda
  • Something went wrong with docker build

At first, I thought it was conda stalling. In writing the response to you, I then realized, oh, the first conda step actually worked, so maybe the docker build was the real issue. But I wasn’t sure how to debug what’s going on in a Docker container build on a remote machine, so I thought, let’s just disable it for now, since it’s not 100% crucial for the package. It’s merely helpful for development.

That was my thought process! Hopefully that’s informative for you too :).

@ericmjl
Copy link
Member

ericmjl commented Jul 13, 2020

As for the current errors, let’s see.

For DeepSource: does the deep source config file (.deepsource.toml) exist in your fork?

For the Azure builds, looks like one was an HTTP error. That just needs a restart. I think I have added you to the pipelines, @samukweku. Can you see if you can restart the build? (You might need to login to Azure again.)

@samukweku
Copy link
Collaborator Author

Didn't have to do anything except push new commit. Fingers crossed.

@ericmjl
Copy link
Member

ericmjl commented Jul 13, 2020

Everything looks good to me. @hectormz would you like to do the honors?

@hectormz
Copy link
Collaborator

@ericmjl will do! just fixed one last thing in .rst's. Will wait for the CI to finish and then will merge!

@hectormz hectormz merged commit fb56198 into pyjanitor-devs:dev Jul 13, 2020
@ericmjl
Copy link
Member

ericmjl commented Jul 13, 2020

🎉

@samukweku samukweku deleted the pyjanitor_string_wrapper branch July 13, 2020 20:59
@samukweku samukweku self-assigned this Nov 28, 2020
@samukweku samukweku mentioned this pull request Dec 4, 2020
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Wrap pandas string methods
4 participants