Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select_Columns Function Added Suggestion... #77

Closed
jcvall opened this issue Dec 4, 2018 · 29 comments
Closed

Select_Columns Function Added Suggestion... #77

jcvall opened this issue Dec 4, 2018 · 29 comments
Labels
enhancement New feature or request

Comments

@jcvall
Copy link
Contributor

jcvall commented Dec 4, 2018

I see there is remove columns function. I think a select_columns function would work nice. It would be cleaner and easier to understand then df[['col1',col2','col3']].

@ericmjl
Copy link
Member

ericmjl commented Dec 4, 2018

@jcvall if you were to design the API for this function, what would it be? What would be the relevant arguments, and what would it return? What corner cases might need to be taken care of?

I'm asking because I'd like to solicit API designs that will eventually get used, rather than design what I think is useful only to find it doesn't get used. Also, some initially simple-looking functions (e.g. clean_names) turned out to have a bunch of possible additional kwargs that were useful that I didn't think of at first glance!

This is your opportunity to shape how the API looks like too!

@ericmjl ericmjl added the enhancement New feature or request label Dec 4, 2018
@ericmjl ericmjl assigned ericmjl and unassigned ericmjl Dec 4, 2018
@ericmjl
Copy link
Member

ericmjl commented Dec 5, 2018

At first glance, now that I have had some time to think about it, an API that might be useful would look like this:

df = (
    pd.DataFrame(...)
    .select_columns(col_names=[col1, col_2, col_3, ...])
)

One could have an optional invert=True, such that we select the complement of the passed-in column names:

df = (
    pd.DataFrame(...)
    .select_columns(col_names=[col1, col_2, col_3, ...], invert=True)
)

@jcvall any thoughts? The only thing I would think needs a second-pass look would be whether invert is the best kwarg name or not.

@jcvall
Copy link
Contributor Author

jcvall commented Dec 6, 2018

I have been coding for a few years in python and r, but I apologize because I am new to this (Github).
This exactly what I had in mind!!!! I think it will be well used. One of the first things I do is select my columns when I start a line of code. I use df[['col1','col2',..]] in python and %>% select() in r all the time. This would probably be used in conjunction with most functions in this package. I was giving it a stab (with the select_columns function) last night, but I am honestly not at this level yet to edit code in a package (I hope to be one day soon). What I would love to do is create a cheat sheet for pyjanitor (like we see in rstudio). Your package has a lot going on and I really find myself just importing pandas, numpy, and janitor and I am off and running. It has so much good stuff here though I need a cheat sheet. By the way...invert is great (also...unselect, deselect, remove could work. But remove would be not my first choice). I would like to help more, so I will be looking for ways to make a contribution.

@ericmjl
Copy link
Member

ericmjl commented Dec 6, 2018

@jcvall, if you're willing to do so, I'd like to invite you to paste your implementation here, where I can guide you through the programming model needed to make select_columns a thing.

I can see how the code should work, but I'm happy to give the opportunity to newcomers. Everybody had a start somewhere, including myself, and in my case, it was with the matplotlib devs, doing some very simple and repetitive things.

Let me know if you'd like to give it a shot - it will be a rewarding process! If not, no worries, I can see how the code would work, and can put up a first pass on select_columns in the near future.

@jcvall
Copy link
Contributor Author

jcvall commented Dec 6, 2018

I’m in. I will give it a go.

Sent with GitHawk

@jcvall
Copy link
Contributor Author

jcvall commented Dec 9, 2018

Here we go....a little trial and error and this is what I have so far.

def select_columns(df,col, invert = False ):
    
    if invert == False:
        df = df[col]
        return df

    elif invert == True:
        df.drop(columns=col)
        return df

@ericmjl
Copy link
Member

ericmjl commented Dec 9, 2018

@jcvall looks great! I'd do a few modifications, with explanations below:

from typing import List
import pandas as pd

#[0]
def select_columns(df: pd.DataFrame, columns: List, invert: bool = False):
    """ #[1]
    Method-chainable selection of columns.

    Optional ability to invert selection of columns available as well.

    Method-chaining example:

    ..code-block:: python
        
        df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)

    :param df: A pandas DataFrame.
    :param columns: A list of columns to select.
    :param invert: Whether or not to invert the selection. This will result in selection of
        the complement of the columns provided.
    :returns: A pandas DataFrame with the columns selected.
    """
    #[2]
    if invert:
        #[3]
        return df.drop(columns=columns)

    else:
        return df[columns]

Note 0: Used the keyword argument columns, to preserve consistency with the rest of the API. Currently coalesce, concatenate_columns, encode_categorical, fill_empty, get_dupes and label_encode all use the columns kwarg. Consistency and repetition helps with learning the API better.

Note 1: I added a docstring. Docstrings helps us document the intent behind the function and the parameters, and get automatically built as part of the docs. (It's pretty amazing what readthedocs does, be sure to support the project!)

Note 2: It's usually pythonic to write boolean conditionals as:

if X:
    # do something here
else:
    # do something else here

Note 3: This is inspired by a bit of technical detail, but if I remember correctly, df.drop(columns=[columns]) results in data being copied over. I'm ok with that for now, but we may in the future end up with a PR from a certain colleague of mine (@zbarry, ahem ahem) who might propose a modification that doesn't result in the dataframe being copied. This is related to issue #76 and #79, for which I think we will end up needing a sane default being set sometime in the future.

Are you familiar with Git and software development workflows? If not, ping back here, I'm happy to guide you through it. We can also do a Google Hangouts call if you'd like. I'd like your contribution to be recorded in the version control history of the repository.

@zbarry
Copy link
Collaborator

zbarry commented Dec 9, 2018

🤔

@jcvall
Copy link
Contributor Author

jcvall commented Dec 10, 2018

Wow. Awesome. I don’t know git that well. Would be honored to have you walk me through it. A google chat is cool with me. My schedule is crazy with the kids, but might have a window on the weekends. If that doesn’t work you can guide me here. Again thanks.

Sent with GitHawk

@ericmjl
Copy link
Member

ericmjl commented Dec 10, 2018

@jcvall totally happy to do so here. That way we can avoid trying to coordinate times. Asynchronous work is always good, even better when there's no pressures involved.

Pardon me if the instructions below cover what you already know, I'm working off the assumption that your git level is at a software carpentry student's level. (i.e. "may have typed some git commands, but don't really have the mental model.)

First off, fork the repository.
screen shot 2018-12-10 at 8 35 16 am

This will give you a copy of the repository that lives under your username on GitHub. It doesn't automatically sync with my copy, as you'll need to execute some commands to make that happen.

Secondly, clone the repository locally. There are two options: you can do it by SSH, or you can do it by HTTPS. Both give the exact same result, but SSH may be simpler once you've set it up. By convention, under my home folder on my Mac, I have a folder in which I place all of my projects that reside on GitHub. Cloning a repository simply results in a folder that lives on your local hard drive, and that one is a copy of the fork that you have. I would suggest that you use the GitHub Desktop application, if you're not familiar with Git commands at the command line. It'll give you an easier time ramping up. (Learning the git commands, though, is extremely useful, particularly for automation purposes.)

screen shot 2018-12-10 at 8 42 19 am

The cloned repository will have the exact history of your forked repository, which will have the exact history of my repository (which we canonically would consider "upstream" to your forked repository, that which is the "origin" for your cloned repository), at the time that you forked it.

What you'll want to do, now is create a branch on your local copy. GitHub Desktop doesn't show you all of the files in the repository, only the ones that are changed. On the GitHub desktop interface, you can click on the branches drop-down menu, and then click on "create branch". This branch will house changes to the source code, without affecting the "master" branch that you have.

screen shot 2018-12-10 at 8 47 37 am

Give it an informative name, e.g. select-columns, and then start making changes to the source code, in this case, by copy/pasting my modified version of your function into the functions.py file.

When you're done with those steps, ping back here, happy to guide you through the rest of the steps involved.

@szuckerman
Copy link
Collaborator

@jcvall, this is an awesome suggestion! I was an R coder for many years and used dplyr::select often.

What's great about the R version of select is that it has other ways of selecting columns, such as starts_with or ends_with which is really useful when trying to slice down a data.frame.

And you can join those as well. For example, you can pass ['col1', ends_with('name')] to match col1 and any column that ends with name.

I'd be willing to add those features if you'd like.

@jcvall
Copy link
Contributor Author

jcvall commented Dec 13, 2018

Sure. I like to learn how to do this so I am ok with you using what Eric did above and just adding to it. Can you copy the code here, then I will give it my best shot to upload it to github. Very excited! I have some other possible ideas and would like jump in to help when I can. Like Eric said, you got to start somewhere.

Sent with GitHawk

@ericmjl
Copy link
Member

ericmjl commented Dec 13, 2018

@jcvall thanks for the enthusiasm! However, let's go slowly with this. It's okay to have a feature-limited first pass on the PR, and I'd actually prefer it, because it makes reviewing the code much easier. 😄

@jcvall
Copy link
Contributor Author

jcvall commented Dec 14, 2018

Eric, sounds like a plan. I saved the function on my computer under my github desktop folder. Ready for further instructions. 👍

Sent with GitHawk

@ericmjl
Copy link
Member

ericmjl commented Dec 14, 2018

Ok! Now, what you'll want to do is to "push" those changes to your fork of pyjanitor that resides on GitHub. To do so:

  1. In the GitHub desktop interface, publish your branch. (There should be a button at the top.)

screen shot 2018-12-14 at 7 59 32 am

  1. GH desktop will also show you the files that have changed. You'll want to enter a commit message.
    screen shot 2018-12-14 at 8 04 15 am

This commit message tells me a summary of the changes that you have made thus far.

  1. Hit the blue commit button. Then, you'll see that you can "Push Origin" in the top toolbar. Click that. It will take your changes that are local to your computer, and send it up to your fork of pyjanitor that lives on GitHub.

screen shot 2018-12-14 at 8 06 34 am

  1. Now go to the main GitHub pyjanitor page. You should see something similar to this:

screen shot 2018-12-14 at 8 08 29 am

You'll want to now make a "pull request" - you're requesting me to pull in changes from your fork into my branch.

screen shot 2018-12-14 at 8 09 42 am

Fill out the two text boxes. The top one contains a summary of changes, and the bottom one contains a checklist for you to follow. Follow the instructions carefully there, and then submit your PR!

@CWen001
Copy link
Contributor

CWen001 commented Feb 20, 2019

Although this issue was closed, may I add some thoughts here about Select_Columns function, please.
Is it possible to enhance it or add some select helpers that make the selecting more flexible?

Let's take the example from this question in StackOverflow .

df = pd.DataFrame({'foo.aa': [1, 2.1, np.nan, 4.7, 5.6, 6.8], 'foo.fighters': [0, 1, np.nan, 0, 0, 0], 'foo.bars': [0, 0, 0, 0, 0, 1], 'bar.baz': [5, 5, 6, 5, 5.6, 6.8], 'foo.fox': [2, 4, 1, 0, 0, 5], 'nas.foo': ['NA', 0, 1, 0, 0, 0], 'foo.manchu': ['NA', 0, 0, 0, 0, 0],})

If I only want the columns starting with 'foo', I probably have to prepare the select helper
filter_col = [col for col in df if col.startswith('foo')] and then use
df[filter_col] or df.Select_Columns(filter_col).

I'm wondering if it is possible to save the effort inside the Select_Columns(), adding some arguments or helpers like starts_with=' ' . The same logic may go to ends_with, contains, and regex. Just like the powerful select() function in dplyr, there are a number of special functions that only work inside the select function.

Thank you all for making this wondering library for users to enjoy EDA.

@zbarry
Copy link
Collaborator

zbarry commented Feb 20, 2019

Wonder if a glob foo*, *foo* *foo like we were talking about for something else (or this) would be best? Or glob + regex depending on how much fun we want to have with it

@ericmjl
Copy link
Member

ericmjl commented Feb 20, 2019

@CWen001 that'd be a great feature to add! And I think @zbarry's globbing idea might be the best implementation start, for the following reasons:

  • We can limit the amount of kwargs in the function, and hence limit the amount of control flow implemented inside. My nxviz library is a bit hard to maintain because of the amount of control flow inside.
  • The interface remains nicely declarative, which is more inline with a functional style.

@CWen001, do you have any bandwidth to give this a shot? Both @zbarry and I are swamped at work at the moment (I am juggling two medium-sized projects, while he's on at least 3), but we can help guide you through the process if you're up for it.

@CWen001
Copy link
Contributor

CWen001 commented Feb 20, 2019

Thank you very much for the quick replys and suggestions. I'm still a python newbie with limited experience but am willing to learn by doing. I tried my best to finish this first draft after following your instructions above. Please have a look and feel free to correct and guide.

Four function parameters with default value None are added, and these included starts_with, ends_with, contains, and regex. As you can tell, the current control flow is that only one way of selecting is enabaled. We either provide a list of column names or use one of these four new ways. Due to my limited experience, I couldn't find a convienent way to implement ['col1', ends_with('name')] suggested by @szuckerman .

@pf.register_dataframe_method
def select_columns2(df: pd.DataFrame = None, columns: List = None, starts_with: str = None,
                   ends_with: str = None, contains: str = None, regex: str = None, invert: bool = False):

"""
 Method-chainable selection of columns.
    Optional ability to invert selection of columns available as well.
    Method-chaining example:
    ..code-block:: python
        df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)
    :param df: A pandas DataFrame.
    :param columns: A list of columns to select.
    :param starts_with: Case sensitive strings that are used to select columns whose name starts with these strings.
    :param ends_with: Case sensitive strings that are used to select columns whose name ends with these strings.
    :param contains: Case sensitive strings that are used to select columns whose name contains these strings.
    :param regex: A regex pattern that is used to select columns whose name matches the regex pattern.
    :param invert: Whether or not to invert the selection.
        This will result in selection ofthe complement of the columns provided.
    :returns: A pandas DataFrame with the columns selected.
"""
@pf.register_dataframe_method
def select_columns2(df: pd.DataFrame = None, columns: List = None, starts_with: str = None,
                   ends_with: str = None, contains: str = None, regex: str = None, invert: bool = False):
    
    if starts_with:        
        if invert:            
            return df.drop(columns=[col for col in df if col.startswith(starts_with)])        
        else:        
            return df[[col for col in df if col.startswith(starts_with)]]
    
    
    if ends_with:      
        if invert:          
            return df.drop(columns=[col for col in df if col.endswith(ends_with)])     
        else:      
            return df[[col for col in df if col.endswith(ends_with)]]
    
    
    if contains:        
        if invert:        
            return df.drop(columns=[col for col in df if col.contains(ends_with)])        
        else:            
            return df[[col for col in df if col.contains(ends_with)]]
    
    
    if regex:       
        if invert:           
            return df.drop(columns=[col for col in df if not re.match(regex, col)])           
        else:            
            return df[[col for col in df if not re.match(regex, col)]]
                 
    if invert:
        return df.drop(columns=columns)
    else:        
        return df[columns]

Maybe we could further improve the control flow. I'm also thinking about adding a bool value parameter keep_others=False, which helps to decide whether select or just arrange the order of columns. Like the dplyr code select(df, var3, everything() ) that simply put var3 as the first columns, it would be nice to reflect the clean design philosopy of pyjanitor. Bring the handy features of R in a pythonic way.

@ericmjl
Copy link
Member

ericmjl commented Feb 20, 2019

@CWen001 that's a great starter implementation! I modified it a little bit. Yes, it looks like control flow is unavoidable here; that said, I think we can shorten the number of kwargs in the function signature, which makes it much more flexible for necessary expansion in the future.

What do you think about my modification of your code? If you like it, please go ahead and PR it in - I'd love for the credit to go to you, because you initiated it!

@pf.register_dataframe_method
def select_columns(df: pd.DataFrame = None, columns: List = None, search_string: str = None, method: str='exact', invert: bool = False):

    """
    Method-chainable selection of columns.
        
    Optional ability to invert selection of columns available as well.

    Method-chaining example:

    ..code-block:: python

            df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)

    :param df: A pandas DataFrame.
    :param columns: A list of columns to select.
    :param method: A string that specifies how to select the columns. Supported strings include:
        "exact", "starts_with", "ends_with", "contains", and "regex". "Exact" is the default.
    :param invert: Whether or not to invert the selection.
        This will result in selection ofthe complement of the columns provided.
    :returns: A pandas DataFrame with the columns selected.
    """
    # Check that the method name exists inside the approved set of methods.
    methods = ["exact", "starts_with", "ends_with", "contains", "regex"]
    if method not in methods:
        raise KeyError(f'method kwarg must be one of {methods}')

    # Every method except for search_string requires that search_string be specified.
    # Only "exact" requires that columns be specified.
    if method == "exact":
        if columns is None:
            raise Error("`columns` must be specified!")
    else:
        if search_string is None:
            raise Error("`search_string` must be specified!")


    # Get out the set of columns that are specified, according to the method specified:
    if method == 'starts_with':
        columns = [col for col in df if col.startswith(starts_with)]
    elif method == 'ends_with':
        columns = [col for col in df if col.endswith(ends_with)]
    elif method == "contains":     
        columns = [col for col in df if col.contains(ends_with)]       
    elif method == "regex":
        columns = [col for col in df if not re.match(regex, col)]
    # This is the case where "exact" is specified. Leaving this here to be 
    # explicit about this scenario being handled.
    else:
        pass

    # Finally, identify whether an inversion is requested or not.
    if invert:
        return df.drop(columns=columns)
    else:        
        return df[columns]

@zbarry
Copy link
Collaborator

zbarry commented Feb 20, 2019

I found out you might be able to get fancy and use: https://docs.python.org/3/library/fnmatch.html#fnmatch.translate

You can then translate an arbitrary *-containing expression to your regex and avoid all those if/elif statements for before, after, exact and replace them with method in {None, 'glob', 'regex'}

See what you think about this:

def select_columns(df: pd.DataFrame = None, columns: List = None,
                   search_string: str = None, method: str = None,
                   invert: bool = False):
    """
    Method-chainable selection of columns.

    Optional ability to invert selection of columns available as well.

    Method-chaining example:

    ..code-block:: python

            df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)

    :param df: A pandas DataFrame.
    :param columns: A list of columns to select.
    :param search_string: Either:
        1) an exact string to look for
        2) a regular expression used to match column names
        3) a shell-style glob string (e.g., `*_thing_*`)
    :param method: A string that specifies how to select the columns. Supported
        methods are {"glob", "regex"}.
        1) Default (`None`) indicates to look for an exact match.
        2) "glob": support shell-style globbing for column names
            (Uses `fnmatch.translate`).
        3) "regex": regex (`re` module)-based string matching.
    :param invert: Whether or not to invert the selection.
        This will result in selection ofthe complement of the columns provided.
    :returns: A pandas DataFrame with the columns selected.
    """
    # Check that the method name exists inside the approved set of methods.
    supported_methods = [None, "glob", "regex"]

    if method not in supported_methods:
        raise ValueError(f'`method` keyword argument must be in '
                         f'{supported_methods}')

    if method is None and columns is None:
        # Only exact matches require that columns be specified.

        raise ValueError("`columns` must be specified.")

    elif search_string is None:
        # For regex, glob methods, search_string be specified.

        raise ValueError("`search_string` must be specified.")

    # If method is specified, extract set of columns accordingly

    if method == "glob":
        search_string = translate(search_string)

    if method in ["glob", "regex"]:
        columns = [col for col in df if re.match(search_string, col)]

    # Finally, identify whether an inversion is requested or not.
    if invert:
        return df.drop(columns=columns)
    else:
        return df[columns]

Completely untested, so no waranty...

@ericmjl
Copy link
Member

ericmjl commented Feb 20, 2019

WHOOOOOOO! Look who's getting fancy here, @zbarry!

@CWen001, feel free to take either of our modified implementations (both will probably work, though I think Zach's is more elegant!), add examples and tests, and submit a PR! I'd love to see your contribution merged on 😄.

@zbarry
Copy link
Collaborator

zbarry commented Feb 20, 2019

Can you tell how bored I am? This allows method to be your own custom function.

from typing import Union, List, Callable

@pf.register_dataframe_method
def select_columns(df: pd.DataFrame = None, columns: List = None,
                   search_string: str = None,
                   method: Union[str, Callable] = None, invert: bool = False):
    """
    Method-chainable selection of columns.

    Optional ability to invert selection of columns available as well.

    Method-chaining example:

    ..code-block:: python

            df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)

    :param df: A pandas DataFrame.
    :param columns: A list of columns to select.
    :param search_string: Either:
        1) an exact string to look for
        2) a regular expression used to match column names
        3) a shell-style glob string (e.g., `*_thing_*`)
        4) a string conforming to the specifications used by your supplied
            callable object (see `method`).
    :param method: A string or function that specifies how to select the
        columns. Supported methods are {"glob", "regex"}.
        1) Default (`None`) indicates to look for an exact match.
        2) "glob": support shell-style globbing for column names
            (Uses `fnmatch.translate`).
        3) "regex": regex (`re` module)-based string matching.
        4) A callable object (function, object containing __call__, method)
            which takes as input `columns` passed one at a time and returns True
            on match. If `search_string` is not None, it is passed to your
            callable as the second parameter.
    :param invert: Whether or not to invert the selection.
        This will result in selection ofthe complement of the columns provided.
    :returns: A pandas DataFrame with the columns selected.
    """
    # Check that the method name exists inside the approved set of methods.
    supported_methods = [None, "glob", "regex"]

    if method not in supported_methods and not callable(method):
        raise ValueError(f'`method` keyword argument must be in '
                         f'[{supported_methods}] or callable.')

    if method is None and columns is None:
        # Only exact matches require that columns be specified.
        raise ValueError("`columns` must be specified.")

    elif search_string is None and not callable(method):
        # For regex, glob methods, search_string be specified.
        raise ValueError("`search_string` must be specified.")

    # If method is specified, extract set of columns accordingly

    if callable(method):
        args = [search_string] if search_string is not None else []
        columns = [col for col in df if method(col, *args)]
        
    if method == "glob":
        search_string = translate(search_string)

    if method in ["glob", "regex"]:
        columns = [col for col in df if re.match(search_string, col)]

    # Finally, identify whether an inversion is requested or not.
    if invert:
        return df.drop(columns=columns)
    else:
        return df[columns]

@CWen001
Copy link
Contributor

CWen001 commented Feb 21, 2019

Thanks for the feedback! Both versions of modification look much more efficient. I immediately learned that it is a good idea to put the if invert at last just once : )

The latter version from @zbarry is so powerful that it should be the case. May I ask if I could further process the PR at this weekend. I don't mean to claim the credit, and just be grateful to take the chance to learn git, glob, and callable. Thank you all.

@szuckerman
Copy link
Collaborator

szuckerman commented Feb 21, 2019

I think the above looks great!

A few thoughts to trim it down though:

  1. Maybe calling it just select instead of select_columns to maintain more parity with SQL?

  2. I see what's trying to be done with method, but I don't think it's really necessary. If we use @zbarry's suggestion of fnmatch then strs will match exactly (through regex), globs will match through regex and, (obviously) regex will match through regex.

  3. Of course this leaves off the method option, but I'm not sure how many people will really use it. And if people need to use it they can just use an iterator.

For example, I have a DataFrame with 100 columns with numbers prefixed with "col_", ex: "col_1", "col_2", etc.

If I want all the cols with even numbers, I could merely do:

even_cols = ('col_' + i for i in range(1, 101) if i % 2)

and pass that as the columns argument.

This would cut down the code for the entire select method to something like:

        search_string = translate(search_string)
        columns = [col for col in df if re.match(search_string, col)]
        return df.drop(columns=columns) if invert else df[columns]

To do what I was mentioning above with ["col_name", "col_*"] you would just need to take the code above, but have a helper list that maintains the full result. Something like this:

#search_cols is something like ["col_name", "col_*"]

full_column_list = []

for col in search_cols:
    search_string = translate(col)
    columns = [col for col in df if re.match(search_string, col)]
    full_column_list.extend(columns)

return df.drop(columns=full_column_list) if invert else df[full_column_list]

@ericmjl
Copy link
Member

ericmjl commented Feb 21, 2019

May I ask if I could further process the PR at this weekend. I don't mean to claim the credit, and just be grateful to take the chance to learn git, glob, and callable. Thank you all.

@CWen001 totally, no problem! Looking forward to seeing what you have. Also, don't forget to consider @szuckerman's suggestion as well. 😄

@zbarry
Copy link
Collaborator

zbarry commented Feb 21, 2019

May I ask if I could further process the PR at this weekend. I don't mean to claim the credit, and just be grateful to take the chance to learn git, glob, and callable. Thank you all.

It's your show. Learning by doing is the only way I learn (to really understand something), myself :)

@zbarry zbarry reopened this Feb 21, 2019
@CWen001
Copy link
Contributor

CWen001 commented Feb 23, 2019

Hello. I'm still fascinated by the idea of ["col_name", "col_*"], because in many cases we want to manually select one or two columns plus a pattern. For instance, select_columns(["Name", "Position", "Defense_*"]) for the following.

df = pd.DataFrame({'Name': ['James', 'Jordan', 'Yao', 'Curry', 'Harden', 'Nowitzki'],
                   'Position':['SF','SG','C','PG','PG','PF'],
                   'Offense_1':[3, 5.1, np.nan, 4.7, 5.6, 6.8],
                   'Offense_2': [0, 1, np.nan, 0, 0, 0],
                   'Offense_3': [0, 0, 0, 0, 0, 1],
                   'Defense_1': [5, 5, 6, 5, 5.6, 6.8],
                   'Defense_2': [2, 4, 1, 0, 0, 5],
                   'Defense_3': ['NA', 0, 1, 0, 0, 0],
                   'Pace_1': ['NA', 3, 6, 1, 7, 3],})

I tried to understand @szuckerman 's code. If I interpret it right, that means the root for searching all three methods ('exact', 'glob', 're') is the same -- all using re.

2. see what's trying to be done with method, but I don't think it's really necessary. If we use @zbarry's suggestion of fnmatch then strs will match exactly (through regex), globs will match through regex and, (obviously) regex will match through regex.

Then I put things together, and the most important parameter might be search_cols instead of columns. It allows inputs for all methods for activating ["col_name", "col_*"].

@pf.register_dataframe_method
def select_columns(df: pd.DataFrame = None, search_cols: List = None,
                   invert: bool = False):
    """
    Method-chainable selection of columns.

    Optional ability to invert selection of columns available as well.

    Method-chaining example:

    ..code-block:: python

            df = pd.DataFrame(...).select_columns(['a', 'b', 'col_*'], invert=True)

    :param df: A pandas DataFrame.
    :param search_cols: A list of column names or search string to be used to select. Valid inputs include
        1) an exact column name to look for
        2) a regular expression used to match column names
        3) a shell-style glob string (e.g., `*_thing_*`)
    :param invert: Whether or not to invert the selection.
        This will result in selection of the complement of the columns provided.
    :returns: A pandas DataFrame with the columns selected.
    """
    full_column_list = []

    for col in search_cols:
        search_string = fnmatch.translate(col)
        columns = [col for col in df if re.match(search_string, col)]
        full_column_list.extend(columns)

    return df.drop(columns=full_column_list) if invert else df[full_column_list]

When I tested the code, I found things work as expected when inputs are "exact" and/or "glob" strings. However, when inputs include a regular expression, it doesn't go well. For example, select_columns(['name', '3$']) only return name but not columns ending with 3. I guess we need to do something at for col in search_cols: search_string = fnmatch.translate(col). When input col is already a regex pattern, the function returns what we don't want. Feedback is welcome. Should we still need the 'method' parameter to control the flow or there are other ways? Thank you and have a nice weekend.

@ericmjl
Copy link
Member

ericmjl commented Feb 23, 2019

@CWen001 thanks for working on this!

I think we can support just exact and globs for now. I think regex seems to be a use case that most data analyst types won't be concerned with. @zbarry and @szuckerman are pretty advanced Pythonistas 😄; with pyjanitor, though, I think both the interface and the source code should be kept beginner friendly as much as possible.

CWen001 added a commit to CWen001/pyjanitor that referenced this issue Feb 24, 2019
Update select_columns functions. For discussion please see pyjanitor-devs#77
@CWen001 CWen001 mentioned this issue Feb 25, 2019
9 tasks
@ericmjl ericmjl closed this as completed Apr 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants