New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select_Columns Function Added Suggestion... #77
Comments
@jcvall if you were to design the API for this function, what would it be? What would be the relevant arguments, and what would it return? What corner cases might need to be taken care of? I'm asking because I'd like to solicit API designs that will eventually get used, rather than design what I think is useful only to find it doesn't get used. Also, some initially simple-looking functions (e.g. clean_names) turned out to have a bunch of possible additional kwargs that were useful that I didn't think of at first glance! This is your opportunity to shape how the API looks like too! |
At first glance, now that I have had some time to think about it, an API that might be useful would look like this: df = (
pd.DataFrame(...)
.select_columns(col_names=[col1, col_2, col_3, ...])
) One could have an optional df = (
pd.DataFrame(...)
.select_columns(col_names=[col1, col_2, col_3, ...], invert=True)
) @jcvall any thoughts? The only thing I would think needs a second-pass look would be whether |
I have been coding for a few years in python and r, but I apologize because I am new to this (Github). |
@jcvall, if you're willing to do so, I'd like to invite you to paste your implementation here, where I can guide you through the programming model needed to make I can see how the code should work, but I'm happy to give the opportunity to newcomers. Everybody had a start somewhere, including myself, and in my case, it was with the matplotlib devs, doing some very simple and repetitive things. Let me know if you'd like to give it a shot - it will be a rewarding process! If not, no worries, I can see how the code would work, and can put up a first pass on |
I’m in. I will give it a go. Sent with GitHawk |
Here we go....a little trial and error and this is what I have so far. def select_columns(df,col, invert = False ):
if invert == False:
df = df[col]
return df
elif invert == True:
df.drop(columns=col)
return df |
@jcvall looks great! I'd do a few modifications, with explanations below: from typing import List
import pandas as pd
#[0]
def select_columns(df: pd.DataFrame, columns: List, invert: bool = False):
""" #[1]
Method-chainable selection of columns.
Optional ability to invert selection of columns available as well.
Method-chaining example:
..code-block:: python
df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)
:param df: A pandas DataFrame.
:param columns: A list of columns to select.
:param invert: Whether or not to invert the selection. This will result in selection of
the complement of the columns provided.
:returns: A pandas DataFrame with the columns selected.
"""
#[2]
if invert:
#[3]
return df.drop(columns=columns)
else:
return df[columns] Note 0: Used the keyword argument Note 1: I added a docstring. Docstrings helps us document the intent behind the function and the parameters, and get automatically built as part of the docs. (It's pretty amazing what readthedocs does, be sure to support the project!) Note 2: It's usually pythonic to write boolean conditionals as: if X:
# do something here
else:
# do something else here Note 3: This is inspired by a bit of technical detail, but if I remember correctly, Are you familiar with Git and software development workflows? If not, ping back here, I'm happy to guide you through it. We can also do a Google Hangouts call if you'd like. I'd like your contribution to be recorded in the version control history of the repository. |
🤔 |
Wow. Awesome. I don’t know git that well. Would be honored to have you walk me through it. A google chat is cool with me. My schedule is crazy with the kids, but might have a window on the weekends. If that doesn’t work you can guide me here. Again thanks. Sent with GitHawk |
@jcvall totally happy to do so here. That way we can avoid trying to coordinate times. Asynchronous work is always good, even better when there's no pressures involved. Pardon me if the instructions below cover what you already know, I'm working off the assumption that your git level is at a software carpentry student's level. (i.e. "may have typed some git commands, but don't really have the mental model.) First off, fork the repository. This will give you a copy of the repository that lives under your username on GitHub. It doesn't automatically sync with my copy, as you'll need to execute some commands to make that happen. Secondly, clone the repository locally. There are two options: you can do it by SSH, or you can do it by HTTPS. Both give the exact same result, but SSH may be simpler once you've set it up. By convention, under my home folder on my Mac, I have a folder in which I place all of my projects that reside on GitHub. Cloning a repository simply results in a folder that lives on your local hard drive, and that one is a copy of the fork that you have. I would suggest that you use the GitHub Desktop application, if you're not familiar with Git commands at the command line. It'll give you an easier time ramping up. (Learning the git commands, though, is extremely useful, particularly for automation purposes.) The cloned repository will have the exact history of your forked repository, which will have the exact history of my repository (which we canonically would consider "upstream" to your forked repository, that which is the "origin" for your cloned repository), at the time that you forked it. What you'll want to do, now is create a branch on your local copy. GitHub Desktop doesn't show you all of the files in the repository, only the ones that are changed. On the GitHub desktop interface, you can click on the branches drop-down menu, and then click on "create branch". This branch will house changes to the source code, without affecting the "master" branch that you have. Give it an informative name, e.g. When you're done with those steps, ping back here, happy to guide you through the rest of the steps involved. |
@jcvall, this is an awesome suggestion! I was an R coder for many years and used What's great about the R version of And you can join those as well. For example, you can pass I'd be willing to add those features if you'd like. |
Sure. I like to learn how to do this so I am ok with you using what Eric did above and just adding to it. Can you copy the code here, then I will give it my best shot to upload it to github. Very excited! I have some other possible ideas and would like jump in to help when I can. Like Eric said, you got to start somewhere. Sent with GitHawk |
@jcvall thanks for the enthusiasm! However, let's go slowly with this. It's okay to have a feature-limited first pass on the PR, and I'd actually prefer it, because it makes reviewing the code much easier. 😄 |
Eric, sounds like a plan. I saved the function on my computer under my github desktop folder. Ready for further instructions. 👍 Sent with GitHawk |
Ok! Now, what you'll want to do is to "push" those changes to your fork of
This commit message tells me a summary of the changes that you have made thus far.
You'll want to now make a "pull request" - you're requesting me to pull in changes from your fork into my branch. Fill out the two text boxes. The top one contains a summary of changes, and the bottom one contains a checklist for you to follow. Follow the instructions carefully there, and then submit your PR! |
Although this issue was closed, may I add some thoughts here about Select_Columns function, please. Let's take the example from this question in StackOverflow .
If I only want the columns starting with 'foo', I probably have to prepare the select helper I'm wondering if it is possible to save the effort inside the Select_Columns(), adding some arguments or helpers like starts_with=' ' . The same logic may go to ends_with, contains, and regex. Just like the powerful select() function in dplyr, there are a number of special functions that only work inside the select function. Thank you all for making this wondering library for users to enjoy EDA. |
Wonder if a glob |
@CWen001 that'd be a great feature to add! And I think @zbarry's globbing idea might be the best implementation start, for the following reasons:
@CWen001, do you have any bandwidth to give this a shot? Both @zbarry and I are swamped at work at the moment (I am juggling two medium-sized projects, while he's on at least 3), but we can help guide you through the process if you're up for it. |
Thank you very much for the quick replys and suggestions. I'm still a python newbie with limited experience but am willing to learn by doing. I tried my best to finish this first draft after following your instructions above. Please have a look and feel free to correct and guide. Four function parameters with default value None are added, and these included @pf.register_dataframe_method
def select_columns2(df: pd.DataFrame = None, columns: List = None, starts_with: str = None,
ends_with: str = None, contains: str = None, regex: str = None, invert: bool = False):
"""
Method-chainable selection of columns.
Optional ability to invert selection of columns available as well.
Method-chaining example:
..code-block:: python
df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)
:param df: A pandas DataFrame.
:param columns: A list of columns to select.
:param starts_with: Case sensitive strings that are used to select columns whose name starts with these strings.
:param ends_with: Case sensitive strings that are used to select columns whose name ends with these strings.
:param contains: Case sensitive strings that are used to select columns whose name contains these strings.
:param regex: A regex pattern that is used to select columns whose name matches the regex pattern.
:param invert: Whether or not to invert the selection.
This will result in selection ofthe complement of the columns provided.
:returns: A pandas DataFrame with the columns selected.
"""
@pf.register_dataframe_method
def select_columns2(df: pd.DataFrame = None, columns: List = None, starts_with: str = None,
ends_with: str = None, contains: str = None, regex: str = None, invert: bool = False):
if starts_with:
if invert:
return df.drop(columns=[col for col in df if col.startswith(starts_with)])
else:
return df[[col for col in df if col.startswith(starts_with)]]
if ends_with:
if invert:
return df.drop(columns=[col for col in df if col.endswith(ends_with)])
else:
return df[[col for col in df if col.endswith(ends_with)]]
if contains:
if invert:
return df.drop(columns=[col for col in df if col.contains(ends_with)])
else:
return df[[col for col in df if col.contains(ends_with)]]
if regex:
if invert:
return df.drop(columns=[col for col in df if not re.match(regex, col)])
else:
return df[[col for col in df if not re.match(regex, col)]]
if invert:
return df.drop(columns=columns)
else:
return df[columns] Maybe we could further improve the control flow. I'm also thinking about adding a bool value parameter |
@CWen001 that's a great starter implementation! I modified it a little bit. Yes, it looks like control flow is unavoidable here; that said, I think we can shorten the number of kwargs in the function signature, which makes it much more flexible for necessary expansion in the future. What do you think about my modification of your code? If you like it, please go ahead and PR it in - I'd love for the credit to go to you, because you initiated it! @pf.register_dataframe_method
def select_columns(df: pd.DataFrame = None, columns: List = None, search_string: str = None, method: str='exact', invert: bool = False):
"""
Method-chainable selection of columns.
Optional ability to invert selection of columns available as well.
Method-chaining example:
..code-block:: python
df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)
:param df: A pandas DataFrame.
:param columns: A list of columns to select.
:param method: A string that specifies how to select the columns. Supported strings include:
"exact", "starts_with", "ends_with", "contains", and "regex". "Exact" is the default.
:param invert: Whether or not to invert the selection.
This will result in selection ofthe complement of the columns provided.
:returns: A pandas DataFrame with the columns selected.
"""
# Check that the method name exists inside the approved set of methods.
methods = ["exact", "starts_with", "ends_with", "contains", "regex"]
if method not in methods:
raise KeyError(f'method kwarg must be one of {methods}')
# Every method except for search_string requires that search_string be specified.
# Only "exact" requires that columns be specified.
if method == "exact":
if columns is None:
raise Error("`columns` must be specified!")
else:
if search_string is None:
raise Error("`search_string` must be specified!")
# Get out the set of columns that are specified, according to the method specified:
if method == 'starts_with':
columns = [col for col in df if col.startswith(starts_with)]
elif method == 'ends_with':
columns = [col for col in df if col.endswith(ends_with)]
elif method == "contains":
columns = [col for col in df if col.contains(ends_with)]
elif method == "regex":
columns = [col for col in df if not re.match(regex, col)]
# This is the case where "exact" is specified. Leaving this here to be
# explicit about this scenario being handled.
else:
pass
# Finally, identify whether an inversion is requested or not.
if invert:
return df.drop(columns=columns)
else:
return df[columns] |
I found out you might be able to get fancy and use: https://docs.python.org/3/library/fnmatch.html#fnmatch.translate You can then translate an arbitrary See what you think about this: def select_columns(df: pd.DataFrame = None, columns: List = None,
search_string: str = None, method: str = None,
invert: bool = False):
"""
Method-chainable selection of columns.
Optional ability to invert selection of columns available as well.
Method-chaining example:
..code-block:: python
df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)
:param df: A pandas DataFrame.
:param columns: A list of columns to select.
:param search_string: Either:
1) an exact string to look for
2) a regular expression used to match column names
3) a shell-style glob string (e.g., `*_thing_*`)
:param method: A string that specifies how to select the columns. Supported
methods are {"glob", "regex"}.
1) Default (`None`) indicates to look for an exact match.
2) "glob": support shell-style globbing for column names
(Uses `fnmatch.translate`).
3) "regex": regex (`re` module)-based string matching.
:param invert: Whether or not to invert the selection.
This will result in selection ofthe complement of the columns provided.
:returns: A pandas DataFrame with the columns selected.
"""
# Check that the method name exists inside the approved set of methods.
supported_methods = [None, "glob", "regex"]
if method not in supported_methods:
raise ValueError(f'`method` keyword argument must be in '
f'{supported_methods}')
if method is None and columns is None:
# Only exact matches require that columns be specified.
raise ValueError("`columns` must be specified.")
elif search_string is None:
# For regex, glob methods, search_string be specified.
raise ValueError("`search_string` must be specified.")
# If method is specified, extract set of columns accordingly
if method == "glob":
search_string = translate(search_string)
if method in ["glob", "regex"]:
columns = [col for col in df if re.match(search_string, col)]
# Finally, identify whether an inversion is requested or not.
if invert:
return df.drop(columns=columns)
else:
return df[columns] Completely untested, so no waranty... |
Can you tell how bored I am? This allows from typing import Union, List, Callable
@pf.register_dataframe_method
def select_columns(df: pd.DataFrame = None, columns: List = None,
search_string: str = None,
method: Union[str, Callable] = None, invert: bool = False):
"""
Method-chainable selection of columns.
Optional ability to invert selection of columns available as well.
Method-chaining example:
..code-block:: python
df = pd.DataFrame(...).select_columns(['a', 'b', 'c'], invert=True)
:param df: A pandas DataFrame.
:param columns: A list of columns to select.
:param search_string: Either:
1) an exact string to look for
2) a regular expression used to match column names
3) a shell-style glob string (e.g., `*_thing_*`)
4) a string conforming to the specifications used by your supplied
callable object (see `method`).
:param method: A string or function that specifies how to select the
columns. Supported methods are {"glob", "regex"}.
1) Default (`None`) indicates to look for an exact match.
2) "glob": support shell-style globbing for column names
(Uses `fnmatch.translate`).
3) "regex": regex (`re` module)-based string matching.
4) A callable object (function, object containing __call__, method)
which takes as input `columns` passed one at a time and returns True
on match. If `search_string` is not None, it is passed to your
callable as the second parameter.
:param invert: Whether or not to invert the selection.
This will result in selection ofthe complement of the columns provided.
:returns: A pandas DataFrame with the columns selected.
"""
# Check that the method name exists inside the approved set of methods.
supported_methods = [None, "glob", "regex"]
if method not in supported_methods and not callable(method):
raise ValueError(f'`method` keyword argument must be in '
f'[{supported_methods}] or callable.')
if method is None and columns is None:
# Only exact matches require that columns be specified.
raise ValueError("`columns` must be specified.")
elif search_string is None and not callable(method):
# For regex, glob methods, search_string be specified.
raise ValueError("`search_string` must be specified.")
# If method is specified, extract set of columns accordingly
if callable(method):
args = [search_string] if search_string is not None else []
columns = [col for col in df if method(col, *args)]
if method == "glob":
search_string = translate(search_string)
if method in ["glob", "regex"]:
columns = [col for col in df if re.match(search_string, col)]
# Finally, identify whether an inversion is requested or not.
if invert:
return df.drop(columns=columns)
else:
return df[columns] |
Thanks for the feedback! Both versions of modification look much more efficient. I immediately learned that it is a good idea to put the The latter version from @zbarry is so powerful that it should be the case. May I ask if I could further process the PR at this weekend. I don't mean to claim the credit, and just be grateful to take the chance to learn git, glob, and callable. Thank you all. |
I think the above looks great! A few thoughts to trim it down though:
For example, I have a If I want all the cols with even numbers, I could merely do: even_cols = ('col_' + i for i in range(1, 101) if i % 2) and pass that as the This would cut down the code for the entire search_string = translate(search_string)
columns = [col for col in df if re.match(search_string, col)]
return df.drop(columns=columns) if invert else df[columns] To do what I was mentioning above with #search_cols is something like ["col_name", "col_*"]
full_column_list = []
for col in search_cols:
search_string = translate(col)
columns = [col for col in df if re.match(search_string, col)]
full_column_list.extend(columns)
return df.drop(columns=full_column_list) if invert else df[full_column_list] |
@CWen001 totally, no problem! Looking forward to seeing what you have. Also, don't forget to consider @szuckerman's suggestion as well. 😄 |
It's your show. Learning by doing is the only way I learn (to really understand something), myself :) |
Hello. I'm still fascinated by the idea of df = pd.DataFrame({'Name': ['James', 'Jordan', 'Yao', 'Curry', 'Harden', 'Nowitzki'],
'Position':['SF','SG','C','PG','PG','PF'],
'Offense_1':[3, 5.1, np.nan, 4.7, 5.6, 6.8],
'Offense_2': [0, 1, np.nan, 0, 0, 0],
'Offense_3': [0, 0, 0, 0, 0, 1],
'Defense_1': [5, 5, 6, 5, 5.6, 6.8],
'Defense_2': [2, 4, 1, 0, 0, 5],
'Defense_3': ['NA', 0, 1, 0, 0, 0],
'Pace_1': ['NA', 3, 6, 1, 7, 3],}) I tried to understand @szuckerman 's code. If I interpret it right, that means the root for searching all three methods ('exact', 'glob', 're') is the same -- all using re.
Then I put things together, and the most important parameter might be @pf.register_dataframe_method
def select_columns(df: pd.DataFrame = None, search_cols: List = None,
invert: bool = False):
"""
Method-chainable selection of columns.
Optional ability to invert selection of columns available as well.
Method-chaining example:
..code-block:: python
df = pd.DataFrame(...).select_columns(['a', 'b', 'col_*'], invert=True)
:param df: A pandas DataFrame.
:param search_cols: A list of column names or search string to be used to select. Valid inputs include
1) an exact column name to look for
2) a regular expression used to match column names
3) a shell-style glob string (e.g., `*_thing_*`)
:param invert: Whether or not to invert the selection.
This will result in selection of the complement of the columns provided.
:returns: A pandas DataFrame with the columns selected.
"""
full_column_list = []
for col in search_cols:
search_string = fnmatch.translate(col)
columns = [col for col in df if re.match(search_string, col)]
full_column_list.extend(columns)
return df.drop(columns=full_column_list) if invert else df[full_column_list] When I tested the code, I found things work as expected when inputs are "exact" and/or "glob" strings. However, when inputs include a regular expression, it doesn't go well. For example, |
@CWen001 thanks for working on this! I think we can support just exact and globs for now. I think regex seems to be a use case that most data analyst types won't be concerned with. @zbarry and @szuckerman are pretty advanced Pythonistas 😄; with pyjanitor, though, I think both the interface and the source code should be kept beginner friendly as much as possible. |
Update select_columns functions. For discussion please see pyjanitor-devs#77
I see there is remove columns function. I think a select_columns function would work nice. It would be cleaner and easier to understand then df[['col1',col2','col3']].
The text was updated successfully, but these errors were encountered: