Adds where() similar to tidyselect #5

dah33 · 2021-08-02T12:53:42Z

Similar to where() in tidyselect. This enables column filtering like:

{% macro col_is_string(col) %}
    {{ return(col.is_string()) }}
{% endmacro %}

{{ dbtplyr.where(col_is_string, source('dev_dan','stg_access_rcrm_timesheets')) }}

See https://dplyr.tidyverse.org/reference/select.html

emilyriederer · 2021-08-08T13:18:04Z

Hi @dah33 ! Thanks so much for the PR. I'd definitely love to add more tidyselect verbs that people would find useful.

I have a couple of questions and would appreciate your thoughts:

I actually have not explored the Column class extensively, but looking at the docs, I see that there are 4 main methods related to data types (is_number, is_numeric, is_float, is_string). Am I thinking about it correctly that these are the only methods that would work with this functionality? If so, I'm wondering if we should hardcode them internally so users don't have to individually do the step of defining the col_is_DATATYPE(col) macro
Right now, I think the returned list is of Column objects whereas other dbtplyr functions return only the string names of the columns. For the use cases you foresee, is there any loss of functionality if we change this to ultimately return names only?

Thanks again!

dah33 · 2021-08-10T11:56:47Z

Hi!

Yes the Column class in dbt is pretty basic. I was planning to PR is_date() and is_boolean() methods.

I am currently building a small data profiling package for dbt. I am currently using dbt's Column class to iterate over all the columns in a relation to display some summary statistics, a bit like R's summary() function, or pandas_profiling. My approach for this is to accept a relation object as an argument, along with a second argument which is either a function or a list of column names. The function is passed each column object in turn, to ask if the column should be included in the analysis. If a list is used instead, each column name is checked against the list to see if it is present.

So to answer your questions specifically:

Yes, I think it would be useful to define is_number/numeric/float/string() functions, as a direct analogy to R's built in is.numeric/etc function. I have also implemented them in my project, so I could remove :)
I think it would be more consistent with dbt and dbt_utils to accept and return a list of Column objects. Perhaps if you implemented this, you might be able to remove the get_column_names macro? I was thinking, what is the analog to R's tibble: is it the relation object, a list of column objects, a list of strings, or something else? A good use case is to get all columns that are numeric and start with "sales_" (with the tests in either order).

emilyriederer · 2021-09-05T16:18:03Z

Thanks so much for the contribution, @dah33 ! (Also, your profiling package looks very neat. I'll be excited to follow the progress there!)

I really like the idea of this functionality. I tweaked the PR slightly for now to both take in and return string arguments to keep the API simpler for users. You give very good feedback that it might be useful for users to be able to access the Column objects and not just the names. That's a pretty big breaking change, so I will open a new issue on that but postpone it to a later version after I think more about how to provide the option to get out either without breaking anything as it exists.

Thanks again!

dah33 added 2 commits August 2, 2021 13:24

Generalisation of get_matches using functions

8ef77de

add equivalent to tidyselect's where()

69ecb6a

See https://dplyr.tidyverse.org/reference/select.html

emilyriederer merged commit 69ecb6a into emilyriederer:main Sep 5, 2021

emilyriederer mentioned this pull request Sep 5, 2021

Add ability to return either list of column names or Column objects #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds where() similar to tidyselect #5

Adds where() similar to tidyselect #5

dah33 commented Aug 2, 2021

emilyriederer commented Aug 8, 2021

dah33 commented Aug 10, 2021

emilyriederer commented Sep 5, 2021

Adds where() similar to tidyselect #5

Adds where() similar to tidyselect #5

Conversation

dah33 commented Aug 2, 2021

emilyriederer commented Aug 8, 2021

dah33 commented Aug 10, 2021

emilyriederer commented Sep 5, 2021