Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds where() similar to tidyselect #5

Merged
merged 2 commits into from
Sep 5, 2021
Merged

Conversation

dah33
Copy link
Contributor

@dah33 dah33 commented Aug 2, 2021

Similar to where() in tidyselect. This enables column filtering like:

{% macro col_is_string(col) %}
    {{ return(col.is_string()) }}
{% endmacro %}

{{ dbtplyr.where(col_is_string, source('dev_dan','stg_access_rcrm_timesheets')) }}

@emilyriederer
Copy link
Owner

Hi @dah33 ! Thanks so much for the PR. I'd definitely love to add more tidyselect verbs that people would find useful.

I have a couple of questions and would appreciate your thoughts:

  • I actually have not explored the Column class extensively, but looking at the docs, I see that there are 4 main methods related to data types (is_number, is_numeric, is_float, is_string). Am I thinking about it correctly that these are the only methods that would work with this functionality? If so, I'm wondering if we should hardcode them internally so users don't have to individually do the step of defining the col_is_DATATYPE(col) macro
  • Right now, I think the returned list is of Column objects whereas other dbtplyr functions return only the string names of the columns. For the use cases you foresee, is there any loss of functionality if we change this to ultimately return names only?

Thanks again!

@dah33
Copy link
Contributor Author

dah33 commented Aug 10, 2021

Hi!

Yes the Column class in dbt is pretty basic. I was planning to PR is_date() and is_boolean() methods.

I am currently building a small data profiling package for dbt. I am currently using dbt's Column class to iterate over all the columns in a relation to display some summary statistics, a bit like R's summary() function, or pandas_profiling. My approach for this is to accept a relation object as an argument, along with a second argument which is either a function or a list of column names. The function is passed each column object in turn, to ask if the column should be included in the analysis. If a list is used instead, each column name is checked against the list to see if it is present.

So to answer your questions specifically:

  1. Yes, I think it would be useful to define is_number/numeric/float/string() functions, as a direct analogy to R's built in is.numeric/etc function. I have also implemented them in my project, so I could remove :)

  2. I think it would be more consistent with dbt and dbt_utils to accept and return a list of Column objects. Perhaps if you implemented this, you might be able to remove the get_column_names macro? I was thinking, what is the analog to R's tibble: is it the relation object, a list of column objects, a list of strings, or something else? A good use case is to get all columns that are numeric and start with "sales_" (with the tests in either order).

@emilyriederer emilyriederer merged commit 69ecb6a into emilyriederer:main Sep 5, 2021
@emilyriederer
Copy link
Owner

Thanks so much for the contribution, @dah33 ! (Also, your profiling package looks very neat. I'll be excited to follow the progress there!)

I really like the idea of this functionality. I tweaked the PR slightly for now to both take in and return string arguments to keep the API simpler for users. You give very good feedback that it might be useful for users to be able to access the Column objects and not just the names. That's a pretty big breaking change, so I will open a new issue on that but postpone it to a later version after I think more about how to provide the option to get out either without breaking anything as it exists.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants