Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JOSS Review]: Substantial scholarly effort--add proof it's not a "thin client" #248

Closed
twedl opened this issue Jan 31, 2021 · 2 comments
Closed

Comments

@twedl
Copy link

twedl commented Jan 31, 2021

I think the package clears the bar for scholarly contribution for openjournals/joss-reviews#2927, but the paper doesn't quite prove it yet. The criteria I'm thinking of is Substantial scholarly effort:

“Minor utility” packages, including “thin” API clients, and single-function packages are not acceptable.

So what makes this a "thick" client? To me, it's the filter-then-collect bit, which is mentioned in the paper, but you don't hype it up enough or give details about how much it's helping. The two things I want to know / want you to be specific about are:

  1. Does filter-then-collect download the whole dataset, then filter it before giving it to you? You write:

"The data is only downloaded, and loaded into R as an ‘sf’ object, once the query is complete and the user requests the final result".

If I was reading this for the first time, I might not know that it's filtered before downloading. A rephrase might be "Once the query is complete, the final result is downloaded and loaded into R as an 'sf' object." That's also not perfect, but will help answer my question along with the next point.

  1. How much does the pre-filter help? Is it worth the extra effort over just using a 'thin' client and filtering yourself? Here's an example that you could use:
all_ha <- bcdc_query_geodata(ha_record) %>%
  collect()

my_ha <- bcdc_query_geodata(ha_record) %>%
  filter(HLTH_AUTHORITY_NAME == "Interior") %>%
  collect()

paste0(
  "The full data is ", format(object.size(all_ha), units = "auto"),
  ", the filtered data is ", format(object.size(my_ha), units = "auto"),
  ", which is a ", scales::percent(1 - as.numeric(object.size(my_ha)) / as.numeric(object.size(all_ha))),
  " reduction!"
)
)
#> "The full data is 9.9 Mb, the filtered data is 239.7 Kb, which is a 98% reduction!"

Technically this still doesn't make the point, because you could swap the collect and filter calls and still end up with the same object size reduction even though you did download the entire file. So, maybe use this along with a line that explicitly states that the filtering is done by the API before downloading, or an intermediate example where you call filter and not collect. The filter without collect gives that great message about how it's working.

NB: I switched to the "Interior" region because the coastline is usually what makes the shapefiles so large. The current example in the paper should give around a 70% improvement, which is also great and makes a better looking map, so feel free to keep that.

@stephhazlitt
Copy link
Member

This is a great suggestion, thanks @tweed1e.

@ateucher
Copy link
Collaborator

Closed by #263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants