New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17188: [R] Update news for 9.0.0 #13726
Conversation
r/NEWS.md
Outdated
* Aggregations over partition columns return correct results. (ARROW-16700) | ||
* `dplyr::union` and `dplyr::union_all` are supported. (ARROW-15622) | ||
* `dplyr::glimpse` is supported. (ARROW-16776) | ||
* `show_exec_plan()` can be added to the end of a dplyr pipeline to show the underlying plan, similar to `dplyr::show_query`. (ARROW-15016) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add here that both dplyr::show_query()
and dplyr::explain()
also work?
We now support namespacing (pkg::
prefixing) => something like the chunk below works.
mtcars %>%
mutate(make_model = rownames(.)) %>%
arrow_table() %>%
mutate(name_length = stringr::str_length(make_model)) %>%
collect()
@@ -19,19 +19,54 @@ | |||
|
|||
# arrow 8.0.0.9000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be 9.0.0?
# arrow 8.0.0.9000 | |
# arrow 9.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I see this is supposed to be done on the utils-prepare.sh
script as with the other versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, we don't make this change manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. A few suggestions
r/NEWS.md
Outdated
* `lubridate::parse_date_time()` datetime parser: | ||
* `orders` with year, month, day, hours, minutes, and seconds components are supported. | ||
* the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`). | ||
## Arrays and tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reorder: First dplyr, then reading/writing, then this (or general assorted bugfixes), then packaging
* UnionDatasets can unify schemas of multiple InMemoryDatasets with varying | ||
schemas. (ARROW-16085) | ||
* `write_dataset()` preserves all schema metadata again. In 8.0.0, it would drop most metadata, breaking packages such as sfarrow. (ARROW-16511) | ||
* Reading and writing functions (such as `write_csv_arrow()`) will automatically (de-)compress data if the file path contains a compression extension (e.g. `"data.csv.gz"`). This works locally as well as on remote filesystems like S3 and GCS. (ARROW-16144) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already sorta the case for csv and json, but there were some bugs. But parquet and feather don't automatically do anything with the file path
* Reading and writing functions (such as `write_csv_arrow()`) will automatically (de-)compress data if the file path contains a compression extension (e.g. `"data.csv.gz"`). This works locally as well as on remote filesystems like S3 and GCS. (ARROW-16144) | ||
* `FileSystemFactoryOptions` can be provided to `open_dataset()`, allowing you to pass options such as which file prefixes to ignore. (ARROW-15280) | ||
* By default, `S3FileSystem` will not create or delete buckets. To enable that, pass the configuration option `allow_bucket_creation` or `allow_bucket_deletion`. (ARROW-15906) | ||
* `GcsFileSystem` and `gs_bucket()` allow connecting to Google Cloud Storage. (ARROW-13404, ARROW-16887) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe lead with this one? We should sort the section based on relevance/priority
r/NEWS.md
Outdated
|
||
## Arrow dplyr queries | ||
|
||
* Bugfixes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise let's lead with major new features (new dplyr verbs, then new functions) and put bug fixes at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, though I was putting these at the top because they seemed like critical bugfixes.
r/NEWS.md
Outdated
* Functions can be called with package namespace prefixes (e.g. `stringr::`, `lubridate::`) within queries. For example, `stringr::str_length` will now dispatch to the same kernel as `str_length`. (ARROW-14575) | ||
* User-defined functions are supported in queries. Use `register_scalar_function()` to create them. (ARROW-16444) | ||
* `lubridate::parse_date_time()` datetime parser: (ARROW-14848, ARROW-16407) | ||
* `orders` with year, month, day, hours, minutes, and seconds components are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are some orders not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, #13506 adds the remaining formats/ orders. So far the focus was on supporting the orders that would enable the higher level parsers (e.g. ymd_hms()
).
r/NEWS.md
Outdated
* the `orders` argument in the Arrow binding works as follows: `orders` are transformed into `formats` which subsequently get applied in turn. There is no `select_formats` parameter and no inference takes place (like is the case in `lubridate::parse_date_time()`). | ||
* `lubridate::ymd()` and related string date parsers supported. (ARROW-16394). Month (`ym`, `my`) and quarter (`yq`) resolution parsers are also added. (ARROW-16516) | ||
* lubridate family of `ymd_hms` datetime parsing functions are supported. (ARROW-16395) | ||
* `lubridate::fast_strptime()` supported. (ARROW-16439) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need a separate bullet point for every function that just says "supported". We can group them as is relevant, and we don't need to include all of the JIRA issue ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second ☝🏻 that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll consolidate. Wrote those bullets in a hurry :)
I include the JIRA IDs because it makes it way easier for me to revise the news. We can strip them at the end if people don't want them shown publicly.
r/NEWS.md
Outdated
* `dplyr::glimpse` is supported. (ARROW-16776) | ||
* `show_exec_plan()` can be added to the end of a dplyr pipeline to show the underlying plan, similar to `dplyr::show_query()`. `dplyr::show_query()` and `dplyr::explain()` also work in Arrow dplyr pipelines. (ARROW-15016) | ||
* Functions can be called with package namespace prefixes (e.g. `stringr::`, `lubridate::`) within queries. For example, `stringr::str_length` will now dispatch to the same kernel as `str_length`. (ARROW-14575) | ||
* User-defined functions are supported in queries. Use `register_scalar_function()` to create them. (ARROW-16444) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go higher up. Also should discuss map_batches()
alongside this since they're both kinds of UDF
r/NEWS.md
Outdated
* `dplyr::union` and `dplyr::union_all` are supported. (ARROW-15622) | ||
* `dplyr::glimpse` is supported. (ARROW-16776) | ||
* `show_exec_plan()` can be added to the end of a dplyr pipeline to show the underlying plan, similar to `dplyr::show_query()`. `dplyr::show_query()` and `dplyr::explain()` also work in Arrow dplyr pipelines. (ARROW-15016) | ||
* Functions can be called with package namespace prefixes (e.g. `stringr::`, `lubridate::`) within queries. For example, `stringr::str_length` will now dispatch to the same kernel as `str_length`. (ARROW-14575) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also significant
r/NEWS.md
Outdated
* Count distinct now gives correct result across multiple row groups. (ARROW-16807) | ||
* Aggregations over partition columns return correct results. (ARROW-16700) | ||
* `dplyr::union` and `dplyr::union_all` are supported. (ARROW-15622) | ||
* `dplyr::glimpse` is supported. (ARROW-16776) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we say more than just "supported"?
Authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Benchmark runs are scheduled for baseline = 71ccff9 and contender = a5f0c56. a5f0c56 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.