-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16276: [R] Arrow 8.0 News #13005
Conversation
|
|
It might be better to describe function and package names as |
d995012
to
671a7f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome...thank you for putting it together!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for doing this! A few suggestions inline, and one general one: see if you can remove the word "now" in as many places as you can. It's redundant: we're describing what's in the release, so of course we mean "now".
r/NEWS.md
Outdated
|
||
Custom extension arrays can be created and registered, allowing other packages to | ||
define their own array types. Extension arrays wrap regular Arrow array types and | ||
provide customized behavior and/or storage. A common use-case for extension types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we link to the format docs on extension types? https://arrow.apache.org/docs/format/Columnar.html#extension-types
There are also some use cases described there.
Also, I'm not sure the use case here is correct. If it's just about custom serialization of R objects, isn't that what as_arrow_array
is for? Extension types are about when you need to define a standard outside of just this implementation, like when you want to have Python and R both understand the semantics of the data. If you're just trying to round trip data with R, the regular R metadata mechanism works for you, and if you need to serialize/deserialize the data differently, define an S3 method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, and I think I agree with you. I borrowed this language from the extension types R docs:
Lines 262 to 268 in d6ca3e2
#' Extension arrays are wrappers around regular Arrow [Array] objects | |
#' that provide some customized behaviour and/or storage. A common use-case | |
#' for extension types is to define a customized conversion between an | |
#' an Arrow [Array] and an R object when the default conversion is slow | |
#' or looses metadata important to the interpretation of values in the array. | |
#' For most types, the built-in | |
#' [vctrs extension type][vctrs_extension_type] is probably sufficient. |
@paleolimbot Any thoughts on that?
I'm can simple cut that part out for now and link to the existing docs on extension types:
Custom extension arrays can be created and registered, allowing other packages to define their own array types. Extension arrays wrap regular Arrow array types and provide customized behavior and/or storage. See further description and an example with
?new_extension_type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing documentation is correct, although a little confusing: as_arrow_array()
is about converting to an Arrow Array, but you need to use an ExtensionType
subclass in order to customize converting from an Arrow Array. Another compelling use of ExtensionType
is, as Neal mentioned, where the type is defined in a Python package as well.
Perhaps a solution here is to group the S3 Generics heading and the ExtensionType heading, because they're both under the theme of extensibility? Maybe:
Extensibility
- Added S3 generic methods to create the core Arrow object types. In particular, packages can define the
as_arrow_array()
generic to ensure that a custom vector type is converted to an Arrow Array in a particular way (e.g., when converting adata.frame
to an Arrow Table). Packages can also define anas_arrow_table()
method to customize conversion of a table-like object (e.g., when an object is passed towrite_parquet()
orwrite_feather()
). - Custom ExtensionTypes can be created and registered, allowing other packages to define their own array types and/or conversions from Arrow Arrays to R vectors. Extension arrays wrap regular Arrow array types and provide customized behavior and/or storage. See documentation for
new_extension_type()
for details. - Implemented a generic extension type and
as_arrow_array()
methods for all objects wherevctrs::vec_is()
returnsTRUE
(i.e., any object that can be used as a column in atibble::tibble()
), provided that the underlyingvctrs::vec_data()
can be converted to an Arrow Array.
(feel free to mix/match/scramble/disregard this with what you've written already!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed that combining the sections makes sense.
r/NEWS.md
Outdated
- now can take a list of datasets with differing schemas and attempt to unify the | ||
schemas to produce a `UnionDataset`. | ||
* Arrow `{dplyr}` queries: | ||
- are now supported on `RecordBatchReader`. This allows results from DuckDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just one example of use case, there are/will be others (for example, you can pass a RecordBatchReader over the C interface, so you can get one from wherever in pyarrow, including Flight, and do dplyr on it)
Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor notes but this looks good, let's get this finished and merged before the next RC is cut if we can
r/NEWS.md
Outdated
* `lubridate::tz()` (timezone), | ||
* `lubridate::semester()` (semester), | ||
* `lubridate::dst()` (daylight savings time indicator), | ||
* `lubridate::date()` (extract date), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this correct?
* `lubridate::date()` (extract date), | |
* `lubridate::date()` (extract date from timestamp), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
instead of `timestamp(unit = "s")`. | ||
* For Arrow dplyr queries, added additional `{lubridate}` features and fixes: | ||
* New component extraction functions: | ||
* `lubridate::tz()` (timezone), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
* `lubridate::tz()` (timezone), | |
* `lubridate::tz()` (string timezone), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather limit these parenthetical to just explain abbreviations (tz, dst, epiyear), rather than try to function as docs. We link to the lubridate function docs directly for each bullet, so more detail is readily available to the reader.
r/NEWS.md
Outdated
* New component extraction functions: | ||
* `lubridate::tz()` (timezone), | ||
* `lubridate::semester()` (semester), | ||
* `lubridate::dst()` (daylight savings time indicator), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this?
* `lubridate::dst()` (daylight savings time indicator), | |
* `lubridate::dst()` (daylight savings time indicator, logical/boolean), |
r/NEWS.md
Outdated
* `lubridate::semester()` (semester), | ||
* `lubridate::dst()` (daylight savings time indicator), | ||
* `lubridate::date()` (extract date), | ||
* `lubridate::epiyear()` (epiyear), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is epiyear? drop the parenthetical if we don't have anything to clarify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"year according to epidemilogical week calendar".
r/NEWS.md
Outdated
* `lubridate::date()` (extract date), | ||
* `lubridate::epiyear()` (epiyear), | ||
* `lubridate::month()` works with integer inputs. | ||
* Added `lubridate::make_date()` & `lubridate::make_datetime()` + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop "Added" from all of these, seems inconsistent with the ones above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
and chunking is acceptable, using `ChunkedArray$create()`. | ||
* ChunkedArrays can be concatenated with `c()`. | ||
* RecordBatches and Tables support `cbind()`. | ||
* Tables support `rbind()`. `concat_tables()` is also provided to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, no rbind for RecordBatch? wasn't there some alternative to concatenate batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is to make it a Table, but that's not really new IMO. https://github.com/apache/arrow/blob/master/r/R/record-batch.R#L195
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
Benchmark runs are scheduled for baseline = 6b32c30 and contender = 526fa07. 526fa07 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Let me know if I've missed anything important in this release!