Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16276: [R] Arrow 8.0 News #13005

Closed
wants to merge 12 commits into from

Conversation

wjones127
Copy link
Member

Let me know if I've missed anything important in this release!

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@wjones127 wjones127 changed the title ARROW-16272: Arrow 8.0 News ARROW-16276: Arrow 8.0 News Apr 26, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has no components in JIRA, make sure you assign one.

@eitsupi
Copy link
Contributor

eitsupi commented Apr 27, 2022

It might be better to describe function and package names as `{dplyr}` or `write_dataset()` so that pkgdown can create the links automatically.

https://github.com/r-lib/pkgdown/blob/feb91bf46c3ea78f8a03aead9f9a4934e3965ba4/vignettes/linking.Rmd?rgh-link-date=2022-02-09T10%3A52%3A57Z#L20-L36

@wjones127 wjones127 marked this pull request as ready for review April 27, 2022 21:01
@wjones127 wjones127 changed the title ARROW-16276: Arrow 8.0 News ARROW-16276: [R] Arrow 8.0 News Apr 27, 2022
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome...thank you for putting it together!

r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Show resolved Hide resolved
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for doing this! A few suggestions inline, and one general one: see if you can remove the word "now" in as many places as you can. It's redundant: we're describing what's in the release, so of course we mean "now".

r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated

Custom extension arrays can be created and registered, allowing other packages to
define their own array types. Extension arrays wrap regular Arrow array types and
provide customized behavior and/or storage. A common use-case for extension types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link to the format docs on extension types? https://arrow.apache.org/docs/format/Columnar.html#extension-types

There are also some use cases described there.

Also, I'm not sure the use case here is correct. If it's just about custom serialization of R objects, isn't that what as_arrow_array is for? Extension types are about when you need to define a standard outside of just this implementation, like when you want to have Python and R both understand the semantics of the data. If you're just trying to round trip data with R, the regular R metadata mechanism works for you, and if you need to serialize/deserialize the data differently, define an S3 method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and I think I agree with you. I borrowed this language from the extension types R docs:

arrow/r/R/extension.R

Lines 262 to 268 in d6ca3e2

#' Extension arrays are wrappers around regular Arrow [Array] objects
#' that provide some customized behaviour and/or storage. A common use-case
#' for extension types is to define a customized conversion between an
#' an Arrow [Array] and an R object when the default conversion is slow
#' or looses metadata important to the interpretation of values in the array.
#' For most types, the built-in
#' [vctrs extension type][vctrs_extension_type] is probably sufficient.

@paleolimbot Any thoughts on that?

I'm can simple cut that part out for now and link to the existing docs on extension types:

Custom extension arrays can be created and registered, allowing other packages to define their own array types. Extension arrays wrap regular Arrow array types and provide customized behavior and/or storage. See further description and an example with ?new_extension_type.

Copy link
Member

@paleolimbot paleolimbot Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing documentation is correct, although a little confusing: as_arrow_array() is about converting to an Arrow Array, but you need to use an ExtensionType subclass in order to customize converting from an Arrow Array. Another compelling use of ExtensionType is, as Neal mentioned, where the type is defined in a Python package as well.

Perhaps a solution here is to group the S3 Generics heading and the ExtensionType heading, because they're both under the theme of extensibility? Maybe:

Extensibility

  • Added S3 generic methods to create the core Arrow object types. In particular, packages can define the as_arrow_array() generic to ensure that a custom vector type is converted to an Arrow Array in a particular way (e.g., when converting a data.frame to an Arrow Table). Packages can also define an as_arrow_table() method to customize conversion of a table-like object (e.g., when an object is passed to write_parquet() or write_feather()).
  • Custom ExtensionTypes can be created and registered, allowing other packages to define their own array types and/or conversions from Arrow Arrays to R vectors. Extension arrays wrap regular Arrow array types and provide customized behavior and/or storage. See documentation for new_extension_type() for details.
  • Implemented a generic extension type and as_arrow_array() methods for all objects where vctrs::vec_is() returns TRUE (i.e., any object that can be used as a column in a tibble::tibble()), provided that the underlying vctrs::vec_data() can be converted to an Arrow Array.

(feel free to mix/match/scramble/disregard this with what you've written already!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed that combining the sections makes sense.

r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated
- now can take a list of datasets with differing schemas and attempt to unify the
schemas to produce a `UnionDataset`.
* Arrow `{dplyr}` queries:
- are now supported on `RecordBatchReader`. This allows results from DuckDB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just one example of use case, there are/will be others (for example, you can pass a RecordBatchReader over the C interface, so you can get one from wherever in pyarrow, including Flight, and do dplyr on it)

wjones127 and others added 3 commits April 29, 2022 08:05
Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
r/NEWS.md Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor notes but this looks good, let's get this finished and merged before the next RC is cut if we can

r/NEWS.md Outdated
* `lubridate::tz()` (timezone),
* `lubridate::semester()` (semester),
* `lubridate::dst()` (daylight savings time indicator),
* `lubridate::date()` (extract date),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct?

Suggested change
* `lubridate::date()` (extract date),
* `lubridate::date()` (extract date from timestamp),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

instead of `timestamp(unit = "s")`.
* For Arrow dplyr queries, added additional `{lubridate}` features and fixes:
* New component extraction functions:
* `lubridate::tz()` (timezone),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Suggested change
* `lubridate::tz()` (timezone),
* `lubridate::tz()` (string timezone),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather limit these parenthetical to just explain abbreviations (tz, dst, epiyear), rather than try to function as docs. We link to the lubridate function docs directly for each bullet, so more detail is readily available to the reader.

r/NEWS.md Outdated
* New component extraction functions:
* `lubridate::tz()` (timezone),
* `lubridate::semester()` (semester),
* `lubridate::dst()` (daylight savings time indicator),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this?

Suggested change
* `lubridate::dst()` (daylight savings time indicator),
* `lubridate::dst()` (daylight savings time indicator, logical/boolean),

r/NEWS.md Outdated
* `lubridate::semester()` (semester),
* `lubridate::dst()` (daylight savings time indicator),
* `lubridate::date()` (extract date),
* `lubridate::epiyear()` (epiyear),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is epiyear? drop the parenthetical if we don't have anything to clarify

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"year according to epidemilogical week calendar".

r/NEWS.md Outdated
* `lubridate::date()` (extract date),
* `lubridate::epiyear()` (epiyear),
* `lubridate::month()` works with integer inputs.
* Added `lubridate::make_date()` & `lubridate::make_datetime()` +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop "Added" from all of these, seems inconsistent with the ones above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

and chunking is acceptable, using `ChunkedArray$create()`.
* ChunkedArrays can be concatenated with `c()`.
* RecordBatches and Tables support `cbind()`.
* Tables support `rbind()`. `concat_tables()` is also provided to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, no rbind for RecordBatch? wasn't there some alternative to concatenate batches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is to make it a Table, but that's not really new IMO. https://github.com/apache/arrow/blob/master/r/R/record-batch.R#L195

r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
r/NEWS.md Outdated Show resolved Hide resolved
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@ursabot
Copy link

ursabot commented May 7, 2022

Benchmark runs are scheduled for baseline = 6b32c30 and contender = 526fa07. 526fa07 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.08% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.28% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 526fa070 ec2-t3-xlarge-us-east-2
[Finished] 526fa070 test-mac-arm
[Finished] 526fa070 ursa-i9-9960x
[Finished] 526fa070 ursa-thinkcentre-m75q
[Finished] 6b32c300 ec2-t3-xlarge-us-east-2
[Finished] 6b32c300 test-mac-arm
[Finished] 6b32c300 ursa-i9-9960x
[Finished] 6b32c300 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants