Skip to content

Commit

Permalink
ARROW-8300: [R] Documentation and changelog updates for 0.17
Browse files Browse the repository at this point in the history
This edits the NEWS.md and adds a python.Rmd (static) example of using the `reticulate` bindings.

cc @wesm @pitrou

Closes #6833 from nealrichardson/docs-0.17

Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Wes McKinney <wesm+git@apache.org>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
  • Loading branch information
nealrichardson and wesm committed Apr 7, 2020
1 parent c9f0a02 commit 042a6ec
Show file tree
Hide file tree
Showing 23 changed files with 292 additions and 108 deletions.
2 changes: 1 addition & 1 deletion r/DESCRIPTION
Expand Up @@ -63,6 +63,7 @@ Collate:
'compute.R'
'csv.R'
'dataset.R'
'filesystem.R'
'ipc_stream.R'
'deprecated.R'
'dictionary.R'
Expand All @@ -72,7 +73,6 @@ Collate:
'dplyr.R'
'feather.R'
'field.R'
'filesystem.R'
'install-arrow.R'
'json.R'
'list.R'
Expand Down
59 changes: 57 additions & 2 deletions r/NEWS.md
Expand Up @@ -19,12 +19,67 @@

# arrow 0.16.0.9000

## Feather v2

This release includes support for version 2 of the Feather file format.
Feather v2 features full support for all Arrow data types,
fixes the 2GB per-column limitation for large amounts of string data,
and it allows files to be compressed using either `lz4` or `zstd`.
`write_feather()` can write either version 2 or
[version 1](https://github.com/wesm/feather) Feather files, and `read_feather()`
automatically detects which file version it is reading.

Related to this change, several functions around reading and writing data
have been reworked. `read_ipc_stream()` and `write_ipc_stream()` have been
added to facilitate writing data to the Arrow IPC stream format, which is
slightly different from the IPC file format (Feather v2 *is* the IPC file format).

Behavior has been standardized: all `read_<format>()` return an R `data.frame`
(default) or a `Table` if the argument `as_data_frame = FALSE`;
all `write_<format>()` functions return the data object, invisibly.
To facilitate some workflows, a special `write_to_raw()` function is added
to wrap `write_ipc_stream()` and return the `raw` vector containing the buffer
that was written.

To achieve this standardization, `read_table()`, `read_record_batch()`,
`read_arrow()`, and `write_arrow()` have been deprecated.

## Python interoperability

The 0.17 Apache Arrow release includes a C data interface that allows
exchanging Arrow data in-process at the C level without copying
and without libraries having a build or runtime dependency on each other. This enables
us to use `reticulate` to share data between R and Python (`pyarrow`) efficiently.

See `vignette("python", package = "arrow")` for details.

## Datasets

* Dataset reading benefits from many speedups and fixes in the C++ library
* Datasets have a `dim()` method, which sums rows across all files (#6635, @boshek)
* Dataset filtering now treats `NA` as `FALSE`, consistent with `dplyr::filter()`
* Dataset filtering is now correctly supported for all Arrow date/time/timestamp column types
* `vignette("dataset", package = "arrow")` now has correct, executable code

## Installation

* Installation on Linux now builds C++ the library from source by default, with some compression libraries disabled. For a faster, richer build, set the environment variable `NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details and more options.
* Source installation is faster and more reliable on more Linux distributions.

## Other bug fixes

* Timezones are faithfully preserved in roundtrip between R and Arrow
* `read_feather()` and other reader functions close any file connections they open
* Arrow R6 objects no longer have namespace collisions when the `R.oo` package is also loaded
* `FileStats` is renamed to `FileInfo`, and the original spelling has been deprecated

# arrow 0.16.0.2

* `install_arrow()` now installs the latest release of `arrow`, including Linux dependencies, either for CRAN releases or for development builds (if `nightly = TRUE`)
* Package installation on Linux no longer downloads C++ dependencies unless the `LIBARROW_DOWNLOAD` or `NOT_CRAN` enviroment variable is set
* Package installation on Linux no longer downloads C++ dependencies unless the `LIBARROW_DOWNLOAD` or `NOT_CRAN` environment variable is set
* `write_feather()`, `write_arrow()` and `write_parquet()` now return their input,
similar to the `write_*` functions in the `readr` package (#6387, @boshek)
* Can now infer the type of an R `list` and create a ListArray when all list elements are the same type (#6275, @michaelchirico)
* Dataset filtering is now correctly supported for all Arrow date/time/timestamp column types.

# arrow 0.16.0

Expand Down
28 changes: 13 additions & 15 deletions r/R/dataset.R
Expand Up @@ -115,24 +115,22 @@ Dataset <- R6Class("Dataset", inherit = ArrowObject,
self
}
},
#' @description
#' Start a new scan of the data
#' @return A [ScannerBuilder]
# @description
# Start a new scan of the data
# @return A [ScannerBuilder]
NewScan = function() unique_ptr(ScannerBuilder, dataset___Dataset__NewScan(self)),
ToString = function() self$schema$ToString()
),
active = list(
#' @description
#' Return the Dataset's `Schema`
schema = function() shared_ptr(Schema, dataset___Dataset__schema(self)),
metadata = function() self$schema$metadata,
num_rows = function() {
warning("Number of rows unknown; returning NA", call. = FALSE)
NA_integer_
},
num_cols = function() length(self$schema),
#' @description
#' Return the Dataset's type.
# @description
# Return the Dataset's type.
type = function() dataset___Dataset__type_name(self)
)
)
Expand Down Expand Up @@ -172,11 +170,11 @@ FileSystemDataset <- R6Class("FileSystemDataset", inherit = Dataset,
}
),
active = list(
#' @description
#' Return the files contained in this `FileSystemDataset`
# @description
# Return the files contained in this `FileSystemDataset`
files = function() dataset___FileSystemDataset__files(self),
#' @description
#' Return the format of files in this `Dataset`
# @description
# Return the format of files in this `Dataset`
format = function() {
shared_ptr(FileFormat, dataset___FileSystemDataset__format(self))$..dispatch()
},
Expand All @@ -197,8 +195,8 @@ FileSystemDataset <- R6Class("FileSystemDataset", inherit = Dataset,
#' @export
UnionDataset <- R6Class("UnionDataset", inherit = Dataset,
active = list(
#' @description
#' Return the UnionDataset's child `Dataset`s
# @description
# Return the UnionDataset's child `Dataset`s
children = function() {
map(dataset___UnionDataset__children(self), ~shared_ptr(Dataset, .)$..dispatch())
}
Expand Down Expand Up @@ -380,8 +378,8 @@ FileFormat <- R6Class("FileFormat", inherit = ArrowObject,
}
),
active = list(
#' @description
#' Return the `FileFormat`'s type
# @description
# Return the `FileFormat`'s type
type = function() dataset___FileFormat__type_name(self)
)
)
Expand Down
27 changes: 27 additions & 0 deletions r/R/deprecated.R
Expand Up @@ -38,3 +38,30 @@ read_table <- function(x, ...) {
.Deprecated("read_arrow")
read_arrow(x, ..., as_data_frame = FALSE)
}

#' @rdname read_ipc_stream
#' @export
read_arrow <- function(x, ...) {
if (inherits(x, "raw")) {
read_ipc_stream(x, ...)
} else {
read_feather(x, ...)
}
}

#' @rdname write_ipc_stream
#' @export
write_arrow <- function(x, sink, ...) {
if (inherits(sink, "raw")) {
# HACK for sparklyr
# Note that this returns a new R raw vector, not the one passed as `sink`
write_to_raw(x)
} else {
write_feather(x, sink, ...)
}
}

#' @rdname FileInfo
#' @export
#' @include filesystem.R
FileStats <- FileInfo
16 changes: 15 additions & 1 deletion r/R/feather.R
Expand Up @@ -17,6 +17,12 @@

#' Write data in the Feather format
#'
#' Feather provides binary columnar serialization for data frames.
#' It is designed to make reading and writing data frames efficient,
#' and to make sharing data across data analysis languages easy.
#' This function writes both the original, limited specification of the format
#' and the version 2 specification, which is the Apache Arrow IPC file format.
#'
#' @param x `data.frame`, [RecordBatch], or [Table]
#' @param sink A string file path or [OutputStream]
#' @param version integer Feather file version. Version 2 is the current.
Expand All @@ -36,6 +42,7 @@
#' @return The input `x`, invisibly. Note that if `sink` is an [OutputStream],
#' the stream will be left open.
#' @export
#' @seealso [RecordBatchWriter] for lower-level access to writing Arrow IPC data.
#' @examples
#' \donttest{
#' tf <- tempfile()
Expand Down Expand Up @@ -109,15 +116,22 @@ write_feather <- function(x,

#' Read a Feather file
#'
#' Feather provides binary columnar serialization for data frames.
#' It is designed to make reading and writing data frames efficient,
#' and to make sharing data across data analysis languages easy.
#' This function reads both the original, limited specification of the format
#' and the version 2 specification, which is the Apache Arrow IPC file format.
#'
#' @param file A character file path, a raw vector, or `InputStream`, passed to
#' `FeatherReader$create()`.
#' @inheritParams read_delim_arrow
#' @param ... additional parameters
#' @param ... additional parameters, passed to [FeatherReader$create()][FeatherReader]
#'
#' @return A `data.frame` if `as_data_frame` is `TRUE` (the default), or an
#' Arrow [Table] otherwise
#'
#' @export
#' @seealso [FeatherReader] and [RecordBatchReader] for lower-level access to reading Arrow IPC data.
#' @examples
#' \donttest{
#' tf <- tempfile()
Expand Down
9 changes: 0 additions & 9 deletions r/R/filesystem.R
Expand Up @@ -79,15 +79,6 @@ FileInfo <- R6Class("FileInfo",
)
)

#' @include arrow-package.R
#' @title FileSystem entry info (Deprecated. Use FileInfo instead.)
#' @usage NULL
#' @format NULL
#'
#' @rdname FileStats
#' @export
FileStats <- FileInfo

#' @title file selector
#' @format NULL
#'
Expand Down
22 changes: 0 additions & 22 deletions r/R/ipc_stream.R
Expand Up @@ -52,18 +52,6 @@ write_ipc_stream <- function(x, sink, ...) {
invisible(x_out)
}

#' @rdname write_ipc_stream
#' @export
write_arrow <- function(x, sink, ...) {
if (inherits(sink, "raw")) {
# HACK for sparklyr
# Note that this returns a new R raw vector, not the one passed as `sink`
write_to_raw(x)
} else {
write_feather(x, sink, ...)
}
}

#' Write Arrow data to a raw vector
#'
#' [write_ipc_stream()] and [write_feather()] write data to a sink and return
Expand Down Expand Up @@ -124,13 +112,3 @@ read_ipc_stream <- function(x, as_data_frame = TRUE, ...) {
}
out
}

#' @rdname read_ipc_stream
#' @export
read_arrow <- function(x, ...) {
if (inherits(x, "raw")) {
read_ipc_stream(x, ...)
} else {
read_feather(x, ...)
}
}
2 changes: 1 addition & 1 deletion r/R/py-to-r.R
Expand Up @@ -59,7 +59,7 @@ r_to_py.RecordBatch <- function(x, convert = FALSE) {
delete_arrow_schema(schema_ptr)
delete_arrow_array(array_ptr)
})

pa <- reticulate::import("pyarrow", convert = convert)
ExportRecordBatch(x, array_ptr, schema_ptr)
pa$RecordBatch$`_import_from_c`(array_ptr, schema_ptr)
Expand Down
36 changes: 15 additions & 21 deletions r/README.md
@@ -1,11 +1,8 @@
# arrow

[![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
[![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
[![Nightly macOS Build
Status](https://travis-ci.org/ursa-labs/arrow-r-nightly.png?branch=master)](https://travis-ci.org/ursa-labs/arrow-r-nightly)
[![Nightly Windows Build
Status](https://ci.appveyor.com/api/projects/status/ume8udm5r26u2c9l/branch/master?svg=true)](https://ci.appveyor.com/project/nealrichardson/arrow-r-nightly-yxl55/branch/master)
[![codecov](https://codecov.io/gh/ursa-labs/arrow-r-nightly/branch/master/graph/badge.svg)](https://codecov.io/gh/ursa-labs/arrow-r-nightly)

[Apache Arrow](https://arrow.apache.org/) is a cross-language
Expand All @@ -26,11 +23,17 @@ access to Arrow memory and messages.

Install the latest release of `arrow` from CRAN with

``` r
```r
install.packages("arrow")
```

Installing a released version of the `arrow` package should require no
Conda users on Linux and macOS can install `arrow` from conda-forge with

```
conda install -c conda-forge --strict-channel-priority r-arrow
```

Installing a released version of the `arrow` package requires no
additional system dependencies. For macOS and Windows, CRAN hosts binary
packages that contain the Arrow C++ library. On Linux, source package
installation will also build necessary C++ dependencies. For a faster,
Expand All @@ -41,7 +44,7 @@ If you install the `arrow` package from source and the C++ library is
not found, the R package functions will notify you that Arrow is not
available. Call

``` r
```r
arrow::install_arrow()
```

Expand All @@ -55,10 +58,6 @@ source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.
install_arrow()
```

Conda users on Linux and macOS can install `arrow` from conda-forge with

conda install -c conda-forge --strict-channel-priority r-arrow

## Installing a development version

Development versions of the package (binary and source) are built daily and hosted at
Expand Down Expand Up @@ -92,16 +91,11 @@ brew install apache-arrow
brew install apache-arrow --HEAD
```

On Windows, you can download a .zip file with the arrow dependencies
from the [rwinlib](https://github.com/rwinlib/arrow/releases) project,
On Windows, you can download a .zip file with the arrow dependencies from the
[nightly bintray repository](https://dl.bintray.com/ursalabs/arrow-r/libarrow/bin/windows-35/),
and then set the `RWINLIB_LOCAL` environment variable to point to that
zip file before installing the `arrow` R package. That project contains
released versions of the C++ library; for a development version, Windows
users may be able to find a binary by going to the [Apache Arrow
project’s
Appveyor](https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow),
selecting an R job from a recent build, and downloading the
`build\arrow-*.zip` file from the “Artifacts” tab.
zip file before installing the `arrow` R package. Version numbers in that
repository correspond to dates, and you will likely want the most recent.

If you need to alter both the Arrow C++ library and the R package code,
or if you can’t get a binary version of the latest C++ library
Expand Down Expand Up @@ -168,7 +162,7 @@ generation.
The codegen.R script has these additional dependencies:

``` r
remotes::install_github("romainfrancois/decor")
remotes::install_github("nealrichardson/decor")
install.packages("glue")
```

Expand Down
3 changes: 3 additions & 0 deletions r/_pkgdown.yml
Expand Up @@ -67,6 +67,7 @@ reference:
- Partitioning
- Expression
- Scanner
- FileFormat
- title: Reading and writing files
contents:
- read_feather
Expand All @@ -76,6 +77,7 @@ reference:
- read_json_arrow
- write_feather
- write_ipc_stream
- write_to_raw
- write_parquet
- title: C++ reader/writer interface
contents:
Expand Down Expand Up @@ -125,6 +127,7 @@ reference:
- MemoryPool
- default_memory_pool
- FileSystem
- FileInfo
- FileStats
- FileSelector
- title: Installation helpers
Expand Down

0 comments on commit 042a6ec

Please sign in to comment.