ARROW-8300: [R] Documentation and changelog updates for 0.17

This edits the NEWS.md and adds a python.Rmd (static) example of using the `reticulate` bindings. cc @wesm @pitrou Closes #6833 from nealrichardson/docs-0.17 Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Co-authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
apache · Apr 7, 2020 · 042a6ec · 042a6ec
1 parent c9f0a02
commit 042a6ec
Show file tree

Hide file tree

Showing 23 changed files with 292 additions and 108 deletions.
diff --git a/r/DESCRIPTION b/r/DESCRIPTION
@@ -63,6 +63,7 @@ Collate:
     'compute.R'
     'csv.R'
     'dataset.R'
+    'filesystem.R'
     'ipc_stream.R'
     'deprecated.R'
     'dictionary.R'
@@ -72,7 +73,6 @@ Collate:
     'dplyr.R'
     'feather.R'
     'field.R'
-    'filesystem.R'
     'install-arrow.R'
     'json.R'
     'list.R'

diff --git a/r/NEWS.md b/r/NEWS.md
@@ -19,12 +19,67 @@
 
 # arrow 0.16.0.9000
 
+## Feather v2
+
+This release includes support for version 2 of the Feather file format.
+Feather v2 features full support for all Arrow data types,
+fixes the 2GB per-column limitation for large amounts of string data,
+and it allows files to be compressed using either `lz4` or `zstd`.
+`write_feather()` can write either version 2 or
+[version 1](https://github.com/wesm/feather) Feather files, and `read_feather()`
+automatically detects which file version it is reading.
+
+Related to this change, several functions around reading and writing data
+have been reworked. `read_ipc_stream()` and `write_ipc_stream()` have been
+added to facilitate writing data to the Arrow IPC stream format, which is
+slightly different from the IPC file format (Feather v2 *is* the IPC file format).
+
+Behavior has been standardized: all `read_<format>()` return an R `data.frame`
+(default) or a `Table` if the argument `as_data_frame = FALSE`;
+all `write_<format>()` functions return the data object, invisibly.
+To facilitate some workflows, a special `write_to_raw()` function is added
+to wrap `write_ipc_stream()` and return the `raw` vector containing the buffer
+that was written.
+
+To achieve this standardization, `read_table()`, `read_record_batch()`,
+`read_arrow()`, and `write_arrow()` have been deprecated.
+
+## Python interoperability
+
+The 0.17 Apache Arrow release includes a C data interface that allows
+exchanging Arrow data in-process at the C level without copying
+and without libraries having a build or runtime dependency on each other. This enables
+us to use `reticulate` to share data between R and Python (`pyarrow`) efficiently.
+
+See `vignette("python", package = "arrow")` for details.
+
+## Datasets
+
+* Dataset reading benefits from many speedups and fixes in the C++ library
+* Datasets have a `dim()` method, which sums rows across all files (#6635, @boshek)
+* Dataset filtering now treats `NA` as `FALSE`, consistent with `dplyr::filter()`
+* Dataset filtering is now correctly supported for all Arrow date/time/timestamp column types
+* `vignette("dataset", package = "arrow")` now has correct, executable code
+
+## Installation
+
+* Installation on Linux now builds C++ the library from source by default, with some compression libraries disabled. For a faster, richer build, set the environment variable `NOT_CRAN=true`. See `vignette("install", package = "arrow")` for details and more options.
+* Source installation is faster and more reliable on more Linux distributions.
+
+## Other bug fixes
+
+* Timezones are faithfully preserved in roundtrip between R and Arrow
+* `read_feather()` and other reader functions close any file connections they open
+* Arrow R6 objects no longer have namespace collisions when the `R.oo` package is also loaded
+* `FileStats` is renamed to `FileInfo`, and the original spelling has been deprecated
+
+# arrow 0.16.0.2
+
 * `install_arrow()` now installs the latest release of `arrow`, including Linux dependencies, either for CRAN releases or for development builds (if `nightly = TRUE`)
-* Package installation on Linux no longer downloads C++ dependencies unless the `LIBARROW_DOWNLOAD` or `NOT_CRAN` enviroment variable is set
+* Package installation on Linux no longer downloads C++ dependencies unless the `LIBARROW_DOWNLOAD` or `NOT_CRAN` environment variable is set
 * `write_feather()`, `write_arrow()` and `write_parquet()` now return their input,
 similar to the `write_*` functions in the `readr` package (#6387, @boshek)
 * Can now infer the type of an R `list` and create a ListArray when all list elements are the same type (#6275, @michaelchirico)
-* Dataset filtering is now correctly supported for all Arrow date/time/timestamp column types.
 
 # arrow 0.16.0
 

diff --git a/r/R/dataset.R b/r/R/dataset.R
@@ -115,24 +115,22 @@ Dataset <- R6Class("Dataset", inherit = ArrowObject,
         self
       }
     },
-    #' @description
-    #' Start a new scan of the data
-    #' @return A [ScannerBuilder]
+    # @description
+    # Start a new scan of the data
+    # @return A [ScannerBuilder]
     NewScan = function() unique_ptr(ScannerBuilder, dataset___Dataset__NewScan(self)),
     ToString = function() self$schema$ToString()
   ),
   active = list(
-    #' @description
-    #' Return the Dataset's `Schema`
     schema = function() shared_ptr(Schema, dataset___Dataset__schema(self)),
     metadata = function() self$schema$metadata,
     num_rows = function() {
       warning("Number of rows unknown; returning NA", call. = FALSE)
       NA_integer_
     },
     num_cols = function() length(self$schema),
-    #' @description
-    #' Return the Dataset's type.
+    # @description
+    # Return the Dataset's type.
     type = function() dataset___Dataset__type_name(self)
   )
 )
@@ -172,11 +170,11 @@ FileSystemDataset <- R6Class("FileSystemDataset", inherit = Dataset,
     }
   ),
   active = list(
-    #' @description
-    #' Return the files contained in this `FileSystemDataset`
+    # @description
+    # Return the files contained in this `FileSystemDataset`
     files = function() dataset___FileSystemDataset__files(self),
-    #' @description
-    #' Return the format of files in this `Dataset`
+    # @description
+    # Return the format of files in this `Dataset`
     format = function() {
       shared_ptr(FileFormat, dataset___FileSystemDataset__format(self))$..dispatch()
     },
@@ -197,8 +195,8 @@ FileSystemDataset <- R6Class("FileSystemDataset", inherit = Dataset,
 #' @export
 UnionDataset <- R6Class("UnionDataset", inherit = Dataset,
   active = list(
-    #' @description
-    #' Return the UnionDataset's child `Dataset`s
+    # @description
+    # Return the UnionDataset's child `Dataset`s
     children = function() {
       map(dataset___UnionDataset__children(self), ~shared_ptr(Dataset, .)$..dispatch())
     }
@@ -380,8 +378,8 @@ FileFormat <- R6Class("FileFormat", inherit = ArrowObject,
     }
   ),
   active = list(
-    #' @description
-    #' Return the `FileFormat`'s type
+    # @description
+    # Return the `FileFormat`'s type
     type = function() dataset___FileFormat__type_name(self)
   )
 )

diff --git a/r/R/deprecated.R b/r/R/deprecated.R
@@ -38,3 +38,30 @@ read_table <- function(x, ...) {
   .Deprecated("read_arrow")
   read_arrow(x, ..., as_data_frame = FALSE)
 }
+
+#' @rdname read_ipc_stream
+#' @export
+read_arrow <- function(x, ...) {
+  if (inherits(x, "raw")) {
+    read_ipc_stream(x, ...)
+  } else {
+    read_feather(x, ...)
+  }
+}
+
+#' @rdname write_ipc_stream
+#' @export
+write_arrow <- function(x, sink, ...) {
+  if (inherits(sink, "raw")) {
+    # HACK for sparklyr
+    # Note that this returns a new R raw vector, not the one passed as `sink`
+    write_to_raw(x)
+  } else {
+    write_feather(x, sink, ...)
+  }
+}
+
+#' @rdname FileInfo
+#' @export
+#' @include filesystem.R
+FileStats <- FileInfo
diff --git a/r/R/feather.R b/r/R/feather.R
@@ -17,6 +17,12 @@
 
 #' Write data in the Feather format
 #'
+#' Feather provides binary columnar serialization for data frames.
+#' It is designed to make reading and writing data frames efficient,
+#' and to make sharing data across data analysis languages easy.
+#' This function writes both the original, limited specification of the format
+#' and the version 2 specification, which is the Apache Arrow IPC file format.
+#'
 #' @param x `data.frame`, [RecordBatch], or [Table]
 #' @param sink A string file path or [OutputStream]
 #' @param version integer Feather file version. Version 2 is the current.
@@ -36,6 +42,7 @@
 #' @return The input `x`, invisibly. Note that if `sink` is an [OutputStream],
 #' the stream will be left open.
 #' @export
+#' @seealso [RecordBatchWriter] for lower-level access to writing Arrow IPC data.
 #' @examples
 #' \donttest{
 #' tf <- tempfile()
@@ -109,15 +116,22 @@ write_feather <- function(x,
 
 #' Read a Feather file
 #'
+#' Feather provides binary columnar serialization for data frames.
+#' It is designed to make reading and writing data frames efficient,
+#' and to make sharing data across data analysis languages easy.
+#' This function reads both the original, limited specification of the format
+#' and the version 2 specification, which is the Apache Arrow IPC file format.
+#'
 #' @param file A character file path, a raw vector, or `InputStream`, passed to
 #' `FeatherReader$create()`.
 #' @inheritParams read_delim_arrow
-#' @param ... additional parameters
+#' @param ... additional parameters, passed to [FeatherReader$create()][FeatherReader]
 #'
 #' @return A `data.frame` if `as_data_frame` is `TRUE` (the default), or an
 #' Arrow [Table] otherwise
 #'
 #' @export
+#' @seealso [FeatherReader] and [RecordBatchReader] for lower-level access to reading Arrow IPC data.
 #' @examples
 #' \donttest{
 #' tf <- tempfile()

diff --git a/r/R/filesystem.R b/r/R/filesystem.R
@@ -79,15 +79,6 @@ FileInfo <- R6Class("FileInfo",
   )
 )
 
-#' @include arrow-package.R
-#' @title FileSystem entry info (Deprecated. Use FileInfo instead.)
-#' @usage NULL
-#' @format NULL
-#'
-#' @rdname FileStats
-#' @export
-FileStats <- FileInfo
-
 #' @title file selector
 #' @format NULL
 #'

diff --git a/r/R/ipc_stream.R b/r/R/ipc_stream.R
@@ -52,18 +52,6 @@ write_ipc_stream <- function(x, sink, ...) {
   invisible(x_out)
 }
 
-#' @rdname write_ipc_stream
-#' @export
-write_arrow <- function(x, sink, ...) {
-  if (inherits(sink, "raw")) {
-    # HACK for sparklyr
-    # Note that this returns a new R raw vector, not the one passed as `sink`
-    write_to_raw(x)
-  } else {
-    write_feather(x, sink, ...)
-  }
-}
-
 #' Write Arrow data to a raw vector
 #'
 #' [write_ipc_stream()] and [write_feather()] write data to a sink and return
@@ -124,13 +112,3 @@ read_ipc_stream <- function(x, as_data_frame = TRUE, ...) {
   }
   out
 }
-
-#' @rdname read_ipc_stream
-#' @export
-read_arrow <- function(x, ...) {
-  if (inherits(x, "raw")) {
-    read_ipc_stream(x, ...)
-  } else {
-    read_feather(x, ...)
-  }
-}
diff --git a/r/R/py-to-r.R b/r/R/py-to-r.R
@@ -59,7 +59,7 @@ r_to_py.RecordBatch <- function(x, convert = FALSE) {
     delete_arrow_schema(schema_ptr)
     delete_arrow_array(array_ptr)
   })
-  
+
   pa <- reticulate::import("pyarrow", convert = convert)
   ExportRecordBatch(x, array_ptr, schema_ptr)
   pa$RecordBatch$`_import_from_c`(array_ptr, schema_ptr)

diff --git a/r/README.md b/r/README.md
@@ -1,11 +1,8 @@
 # arrow
 
 [![cran](https://www.r-pkg.org/badges/version-last-release/arrow)](https://cran.r-project.org/package=arrow)
+[![CI](https://github.com/apache/arrow/workflows/R/badge.svg?event=push)](https://github.com/apache/arrow/actions?query=workflow%3AR+branch%3Amaster+event%3Apush)
 [![conda-forge](https://img.shields.io/conda/vn/conda-forge/r-arrow.svg)](https://anaconda.org/conda-forge/r-arrow)
-[![Nightly macOS Build
-Status](https://travis-ci.org/ursa-labs/arrow-r-nightly.png?branch=master)](https://travis-ci.org/ursa-labs/arrow-r-nightly)
-[![Nightly Windows Build
-Status](https://ci.appveyor.com/api/projects/status/ume8udm5r26u2c9l/branch/master?svg=true)](https://ci.appveyor.com/project/nealrichardson/arrow-r-nightly-yxl55/branch/master)
 [![codecov](https://codecov.io/gh/ursa-labs/arrow-r-nightly/branch/master/graph/badge.svg)](https://codecov.io/gh/ursa-labs/arrow-r-nightly)
 
 [Apache Arrow](https://arrow.apache.org/) is a cross-language
@@ -26,11 +23,17 @@ access to Arrow memory and messages.
 
 Install the latest release of `arrow` from CRAN with
 
-``` r
+```r
 install.packages("arrow")
 ```
 
-Installing a released version of the `arrow` package should require no
+Conda users on Linux and macOS can install `arrow` from conda-forge with
+
+```
+conda install -c conda-forge --strict-channel-priority r-arrow
+```
+
+Installing a released version of the `arrow` package requires no
 additional system dependencies. For macOS and Windows, CRAN hosts binary
 packages that contain the Arrow C++ library. On Linux, source package
 installation will also build necessary C++ dependencies. For a faster,
@@ -41,7 +44,7 @@ If you install the `arrow` package from source and the C++ library is
 not found, the R package functions will notify you that Arrow is not
 available. Call
 
-``` r
+```r
 arrow::install_arrow()
 ```
 
@@ -55,10 +58,6 @@ source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.
 install_arrow()
 ```
 
-Conda users on Linux and macOS can install `arrow` from conda-forge with
-
-    conda install -c conda-forge --strict-channel-priority r-arrow
-
 ## Installing a development version
 
 Development versions of the package (binary and source) are built daily and hosted at
@@ -92,16 +91,11 @@ brew install apache-arrow
 brew install apache-arrow --HEAD
 ```
 
-On Windows, you can download a .zip file with the arrow dependencies
-from the [rwinlib](https://github.com/rwinlib/arrow/releases) project,
+On Windows, you can download a .zip file with the arrow dependencies from the
+[nightly bintray repository](https://dl.bintray.com/ursalabs/arrow-r/libarrow/bin/windows-35/),
 and then set the `RWINLIB_LOCAL` environment variable to point to that
-zip file before installing the `arrow` R package. That project contains
-released versions of the C++ library; for a development version, Windows
-users may be able to find a binary by going to the [Apache Arrow
-project’s
-Appveyor](https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow),
-selecting an R job from a recent build, and downloading the
-`build\arrow-*.zip` file from the “Artifacts” tab.
+zip file before installing the `arrow` R package. Version numbers in that
+repository correspond to dates, and you will likely want the most recent.
 
 If you need to alter both the Arrow C++ library and the R package code,
 or if you can’t get a binary version of the latest C++ library
@@ -168,7 +162,7 @@ generation.
 The codegen.R script has these additional dependencies:
 
 ``` r
-remotes::install_github("romainfrancois/decor")
+remotes::install_github("nealrichardson/decor")
 install.packages("glue")
 ```
 

diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml
@@ -67,6 +67,7 @@ reference:
   - Partitioning
   - Expression
   - Scanner
+  - FileFormat
 - title: Reading and writing files
   contents:
   - read_feather
@@ -76,6 +77,7 @@ reference:
   - read_json_arrow
   - write_feather
   - write_ipc_stream
+  - write_to_raw
   - write_parquet
 - title: C++ reader/writer interface
   contents:
@@ -125,6 +127,7 @@ reference:
   - MemoryPool
   - default_memory_pool
   - FileSystem
+  - FileInfo
   - FileStats
   - FileSelector
 - title: Installation helpers