ARROW-3731: MVP to read parquet in R library #3230

jeffwong-nflx · 2018-12-20T04:56:22Z

I am contributing to Arrow 3731. This PR has the minimum functionality to read parquet files into an arrow::Table, which can then be converted to a tibble. Multiple parquet files can be read inside lapply, and then concatenated at the end.

Steps to compile

Build arrow and parquet c++ projects
In R run devtools::load_all()

What I could use help with:
The biggest challenge for me is my lack of experience with pkg-config. The R library has a configure file which uses pkg-config to figure out what c++ libraries to link to. Currently, configure looks up the Arrow project and links to -larrow only. We need it to also link to -lparquet. I do not know how to modify pkg-config's metadata to let it know to link to both -larrow and -lparquet

romainfrancois

Thanks for this initial work. I've added some comments inline.

I need to have a look at the parquet api on the python side as we try to be as compatible as possible.

This needs

more low level api too, i.e. classes such as parquet::arrow::FileReader exposed on the R side.
a way to write parquet files too
unit tests

I'll come back here with more comments once I have more information on both python and C++ api for parquet.

romainfrancois · 2019-01-02T09:56:56Z

r/R/read_parquet.R

+  tables = lapply(files, function(f) {
+    return (as_tibble(shared_ptr(`arrow::Table`, read_parquet_file(f))))
+  })
+  do.call('rbind', tables)


We'd probably use vctrs::vec_rbind(!!!tables) or dplyr::bind_rows(!!!tables) instead of do.call + rbind.

arrow currently does not depend on dplyr so I guess vctrs::vec_rbind(!!!tables) is preferred. @hadley

romainfrancois · 2019-01-02T10:00:03Z

r/R/read_parquet.R

+#' @param files a vector of filenames
+#' @export
+read_parquet = function(files) {
+  tables = lapply(files, function(f) {


we'd probably use purrr::map() or even purrr::map_dfr()

romainfrancois · 2019-01-02T10:02:11Z

r/R/read_parquet.R

+#' @export
+read_parquet = function(files) {
+  tables = lapply(files, function(f) {
+    return (as_tibble(shared_ptr(`arrow::Table`, read_parquet_file(f))))


Maybe we could have only one parquet reader instead of one per file. But I need to read more about the parquet api first.

romainfrancois · 2019-01-02T10:04:30Z

r/configure

@@ -49,7 +49,8 @@ if [ "$INCLUDE_DIR" ] || [ "$LIB_DIR" ]; then
 elif [ "$PKGCONFIG_CFLAGS" ] || [ "$PKGCONFIG_LIBS" ]; then
  echo "Found pkg-config cflags and libs!"
  PKG_CFLAGS=${PKGCONFIG_CFLAGS}
-  PKG_LIBS=${PKGCONFIG_LIBS}
+  #PKG_LIBS=${PKGCONFIG_LIBS}
+  PKG_LIBS="-larrow -lparquet" # hard coded! what is the right way to do this?


I think this should rather be addressed by updating the PKG_CONFIG_NAME higher

Having this seems to work for me:

PKG_CONFIG_NAME="arrow parquet"

This needs an update to the README.Rmd (and README.md) to have -DARROW_PARQUET=ON, i.e.:

cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off

romainfrancois · 2019-01-02T10:06:13Z

r/src/parquetfilereader.cpp

+// [[Rcpp::export]]
+std::shared_ptr<arrow::Table> read_parquet_file(std::string filename) {
+  std::shared_ptr<arrow::io::ReadableFile> infile;
+  PARQUET_THROW_NOT_OK(arrow::io::ReadableFile::Open(


Although the name is a big hint, I'm not sure what the macro does and if it should perhaps be replaced by something that plays well with Rcpp.

romainfrancois · 2019-01-02T14:11:03Z

Seems like the python version of parquet.read_table returns an arrow::Table: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table

We can/should probably do the same here and maybe have an as_tibble argument like in read_feather to convert to tibble if set to TRUE.

romainfrancois · 2019-01-02T14:13:00Z

r/R/read_parquet.R

@@ -0,0 +1,10 @@
+#' Read parquet file from disk


This needs the license headers, you can just copy from any other R file.
https://travis-ci.org/apache/arrow/jobs/470349974#L493

romainfrancois · 2019-01-02T14:27:49Z

r/src/parquetfilereader.cpp

@@ -0,0 +1,38 @@
+// // Licensed to the Apache Software Foundation (ASF) under one


The cpp files must adhere to the style. You can enforce that with

./r/lint.sh --fix

romainfrancois · 2019-01-02T14:36:39Z

@jeffwong-nflx I've incorporated your commits and then some in this branch on my fork:
https://github.com/romainfrancois/arrow/tree/3731/parquet

This needs more work until this can be merged to the upstream repo, but if you want to continue working on this, please send pull requests to that branch.

romainfrancois · 2019-01-02T15:10:12Z

Looking more at the python api: https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files it looks like parquet.read_table reads a single file, which would simplify the code here, users may iterate themselves but it does not belong in the read_parquet function. maybe the function should be called parquet_read_table instead of read_parquet. Unfortunately we cannot nest packages in R, but once arrow is done one thing we could do is make a small convenience parquet package so that we could do e.g. parquet::read_table().

Then, we'd need something like the ParquetFile and ParquetWriter classes at the r level: https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing

For multiple files, we should follow the partitioned dataset api of https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files

wesm · 2019-01-02T16:11:57Z

I would recommend that you do not spend much energy implementing the ParquetDataset logic in R. I would like to port this from Python to C++, so we can maintain a common implementation https://issues.apache.org/jira/browse/ARROW-3764

wesm · 2019-01-02T16:12:52Z

It might be a good idea for someone to do that here in January so that it can be available to R and Ruby as soon as possible

romainfrancois · 2019-01-02T16:17:48Z

Definitely. Most of what the R package does is rely on the C++ api and make things available to R. I just assumed that the python code did that for ParquetDataset too. We'll wait for a C++ implementation to mature.

… argument

romainfrancois · 2019-01-03T09:49:32Z

I submitted a PR to @jeffwong-nflx fork to incorporate some changes on this PR: https://github.com/jeffwong-nflx/arrow/pull/2

wesm · 2019-01-03T15:32:23Z

@jeffwong-nflx would you mind if Romain picks up this branch? I think Romain can submit a new PR with your changes

romainfrancois · 2019-01-03T15:35:59Z

I can go either way. It all depends if @jeffwong-nflx wants to do more work, in which case I can guide/help and work on other things in the meantime.

The other nicety is that it might give me ideas for a "how to contribute to the arrow R package" document.

Follow up to initial parquet support

jeffwong-nflx · 2019-01-04T06:17:15Z

@romainfrancois let's consolidate onto your branch. This minimum functionality to read parquet files into R was what I was looking for in my application - I don't know if I can commit to getting the R library to have feature parity with the python library.

romainfrancois · 2019-01-04T07:56:35Z

Fair enough.

What I'm proposing is that we merge this asap, and then I'll work out in another branch/pr, etc ... additional things for more parquet support.

To merge we need to fix the last small issues, mostly about files that have not been removed (not sure why).

R/read_parquet.R Can you remove this file ? The code has been moved to the R/parquet.R file. https://travis-ci.org/apache/arrow/jobs/475176475#L494
src/parquetfilereader.cpp: Can you also remove this one. The code is now in the parquet.cpp file. https://travis-ci.org/apache/arrow/jobs/475176489#L3921

travis should pass then, and we can merge this PR to the apache repo.

@wesm things are easier if we don't merge any other R PR for now (otherwise because of the centrality of the generated RcppExports.* files, this typically needs a rebase, and that's not fun. I'm used to it so I don't mind, but I'd rather avoid it on external contributions).

romainfrancois · 2019-01-04T07:59:39Z

@jeffwong-nflx in addition, if you like, would you add a line in the DESCRIPTION file and mark yourself as a cab, e.g.

https://github.com/apache/arrow/blob/master/r/DESCRIPTION#L6

Something like:

    person("Jeff", "Wong", email = "<your@email>", role = c("ctb")),

Cleanup old files

jeffwong-nflx · 2019-01-04T22:14:31Z

@romainfrancois I added all the items to my fork, hope Travis CI passes!

jeffwong-nflx · 2019-01-04T23:33:37Z

Travis CI failed, but it looks like the R part passed, and the python part failed. @wesm

romainfrancois · 2019-01-05T10:31:39Z

The problem looks unrelated. I restarted the job. We'll see what happens. Perhaps we'll need to rebase.

codecov-io · 2019-01-05T10:57:59Z

Codecov Report

Merging #3230 into master will increase coverage by 0.37%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3230      +/-   ##
==========================================
+ Coverage   88.18%   88.55%   +0.37%     
==========================================
  Files         535      540       +5     
  Lines       72873    73104     +231     
==========================================
+ Hits        64260    64740     +480     
+ Misses       8506     8257     -249     
  Partials      107      107

Impacted Files	Coverage Δ
cpp/src/gandiva/configuration.h	`88.23% <0%> (-11.77%)`	⬇️
python/pyarrow/__init__.py	`65.78% <0%> (-4.36%)`	⬇️
cpp/src/arrow/util/compression_snappy.cc	`75% <0%> (-2.78%)`	⬇️
cpp/src/arrow/type.h	`89.67% <0%> (-1.89%)`	⬇️
python/pyarrow/tests/test_array.py	`97.28% <0%> (-1.84%)`	⬇️
cpp/src/arrow/python/helpers.cc	`78.26% <0%> (-1.4%)`	⬇️
cpp/src/arrow/util/thread-pool-test.cc	`98.14% <0%> (-0.78%)`	⬇️
cpp/src/arrow/record_batch.cc	`89.84% <0%> (-0.74%)`	⬇️
cpp/src/arrow/io/hdfs-internal.cc	`33.04% <0%> (-0.58%)`	⬇️
cpp/src/plasma/io.cc	`74.21% <0%> (-0.4%)`	⬇️
... and 126 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a8c8f0...c67fa3d. Read the comment docs.

wesm

+1

wesm · 2019-01-05T18:32:08Z

r/DESCRIPTION

@@ -4,6 +4,7 @@ Version: 0.0.0.9000
 Authors@R: c(
    person("Romain", "François", email = "romain@rstudio.com", role = c("aut", "cre")),
    person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")),
+    person("Jeffrey", "Wong", email = "jeffreyw@netflix.com", role = c("ctb")),


I don't know that we need to maintain these bylines

Not necessarily.

I am contributing to [Arrow 3731](https://issues.apache.org/jira/browse/ARROW-3731). This PR has the minimum functionality to read parquet files into an arrow::Table, which can then be converted to a tibble. Multiple parquet files can be read inside `lapply`, and then concatenated at the end. Steps to compile 1) Build arrow and parquet c++ projects 2) In R run `devtools::load_all()` What I could use help with: The biggest challenge for me is my lack of experience with pkg-config. The R library has a `configure` file which uses pkg-config to figure out what c++ libraries to link to. Currently, `configure` looks up the Arrow project and links to -larrow only. We need it to also link to -lparquet. I do not know how to modify pkg-config's metadata to let it know to link to both -larrow and -lparquet Author: Jeffrey Wong <jeffreyw@netflix.com> Author: Romain Francois <romain@purrple.cat> Author: jeffwong-nflx <jeffreyw@netflix.com> Closes apache#3230 from jeffwong-nflx/master and squashes the following commits: c67fa3d <jeffwong-nflx> Merge pull request #3 from jeffwong-nflx/cleanup 1df3026 <Jeffrey Wong> don't hard code -larrow and -lparquet 8ccaa51 <Jeffrey Wong> cleanup 75ba5c9 <Jeffrey Wong> add contributor 56adad2 <jeffwong-nflx> Merge pull request #2 from romainfrancois/3731/parquet-2 7d6e64d <Romain Francois> read_parquet() only reading one parquet file, and gains a `as_tibble` argument e936b44 <Romain Francois> need parquet on travis too ff260c5 <Romain Francois> header was too commented, renamed to parquet.cpp 9e1897f <Romain Francois> styling etc ... 456c5d2 <Jeffrey Wong> read parquet files 22d89dd <Jeffrey Wong> hardcode -larrow and -lparquet

jeffwong-nflx added 2 commits December 19, 2018 19:42

hardcode -larrow and -lparquet

22d89dd

read parquet files

456c5d2

kou changed the title ~~MVP to read parquet in R library [Arrow 3731]~~ ARROW-3731: MVP to read parquet in R library Dec 20, 2018

romainfrancois added the Component: R label Jan 2, 2019

romainfrancois reviewed Jan 2, 2019

View reviewed changes

romainfrancois added 4 commits January 3, 2019 10:43

styling etc ...

9e1897f

header was too commented, renamed to parquet.cpp

ff260c5

need parquet on travis too

e936b44

read_parquet() only reading one parquet file, and gains a as_tibble…

7d6e64d

… argument

Merge pull request #2 from romainfrancois/3731/parquet-2

56adad2

Follow up to initial parquet support

jeffwong-nflx added 4 commits January 4, 2019 13:26

add contributor

75ba5c9

cleanup

8ccaa51

don't hard code -larrow and -lparquet

1df3026

Merge pull request #3 from jeffwong-nflx/cleanup

c67fa3d

Cleanup old files

wesm approved these changes Jan 5, 2019

View reviewed changes

wesm closed this in 5723ada Jan 5, 2019

kou mentioned this pull request Feb 22, 2019

fix R parquet #3366

Closed

asfimport mentioned this pull request Jan 6, 2019

[R] R API for reading and writing Parquet files #20030

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-3731: MVP to read parquet in R library #3230

ARROW-3731: MVP to read parquet in R library #3230

jeffwong-nflx commented Dec 20, 2018

romainfrancois left a comment

romainfrancois Jan 2, 2019

romainfrancois Jan 2, 2019

romainfrancois Jan 2, 2019

romainfrancois Jan 2, 2019

romainfrancois Jan 2, 2019

romainfrancois Jan 2, 2019

romainfrancois commented Jan 2, 2019

romainfrancois Jan 2, 2019

romainfrancois Jan 2, 2019

romainfrancois commented Jan 2, 2019

romainfrancois commented Jan 2, 2019

wesm commented Jan 2, 2019

wesm commented Jan 2, 2019

romainfrancois commented Jan 2, 2019

romainfrancois commented Jan 3, 2019

wesm commented Jan 3, 2019

romainfrancois commented Jan 3, 2019

jeffwong-nflx commented Jan 4, 2019

romainfrancois commented Jan 4, 2019

romainfrancois commented Jan 4, 2019

jeffwong-nflx commented Jan 4, 2019

jeffwong-nflx commented Jan 4, 2019

romainfrancois commented Jan 5, 2019

codecov-io commented Jan 5, 2019

wesm left a comment

wesm Jan 5, 2019

romainfrancois Jan 6, 2019

		@@ -0,0 +1,38 @@
		// // Licensed to the Apache Software Foundation (ASF) under one

ARROW-3731: MVP to read parquet in R library #3230

ARROW-3731: MVP to read parquet in R library #3230

Conversation

jeffwong-nflx commented Dec 20, 2018

romainfrancois left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romainfrancois commented Jan 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romainfrancois commented Jan 2, 2019

romainfrancois commented Jan 2, 2019

wesm commented Jan 2, 2019

wesm commented Jan 2, 2019

romainfrancois commented Jan 2, 2019

romainfrancois commented Jan 3, 2019

wesm commented Jan 3, 2019

romainfrancois commented Jan 3, 2019

jeffwong-nflx commented Jan 4, 2019

romainfrancois commented Jan 4, 2019

romainfrancois commented Jan 4, 2019

jeffwong-nflx commented Jan 4, 2019

jeffwong-nflx commented Jan 4, 2019

romainfrancois commented Jan 5, 2019

codecov-io commented Jan 5, 2019

Codecov Report

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment