Skip to content

Commit

Permalink
Merge pull request #2 from romainfrancois/3731/parquet-2
Browse files Browse the repository at this point in the history
Follow up to initial parquet support
  • Loading branch information
jeffwong-nflx committed Jan 4, 2019
2 parents 456c5d2 + 7d6e64d commit 56adad2
Show file tree
Hide file tree
Showing 11 changed files with 129 additions and 47 deletions.
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,8 @@ matrix:
language: r
cache: packages
latex: false
env:
- ARROW_TRAVIS_PARQUET=1
before_install:
# Have to copy-paste this here because of how R's build steps work
- eval `python $TRAVIS_BUILD_DIR/ci/detect-changes.py`
Expand Down
1 change: 1 addition & 0 deletions r/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ Collate:
'memory_pool.R'
'message.R'
'on_exit.R'
'parquet.R'
'read_record_batch.R'
'read_table.R'
'reexports-bit64.R'
Expand Down
1 change: 1 addition & 0 deletions r/NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ export(print.integer64)
export(read_arrow)
export(read_feather)
export(read_message)
export(read_parquet)
export(read_record_batch)
export(read_schema)
export(read_table)
Expand Down
4 changes: 4 additions & 0 deletions r/R/RcppExports.R

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

33 changes: 33 additions & 0 deletions r/R/parquet.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

#' Read parquet file from disk
#'
#' @param file a file path
#' @param as_tibble should the [arrow::Table][arrow__Table] be converted to a tibble.
#' @param ... currently ignored
#'
#' @return a [arrow::Table][arrow__Table], or a data frame if `as_tibble` is `TRUE`.
#'
#' @export
read_parquet <- function(file, as_tibble = TRUE, ...) {
tab <- shared_ptr(`arrow::Table`, read_parquet_file(f))
if (isTRUE(as_tibble)) {
tab <- as_tibble(tab)
}
tab
}
2 changes: 1 addition & 1 deletion r/README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install
```

Expand Down
61 changes: 16 additions & 45 deletions r/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ git clone https://github.com/apache/arrow.git
cd arrow/cpp && mkdir release && cd release

# It is important to statically link to boost libraries
cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
make install
```

Expand All @@ -38,48 +38,19 @@ tf <- tempfile()
#> # A tibble: 10 x 2
#> x y
#> <int> <dbl>
#> 1 1 -0.255
#> 2 2 -0.162
#> 3 3 -0.614
#> 4 4 -0.322
#> 5 5 0.0693
#> 6 6 -0.920
#> 7 7 -1.08
#> 8 8 0.658
#> 9 9 0.821
#> 10 10 0.539
arrow::write_arrow(tib, tf)

# read it back with pyarrow
pa <- import("pyarrow")
as_tibble(pa$open_file(tf)$read_pandas())
#> # A tibble: 10 x 2
#> x y
#> <int> <dbl>
#> 1 1 -0.255
#> 2 2 -0.162
#> 3 3 -0.614
#> 4 4 -0.322
#> 5 5 0.0693
#> 6 6 -0.920
#> 7 7 -1.08
#> 8 8 0.658
#> 9 9 0.821
#> 10 10 0.539
```

## Development

### Code style

We use Google C++ style in our C++ code. Check for style errors with

```
./lint.sh
```

You can fix the style issues with

#> 1 1 0.0855
#> 2 2 -1.68
#> 3 3 -0.0294
#> 4 4 -0.124
#> 5 5 0.0675
#> 6 6 1.64
#> 7 7 1.54
#> 8 8 -0.0209
#> 9 9 -0.982
#> 10 10 0.349
# arrow::write_arrow(tib, tf)

# # read it back with pyarrow
# pa <- import("pyarrow")
# as_tibble(pa$open_file(tf)$read_pandas())
```
./lint.sh --fix
```
2 changes: 1 addition & 1 deletion r/configure
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
# R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'

# Library settings
PKG_CONFIG_NAME="arrow"
PKG_CONFIG_NAME="arrow parquet"
PKG_DEB_NAME="arrow"
PKG_RPM_NAME="arrow"
PKG_CSW_NAME="arrow"
Expand Down
21 changes: 21 additions & 0 deletions r/man/read_parquet.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 12 additions & 0 deletions r/src/RcppExports.cpp

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

37 changes: 37 additions & 0 deletions r/src/parquet.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/reader.h>
#include <parquet/arrow/writer.h>
#include <parquet/exception.h>

// [[Rcpp::export]]
std::shared_ptr<arrow::Table> read_parquet_file(std::string filename) {
std::shared_ptr<arrow::io::ReadableFile> infile;
PARQUET_THROW_NOT_OK(
arrow::io::ReadableFile::Open(filename, arrow::default_memory_pool(), &infile));

std::unique_ptr<parquet::arrow::FileReader> reader;
PARQUET_THROW_NOT_OK(
parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), &reader));
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadTable(&table));

return table;
}

0 comments on commit 56adad2

Please sign in to comment.