Skip to content
Permalink
Browse files
ARROW-13709: Reading JSON in R recipe (#64)
* Ensure that test chunks are not rendered

* Add code to delete any temporarily generated files, add recipe for reading JSON

* Rephrase
  • Loading branch information
thisisnic committed Sep 9, 2021
1 parent e6faed2 commit 894ec7789c4d850271ee80ba6f83514a56f8fcf9
Showing 2 changed files with 111 additions and 47 deletions.
@@ -10,6 +10,7 @@ library(testthat)
library(dplyr)
# Include test
knitr::opts_template$set(test = list(
include = FALSE,
test = TRUE,
eval = params$inline_test_output
))
@@ -1,8 +1,8 @@
# Reading and Writing Data

This chapter contains recipes related to reading and writing data using Apache Arrow. When reading data using Apache Arrow, there are 2 different ways you may choose to read in the data:
1. a `tibble`
2. an Arrow Table
This chapter contains recipes related to reading and writing data using Apache
Arrow. When reading files into R using Apache Arrow, you can choose to read in
your file as either a `tibble` or as an Arrow Table object.

There are a number of circumstances in which you may want to read in the data as an Arrow Table:
* your dataset is large and if you load it into memory, it may lead to performance issues
@@ -11,7 +11,9 @@ There are a number of circumstances in which you may want to read in the data as

## Converting from a tibble to an Arrow Table

You can convert an existing `tibble` or `data.frame` into an Arrow Table.
You want to convert an existing `tibble` or `data.frame` into an Arrow Table.

### Solution

```{r, table_create_from_tibble}
air_table <- Table$create(airquality)
@@ -25,7 +27,11 @@ test_that("table_create_from_tibble chunk works as expected", {

## Converting data from an Arrow Table to a tibble

You may want to convert an Arrow Table to a tibble to view the data or work with it in your usual analytics pipeline. You can use either `dplyr::collect()` or `as.data.frame()` to do this.
You want to convert an Arrow Table to a tibble to view the data or work with it
in your usual analytics pipeline. You can use either `dplyr::collect()` or
`as.data.frame()` to do this.

### Solution

```{r, collect_table}
air_tibble <- dplyr::collect(air_table)
@@ -37,11 +43,12 @@ test_that("collect_table chunk works as expected", {
})
```

## Reading and Writing Parquet Files
## Writing a Parquet file

You want to write Parquet files to disk.

### Writing a Parquet file
### Solution

You can write Parquet files to disk using `arrow::write_parquet()`.
```{r, write_parquet}
# Create table
my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
@@ -54,9 +61,11 @@ test_that("write_parquet chunk works as expected", {
})
```

### Reading a Parquet file
## Reading a Parquet file

Given a Parquet file, it can be read back in by using `arrow::read_parquet()`.
You want to read a Parquet file.

### Solution

```{r, read_parquet}
parquet_tbl <- read_parquet("my_table.parquet")
@@ -78,6 +87,9 @@ test_that("read_parquet_2 works as expected", {
expect_s3_class(parquet_tbl, "data.frame")
})
```

### Discussion

If you set `as_data_frame` to `FALSE`, the file will be read in as an Arrow Table.

```{r, read_parquet_table}
@@ -94,18 +106,25 @@ test_that("read_parquet_table_class works as expected", {
})
```

### How to read a Parquet file from S3
## Read a Parquet file from S3

You want to read a Parquet file from S3.

You can open a Parquet file saved on S3 by calling `read_parquet()` and passing the relevant URI as the `file` argument.
### Solution

```{r, read_parquet_s3, eval = FALSE}
df <- read_parquet(file = "s3://ursa-labs-taxi-data/2019/06/data.parquet")
```

### See also

For more in-depth instructions, including how to work with S3 buckets which require authentication, you can find a guide to reading and writing to/from S3 buckets here: https://arrow.apache.org/docs/r/articles/fs.html.

### How to filter columns while reading a Parquet file
## Filter columns while reading a Parquet file

You want to specify which columns to include when reading in a Parquet file.

When reading in a Parquet file, you can specify which columns to read in via the `col_select` argument.
### Solution

```{r, read_parquet_filter}
# Create table to read back in
@@ -123,28 +142,11 @@ test_that("read_parquet_filter works as expected", {
})
```

## Reading and Writing Feather files
## Write an IPC/Feather V2 file

### Write an IPC/Feather V2 file
You want to read in a Feather file.

The Arrow IPC file format is identical to the Feather version 2 format. If you call `write_arrow()`, you will get a warning telling you to use `write_feather()` instead.

```{r, write_arrow}
# Create table
my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
write_arrow(my_table, "my_table.arrow")
```
```{r, test_write_arrow, opts.label = "test"}
test_that("write_arrow chunk works as expected", {
expect_true(file.exists("my_table.arrow"))
expect_warning(
write_arrow(iris, "my_table.arrow"),
regexp = "Use 'write_ipc_stream' or 'write_feather' instead."
)
})
```

Instead, you can use `write_feather()`.
### Solution

```{r, write_feather}
my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
@@ -155,7 +157,7 @@ test_that("write_feather chunk works as expected", {
expect_true(file.exists("my_table.arrow"))
})
```
### Write a Feather (version 1) file
### Discussion

For legacy support, you can write data in the original Feather format by setting the `version` parameter to `1`.

@@ -169,11 +171,15 @@ write_feather(mtcars, "my_table.feather", version = 1)
test_that("write_feather1 chunk works as expected", {
expect_true(file.exists("my_table.feather"))
})
unlink("my_table.feather")
```

### Read a Feather file
## Read a Feather file

You can read Feather files in via `read_feather()`.
You want to read a Feather file.

### Solution

```{r, read_feather}
my_feather_tbl <- read_feather("my_table.arrow")
@@ -182,15 +188,23 @@ my_feather_tbl <- read_feather("my_table.arrow")
test_that("read_feather chunk works as expected", {
expect_identical(dplyr::collect(my_feather_tbl), tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
})
unlink("my_table.arrow")
```

## Reading and Writing Streaming IPC Files
## Write Streaming IPC Files

You want to write to the IPC stream format.

You can write to the IPC stream format using `write_ipc_stream()`.
### Solution

```{r, write_ipc_stream}
# Create table
my_table <- Table$create(tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
my_table <- Table$create(
tibble::tibble(
group = c("A", "B", "C"),
score = c(99, 97, 99)
)
)
# Write to IPC stream format
write_ipc_stream(my_table, "my_table.arrows")
```
@@ -199,15 +213,23 @@ test_that("write_ipc_stream chunk works as expected", {
expect_true(file.exists("my_table.arrows"))
})
```
You can read from IPC stream format using `read_ipc_stream()`.

## Read Streaming IPC Files

You want to read from the IPC stream format.

### Solution
```{r, read_ipc_stream}
my_ipc_stream <- arrow::read_ipc_stream("my_table.arrows")
```
```{r, test_read_ipc_stream, opts.label = "test"}
test_that("read_ipc_stream chunk works as expected", {
expect_equal(my_ipc_stream, tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99)))
expect_equal(
my_ipc_stream,
tibble::tibble(group = c("A", "B", "C"), score = c(99, 97, 99))
)
})
unlink("my_table.arrows")
```

## Reading and Writing CSV files
@@ -233,13 +255,48 @@ my_csv <- read_csv_arrow("cars.csv", as_data_frame = FALSE)
test_that("read_csv_arrow chunk works as expected", {
expect_equivalent(dplyr::collect(my_csv), cars)
})
unlink("cars.csv")
```

## Read JSON files

You want to read a JSON file.

### Solution

```{r, read_json_arrow}
# Create a file to read back in
tf <- tempfile()
writeLines('
{"country": "United Kingdom", "code": "GB", "long": -3.44, "lat": 55.38}
{"country": "France", "code": "FR", "long": 2.21, "lat": 46.23}
{"country": "Germany", "code": "DE", "long": 10.45, "lat": 51.17}
', tf, useBytes = TRUE)
# Read in the data
countries <- read_json_arrow(tf, col_select = c("country", "long", "lat"))
countries
```
```{r, test_read_json_arrow, opts.label = "test"}
test_that("read_json_arrow chunk works as expected", {
expect_equivalent(
countries,
tibble::tibble(
country = c("United Kingdom", "France", "Germany"),
long = c(-3.44, 2.21, 10.45),
lat = c(55.38, 46.23, 51.17)
)
)
})
unlink(tf)
```

## Reading and Writing Partitioned Data

### Writing Partitioned Data
## Write Partitioned Data

You can use `write_dataset()` to save data to disk in partitions based on columns in the data.
You want to save data to disk in partitions based on columns in the data.

### Solution

```{r, write_dataset}
write_dataset(airquality, "airquality_partitioned", partitioning = c("Month", "Day"))
@@ -267,9 +324,11 @@ Each of these folders contains 1 or more Parquet files containing the relevant p
list.files("airquality_partitioned/Month=5/Day=10")
```

### Reading Partitioned Data
## Reading Partitioned Data

You want to read partitioned data.

You can use `open_dataset()` to read partitioned data.
### Solution

```{r, open_dataset}
# Read data from directory
@@ -285,3 +344,7 @@ test_that("open_dataset chunk works as expected", {
})
```

```{r}
unlink("airquality_partitioned", recursive = TRUE)
```

0 comments on commit 894ec77

Please sign in to comment.