Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Reading large Parquet files from a seekable connection fails #36819

Closed
paleolimbot opened this issue Jul 22, 2023 · 0 comments · Fixed by #37274
Closed

[R] Reading large Parquet files from a seekable connection fails #36819

paleolimbot opened this issue Jul 22, 2023 · 0 comments · Fixed by #37274

Comments

@paleolimbot
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

To support IO that calls into R (e.g., using connection objects as Input/Output streams), we wrap most our calls to the various file readers in RunWithCapturedR(). We didn't do this for Parquet because it didn't seem to require it; however, it seems that for Parquet files that are large enough we actually do ( https://stackoverflow.com/questions/76739590/r-arrow-read-parquet-call-to-r-seek-on-r-connection-from-a-non-r-thread-fro ).

At least the ReadTable() calls below should be wrapped in RunWithCapturedR() to support this behaviour:

arrow/r/src/parquet.cpp

Lines 98 to 148 in b557e85

// [[parquet::export]]
std::shared_ptr<arrow::Table> parquet___arrow___FileReader__ReadTable1(
const std::shared_ptr<parquet::arrow::FileReader>& reader) {
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
return table;
}
// [[parquet::export]]
std::shared_ptr<arrow::Table> parquet___arrow___FileReader__ReadTable2(
const std::shared_ptr<parquet::arrow::FileReader>& reader,
const std::vector<int>& column_indices) {
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadTable(column_indices, &table));
return table;
}
// [[parquet::export]]
std::shared_ptr<arrow::Table> parquet___arrow___FileReader__ReadRowGroup1(
const std::shared_ptr<parquet::arrow::FileReader>& reader, int i) {
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadRowGroup(i, &table));
return table;
}
// [[parquet::export]]
std::shared_ptr<arrow::Table> parquet___arrow___FileReader__ReadRowGroup2(
const std::shared_ptr<parquet::arrow::FileReader>& reader, int i,
const std::vector<int>& column_indices) {
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadRowGroup(i, column_indices, &table));
return table;
}
// [[parquet::export]]
std::shared_ptr<arrow::Table> parquet___arrow___FileReader__ReadRowGroups1(
const std::shared_ptr<parquet::arrow::FileReader>& reader,
const std::vector<int>& row_groups) {
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadRowGroups(row_groups, &table));
return table;
}
// [[parquet::export]]
std::shared_ptr<arrow::Table> parquet___arrow___FileReader__ReadRowGroups2(
const std::shared_ptr<parquet::arrow::FileReader>& reader,
const std::vector<int>& row_groups, const std::vector<int>& column_indices) {
std::shared_ptr<arrow::Table> table;
PARQUET_THROW_NOT_OK(reader->ReadRowGroups(row_groups, column_indices, &table));
return table;
}

Component(s)

R

paleolimbot added a commit that referenced this issue Sep 5, 2023
### Rationale for this change

When we first added RunWithCapturedR to support reading files from R connections, none of the Parquet tests seemed to call R from another thread. Because RunWithCapturedR comes with some complexity, I didn't add it anywhere it wasn't strictly needed. A recent StackOverflow post exposed that reading very large parquet files do use multiple threads and thus need RunWithCapturedR.

### What changes are included in this PR?

The two most common calls to read a parquet in which a user might trigger this failure are now wrapped in RunWithCapturedR.

### Are these changes tested?

The changes are tested in the current suite.

### Are there any user-facing changes?

No.
* Closes: #36819

Lead-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
@paleolimbot paleolimbot added this to the 14.0.0 milestone Sep 5, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…pache#37274)

### Rationale for this change

When we first added RunWithCapturedR to support reading files from R connections, none of the Parquet tests seemed to call R from another thread. Because RunWithCapturedR comes with some complexity, I didn't add it anywhere it wasn't strictly needed. A recent StackOverflow post exposed that reading very large parquet files do use multiple threads and thus need RunWithCapturedR.

### What changes are included in this PR?

The two most common calls to read a parquet in which a user might trigger this failure are now wrapped in RunWithCapturedR.

### Are these changes tested?

The changes are tested in the current suite.

### Are there any user-facing changes?

No.
* Closes: apache#36819

Lead-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…pache#37274)

### Rationale for this change

When we first added RunWithCapturedR to support reading files from R connections, none of the Parquet tests seemed to call R from another thread. Because RunWithCapturedR comes with some complexity, I didn't add it anywhere it wasn't strictly needed. A recent StackOverflow post exposed that reading very large parquet files do use multiple threads and thus need RunWithCapturedR.

### What changes are included in this PR?

The two most common calls to read a parquet in which a user might trigger this failure are now wrapped in RunWithCapturedR.

### Are these changes tested?

The changes are tested in the current suite.

### Are there any user-facing changes?

No.
* Closes: apache#36819

Lead-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Co-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants