Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37969: [C++][Parquet] add more closed file checks for ParquetFileWriter #38390

Merged
merged 11 commits into from
Nov 16, 2023

Conversation

quanghgx
Copy link
Contributor

@quanghgx quanghgx commented Oct 22, 2023

Rationale for this change

Operations on closed ParquetFileWriter are not allowed, but should not segfault. Somehow, ParquetFileWriter::Close() also reset its pimpl, so after that, any operators, those need this pointer will lead to segfault

What changes are included in this PR?

Adding more checks for closed file.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@mapleFU
Copy link
Member

mapleFU commented Oct 22, 2023

(I didn't go through the issue carefully, but would CheckClosed function like S3 Filesystem better here?)

e.g. https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L1249

@quanghgx quanghgx changed the title GH-37969: [C++] ParquetFileWriter add more closed file checks (WIP) GH-37969: [C++] ParquetFileWriter add more closed file checks Oct 29, 2023
@quanghgx
Copy link
Contributor Author

(I didn't go through the issue carefully, but would CheckClosed function like S3 Filesystem better here?)

e.g. https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/s3fs.cc#L1249

Thanks for your hint. Refactored the checking into separate method. Use "Assert" here because it will throw instead of return arrow::Status. As operation on closed object is not a normal application flow.

Please help take a look again, @mapleFU

@quanghgx quanghgx changed the title GH-37969: [C++] ParquetFileWriter add more closed file checks GH-37969: [C++][Parquet] ParquetFileWriter add more closed file checks Oct 29, 2023
@quanghgx quanghgx changed the title GH-37969: [C++][Parquet] ParquetFileWriter add more closed file checks GH-37969: [C++][Parquet] add more closed file checks for ParquetFileWriter Oct 29, 2023
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 30, 2023
@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2023

Use "Assert" here because it will throw instead of return arrow::Status. As operation on closed object is not a normal application flow.

Ooops, I think you're choosing the right way, but thats maybe because parquet lib uses exception, and arrow uses Status. So I think we'd better use Exception here. If we are do it in parquet/arrow directory we need use Status

@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2023

  Status Close() override {
    if (!closed_) {
      // Make idempotent
      closed_ = true;
      if (row_group_writer_ != nullptr) {
        auto row_group_writer = std::move(row_group_writer_);
        PARQUET_CATCH_NOT_OK(row_group_writer->Close());
      }
      PARQUET_CATCH_NOT_OK(writer_->Close());
    }

Actually this change LGTM. But I'm not fully understand why row_group_writer_ will be double-closed. For example, when call FileWriter::Close twice, the closed_ will be true, so row_group_writer_ will only close once. Did I miss something?

Edit: You can leave this patch just checking here, and find out why it would close twice, then open a new issue for that?

I'm a bit tired today, maybe I'll gothrough the code here tomorrow

@quanghgx
Copy link
Contributor Author

Actually this change LGTM. But I'm not fully understand why row_group_writer_ will be double-closed. For example, when call FileWriter::Close twice, the closed_ will be true, so row_group_writer_ will only close once. Did I miss something?

you are right, normally, it will not be called twice. However, in test case for use-after-close with writer->Close() and then writer->NewBufferedRowGroup():

ASSERT_OK(writer->Close());
EXPECT_THROW_THAT([&]() { ASSERT_OK(writer->WriteTable(*table, 1)); }, ParquetException,
::testing::Property(
&ParquetException::what,
::testing::HasSubstr("Cannot get properties from closed file")));
// These two functions do not throw because they use PARQUET_CATCH_NOT_OK
// in their implementations.
ASSERT_RAISES_WITH_MESSAGE(IOError,
"IOError: Cannot append buffered-row-group to closed file",
writer->NewBufferedRowGroup());

To be honest, I think I need to study a bit more to fully understand state diagram of these there attributes {writer_, row_group_writer_ and closed_ } of FileWriterImpl. If you have hints for me, that will be great?

std::unique_ptr<ParquetFileWriter> writer_;
RowGroupWriter* row_group_writer_;
ArrowWriteContext column_write_context_;
std::shared_ptr<ArrowWriterProperties> arrow_properties_;
bool closed_;

Thank you so much @mapleFU

@mapleFU
Copy link
Member

mapleFU commented Oct 30, 2023

The logic here is not hard to understand, but a bit complex because so many pimpl used and there are parquet and also parquet/arrow. I'll try to draw something tomorrow.

@quanghgx quanghgx force-pushed the gh37969_segfault_use_after_close branch from 2b0dcc4 to c718b8b Compare October 31, 2023 01:00
@quanghgx
Copy link
Contributor Author

quanghgx commented Oct 31, 2023

Update:,

  • Revised MR (simplified) so that it focuses on the fix for writer->Close(); writer->WriteTable(..)
  • Double closing and other cases of use after close will need further understanding. Will open a new issue with your help to clarify the intended flow of the code.

How do you think? Thank you so much for guiding me along the way; I don't mind to document it down, get your review along with MR for that. Thanks @mapleFU

@mapleFU
Copy link
Member

mapleFU commented Oct 31, 2023

Parquet is about two parts in this library

  1. parquet: the core part of parquet, mostly it handles types like parquet defined physical-type and logical type.
  2. parquet/arrow: the arrow part of parquet. It read / write arrow array, and maintaining some other infos.

The writer has the structure below:

arrow::parquet::FileWriter (Holding parquet::ParquetFileWriter, and some arrow-related schema/metadata)
parquet::ParquetFileWriter ( holds fileMetadata and at most one active row-group writer)
- parquet::FileSerializer extends parquet::ParquetFileWriter::Contents
parquet::RowGroupWriter (Holding some ColumnWriter)
- parquet::RowGroupSerializer extends parquet::RowGroupWriter::Contents
ColumnWriter
- May hold encoder and write multiple pages

Also, there are some buffered and unbuffered RowGroupWriter. You can go through the doc for that.

mapleFU
mapleFU previously approved these changes Oct 31, 2023
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Also cc @wgtmac @pitrou

@mapleFU mapleFU dismissed their stale review October 31, 2023 16:16

Find the reason for re-close

@mapleFU
Copy link
Member

mapleFU commented Oct 31, 2023

Oh previously I approve because the change looks ok to me, but after re-review I found the reason for re-close.

  Status NewRowGroup(int64_t chunk_size) override {
    if (row_group_writer_ != nullptr) {
      PARQUET_CATCH_NOT_OK(row_group_writer_->Close());
    }
    PARQUET_CATCH_NOT_OK(row_group_writer_ = writer_->AppendRowGroup());
    return Status::OK();
  }

  Status Close() override {
    if (!closed_) {
      // Make idempotent
      closed_ = true;
      if (row_group_writer_ != nullptr) {
        PARQUET_CATCH_NOT_OK(row_group_writer_->Close());
      }
      PARQUET_CATCH_NOT_OK(writer_->Close());
    }
    return Status::OK();
  }

Also I think the re-close is like:

Close();
NewRowGroupWriter();

in parquet::arrow::FileWriter. And

  1. Close -> row_group_writer_->Close() called once without cleaning closed_
  2. NewRowGroupWriter() -> find row_group_writer_ != nullptr, and close it.

Solving: You can also add a CheckClosed like #38390 (comment) here. This time we should pick return Status, because the parquet::arrow uses Status.

Here I think you should:

  1. keep the current change, I think it's ok here.
  2. Add CheckClosed in NewRowGroup and NewBufferedRowGroup in parquet::arrow::FileWriter

This will also prevent from the issue, because WriteTable api will NewRowGroup first.

@quanghgx quanghgx marked this pull request as draft November 1, 2023 00:53
@quanghgx
Copy link
Contributor Author

Thank you, @mapleFU, for providing a clear and detailed guideline. I've applied these changes and added back two simple tests.

I have one final comment regarding the WriteTable function:

We currently have two options for handling errors in this function:

  • (A.) We can let writer->WriteTable throw a ParquetException. This is because, in this part of the code, it calls this->properties, which directly returns a ParquetException from writer_.
  • (B.) Alternatively, we can add RETURN_NOT_OK(CheckClosed()); to the beginning of WriteTable, similar to NewRowGroup and NewBufferedRowGroup. This would consistently return arrow::Status::Invalid.

I personally lean towards option (B.) for the sake of consistency. However, considering that use-after-close is not a typical flow (usually requiring user code revision to fully handle this case), using an exception might be a viable option here.

This part is making me uncertain, so I would appreciate your input on this matter. Thanks mappleFU.

@mapleFU
Copy link
Member

mapleFU commented Nov 14, 2023

Sorry for delaying reply.

Nice catch, I missed this->properties() call here. I prefer we can also add a CheckClosed in WriteTable, this is much more easier... Personally properties might be called frequently, which might be hard to debug and reason.

@quanghgx quanghgx marked this pull request as ready for review November 14, 2023 13:38
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Also cc @pitrou @wgtmac

@mapleFU
Copy link
Member

mapleFU commented Nov 14, 2023

CI failed in unrelated

Will wait other committers review

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't you protecting all public methods in FileWriter? I don't think it makes sense to only protect a couple of them.

@mapleFU
Copy link
Member

mapleFU commented Nov 14, 2023

@pitrou Should this better for adding it on

  Status WriteColumnChunk(const std::shared_ptr<ChunkedArray>& data, int64_t offset,
                          int64_t size) override {...} 

  Status WriteRecordBatch(const RecordBatch& batch) override;

I think other methods like Init and schema() is not so neccessary here?

@pitrou
Copy link
Member

pitrou commented Nov 14, 2023

Everywhere where we rely on something that's destroyed by Close.

@mapleFU
Copy link
Member

mapleFU commented Nov 14, 2023

Well I think adding in functions I mentioned above is ok

Previously I think NewRowGroup is OK, because without WriteRecordBatch, every write operation will call New{Buffered}RowGroup, but this change also LGTM

Maybe you can try add CheckClosed in WriteRecordBatch, WriteColumnChunk (maybe also Init?). I think for some interfaces, like schema, can not add this check. If a WriteTable or WriteColumnChunk calls another overload function with same name, we can not adding the check here.

@quanghgx
Copy link
Contributor Author

Thanks @mapleFU and @pitrou
Added:

  • WriteColumnChunk(.. ) // Actual implementation; other two overloads all delegate to this.
  • WriteRecordBatch(.. )

Reviewed, safe to skip:

  • Init() // Internally implemented and called only within FileWriter::Make, with a newly created writer.
  • schema() // Independent of the writer.
  • properties() // Internal implementation, called by WriteTable, WriteRecordBatch, both already validated.
  • metadata() // In the current implementation of ParquetFileWriter::Content, it's safe to be called when closed. Open to adding this.

@quanghgx
Copy link
Contributor Author

quanghgx commented Nov 15, 2023

update: checked 3 failing tests. They are unrelated to this MR, I believe.

@pitrou
Copy link
Member

pitrou commented Nov 16, 2023

Indeed, the CI failures are unrelated.

@pitrou
Copy link
Member

pitrou commented Nov 16, 2023

Thank you @quanghgx ! I will merge this PR now.

@pitrou pitrou merged commit 3e0ca5b into apache:main Nov 16, 2023
31 of 32 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Nov 16, 2023
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Nov 16, 2023
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 3e0ca5b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

@quanghgx quanghgx deleted the gh37969_segfault_use_after_close branch November 17, 2023 01:13
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…tFileWriter (apache#38390)

### Rationale for this change
Operations on closed ParquetFileWriter are not allowed, but should not segfault. Somehow, ParquetFileWriter::Close() also reset its pimpl, so after that, any operators, those need this pointer will lead to segfault

### What changes are included in this PR?
Adding more checks for closed file.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.

* Closes: apache#37969

Authored-by: Quang Hoang <quanghgx@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R][C++] segfault when writing to ParquetFileWriter after closing
3 participants