Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Minor: Declare lifetime of shared_ptr<Page> on PageReader #34722

Closed
mapleFU opened this issue Mar 24, 2023 · 3 comments · Fixed by #35368
Closed

[C++][Parquet] Minor: Declare lifetime of shared_ptr<Page> on PageReader #34722

mapleFU opened this issue Mar 24, 2023 · 3 comments · Fixed by #35368

Comments

@mapleFU
Copy link
Member

mapleFU commented Mar 24, 2023

Describe the enhancement requested

In C++ Parquet, SerializedPageReader holds a decryption_buffer_ and decompression_buffer_, and when NextPage called, it will reuse that buffer. So, Page's buffer data is bounded on SerializedPageReader::NextPage, when NextPage is called, it's data might be reset.

I have two problems here:

  1. If it's by design, why don't we cache raw data in SerializePageReader?
  2. Should we note these in comments?

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Mar 24, 2023

@pitrou @wjones127 mind take a look?

@cyb70289
Copy link
Contributor

It might be easier to review if you can paster the code link here.

@mapleFU
Copy link
Member Author

mapleFU commented Mar 24, 2023

The interface:

// Abstract page iterator interface. This way, we can feed column pages to the
// ColumnReader through whatever mechanism we choose
class PARQUET_EXPORT PageReader {
  using DataPageFilter = std::function<bool(const DataPageStats&)>;

 public:
  virtual ~PageReader() = default;

  // @returns: shared_ptr<Page>(nullptr) on EOS, std::shared_ptr<Page>
  // containing new Page otherwise
  virtual std::shared_ptr<Page> NextPage() = 0;

The actual:

// This subclass delimits pages appearing in a serialized stream, each preceded
// by a serialized Thrift format::PageHeader indicating the type of each page
// and the page metadata.
class SerializedPageReader : public PageReader {
 public:
  SerializedPageReader(std::shared_ptr<ArrowInputStream> stream, int64_t total_num_values,
                       Compression::type codec, const ReaderProperties& properties,
                       const CryptoContext* crypto_ctx, bool always_compressed)
      : properties_(properties),
        stream_(std::move(stream)),
        decompression_buffer_(AllocateBuffer(properties_.memory_pool(), 0)),
        decryption_buffer_(AllocateBuffer(properties_.memory_pool(), 0)) {}

It (implicitly) depend on SerializedPageReader's buffer

wjones127 added a commit that referenced this issue May 9, 2023
…35368)

### Rationale for this change

This change update wording of Parquet `NextPage`. It will reuse same decompression/decrypt buffer internal. So, use should attension it syntax.

### What changes are included in this PR?

No code change, just add some comments.

### Are these changes tested?

No need

### Are there any user-facing changes?

No.

* Closes: #34722

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added this to the 13.0.0 milestone May 9, 2023
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…age (apache#35368)

### Rationale for this change

This change update wording of Parquet `NextPage`. It will reuse same decompression/decrypt buffer internal. So, use should attension it syntax.

### What changes are included in this PR?

No code change, just add some comments.

### Are these changes tested?

No need

### Are there any user-facing changes?

No.

* Closes: apache#34722

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…age (apache#35368)

### Rationale for this change

This change update wording of Parquet `NextPage`. It will reuse same decompression/decrypt buffer internal. So, use should attension it syntax.

### What changes are included in this PR?

No code change, just add some comments.

### Are these changes tested?

No need

### Are there any user-facing changes?

No.

* Closes: apache#34722

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…age (apache#35368)

### Rationale for this change

This change update wording of Parquet `NextPage`. It will reuse same decompression/decrypt buffer internal. So, use should attension it syntax.

### What changes are included in this PR?

No code change, just add some comments.

### Are these changes tested?

No need

### Are there any user-facing changes?

No.

* Closes: apache#34722

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants