PARQUET-2411: [C++][Parquet] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader #39153

jp0317 · 2023-12-09T00:50:02Z

Rationale for this change

This proposes an API to read only the dictionary from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content.

What changes are included in this PR?

New APIs to enable reading dictionary with RecordReader.

Are these changes tested?

Unit tests.

Are there any user-facing changes?

New APIs without breaking existing workflow.

github-actions · 2023-12-09T00:50:27Z

https://issues.apache.org/jira/browse/PARQUET-2411

github-actions · 2023-12-09T00:50:29Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

emkornfield · 2023-12-09T01:18:56Z

cpp/src/parquet/column_reader.cc

+      return nullptr;
+    }
+    // Verify the current data page is dictionary encoded.
+    if (this->current_encoding_ != Encoding::RLE_DICTIONARY) {


doesn't this need to be plain dictionary also?

or actually, I guess PLAIN?

The current_encoding_ is set as rle_dictionary if it's plain dictionary.

Perhaps adding a comment to clear the confusion?

added some comments, thanks!

emkornfield · 2023-12-09T01:21:56Z

cpp/src/parquet/column_reader.h

+  /// \brief Returns the dictionary owned by the current decoder. Throws an
+  /// exception if the current decoder is not for dictionary encoding.
+  /// \param[out] dictionary_length The number of dictionary entries.
+  virtual const uint8_t* ReadDictionary(int32_t* dictionary_length) = 0;


why uint8_t should this be int32? or provide some more details on how to interpret this?

i just make it consistent with the values() which uses uint8_t*. The caller should handle the conversion properly based on the column type.

It seems that comment of values() says FLBA and ByteArray types do not use this array. Should we change this into a template function or simply return const void* so downstream is required to be aware of the type to avoid misuse the uint8_t*?

changed to void*, thanks!

wgtmac · 2023-12-09T03:24:44Z

cpp/src/parquet/column_reader.h

+  /// \brief Returns the dictionary owned by the current decoder. Throws an
+  /// exception if the current decoder is not for dictionary encoding.
+  /// \param[out] dictionary_length The number of dictionary entries.
+  virtual const uint8_t* ReadDictionary(int32_t* dictionary_length) = 0;


It seems that comment of values() says FLBA and ByteArray types do not use this array. Should we change this into a template function or simply return const void* so downstream is required to be aware of the type to avoid misuse the uint8_t*?

wgtmac · 2023-12-09T03:26:32Z

cpp/src/parquet/column_reader.cc

+         << EncodingToString(this->current_encoding_);
+      throw ParquetException(ss.str());
+    }
+    auto decoder = dynamic_cast<DictDecoder<DType>*>(this->current_decoder_);


nit: use arrow::internal::checked_cast

done, thanks!

didn't notice that there's a virtual base. changed back to dynamic_cast.

wgtmac · 2023-12-09T03:27:19Z

cpp/src/parquet/column_reader.cc

+      return nullptr;
+    }
+    // Verify the current data page is dictionary encoded.
+    if (this->current_encoding_ != Encoding::RLE_DICTIONARY) {


Perhaps adding a comment to clear the confusion?

wgtmac · 2023-12-09T03:30:25Z

cpp/src/parquet/file_reader.cc

@@ -61,6 +61,34 @@ static constexpr uint32_t kFooterSize = 8;
 // For PARQUET-816
 static constexpr int64_t kMaxDictHeaderSize = 100;

+bool IsColumnChunkFullyDictionaryEncoded(const ColumnChunkMetaData& col) {


Move it into anonymous namespace?

done, thanks!

wgtmac · 2023-12-09T03:35:31Z

cpp/src/parquet/file_reader.h

+  // fully dictionary encoded byte arrays. The caller can verify if the reader can read
+  // and expose the dictionary by checking the reader's read_dictionary(). If a column
+  // chunk uses dictionary encoding but then falls back to plain encoding, the returned
+  // reader will read decoded data without exposing the dictionary.


Should we throw if the column chunk fallbacks half way to avoid misuse?

if it falls back the read_dictionary() will return false, I reword the comment to state that the caller should verify the reader using read_dictionary()

Hattonuri · 2023-12-09T16:53:18Z

Am I right that this can be used for faster checking of value presence for Expression == ? (like in bloom filters)

mapleFU · 2023-12-11T02:48:00Z

cpp/src/parquet/file_reader.cc

+  return RecordReader(
+      i,
+      /*read_dictionary=*/encoding_to_expose == ExposedEncoding::DICTIONARY &&
+          IsColumnChunkFullyDictionaryEncoded(*metadata()->ColumnChunk(i)));


Some nit (not about correctness): metadata()->ColumnChunk(i) would be heavy because it will create a mirror for ColumnChunk

good point, thanks! It's probably fine as this is on reader creation, rather than a perf critical path.

I guess it depends on how expensive the initialization is?

I guess it depends on how expensive the initialization is?

The code in current metadata implemention is a bit tricky, however it would not copy the underlying data, I think it's ok to do this if not we call in per-row-group

mapleFU

Rest looks ok to me

Also, this didn't directly related to ByteArray?

mapleFU · 2023-12-12T05:30:32Z

cpp/src/parquet/column_reader.cc

@@ -1369,6 +1369,26 @@ class TypedRecordReader : public TypedColumnReaderImpl<DType>,
    return bytes_for_values;
  }

+  const void* ReadDictionary(int32_t* dictionary_length) override {
+    if (this->current_decoder_ == nullptr && !this->HasNextInternal()) {


Would this trigger initialization? (Called before initialization, since current_decoder_ == nullptr)

yes.. it would try to initialize it in HasNExtInternal

mapleFU · 2023-12-12T06:31:05Z

Am I right that this can be used for faster checking of value presence for Expression == ? (like in bloom filters)

Hmmm we didn't implement Dictionary Filter, but maybe later we can trying it

emkornfield · 2023-12-12T06:40:49Z

Am I right that this can be used for faster checking of value presence for Expression == ? (like in bloom filters)

This is one potential use. In general it can be used to reduce computation of predicates by precomputing on the dictionary, and each predicate becomes a simple lookup. for some predicates (like equality) that can rule out all data being present this can have a large impact

emkornfield · 2023-12-12T06:42:30Z

cpp/src/parquet/column_reader.h

@@ -368,6 +368,11 @@ class PARQUET_EXPORT RecordReader {

  virtual void DebugPrintState() = 0;

+  /// \brief Returns the dictionary owned by the current decoder. Throws an
+  /// exception if the current decoder is not for dictionary encoding.


Maybe add a comment that in general the only safe time to call this is before any values have been read?

i think it's also fine to read the dictionary after reading some values to read, as long as the decoder is still valid.

emkornfield · 2023-12-12T06:54:43Z

cpp/src/parquet/column_reader.h

+  /// \brief Returns the dictionary owned by the current decoder. Throws an
+  /// exception if the current decoder is not for dictionary encoding.
+  /// \param[out] dictionary_length The number of dictionary entries.
+  virtual const void* ReadDictionary(int32_t* dictionary_length) = 0;


I think it should be possible to make this templated (with implementations in the .cc file? If so I think that might make sense for consumers. Either way we should clarify how users can understand what to case the dictionary to.

Added some explanations and examples in the function comments. PTAL

mapleFU · 2023-12-13T09:03:16Z

Some remaining question: What would it expect todo with non-bytearray types, like int32?

jp0317 · 2023-12-13T18:37:31Z

Some remaining question: What would it expect todo with non-bytearray types, like int32?

oh sorry, missed the last comment. The new GetDictionary itself would work as long as the internal decoder is a dictionary decoder, regardless of types. The issue is that current record reader only read_dictionary for byte array (i.e., the values() always returns decoded values and read_dictionary() returns false for non byte-array, and that's why I target byte array in this pr). I think it probably makes sense to extend the read_dictionary for other types in a separate PR.

mapleFU

Rest looks good to me

mapleFU · 2023-12-15T14:32:55Z

cpp/src/parquet/column_reader.cc

+    // PLAIN_DICTIONARY.
+    if (this->current_encoding_ != Encoding::RLE_DICTIONARY) {
+      std::stringstream ss;
+      ss << "Data page is not dictionary encoded. Encoding: "


nit: should we also add descr_->ToString() to help debugging?

maybe it's better to leave this as a user choice (they can still choose to print the descr using existing api)? I don't have a strong preference but let me know if you prefer always printing it.

mapleFU · 2023-12-15T14:38:21Z

cpp/src/parquet/column_reader.h

+  /// exception if the current decoder is not for dictionary encoding. The caller is
+  /// responsible for casting the returned pointer to proper type depending on the
+  /// column's physical type. An example:
+  ///   ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));


Suggested change

/// ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));

/// const ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));

mapleFU · 2023-12-15T14:38:29Z

cpp/src/parquet/column_reader.h

+  /// column's physical type. An example:
+  ///   ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));
+  /// or:
+  ///   float* dict = reinterpret_cast<const float*>(ReadDictionary(&len));


Suggested change

/// float* dict = reinterpret_cast<const float*>(ReadDictionary(&len));

/// const float* dict = reinterpret_cast<const float*>(ReadDictionary(&len));

done, thanks for catching this!

mapleFU · 2024-01-03T02:42:37Z

@emkornfield @jp0317 Would you like to merge this?

emkornfield · 2024-01-05T00:13:30Z

Yes, I think we can merge it if there are no objections.

conbench-apache-arrow · 2024-01-05T09:43:52Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit bec0385.

There were 2 benchmark results with an error:

Commit Run on ursa-i9-9960x at 2024-01-05 04:06:56Z
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-21, scale_factor=1
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-21, scale_factor=10

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

… data via ByteArrayDictionaryRecordReader (apache#39153) ### Rationale for this change This proposes an API to read only the dictionary from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content. ### What changes are included in this PR? New APIs to enable reading dictionary with RecordReader. ### Are these changes tested? Unit tests. ### Are there any user-facing changes? New APIs without breaking existing workflow. Authored-by: jp0317 <zjpzlz@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>

jp0317 requested a review from wgtmac as a code owner December 9, 2023 00:50

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Dec 9, 2023

allow reading dictionary content from ByteArrayDictionaryRecordReader

2a04a2e

jp0317 force-pushed the dict-record-reader branch from 64de2a6 to 2a04a2e Compare December 9, 2023 00:55

emkornfield reviewed Dec 9, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 9, 2023

emkornfield reviewed Dec 9, 2023

View reviewed changes

wgtmac reviewed Dec 9, 2023

View reviewed changes

wgtmac changed the title ~~PARQUET-2411: [C++] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader~~ PARQUET-2411: [C++][Parquet] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader Dec 9, 2023

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 10, 2023

mapleFU reviewed Dec 11, 2023

View reviewed changes

minor changes on return type and comments

72f2f8e

jp0317 force-pushed the dict-record-reader branch from 2ad82fb to 72f2f8e Compare December 11, 2023 04:09

jp0317 requested review from emkornfield, wgtmac and mapleFU December 11, 2023 18:35

mapleFU reviewed Dec 12, 2023

View reviewed changes

emkornfield reviewed Dec 12, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 12, 2023

emkornfield reviewed Dec 12, 2023

View reviewed changes

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 12, 2023

mapleFU reviewed Dec 15, 2023

View reviewed changes

doc changes

08c1ce6

jp0317 force-pushed the dict-record-reader branch from 42c4d36 to 08c1ce6 Compare December 19, 2023 00:14

jp0317 requested review from emkornfield and mapleFU December 19, 2023 00:15

emkornfield approved these changes Dec 19, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Dec 19, 2023

mapleFU approved these changes Dec 19, 2023

View reviewed changes

wgtmac approved these changes Dec 19, 2023

View reviewed changes

mapleFU merged commit bec0385 into apache:main Jan 5, 2024
33 of 34 checks passed

mapleFU removed the awaiting merge Awaiting merge label Jan 5, 2024

jp0317 deleted the dict-record-reader branch January 10, 2024 01:16

	/// ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));
	/// const ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));

	/// float* dict = reinterpret_cast<const float*>(ReadDictionary(&len));
	/// const float* dict = reinterpret_cast<const float*>(ReadDictionary(&len));

PARQUET-2411: [C++][Parquet] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader #39153

PARQUET-2411: [C++][Parquet] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader #39153

Conversation

jp0317 commented Dec 9, 2023

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Dec 9, 2023

github-actions bot commented Dec 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jp0317 Dec 10, 2023 • edited

Choose a reason for hiding this comment

Hattonuri commented Dec 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented Dec 12, 2023

emkornfield commented Dec 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented Dec 13, 2023 • edited

jp0317 commented Dec 13, 2023

mapleFU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented Jan 3, 2024

emkornfield commented Jan 5, 2024

conbench-apache-arrow bot commented Jan 5, 2024

jp0317 Dec 10, 2023 •

edited

mapleFU left a comment •

edited

mapleFU commented Dec 13, 2023 •

edited