-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2411: [C++][Parquet] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader #39153
Conversation
|
64de2a6
to
2a04a2e
Compare
return nullptr; | ||
} | ||
// Verify the current data page is dictionary encoded. | ||
if (this->current_encoding_ != Encoding::RLE_DICTIONARY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this need to be plain dictionary also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or actually, I guess PLAIN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current_encoding_ is set as rle_dictionary if it's plain dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps adding a comment to clear the confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some comments, thanks!
cpp/src/parquet/column_reader.h
Outdated
/// \brief Returns the dictionary owned by the current decoder. Throws an | ||
/// exception if the current decoder is not for dictionary encoding. | ||
/// \param[out] dictionary_length The number of dictionary entries. | ||
virtual const uint8_t* ReadDictionary(int32_t* dictionary_length) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why uint8_t should this be int32? or provide some more details on how to interpret this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just make it consistent with the values() which uses uint8_t*
. The caller should handle the conversion properly based on the column type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that comment of values()
says FLBA and ByteArray types do not use this array
. Should we change this into a template function or simply return const void*
so downstream is required to be aware of the type to avoid misuse the uint8_t*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to void*, thanks!
cpp/src/parquet/column_reader.h
Outdated
/// \brief Returns the dictionary owned by the current decoder. Throws an | ||
/// exception if the current decoder is not for dictionary encoding. | ||
/// \param[out] dictionary_length The number of dictionary entries. | ||
virtual const uint8_t* ReadDictionary(int32_t* dictionary_length) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that comment of values()
says FLBA and ByteArray types do not use this array
. Should we change this into a template function or simply return const void*
so downstream is required to be aware of the type to avoid misuse the uint8_t*
?
<< EncodingToString(this->current_encoding_); | ||
throw ParquetException(ss.str()); | ||
} | ||
auto decoder = dynamic_cast<DictDecoder<DType>*>(this->current_decoder_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use arrow::internal::checked_cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't notice that there's a virtual base. changed back to dynamic_cast.
return nullptr; | ||
} | ||
// Verify the current data page is dictionary encoded. | ||
if (this->current_encoding_ != Encoding::RLE_DICTIONARY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps adding a comment to clear the confusion?
@@ -61,6 +61,34 @@ static constexpr uint32_t kFooterSize = 8; | |||
// For PARQUET-816 | |||
static constexpr int64_t kMaxDictHeaderSize = 100; | |||
|
|||
bool IsColumnChunkFullyDictionaryEncoded(const ColumnChunkMetaData& col) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it into anonymous namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks!
// fully dictionary encoded byte arrays. The caller can verify if the reader can read | ||
// and expose the dictionary by checking the reader's read_dictionary(). If a column | ||
// chunk uses dictionary encoding but then falls back to plain encoding, the returned | ||
// reader will read decoded data without exposing the dictionary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw if the column chunk fallbacks half way to avoid misuse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it falls back the read_dictionary() will return false, I reword the comment to state that the caller should verify the reader using read_dictionary()
Am I right that this can be used for faster checking of value presence for Expression == ? (like in bloom filters) |
return RecordReader( | ||
i, | ||
/*read_dictionary=*/encoding_to_expose == ExposedEncoding::DICTIONARY && | ||
IsColumnChunkFullyDictionaryEncoded(*metadata()->ColumnChunk(i))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nit (not about correctness): metadata()->ColumnChunk(i)
would be heavy because it will create a mirror for ColumnChunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, thanks! It's probably fine as this is on reader creation, rather than a perf critical path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on how expensive the initialization is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on how expensive the initialization is?
The code in current metadata
implemention is a bit tricky, however it would not copy the underlying data, I think it's ok to do this if not we call in per-row-group
2ad82fb
to
72f2f8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest looks ok to me
Also, this didn't directly related to ByteArray
?
@@ -1369,6 +1369,26 @@ class TypedRecordReader : public TypedColumnReaderImpl<DType>, | |||
return bytes_for_values; | |||
} | |||
|
|||
const void* ReadDictionary(int32_t* dictionary_length) override { | |||
if (this->current_decoder_ == nullptr && !this->HasNextInternal()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this trigger initialization? (Called before initialization, since current_decoder_ == nullptr
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.. it would try to initialize it in HasNExtInternal
Hmmm we didn't implement Dictionary Filter, but maybe later we can trying it |
This is one potential use. In general it can be used to reduce computation of predicates by precomputing on the dictionary, and each predicate becomes a simple lookup. for some predicates (like equality) that can rule out all data being present this can have a large impact |
cpp/src/parquet/column_reader.h
Outdated
@@ -368,6 +368,11 @@ class PARQUET_EXPORT RecordReader { | |||
|
|||
virtual void DebugPrintState() = 0; | |||
|
|||
/// \brief Returns the dictionary owned by the current decoder. Throws an | |||
/// exception if the current decoder is not for dictionary encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment that in general the only safe time to call this is before any values have been read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's also fine to read the dictionary after reading some values to read, as long as the decoder is still valid.
/// \brief Returns the dictionary owned by the current decoder. Throws an | ||
/// exception if the current decoder is not for dictionary encoding. | ||
/// \param[out] dictionary_length The number of dictionary entries. | ||
virtual const void* ReadDictionary(int32_t* dictionary_length) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to make this templated (with implementations in the .cc file? If so I think that might make sense for consumers. Either way we should clarify how users can understand what to case the dictionary to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some explanations and examples in the function comments. PTAL
Some remaining question: What would it expect todo with non-bytearray types, like int32? |
oh sorry, missed the last comment. The new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest looks good to me
// PLAIN_DICTIONARY. | ||
if (this->current_encoding_ != Encoding::RLE_DICTIONARY) { | ||
std::stringstream ss; | ||
ss << "Data page is not dictionary encoded. Encoding: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we also add descr_->ToString()
to help debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's better to leave this as a user choice (they can still choose to print the descr using existing api)? I don't have a strong preference but let me know if you prefer always printing it.
cpp/src/parquet/column_reader.h
Outdated
/// exception if the current decoder is not for dictionary encoding. The caller is | ||
/// responsible for casting the returned pointer to proper type depending on the | ||
/// column's physical type. An example: | ||
/// ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len)); | |
/// const ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len)); |
cpp/src/parquet/column_reader.h
Outdated
/// column's physical type. An example: | ||
/// ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len)); | ||
/// or: | ||
/// float* dict = reinterpret_cast<const float*>(ReadDictionary(&len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// float* dict = reinterpret_cast<const float*>(ReadDictionary(&len)); | |
/// const float* dict = reinterpret_cast<const float*>(ReadDictionary(&len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks for catching this!
42c4d36
to
08c1ce6
Compare
@emkornfield @jp0317 Would you like to merge this? |
Yes, I think we can merge it if there are no objections. |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit bec0385. There were 2 benchmark results with an error:
There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
… data via ByteArrayDictionaryRecordReader (apache#39153) ### Rationale for this change This proposes an API to read only the dictionary from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content. ### What changes are included in this PR? New APIs to enable reading dictionary with RecordReader. ### Are these changes tested? Unit tests. ### Are there any user-facing changes? New APIs without breaking existing workflow. Authored-by: jp0317 <zjpzlz@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
… data via ByteArrayDictionaryRecordReader (apache#39153) ### Rationale for this change This proposes an API to read only the dictionary from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content. ### What changes are included in this PR? New APIs to enable reading dictionary with RecordReader. ### Are these changes tested? Unit tests. ### Are there any user-facing changes? New APIs without breaking existing workflow. Authored-by: jp0317 <zjpzlz@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
… data via ByteArrayDictionaryRecordReader (apache#39153) ### Rationale for this change This proposes an API to read only the dictionary from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content. ### What changes are included in this PR? New APIs to enable reading dictionary with RecordReader. ### Are these changes tested? Unit tests. ### Are there any user-facing changes? New APIs without breaking existing workflow. Authored-by: jp0317 <zjpzlz@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
Rationale for this change
This proposes an API to read only the dictionary from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content.
What changes are included in this PR?
New APIs to enable reading dictionary with RecordReader.
Are these changes tested?
Unit tests.
Are there any user-facing changes?
New APIs without breaking existing workflow.