Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2411: [C++][Parquet] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader #39153

Merged
merged 3 commits into from
Jan 5, 2024

Conversation

jp0317
Copy link
Contributor

@jp0317 jp0317 commented Dec 9, 2023

Rationale for this change

This proposes an API to read only the dictionary from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content.

What changes are included in this PR?

New APIs to enable reading dictionary with RecordReader.

Are these changes tested?

Unit tests.

Are there any user-facing changes?

New APIs without breaking existing workflow.

@jp0317 jp0317 requested a review from wgtmac as a code owner December 9, 2023 00:50
Copy link

github-actions bot commented Dec 9, 2023

Copy link

github-actions bot commented Dec 9, 2023

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

return nullptr;
}
// Verify the current data page is dictionary encoded.
if (this->current_encoding_ != Encoding::RLE_DICTIONARY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this need to be plain dictionary also?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or actually, I guess PLAIN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current_encoding_ is set as rle_dictionary if it's plain dictionary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps adding a comment to clear the confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments, thanks!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Dec 9, 2023
/// \brief Returns the dictionary owned by the current decoder. Throws an
/// exception if the current decoder is not for dictionary encoding.
/// \param[out] dictionary_length The number of dictionary entries.
virtual const uint8_t* ReadDictionary(int32_t* dictionary_length) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why uint8_t should this be int32? or provide some more details on how to interpret this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just make it consistent with the values() which uses uint8_t*. The caller should handle the conversion properly based on the column type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that comment of values() says FLBA and ByteArray types do not use this array. Should we change this into a template function or simply return const void* so downstream is required to be aware of the type to avoid misuse the uint8_t*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to void*, thanks!

/// \brief Returns the dictionary owned by the current decoder. Throws an
/// exception if the current decoder is not for dictionary encoding.
/// \param[out] dictionary_length The number of dictionary entries.
virtual const uint8_t* ReadDictionary(int32_t* dictionary_length) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that comment of values() says FLBA and ByteArray types do not use this array. Should we change this into a template function or simply return const void* so downstream is required to be aware of the type to avoid misuse the uint8_t*?

<< EncodingToString(this->current_encoding_);
throw ParquetException(ss.str());
}
auto decoder = dynamic_cast<DictDecoder<DType>*>(this->current_decoder_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use arrow::internal::checked_cast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't notice that there's a virtual base. changed back to dynamic_cast.

return nullptr;
}
// Verify the current data page is dictionary encoded.
if (this->current_encoding_ != Encoding::RLE_DICTIONARY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps adding a comment to clear the confusion?

@@ -61,6 +61,34 @@ static constexpr uint32_t kFooterSize = 8;
// For PARQUET-816
static constexpr int64_t kMaxDictHeaderSize = 100;

bool IsColumnChunkFullyDictionaryEncoded(const ColumnChunkMetaData& col) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it into anonymous namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

// fully dictionary encoded byte arrays. The caller can verify if the reader can read
// and expose the dictionary by checking the reader's read_dictionary(). If a column
// chunk uses dictionary encoding but then falls back to plain encoding, the returned
// reader will read decoded data without exposing the dictionary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw if the column chunk fallbacks half way to avoid misuse?

Copy link
Contributor Author

@jp0317 jp0317 Dec 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it falls back the read_dictionary() will return false, I reword the comment to state that the caller should verify the reader using read_dictionary()

@wgtmac wgtmac changed the title PARQUET-2411: [C++] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader PARQUET-2411: [C++][Parquet] Allow reading dictionary without reading data via ByteArrayDictionaryRecordReader Dec 9, 2023
@Hattonuri
Copy link
Contributor

Am I right that this can be used for faster checking of value presence for Expression == ? (like in bloom filters)

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 10, 2023
return RecordReader(
i,
/*read_dictionary=*/encoding_to_expose == ExposedEncoding::DICTIONARY &&
IsColumnChunkFullyDictionaryEncoded(*metadata()->ColumnChunk(i)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nit (not about correctness): metadata()->ColumnChunk(i) would be heavy because it will create a mirror for ColumnChunk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thanks! It's probably fine as this is on reader creation, rather than a perf critical path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how expensive the initialization is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how expensive the initialization is?

The code in current metadata implemention is a bit tricky, however it would not copy the underlying data, I think it's ok to do this if not we call in per-row-group

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest looks ok to me

Also, this didn't directly related to ByteArray?

@@ -1369,6 +1369,26 @@ class TypedRecordReader : public TypedColumnReaderImpl<DType>,
return bytes_for_values;
}

const void* ReadDictionary(int32_t* dictionary_length) override {
if (this->current_decoder_ == nullptr && !this->HasNextInternal()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this trigger initialization? (Called before initialization, since current_decoder_ == nullptr)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.. it would try to initialize it in HasNExtInternal

@mapleFU
Copy link
Member

mapleFU commented Dec 12, 2023

Am I right that this can be used for faster checking of value presence for Expression == ? (like in bloom filters)

Hmmm we didn't implement Dictionary Filter, but maybe later we can trying it

@emkornfield
Copy link
Contributor

Am I right that this can be used for faster checking of value presence for Expression == ? (like in bloom filters)

This is one potential use. In general it can be used to reduce computation of predicates by precomputing on the dictionary, and each predicate becomes a simple lookup. for some predicates (like equality) that can rule out all data being present this can have a large impact

@@ -368,6 +368,11 @@ class PARQUET_EXPORT RecordReader {

virtual void DebugPrintState() = 0;

/// \brief Returns the dictionary owned by the current decoder. Throws an
/// exception if the current decoder is not for dictionary encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment that in general the only safe time to call this is before any values have been read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's also fine to read the dictionary after reading some values to read, as long as the decoder is still valid.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 12, 2023
/// \brief Returns the dictionary owned by the current decoder. Throws an
/// exception if the current decoder is not for dictionary encoding.
/// \param[out] dictionary_length The number of dictionary entries.
virtual const void* ReadDictionary(int32_t* dictionary_length) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to make this templated (with implementations in the .cc file? If so I think that might make sense for consumers. Either way we should clarify how users can understand what to case the dictionary to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some explanations and examples in the function comments. PTAL

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Dec 12, 2023
@mapleFU
Copy link
Member

mapleFU commented Dec 13, 2023

Some remaining question: What would it expect todo with non-bytearray types, like int32?

@jp0317
Copy link
Contributor Author

jp0317 commented Dec 13, 2023

Some remaining question: What would it expect todo with non-bytearray types, like int32?

oh sorry, missed the last comment. The new GetDictionary itself would work as long as the internal decoder is a dictionary decoder, regardless of types. The issue is that current record reader only read_dictionary for byte array (i.e., the values() always returns decoded values and read_dictionary() returns false for non byte-array, and that's why I target byte array in this pr). I think it probably makes sense to extend the read_dictionary for other types in a separate PR.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest looks good to me

// PLAIN_DICTIONARY.
if (this->current_encoding_ != Encoding::RLE_DICTIONARY) {
std::stringstream ss;
ss << "Data page is not dictionary encoded. Encoding: "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we also add descr_->ToString() to help debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's better to leave this as a user choice (they can still choose to print the descr using existing api)? I don't have a strong preference but let me know if you prefer always printing it.

/// exception if the current decoder is not for dictionary encoding. The caller is
/// responsible for casting the returned pointer to proper type depending on the
/// column's physical type. An example:
/// ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));
/// const ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));

/// column's physical type. An example:
/// ByteArray* dict = reinterpret_cast<const ByteArray*>(ReadDictionary(&len));
/// or:
/// float* dict = reinterpret_cast<const float*>(ReadDictionary(&len));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// float* dict = reinterpret_cast<const float*>(ReadDictionary(&len));
/// const float* dict = reinterpret_cast<const float*>(ReadDictionary(&len));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for catching this!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Dec 19, 2023
@mapleFU
Copy link
Member

mapleFU commented Jan 3, 2024

@emkornfield @jp0317 Would you like to merge this?

@emkornfield
Copy link
Contributor

Yes, I think we can merge it if there are no objections.

@mapleFU mapleFU merged commit bec0385 into apache:main Jan 5, 2024
33 of 34 checks passed
@mapleFU mapleFU removed the awaiting merge Awaiting merge label Jan 5, 2024
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit bec0385.

There were 2 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

@jp0317 jp0317 deleted the dict-record-reader branch January 10, 2024 01:16
clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
… data via ByteArrayDictionaryRecordReader (apache#39153)

### Rationale for this change
This proposes an API to read only the dictionary  from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content.

### What changes are included in this PR?

New APIs to enable reading dictionary with RecordReader.

### Are these changes tested?
Unit tests.

### Are there any user-facing changes?

New APIs without breaking existing workflow.

Authored-by: jp0317 <zjpzlz@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
… data via ByteArrayDictionaryRecordReader (apache#39153)

### Rationale for this change
This proposes an API to read only the dictionary  from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content.

### What changes are included in this PR?

New APIs to enable reading dictionary with RecordReader.

### Are these changes tested?
Unit tests.

### Are there any user-facing changes?

New APIs without breaking existing workflow.

Authored-by: jp0317 <zjpzlz@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
… data via ByteArrayDictionaryRecordReader (apache#39153)

### Rationale for this change
This proposes an API to read only the dictionary  from ByteArrayDictionaryRecordReader, enabling possible uses cases where the caller just want to check the dictionary content.

### What changes are included in this PR?

New APIs to enable reading dictionary with RecordReader.

### Are these changes tested?
Unit tests.

### Are there any user-facing changes?

New APIs without breaking existing workflow.

Authored-by: jp0317 <zjpzlz@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants