-
Notifications
You must be signed in to change notification settings - Fork 914
Improve speed of writing string dictionaries to parquet by skipping a copy(#1764) #2322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
efefa6c
to
c1b8d26
Compare
c1b8d26
to
fe3e537
Compare
use crate::{data_type::*, file::writer::SerializedFileWriter}; | ||
use levels::{calculate_array_levels, LevelInfo}; | ||
|
||
mod byte_array; | ||
mod levels; | ||
|
||
/// An object-safe API for writing an [`ArrayRef`] | ||
trait ArrayWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up implementing type erasure within the ByteArrayWriter, and so this indirection can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- I am not super familiar with all the stucts in this area of the code, but this looks like a beautiful way to use ArrayIter and connect up the existing pieces.
pub struct TypedDictionaryArray<'a, K: ArrowPrimitiveType, V> { | ||
/// The dictionary array | ||
dictionary: &'a DictionaryArray<K>, | ||
/// The values of the dictionary | ||
values: &'a V, | ||
} | ||
|
||
// Manually implement `Clone` to avoid `V: Clone` type constraint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is strange that having a reference to & V
would require V: Clone
in order to #[derive(Clone)]
🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That rfc doesn't sound quite right (or at least it is a overkill -- like a 🔨 for swatting a 🪰 ). All that is needed in this case is to recognize that the struct only uses &V
rather than V
rather than a way to provide generic arguments to macros 😱
Also, I am firmly of the belief that adding more generics is not the answer to most of life's problems 🤣 Maybe because my feeble mind can't handle the extra level of indirection
@@ -143,6 +143,17 @@ pub fn create_random_array( | |||
}) | |||
.collect::<Result<Vec<(&str, ArrayRef)>>>()?, | |||
)?), | |||
d @ Dictionary(_, value_type) | |||
if crate::compute::can_cast_types(value_type, d) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using cast
is a neat trick here 👍
($array:ident, $key:ident, $val:ident, $op:expr $(, $arg:expr)*) => {{ | ||
$op($array | ||
.as_any() | ||
.downcast_ref::<DictionaryArray<arrow::datatypes::$key>>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "could be made faster" the issue is that this is effectively creating something that will iterate over strings, which means that to encode the column, the arrow array dictionary index is used to find a string, which is then used to find the parquet array index which is then written.
It could potentially be faster if we skipped the string step in the middle and simply computed an arrow dictionary index --> parquet dictionary index mapping up front and applied that mapping during writing
(I think you said this in this PR's description, but I am restating it to confirm I understand what is happening)
Benchmark runs are scheduled for baseline = 6859efa and contender = b8fd432. b8fd432 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Draft as builds on #2136Which issue does this PR close?
Closes #1764
Rationale for this change
What changes are included in this PR?
This alters the parquet writer to not hydrate dictionaries when writing. There is still a potential optimisation here to memoize the dictionary keys as they are converted, instead of interning the same dictionary key repeatedly, but I need to have a think about how to expose this from
ArrayAccessor
.Are there any user-facing changes?
No