Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of writing string dictionaries to parquet by skipping a copy(#1764) #2322

Merged
merged 1 commit into from Aug 5, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Aug 4, 2022

Draft as builds on #2136

Which issue does this PR close?

Closes #1764

Rationale for this change

write_batch primitive/4096 values string dictionary                                                                            
                        time:   [281.80 us 281.91 us 282.03 us]
                        thrpt:  [169.67 MiB/s 169.74 MiB/s 169.81 MiB/s]
                 change:
                        time:   [-11.583% -11.483% -11.395%] (p = 0.00 < 0.05)
                        thrpt:  [+12.861% +12.973% +13.101%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

What changes are included in this PR?

This alters the parquet writer to not hydrate dictionaries when writing. There is still a potential optimisation here to memoize the dictionary keys as they are converted, instead of interning the same dictionary key repeatedly, but I need to have a think about how to expose this from ArrayAccessor.

Are there any user-facing changes?

No

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Aug 4, 2022
@tustvold tustvold marked this pull request as ready for review August 5, 2022 10:31
@alamb alamb changed the title Don't hydrate string dictionaries when writing to parquet (#1764) Improve speed of writing string dictionaries to parquet by skipping a copy(#1764) Aug 5, 2022
use crate::{data_type::*, file::writer::SerializedFileWriter};
use levels::{calculate_array_levels, LevelInfo};

mod byte_array;
mod levels;

/// An object-safe API for writing an [`ArrayRef`]
trait ArrayWriter {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up implementing type erasure within the ByteArrayWriter, and so this indirection can be removed

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- I am not super familiar with all the stucts in this area of the code, but this looks like a beautiful way to use ArrayIter and connect up the existing pieces.

pub struct TypedDictionaryArray<'a, K: ArrowPrimitiveType, V> {
/// The dictionary array
dictionary: &'a DictionaryArray<K>,
/// The values of the dictionary
values: &'a V,
}

// Manually implement `Clone` to avoid `V: Clone` type constraint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is strange that having a reference to & V would require V: Clone in order to #[derive(Clone)] 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That rfc doesn't sound quite right (or at least it is a overkill -- like a 🔨 for swatting a 🪰 ). All that is needed in this case is to recognize that the struct only uses &V rather than V rather than a way to provide generic arguments to macros 😱

Also, I am firmly of the belief that adding more generics is not the answer to most of life's problems 🤣 Maybe because my feeble mind can't handle the extra level of indirection

@@ -143,6 +143,17 @@ pub fn create_random_array(
})
.collect::<Result<Vec<(&str, ArrayRef)>>>()?,
)?),
d @ Dictionary(_, value_type)
if crate::compute::can_cast_types(value_type, d) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using cast is a neat trick here 👍

($array:ident, $key:ident, $val:ident, $op:expr $(, $arg:expr)*) => {{
$op($array
.as_any()
.downcast_ref::<DictionaryArray<arrow::datatypes::$key>>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "could be made faster" the issue is that this is effectively creating something that will iterate over strings, which means that to encode the column, the arrow array dictionary index is used to find a string, which is then used to find the parquet array index which is then written.

It could potentially be faster if we skipped the string step in the middle and simply computed an arrow dictionary index --> parquet dictionary index mapping up front and applied that mapping during writing

(I think you said this in this PR's description, but I am restating it to confirm I understand what is happening)

@tustvold tustvold merged commit b8fd432 into apache:master Aug 5, 2022
24 checks passed
@ursabot
Copy link

ursabot commented Aug 5, 2022

Benchmark runs are scheduled for baseline = 6859efa and contender = b8fd432. b8fd432 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimized Writing of Arrow Byte Array to Parquet
3 participants