Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to FlightDataEncoder to always resend batch dictionaries #4896

Merged
merged 4 commits into from
Oct 12, 2023

Conversation

alexwilcoxson-rel
Copy link
Contributor

@alexwilcoxson-rel alexwilcoxson-rel commented Oct 5, 2023

Which issue does this PR close?

Closes #4895

Rationale for this change

see #4895

What changes are included in this PR?

Adds with_send_dictionaries to FlightDataEncoderBuilder to opt in to always sending dictionary messages with each record batch encoded.

Are there any user-facing changes?

with_send_dictionaries added to FlightDataEncoderBuilder and FlightDataEncoder. Doc strings updated.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Oct 5, 2023
@alamb
Copy link
Contributor

alamb commented Oct 6, 2023

Thank you for this PR @alexwilcoxson-rel -- I hope to find time to review it properly in the next few days

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alexwilcoxson-rel -- I think this is a really nice improvement and it could be merged as is.

I was thinking to a future when we might have proper support for resending dictionaries and trying to minimize the API impact as well as adding a nice place for more documentation. What do you think about using an enum instead of a bool like this:

So instead of

pub struct FlightDataEncoder {
...
    send_dictionaries: bool,
...
}

Something like

pub struct FlightDataEncoder {
...
    dictionary_handling: DictionaryHandling,
...
}

/// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s
pub enum DictionaryHandling {
  /// Expands to the underlying type (default). This likely sends more data over the network
  /// but requires less memory (dictionaries are not tracked) and is  more compatible
  /// with other arrow flight client implementations that may not support `DictionaryEncoding`
  /// see [`hydrate_dictionary`] for more details.
  Hydrate,
  /// Send dictionary FlightData with every RecordBatch that contains a [`DictionaryArray`].
  /// See [`Self::Hydrate`] for more tradeoffs. No attempt is made to skip sending the same (logical)
  /// dictionary values twice. 
  Resend, 
}

@@ -520,14 +542,92 @@ mod tests {
);
}

#[tokio::test]
async fn test_dictionary_hydration() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


let arr_one: Arc<DictionaryArray<UInt16Type>> =
Arc::new(vec!["a", "a", "b"].into_iter().collect());
let arr_two: Arc<DictionaryArray<UInt16Type>> =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@alamb
Copy link
Contributor

alamb commented Oct 10, 2023

BTW it looks like some documentation also needs to be updated, like

/// # Caveats

I can handle doing so, but wanted to mention it in case @alexwilcoxson-rel was going to change this one more

@alexwilcoxson-rel
Copy link
Contributor Author

BTW it looks like some documentation also needs to be updated, like

/// # Caveats

I can handle doing so, but wanted to mention it in case @alexwilcoxson-rel was going to change this one more

Updated these docs

@alexwilcoxson-rel
Copy link
Contributor Author

Updated to use Enum per @alamb feedback

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alexwilcoxson-rel -- this looks great

@alamb
Copy link
Contributor

alamb commented Oct 11, 2023

The CI seems to be failing the doc checks. I'll try and fix that

@alamb
Copy link
Contributor

alamb commented Oct 11, 2023

Here is a proposed PR to your branch for improved documentation relativityone#1 🙏

Improve docs for sending flight dictionaries
@alexwilcoxson-rel
Copy link
Contributor Author

Here is a proposed PR to your branch for improved documentation relativityone#1 🙏

LGTM, merged! please rerun CI when you can.

@alamb alamb changed the title Add option to FlightDataEncoder to always resend batch dictionaries Add option to FlightDataEncoder to always resend batch dictionaries Oct 12, 2023
@alamb alamb merged commit d5a655d into apache:master Oct 12, 2023
13 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 12, 2023

Thanks again @alexwilcoxson-rel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to FlightDataEncoder to always send dictionaries
2 participants