Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for parquet input/output format. #1510

Merged
merged 2 commits into from
Mar 11, 2024
Merged

Initial support for parquet input/output format. #1510

merged 2 commits into from
Mar 11, 2024

Conversation

gz
Copy link
Collaborator

@gz gz commented Mar 7, 2024

Adds pretty basic parquet/arrow support.

  • The input/deserialization (for now) is done just by converting to JSON (this is pretty inefficient but doesn't require us to change the API in adapters as we can provide an [u8] buffer)
  • The output/serialization is done using serde-arrow and some modifications in the APIs to make it work

Long term think about a better API that works well for column oriented data where parsing/conversion happens in external libraries.

We'll probably need to extend this further to customize the serialization/deserialization.

@mihaibudiu review for the compiler changes, @ryzhyk for the rest

Is this a user-visible change (yes/no): yes

@gz gz requested review from ryzhyk and mihaibudiu March 7, 2024 23:54
crates/adapters/src/format/parquet/mod.rs Outdated Show resolved Hide resolved
crates/adapters/src/format/parquet/mod.rs Outdated Show resolved Hide resolved
crates/adapters/src/format/parquet/mod.rs Show resolved Hide resolved
crates/adapters/src/format/parquet/mod.rs Show resolved Hide resolved
crates/adapters/src/format/parquet/mod.rs Outdated Show resolved Hide resolved
crates/adapters/src/format/parquet/mod.rs Outdated Show resolved Hide resolved
crates/adapters/src/format/parquet/mod.rs Outdated Show resolved Hide resolved
crates/adapters/src/format/parquet/mod.rs Show resolved Hide resolved
@@ -225,6 +229,9 @@ pub trait SerCursor {
/// Serialize current key. Panics if invalid.
fn serialize_key(&mut self, dst: &mut Vec<u8>) -> AnyResult<()>;

/// Serialize current key. Panics if invalid.
fn serialize_arrow_key(&mut self, dst: &mut ArrowBuilder) -> AnyResult<()>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this serialize_key_to_arrow. The comment may need to be adjusted too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more abstract Trait we can use instead of ArrowBuilder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more abstract Trait we can use instead of ArrowBuilder?

This is bit of a desperation move we decided to do here since the given APIs &mut [u8] and Vec don't really work that well for arbitrary formats... We'll have to rework this API a bit but I'm not really familiar enough to say how it should look like.

crates/adapters/src/format/parquet/mod.rs Show resolved Hide resolved
}
let mut w = cursor.weight();
if !(-MAX_DUPLICATES..=MAX_DUPLICATES).contains(&w) {
bail!("Unable to output record with very large weight {w}. Consider adjusting your SQL queries to avoid duplicate output records, e.g., using 'SELECT DISTINCT'.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we really have to do this?
Just write all of them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken from current JSON impl cc @ryzhyk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some users out there may not like it if we output a trillion identical records. We can make MAX_DUPLICATES very large, but I think we need to bound it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, it is very unlikely that someone will write such a query.
Second, if we crash it's our fault, if the output is very large, it's their fault.
Third, this is not the right way to report this problem, we really have to work on an error stream that survives the circuit crash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't crash, just report an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is even worse, because the results will be wrong and no one may be looking at that error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is a DoS attack that we help the user launch on themselves.

}
if w < 0 {
// TODO: we don't support deletes in the parquet format yet.
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is continue the best solution?
Here I would bail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely problematic, but there are cases where ignoring deletes is actually ok, e.g., when the set of keys is stable.

crates/adapters/src/format/parquet/mod.rs Show resolved Hide resolved
@gz
Copy link
Collaborator Author

gz commented Mar 9, 2024

I added support for Date32 and Time64 in serde-arrow and forked the crate meanwhile chmp/serde_arrow#147

Copy link
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a user-visible change to me. You even added docs!

docs/api/parquet.md Show resolved Hide resolved
crates/pipeline-types/src/format/parquet.rs Show resolved Hide resolved
crates/pipeline-types/src/format/parquet.rs Show resolved Hide resolved
@gz gz force-pushed the parquet branch 3 times, most recently from 8f86358 to fde17b4 Compare March 11, 2024 02:57
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@gz gz merged commit 60169ef into main Mar 11, 2024
5 checks passed
@gz gz deleted the parquet branch March 11, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants