-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for parquet input/output format. #1510
Conversation
crates/adapters/src/catalog.rs
Outdated
@@ -225,6 +229,9 @@ pub trait SerCursor { | |||
/// Serialize current key. Panics if invalid. | |||
fn serialize_key(&mut self, dst: &mut Vec<u8>) -> AnyResult<()>; | |||
|
|||
/// Serialize current key. Panics if invalid. | |||
fn serialize_arrow_key(&mut self, dst: &mut ArrowBuilder) -> AnyResult<()>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this serialize_key_to_arrow
. The comment may need to be adjusted too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more abstract Trait we can use instead of ArrowBuilder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more abstract Trait we can use instead of ArrowBuilder?
This is bit of a desperation move we decided to do here since the given APIs &mut [u8] and Vec don't really work that well for arbitrary formats... We'll have to rework this API a bit but I'm not really familiar enough to say how it should look like.
} | ||
let mut w = cursor.weight(); | ||
if !(-MAX_DUPLICATES..=MAX_DUPLICATES).contains(&w) { | ||
bail!("Unable to output record with very large weight {w}. Consider adjusting your SQL queries to avoid duplicate output records, e.g., using 'SELECT DISTINCT'."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we really have to do this?
Just write all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken from current JSON impl cc @ryzhyk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some users out there may not like it if we output a trillion identical records. We can make MAX_DUPLICATES
very large, but I think we need to bound it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, it is very unlikely that someone will write such a query.
Second, if we crash it's our fault, if the output is very large, it's their fault.
Third, this is not the right way to report this problem, we really have to work on an error stream that survives the circuit crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't crash, just report an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is even worse, because the results will be wrong and no one may be looking at that error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is a DoS attack that we help the user launch on themselves.
} | ||
if w < 0 { | ||
// TODO: we don't support deletes in the parquet format yet. | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is continue the best solution?
Here I would bail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely problematic, but there are cases where ignoring deletes is actually ok, e.g., when the set of keys is stable.
I added support for Date32 and Time64 in serde-arrow and forked the crate meanwhile chmp/serde_arrow#147 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a user-visible change to me. You even added docs!
8f86358
to
fde17b4
Compare
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Adds pretty basic parquet/arrow support.
Long term think about a better API that works well for column oriented data where parsing/conversion happens in external libraries.
We'll probably need to extend this further to customize the serialization/deserialization.
@mihaibudiu review for the compiler changes, @ryzhyk for the rest
Is this a user-visible change (yes/no): yes