-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1711: support recursive proto schemas by limiting recursion depth #995
Conversation
ping |
fixed missing dep issue. can someone approve the ci flow? |
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java
Outdated
Show resolved
Hide resolved
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java
Show resolved
Hide resolved
@@ -99,9 +139,9 @@ private Type.Repetition getRepetition(FieldDescriptor descriptor) { | |||
} | |||
} | |||
|
|||
private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) { | |||
private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to consolidate seen and depth into a single data-structure that can be passed through and abstract some of the direct access to the multimap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
the seen
map does encode the seen fields along with their depth as a single datastructure. depth
being a separate arg is important b/c it's the current depth in the traversal, and is used to update the seen data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I was thinking of encapsulating this logic into its own class, so they can be recorded and updated together, to 1. Reduce additional parameters that have to be passed through.
2. Encapsulate the logic behind more mnemonic method names (e.g. AddRecursiveStep())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure encapsulation helps with readability or protection in this case. they are really tracking different things, and should be understood by readers of the traversal code to know how each piece of state is used.
thanks for the review. updated to handle the logging perf concern as well as fixing the javadoc errors. |
…epth This approach could address the other recursion related issues (PARQUET-129, PARQUET-554).
ping |
" }\n" + | ||
"}"; | ||
public void testProto3ConvertAllDatatypes() { | ||
String expectedSchema = JOINER.join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to separate this tpe of code style cleanup from functional changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdym by "tpe"?
if this isn't blocking, i'd rather avoid the busy-work to undo and redo in a different branch.
: value instanceof Message | ||
? ((Message) value).toByteString() | ||
// Worst-case, just dump as plain java string. | ||
: ByteString.copyFromUtf8(value.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually an intended state? If not it is probably better to raise an exception then writing data that could possibly be hard to recover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is intended. for a real-time, production pipeline i'm working on, losing data as it passes through or killing the job b/c of an uncaught exception is problematic as it could lead to data loss and down time. this way, there's some way to know what the problematic data was and fix it properly asap.
? (ByteString) value | ||
// TODO: figure out a way to use MessageOrBuilder | ||
: value instanceof Message | ||
? ((Message) value).toByteString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does recordconsumer offer a stream API or something else to avoid the additional array/bytestring copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,50 @@ | |||
message Trees.BinaryTree { | |||
optional group value = 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't groups deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is par not proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is parquet schema, not proto. protos should/would have a .proto suffix.
option java_package = "org.apache.parquet.proto.test"; | ||
|
||
message BinaryTree { | ||
google.protobuf.Any value = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to verify that something like:
message WrappedTree {
google.protobuf.Any non_recursive = 1;
BinaryTree tree = 2;
}
Also gives expected results (non_recursive doesn't accidentally trigger any of the recursio logic).
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the existing non-recursive proto tests exercise the existing and newly added (the skipping behavior) code paths.
Mostly looks reasonable, I'm not too familiar with parquet-mr @shangxinli can you recommend someone who might be able to give a better review? |
pinging @shangxinli :) |
@ggershinsky Can you have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with the current state of the PR, and would like to thank its reviewers.
I would also like to recommend adding @matthieun as a co-author to this PR, per the discussion in the parallel PR. |
|
Co-authored-by: matthieun <matthieu.nahoum@gmail.com>
can someone retry the github actions? there seemed to have been a transient issue that caused one of the test/build targets to fail. i'd like to get this change in this week. |
@ggershinsky what is the process to merge this? Does parquet-mr just use the github UI? |
yep, just the squash/merge button. |
i'd love to just hit the button. i don't see it. the workflow for travis ci had a failure due to a transient connection issue, and so it wasn't giving me the option to merge. the ui messaging also states that "Only those with write access to this repository can merge pull requests." |
@shangxinli are you ok with this PR in its current form? |
yeah, i still don't see a button to merge. it now shows everything approved, checks passed, and no conflicts. i think a committer needs to merge. |
@jinyius only committers can see the button. I was asking because different repos have different commit procedures. Should be able to merge this soon as long as @shangxinli doesn't express concerns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jinyius
LGTM |
Jira
Tests
ProtoSchemaConverterTest#test*Recursion
ProtoWriteSupportTest#test*Recursion
Commits
Documentation