Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1711: support recursive proto schemas by limiting recursion depth #995

Merged
merged 3 commits into from
Nov 2, 2022
Merged

PARQUET-1711: support recursive proto schemas by limiting recursion depth #995

merged 3 commits into from
Nov 2, 2022

Conversation

jinyius
Copy link
Contributor

@jinyius jinyius commented Sep 8, 2022

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    • ProtoSchemaConverterTest#test*Recursion
    • ProtoWriteSupportTest#test*Recursion

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@jinyius
Copy link
Contributor Author

jinyius commented Sep 14, 2022

ping

@jinyius
Copy link
Contributor Author

jinyius commented Sep 28, 2022

fixed missing dep issue. can someone approve the ci flow?

@@ -99,9 +139,9 @@ private Type.Repetition getRepetition(FieldDescriptor descriptor) {
}
}

private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder) {
private <T> Builder<? extends Builder<?, GroupBuilder<T>>, GroupBuilder<T>> addField(FieldDescriptor descriptor, final GroupBuilder<T> builder, ImmutableSetMultimap<String, Integer> seen, int depth) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to consolidate seen and depth into a single data-structure that can be passed through and abstract some of the direct access to the multimap?

Copy link
Contributor Author

@jinyius jinyius Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

the seen map does encode the seen fields along with their depth as a single datastructure. depth being a separate arg is important b/c it's the current depth in the traversal, and is used to update the seen data structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I was thinking of encapsulating this logic into its own class, so they can be recorded and updated together, to 1. Reduce additional parameters that have to be passed through.
2. Encapsulate the logic behind more mnemonic method names (e.g. AddRecursiveStep())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure encapsulation helps with readability or protection in this case. they are really tracking different things, and should be understood by readers of the traversal code to know how each piece of state is used.

@jinyius
Copy link
Contributor Author

jinyius commented Sep 30, 2022

thanks for the review. updated to handle the logging perf concern as well as fixing the javadoc errors.

@jinyius
Copy link
Contributor Author

jinyius commented Oct 7, 2022

ping

" }\n" +
"}";
public void testProto3ConvertAllDatatypes() {
String expectedSchema = JOINER.join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to separate this tpe of code style cleanup from functional changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdym by "tpe"?

if this isn't blocking, i'd rather avoid the busy-work to undo and redo in a different branch.

: value instanceof Message
? ((Message) value).toByteString()
// Worst-case, just dump as plain java string.
: ByteString.copyFromUtf8(value.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually an intended state? If not it is probably better to raise an exception then writing data that could possibly be hard to recover.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is intended. for a real-time, production pipeline i'm working on, losing data as it passes through or killing the job b/c of an uncaught exception is problematic as it could lead to data loss and down time. this way, there's some way to know what the problematic data was and fix it properly asap.

? (ByteString) value
// TODO: figure out a way to use MessageOrBuilder
: value instanceof Message
? ((Message) value).toByteString()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does recordconsumer offer a stream API or something else to avoid the additional array/bytestring copies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,50 @@
message Trees.BinaryTree {
optional group value = 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't groups deprecated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is par not proto?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is parquet schema, not proto. protos should/would have a .proto suffix.

option java_package = "org.apache.parquet.proto.test";

message BinaryTree {
google.protobuf.Any value = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to verify that something like:

message WrappedTree {
google.protobuf.Any non_recursive = 1;
BinaryTree tree = 2;
}

Also gives expected results (non_recursive doesn't accidentally trigger any of the recursio logic).
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the existing non-recursive proto tests exercise the existing and newly added (the skipping behavior) code paths.

@emkornfield
Copy link
Contributor

Mostly looks reasonable, I'm not too familiar with parquet-mr @shangxinli can you recommend someone who might be able to give a better review?

@jinyius
Copy link
Contributor Author

jinyius commented Oct 17, 2022

Mostly looks reasonable, I'm not too familiar with parquet-mr @shangxinli can you recommend someone who might be able to give a better review?

pinging @shangxinli :)

@shangxinli
Copy link
Contributor

@ggershinsky Can you have a look?

Copy link
Contributor

@ggershinsky ggershinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with the current state of the PR, and would like to thank its reviewers.

@ggershinsky
Copy link
Contributor

I would also like to recommend adding @matthieun as a co-author to this PR, per the discussion in the parallel PR.

@jinyius
Copy link
Contributor Author

jinyius commented Oct 19, 2022

I would also like to recommend adding @matthieun as a co-author to this PR, per the discussion in the parallel PR.

how do you do this? i think i did it right...

Co-authored-by: matthieun <matthieu.nahoum@gmail.com>
@jinyius jinyius requested review from ggershinsky and emkornfield and removed request for emkornfield and ggershinsky October 19, 2022 22:07
@jinyius jinyius requested review from emkornfield and ggershinsky and removed request for ggershinsky and emkornfield October 19, 2022 22:07
@jinyius
Copy link
Contributor Author

jinyius commented Oct 24, 2022

can someone retry the github actions? there seemed to have been a transient issue that caused one of the test/build targets to fail. i'd like to get this change in this week.

@jinyius jinyius requested review from emkornfield and ggershinsky and removed request for ggershinsky and emkornfield October 25, 2022 16:49
@emkornfield
Copy link
Contributor

@ggershinsky what is the process to merge this? Does parquet-mr just use the github UI?

@ggershinsky
Copy link
Contributor

yep, just the squash/merge button.

@jinyius jinyius closed this Oct 31, 2022
@jinyius jinyius reopened this Oct 31, 2022
@jinyius
Copy link
Contributor Author

jinyius commented Oct 31, 2022

@ggershinsky

i'd love to just hit the button. i don't see it. the workflow for travis ci had a failure due to a transient connection issue, and so it wasn't giving me the option to merge. the ui messaging also states that "Only those with write access to this repository can merge pull requests."

@ggershinsky
Copy link
Contributor

@shangxinli are you ok with this PR in its current form?

@jinyius
Copy link
Contributor Author

jinyius commented Nov 1, 2022

yeah, i still don't see a button to merge. it now shows everything approved, checks passed, and no conflicts.

i think a committer needs to merge.

@emkornfield
Copy link
Contributor

@jinyius only committers can see the button. I was asking because different repos have different commit procedures. Should be able to merge this soon as long as @shangxinli doesn't express concerns.

Copy link
Contributor

@ggershinsky ggershinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jinyius

@shangxinli
Copy link
Contributor

LGTM

@shangxinli shangxinli merged commit d75596b into apache:master Nov 2, 2022
@jinyius jinyius deleted the feat/protorecurse branch November 2, 2022 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants