PARQUET-1711: Break circular dependencies in proto definitions #988

matthieun · 2022-08-18T20:02:00Z

In case some proto definitions have circular dependencies, the proto schema converter breaks those and logs a warning, instead of a StackOverflowException.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-1711

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
- Proto definitions with circular dependencies tested in ProtoSchemaConverterTest

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

shangxinli · 2022-08-21T19:09:11Z

parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java

+      final String name = fieldDescriptor.getFullName();
+      final List<String> newParentNames = new ArrayList<>(parentNames);
+      newParentNames.add(name);
+      if (parentNames.contains(name)) {


The list contains would be slower than HashSet. Any reason we don't use HashSet?

The list is mostly used to keep the ordering, so that the dependency chain is printed in order in the warning message. I understand that in case the schema definition is really deep with nested types it might be slower, but overall that list is not growing any bigger than the deepest nesting in the schema.
If this is still a concern, I am happy to switch to HashSet at the expense of maybe dumming down the log message (printing the nesting chain out of order would not be valuable anyway I think).

shangxinli · 2022-08-21T19:11:27Z

parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoSchemaConverter.java

+      newParentNames.add(name);
+      if (parentNames.contains(name)) {
+        // Circular dependency, skip
+        LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),


I am not very familiar with Proto. By design, is the 'circular' normal in the proto or it is caused by issues?

It is possible to create circular dependencies, that is the problem. I am not sure in what case they would be useful, however since they can exist, parquet should not fail with StackOverflowError when it encounters them 😄

In that case, we silently break the circle without throwing an exception. Is that OK?

Well, another option would be to add a new configuration setting to allow the user to either have it fail with a good error message, or just silently break the circle like this. However I am not familiar with how parquet-protobuf is configured. If I should go that route, I'd appreciate some examples!

i had been working on this issue as well and arrived at a similar solution to this one (however, without skipping/losing data) and linked to the prs in this pr conversation. ptal, and if you folks prefer it, i can submit a merge against head and close out this pr.

matthieun · 2022-08-26T23:40:10Z

@shangxinli Let me know if this is good to merge!

jinyius · 2022-08-31T05:11:31Z

hmm... what timing. i actually have a pr for what i think is a more robust approach that truncates at an arbitrary recursion depth by putting the remaining recursion levels into a binary blob. this approach lets downstream querying things query the non-truncated parts fine, and allows for udfs to be defined to reinstantiate the truncated recursed fields.

i didn't submit the pr for merge quite yet b/c i'm busy trying to finish off the overall project i needed this for at work, so it's just coded against 1.12.3 and not head.

ptal, and if everyone likes my proposal, i can spend a few cycles and move it to head:

schema converter pr:

https://github.com/promotedai/parquet-mr/pull/1
write support pr:
https://github.com/promotedai/parquet-mr/pull/2

jinyius · 2022-09-08T07:44:53Z

fyi, i sent pr #995

shangxinli · 2022-09-27T02:32:40Z

@matthieun and @jinyius Would it be possible for you both to sync to come up with one solution? You can put the other one as co-author.

jinyius · 2022-09-28T05:49:48Z

@matthieun and @jinyius Would it be possible for you both to sync to come up with one solution? You can put the other one as co-author.

imho, i believe #995 is a superset of functionality to this pr.

shangxinli · 2022-10-09T20:26:45Z

Hi @jinyius and @matthieun, Thank both of you for the contribution and we really appreciate your patience with us. Now we have two PRs for the same issue, we better merge them into one. Given this PR is earlier, would it be a good idea to incorporate #995 into this PR for what is missing? @matthieun can add @jinyius as a co-author in that case.

Does it make sense to both of you?

jinyius · 2022-10-10T04:51:05Z

Hi @jinyius and @matthieun, Thank both of you for the contribution and we really appreciate your patience with us. Now we have two PRs for the same issue, we better merge them into one. Given this PR is earlier, would it be a good idea to incorporate #995 into this PR for what is missing? @matthieun can add @jinyius as a co-author in that case.

Does it make sense to both of you?

i don't think merging will help here. both approaches do similar things in terms of traversing and expanding out the schema on recursive fields. the differ on the state used during the traversal, and they differ on how to deal with the remaining recursive data (this one silently ignores, but the mine stores as serialized bytes).

i don't care about authorship. i want this to get fixed, and fixed properly.

matthieun · 2022-10-10T17:26:57Z

Hi, I am fine with whatever solution. If you choose #995 that works, please just close this one!

shangxinli · 2022-12-03T18:34:17Z

Since #995 is merged, let's close this one. Thanks @matthieun for the contribution !

PARQUET-1711: Break circular dependencies in proto definitions

71424f2

shangxinli reviewed Aug 21, 2022

View reviewed changes

jinyius mentioned this pull request Sep 8, 2022

PARQUET-1711: support recursive proto schemas by limiting recursion depth #995

Merged

4 tasks

shangxinli closed this Dec 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-1711: Break circular dependencies in proto definitions #988

PARQUET-1711: Break circular dependencies in proto definitions #988

matthieun commented Aug 18, 2022

shangxinli Aug 21, 2022

matthieun Aug 22, 2022

shangxinli Aug 23, 2022

shangxinli Aug 21, 2022

matthieun Aug 22, 2022

shangxinli Aug 23, 2022 •

edited

Loading

matthieun Aug 23, 2022

jinyius Aug 31, 2022

matthieun commented Aug 26, 2022

jinyius commented Aug 31, 2022

jinyius commented Sep 8, 2022

shangxinli commented Sep 27, 2022

jinyius commented Sep 28, 2022

shangxinli commented Oct 9, 2022

jinyius commented Oct 10, 2022

matthieun commented Oct 10, 2022

shangxinli commented Dec 3, 2022

PARQUET-1711: Break circular dependencies in proto definitions #988

PARQUET-1711: Break circular dependencies in proto definitions #988

Conversation

matthieun commented Aug 18, 2022

Jira

Tests

Commits

Documentation

shangxinli Aug 21, 2022

Choose a reason for hiding this comment

matthieun Aug 22, 2022

Choose a reason for hiding this comment

shangxinli Aug 23, 2022

Choose a reason for hiding this comment

shangxinli Aug 21, 2022

Choose a reason for hiding this comment

matthieun Aug 22, 2022

Choose a reason for hiding this comment

shangxinli Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

matthieun Aug 23, 2022

Choose a reason for hiding this comment

jinyius Aug 31, 2022

Choose a reason for hiding this comment

matthieun commented Aug 26, 2022

jinyius commented Aug 31, 2022

jinyius commented Sep 8, 2022

shangxinli commented Sep 27, 2022

jinyius commented Sep 28, 2022

shangxinli commented Oct 9, 2022

jinyius commented Oct 10, 2022

matthieun commented Oct 10, 2022

shangxinli commented Dec 3, 2022

shangxinli Aug 23, 2022 •

edited

Loading