Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1711: Break circular dependencies in proto definitions #988

Closed
wants to merge 1 commit into from

Conversation

matthieun
Copy link
Contributor

In case some proto definitions have circular dependencies, the proto schema converter breaks those and logs a warning, instead of a StackOverflowException.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    • Proto definitions with circular dependencies tested in ProtoSchemaConverterTest

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

final String name = fieldDescriptor.getFullName();
final List<String> newParentNames = new ArrayList<>(parentNames);
newParentNames.add(name);
if (parentNames.contains(name)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list contains would be slower than HashSet. Any reason we don't use HashSet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list is mostly used to keep the ordering, so that the dependency chain is printed in order in the warning message. I understand that in case the schema definition is really deep with nested types it might be slower, but overall that list is not growing any bigger than the deepest nesting in the schema.
If this is still a concern, I am happy to switch to HashSet at the expense of maybe dumming down the log message (printing the nesting chain out of order would not be valuable anyway I think).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense

newParentNames.add(name);
if (parentNames.contains(name)) {
// Circular dependency, skip
LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with Proto. By design, is the 'circular' normal in the proto or it is caused by issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to create circular dependencies, that is the problem. I am not sure in what case they would be useful, however since they can exist, parquet should not fail with StackOverflowError when it encounters them 😄

Copy link
Contributor

@shangxinli shangxinli Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we silently break the circle without throwing an exception. Is that OK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, another option would be to add a new configuration setting to allow the user to either have it fail with a good error message, or just silently break the circle like this. However I am not familiar with how parquet-protobuf is configured. If I should go that route, I'd appreciate some examples!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had been working on this issue as well and arrived at a similar solution to this one (however, without skipping/losing data) and linked to the prs in this pr conversation. ptal, and if you folks prefer it, i can submit a merge against head and close out this pr.

@matthieun
Copy link
Contributor Author

@shangxinli Let me know if this is good to merge!

@jinyius
Copy link
Contributor

jinyius commented Aug 31, 2022

hmm... what timing. i actually have a pr for what i think is a more robust approach that truncates at an arbitrary recursion depth by putting the remaining recursion levels into a binary blob. this approach lets downstream querying things query the non-truncated parts fine, and allows for udfs to be defined to reinstantiate the truncated recursed fields.

i didn't submit the pr for merge quite yet b/c i'm busy trying to finish off the overall project i needed this for at work, so it's just coded against 1.12.3 and not head.

ptal, and if everyone likes my proposal, i can spend a few cycles and move it to head:

schema converter pr:

@jinyius
Copy link
Contributor

jinyius commented Sep 8, 2022

fyi, i sent pr #995

@shangxinli
Copy link
Contributor

@matthieun and @jinyius Would it be possible for you both to sync to come up with one solution? You can put the other one as co-author.

@jinyius
Copy link
Contributor

jinyius commented Sep 28, 2022

@matthieun and @jinyius Would it be possible for you both to sync to come up with one solution? You can put the other one as co-author.

imho, i believe #995 is a superset of functionality to this pr.

@shangxinli
Copy link
Contributor

Hi @jinyius and @matthieun, Thank both of you for the contribution and we really appreciate your patience with us. Now we have two PRs for the same issue, we better merge them into one. Given this PR is earlier, would it be a good idea to incorporate #995 into this PR for what is missing? @matthieun can add @jinyius as a co-author in that case.

Does it make sense to both of you?

@jinyius
Copy link
Contributor

jinyius commented Oct 10, 2022

Hi @jinyius and @matthieun, Thank both of you for the contribution and we really appreciate your patience with us. Now we have two PRs for the same issue, we better merge them into one. Given this PR is earlier, would it be a good idea to incorporate #995 into this PR for what is missing? @matthieun can add @jinyius as a co-author in that case.

Does it make sense to both of you?

i don't think merging will help here. both approaches do similar things in terms of traversing and expanding out the schema on recursive fields. the differ on the state used during the traversal, and they differ on how to deal with the remaining recursive data (this one silently ignores, but the mine stores as serialized bytes).

i don't care about authorship. i want this to get fixed, and fixed properly.

@matthieun
Copy link
Contributor Author

Hi, I am fine with whatever solution. If you choose #995 that works, please just close this one!

@shangxinli
Copy link
Contributor

Since #995 is merged, let's close this one. Thanks @matthieun for the contribution !

@shangxinli shangxinli closed this Dec 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants