-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1711: Break circular dependencies in proto definitions #988
Conversation
final String name = fieldDescriptor.getFullName(); | ||
final List<String> newParentNames = new ArrayList<>(parentNames); | ||
newParentNames.add(name); | ||
if (parentNames.contains(name)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list contains would be slower than HashSet. Any reason we don't use HashSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list is mostly used to keep the ordering, so that the dependency chain is printed in order in the warning message. I understand that in case the schema definition is really deep with nested types it might be slower, but overall that list is not growing any bigger than the deepest nesting in the schema.
If this is still a concern, I am happy to switch to HashSet at the expense of maybe dumming down the log message (printing the nesting chain out of order would not be valuable anyway I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense
newParentNames.add(name); | ||
if (parentNames.contains(name)) { | ||
// Circular dependency, skip | ||
LOG.warn("Breaking circular dependency:{}{}", System.lineSeparator(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with Proto. By design, is the 'circular' normal in the proto or it is caused by issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to create circular dependencies, that is the problem. I am not sure in what case they would be useful, however since they can exist, parquet should not fail with StackOverflowError
when it encounters them 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we silently break the circle without throwing an exception. Is that OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, another option would be to add a new configuration setting to allow the user to either have it fail with a good error message, or just silently break the circle like this. However I am not familiar with how parquet-protobuf
is configured. If I should go that route, I'd appreciate some examples!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had been working on this issue as well and arrived at a similar solution to this one (however, without skipping/losing data) and linked to the prs in this pr conversation. ptal, and if you folks prefer it, i can submit a merge against head and close out this pr.
@shangxinli Let me know if this is good to merge! |
hmm... what timing. i actually have a pr for what i think is a more robust approach that truncates at an arbitrary recursion depth by putting the remaining recursion levels into a binary blob. this approach lets downstream querying things query the non-truncated parts fine, and allows for udfs to be defined to reinstantiate the truncated recursed fields. i didn't submit the pr for merge quite yet b/c i'm busy trying to finish off the overall project i needed this for at work, so it's just coded against 1.12.3 and not head. ptal, and if everyone likes my proposal, i can spend a few cycles and move it to head: schema converter pr: |
fyi, i sent pr #995 |
@matthieun and @jinyius Would it be possible for you both to sync to come up with one solution? You can put the other one as co-author. |
imho, i believe #995 is a superset of functionality to this pr. |
Hi @jinyius and @matthieun, Thank both of you for the contribution and we really appreciate your patience with us. Now we have two PRs for the same issue, we better merge them into one. Given this PR is earlier, would it be a good idea to incorporate #995 into this PR for what is missing? @matthieun can add @jinyius as a co-author in that case. Does it make sense to both of you? |
i don't think merging will help here. both approaches do similar things in terms of traversing and expanding out the schema on recursive fields. the differ on the state used during the traversal, and they differ on how to deal with the remaining recursive data (this one silently ignores, but the mine stores as serialized bytes). i don't care about authorship. i want this to get fixed, and fixed properly. |
Hi, I am fine with whatever solution. If you choose #995 that works, please just close this one! |
Since #995 is merged, let's close this one. Thanks @matthieun for the contribution ! |
In case some proto definitions have circular dependencies, the proto schema converter breaks those and logs a warning, instead of a
StackOverflowException
.Jira
Tests
ProtoSchemaConverterTest
Commits
Documentation