New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
story 71751536. rdf:type triples are duplicated in output #421
Conversation
Remove duplicates from RDFStream. https://www.pivotaltracker.com/s/projects/684825/stories/71751536
story 71751536. rdf:type triples are duplicated in output
* Removes duplicate triples. | ||
*/ | ||
public void removeDuplicates() { | ||
final LinkedHashSet triplesLHS = new LinkedHashSet(Lists.newArrayList(triples)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean we have to load every triple in RAM in order to de-dupe, or is there something clever happening under the hood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to de-dupe? (And yes, @cbeer, putting these triples in an ArrayList or LinkedHashSet will pull them all into memory. Pretty much anything from the basic Java Collection API that impls Collection is an in-memory construct.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should de-dupe the rdf:type statements -- it looks sloppy to repeat them.
Ideally, we would be able to avoid adding the duplicate types in the first place. But if that doesn't work, then a better pattern would be an Iterator implementation that wraps the triples and suppresses duplicate rdf:type triples. We would only need the rdf:type triples in memory, which would avoid the typical cases where this would be a problem (e.g. large numbers of children, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be much better to do some kind of limited filtering. The extant filter() method could be reused for this. The whole purpose of RdfStream was originally precisely to avoid pulling all the triples into heap, because we saw severe performance degradations when that happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should merge this as is. It would be good to understand why there are duplicate triples in the first place. It'd surprise me if that was necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the ticket describes why (or rather "when") the duplicate triples are introduced.
https://www.pivotaltracker.com/story/show/71751536
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of not introducing duplicates in the first place is appealing, as it is not clear to me how we would filter duplicates without building up the triples in-memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I'm actually fine with duplicate triples. I understand @escowles's point, but it just doesn't bother me that much. RDF is a machine format.
-
If we can understand how to avoid the duplicates at some closer-to-the-JCR layer, that is without question the right thing to do. It shouldn't be that hard. It currently seems to be a matter of maintaining some in-flight state in the subclasses of NodeRdfContext.
-
NodeRdfContext is a piece of junk. In fact, the whole triple generation subsystem should be reworked to bring far more of the action out into the type system. And yes, I wrote a huge amount of that and that's one reason I'm so sure about it.
This commit has been reverted. |
Remove duplicates from RDFStream.
https://www.pivotaltracker.com/s/projects/684825/stories/71751536