story 71751536. rdf:type triples are duplicated in output #421

mohideen · 2014-07-17T20:54:16Z

Remove duplicates from RDFStream.

https://www.pivotaltracker.com/s/projects/684825/stories/71751536

Remove duplicates from RDFStream. https://www.pivotaltracker.com/s/projects/684825/stories/71751536

story 71751536. rdf:type triples are duplicated in output

cbeer · 2014-07-23T20:32:19Z

fcrepo-kernel/src/main/java/org/fcrepo/kernel/utils/iterators/RdfStream.java

+     * Removes duplicate triples.
+     */
+    public void removeDuplicates() {
+        final LinkedHashSet triplesLHS = new LinkedHashSet(Lists.newArrayList(triples));


Doesn't this mean we have to load every triple in RAM in order to de-dupe, or is there something clever happening under the hood?

Why do we need to de-dupe? (And yes, @cbeer, putting these triples in an ArrayList or LinkedHashSet will pull them all into memory. Pretty much anything from the basic Java Collection API that impls Collection is an in-memory construct.)

I think we should de-dupe the rdf:type statements -- it looks sloppy to repeat them.

Ideally, we would be able to avoid adding the duplicate types in the first place. But if that doesn't work, then a better pattern would be an Iterator implementation that wraps the triples and suppresses duplicate rdf:type triples. We would only need the rdf:type triples in memory, which would avoid the typical cases where this would be a problem (e.g. large numbers of children, etc.).

It would be much better to do some kind of limited filtering. The extant filter() method could be reused for this. The whole purpose of RdfStream was originally precisely to avoid pulling all the triples into heap, because we saw severe performance degradations when that happened.

@cbeer, @ajs6f, are you ok with duplicate triples? Should we find another technique for maintaining uniqueness or should this ticket be dropped?

I don't think we should merge this as is. It would be good to understand why there are duplicate triples in the first place. It'd surprise me if that was necessary.

I believe the ticket describes why (or rather "when") the duplicate triples are introduced.
https://www.pivotaltracker.com/story/show/71751536

The idea of not introducing duplicates in the first place is appealing, as it is not clear to me how we would filter duplicates without building up the triples in-memory.

I'm actually fine with duplicate triples. I understand @escowles's point, but it just doesn't bother me that much. RDF is a machine format.

If we can understand how to avoid the duplicates at some closer-to-the-JCR layer, that is without question the right thing to do. It shouldn't be that hard. It currently seems to be a matter of maintaining some in-flight state in the subclasses of NodeRdfContext.

NodeRdfContext is a piece of junk. In fact, the whole triple generation subsystem should be reworked to bring far more of the action out into the type system. And yes, I wrote a huge amount of that and that's one reason I'm so sure about it.

awoods · 2014-07-23T21:25:11Z

This commit has been reverted.

story 71751536. rdf:type triples are duplicated in output

130eb1f

Remove duplicates from RDFStream. https://www.pivotaltracker.com/s/projects/684825/stories/71751536

awoods pushed a commit that referenced this pull request Jul 23, 2014

Merge pull request #421 from umd-lib/rdf-dedup

fd8eccd

story 71751536. rdf:type triples are duplicated in output

awoods merged commit fd8eccd into fcrepo:master Jul 23, 2014

cbeer reviewed Jul 23, 2014
View reviewed changes

peichman-umd deleted the rdf-dedup branch December 10, 2014 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

story 71751536. rdf:type triples are duplicated in output #421

story 71751536. rdf:type triples are duplicated in output #421

mohideen commented Jul 17, 2014

cbeer Jul 23, 2014

ajs6f Jul 23, 2014

escowles Jul 23, 2014

ajs6f Jul 23, 2014

awoods Jul 23, 2014

cbeer Jul 23, 2014

awoods Jul 23, 2014

awoods Jul 23, 2014

ajs6f Jul 24, 2014

awoods commented Jul 23, 2014

story 71751536. rdf:type triples are duplicated in output #421

story 71751536. rdf:type triples are duplicated in output #421

Conversation

mohideen commented Jul 17, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awoods commented Jul 23, 2014