Issues/#207 turtle pretty print #290

ansell · 2016-08-12T04:37:00Z

This PR addresses GitHub issue: #207

Briefly describe the changes proposed in this PR:

Extends the current TurtleWriter and TriGWriter to support BasicWriterSettings.PRETTY_PRINT
Buffers Statements before writing them out for TurtleWriter and TriGWriter
Added many regression tests in RDFWriterTest
Fixed many regression test failures in other RDFParser implementations
Support blank node contexts in RDF/JSON based on _: prefix
Switch BasicWriterSettings.PRETTY_PRINT to default to false as it should only be switched on when appropriate

Make sure you've followed the Contributor Guidelines. In particular (please tick to indicate you've taken care of it):

RDF4J code formatting has been applied
tests are included
all tests succeed

Note that the current structure is not designed to allow pretty-printing of nested anonymous structures, and in particular, does not support pretty printed lists/collections. It is a medium ground between almost no pretty printing, and something that looks simpler to verify compared to arbitrary N-Triples/N-Quads.

Note, this is targeted as master, so does not involve the 2.0 release.

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Note: trig pretty writer is failing for some tests still at this point Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

TriGWriter wasn't sending statements to be buffered correctly Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

abrokenjester · 2016-08-12T04:53:10Z

core/rio/trig/src/main/java/org/eclipse/rdf4j/rio/trig/TriGWriter.java

@@ -100,6 +100,32 @@ public void handleStatement(Statement st)
 			throw new RuntimeException("Document writing has not yet been started");
 		}

+        // If we are pretty-printing, all writing is buffered until endRDF is
+        // called


To avoid OoM errors, perhaps we can use a configurable buffer size instead of just waiting for endRDF; similar to what is done in BufferedGroupingRDFHandler.

The trouble with that is that it isn't a full solution and will corrupt some datasets that rely heavily on blank node use if we anonymise any blank nodes at all. For example, there are parts of the OWL to RDF translation which insist on the use of blank nodes and not IRIs which could be corrupted by a buffering/anonymising solution depending on how the statements are delivered. The pure streaming approach and the full buffering approach will never corrupt datasets that don't OOM.

There may need to be another option to switch on/off anonymising blank nodes that could be used to specify whether local grouping/buffering occurred rather than a full buffer.

One issue I have noticed is that Pretty-Printing setting is on by default, and we probably want to switch that in this pull request as I don't see pretty-printing as a production feature for any scalable system, even for the other formats.

Good point, hadn't thought of that.

…us blank node notation Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…o break Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…ll stack traces get through Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…ter tests for future use Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

… often Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

… by design or default Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…with strange IRIs Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…contexts Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…heck, so isolate it further Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…ect/object Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

As they both close the active context, they will both interfere with pretty-printing in some corner cases and need to check the spec to make sure they need to have this effect. Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…Comment closes context Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…king pretty printing Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…d handleComment Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…ot with it Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Preparing to move these tests to RDFWriterTest to verify the patterns don't fail parsing for any other parsers other than TriG Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…tions The patterns shouldn't fail for any writer/parser combinations. Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…failing tests Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

…r BinaryRDF Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

They needed to be protected while the tests were being bootstrapped in TriGPrettyWriterTest, but not after they were relocated to RDFWriterTest Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

ansell · 2016-08-15T06:57:09Z

@jeenbroekstra and others, this is ready for more review (after the 2.0 release is out the door). Many regression tests have been added to verify round tripping for the blank node anonymisation for Turtle/TriG, some of which also made visible regressions in the other parsers which have also been fixed as part of this pull request. No attempt was made to pretty-print list structures so far, but this could still be useful on its own.

abrokenjester · 2016-08-16T21:29:16Z

Hudson reports a test failure on this PR. See https://hudson.eclipse.org/rdf4j/job/rdf4j-verify-pr/10/

ansell · 2016-08-16T23:27:19Z

I can't replicate that failure locally and I didn't modify any of the query result io code in this pull request so this pull request may not be the cause there.

It is hypothetically possible that it may have been related to a previous pull request of mine that did change query resultio code, but the exception is intermittent and NPE isn't the easiest thing to diagnose with multi-threading code :( I thought we had fixed the NPE's from BackgroundQueryResult at some stage in the past as they haven't occurred for a while.

....
[INFO] RDF4J compliance tests ............................. SUCCESS [  0.029 s]
[INFO] RDF4J Model compliance test ........................ SUCCESS [  8.093 s]
[INFO] RDF4J Query Result IO compliance tests ............. SUCCESS [  9.022 s]
[INFO] RDF4J Rio compliance tests ......................... SUCCESS [01:06 min]
[INFO] RDF4J HTTP server compliance tests ................. SUCCESS [ 14.270 s]
[INFO] RDF4J SAIL and Repository compliance test .......... SUCCESS [09:34 min]
[INFO] RDF4J SeRQL query parser compliance tests .......... SUCCESS [  4.501 s]
[INFO] RDF4J SPARQL query parser compliance tests ......... SUCCESS [02:25 min]
[INFO] RDF4J GeoSPARQL compliance tests ................... SUCCESS [  1.880 s]
[INFO] RDF4J BOM .......................................... SUCCESS [  0.006 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 19:47 min
[INFO] Finished at: 2016-08-17T09:21:31+10:00
[INFO] Final Memory: 127M/687M
[INFO] ------------------------------------------------------------------------

ansell · 2016-08-16T23:35:32Z

One bug report from sesame history that references handleClose and NPE is:

https://openrdf.atlassian.net/browse/SES-1978

ansell · 2016-08-16T23:38:22Z

I also reported a very similar stack trace during the migration to the new httpclient as part of the pull request review:

https://bitbucket.org/openrdf/sesame/pull-requests/134/update-to-apache-httpcomponents-httpclient/diff

ansell · 2016-09-01T00:38:34Z

This is good to be reviewed/merged from my point of view, as the two intermittent bugs that my testing (and hudson) has been finding are being dealt with separately and are not only happening on this branch.

It is a fairly useful form of Turtle pretty-printing with the major exception that it does not include lists so far, and would need quite a bit of work to support lists in general as there are quite a few rabbit holes to encounter on the way to that goal.

abrokenjester · 2016-09-01T04:36:54Z

Looks good to me. This is already a useful improvement in itself and it nicely localizes the pretty-printing logic, so that for future extensions (handling lists etc) we need only tweak that part of the code.

abrokenjester · 2016-09-01T04:43:28Z

core/rio/turtle/src/main/java/org/eclipse/rdf4j/rio/turtle/TurtleWriter.java

+								// Cannot shorten this blank node as it is used as the object of a statement somewhere 
+								// so must be written in a non-anonymous form
+								canShortenSubjectBNode = false;
+							}


To be able to handle this kind of nested blank nodes (such as in anonymous objects, or list structures) we need to build up a dependency mapping of some sort. Or rather: if we have a case where the subject is the object of another statement, we can still shorten it if we first print that other statement.

Anyway, I realize there's more to it than that, so I'm fine with the implementation as-is for now. Just trying to think about future extension.

Thinking around future extensions, as you say: it seems to me someway similar to a naive framing algorithm: one could populate a "lazy" collection (stream?) of subject references, then try to resolve them for formatting when needed.
At the same time it could be possible to apply CURIEs or other short representation on the subject uris, as they can improve readability for turtle.

abrokenjester · 2016-09-01T05:16:04Z

I'm happy with this as-is. Do you wish to merge now or do you have further things you want to put in?

ansell · 2016-09-01T05:26:01Z

No more things for now, merge as is.

ansell added 3 commits August 11, 2016 21:25

issue#207 : Initial primitive turtle pretty printing

a0f04b7

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Also add separate pretty writer test for trig

0860c0d

Note: trig pretty writer is failing for some tests still at this point Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Fix tests for TriGWriter

cc9c7b8

TriGWriter wasn't sending statements to be buffered correctly Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

abrokenjester reviewed Aug 12, 2016
View reviewed changes

ansell added 26 commits August 12, 2016 01:11

issue#207 : More work on regression testing before working on anonymo…

f9c8735

…us blank node notation Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Fix statement pattern that was matching twice accidentally

91395b4

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Create so many regression tests that other things start t…

2e5bee6

…o break Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Fix ambiguous varargs

fbae8f5

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Fix some code in AbstractRDFParser and RDFXMLParser so fu…

2491d01

…ll stack traces get through Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Switch pretty printing off by default as it isn't scalable

70eec6d

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Set pretty printing option properly in the two RDFXML wri…

50511d5

…ter tests for future use Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Support the use of blank nodes as contexts for RDF/JSON

73e953f

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Fix another issue where a triple pattern was matching too…

fb26540

… often Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Fix tests for parser/writers that don't preserve bnode id…

1a016d2

… by design or default Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Try not to trouble the problematic child that is RDF/XML …

671a16c

…with strange IRIs Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Start work on infrastructure for anonymous bnode writing

7d7dcde

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Add more regression tests

4d66718

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Just use a single handleStatementInternal

ea5c4c2

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Generalise to not just TriG but any format that supports …

aa27634

…contexts Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Fix regression tests to use the test parameters

2a1ae54

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Even Turtle is broken by the statements across contexts c…

fd76978

…heck, so isolate it further Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Reduce the scope of performance testing while unit testing

fe71b5e

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Push through the object bnode anonymisation boolean

28101f1

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Find further corner case where context bnode used as subj…

e7a6da7

…ect/object Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Add TriGPrettyPrinterWriterTest that fails because handle…

e4ff622

…Comment closes context Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Fix test to break legitimately

8e7b9a1

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Avoid closing context when writing comments to avoid brea…

7f25c52

…king pretty printing Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Add tests with namespace called before handleStatement an…

726bbc0

…d handleComment Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Add tests that may fail if not using pretty printer but n…

59259f1

…ot with it Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

ansell added 7 commits August 15, 2016 01:56

issue#207 : Remove reference to pretty printing in test names

7e17622

Preparing to move these tests to RDFWriterTest to verify the patterns don't fail parsing for any other parsers other than TriG Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Move tests to RDFWriterTest for all writer/parser combina…

cc2fa33

…tions The patterns shouldn't fail for any writer/parser combinations. Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Segregate context sensitive tests to isolate most of the …

92c69b9

…failing tests Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Fix TriGWriter logic for shortening contexts

5026dcc

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Switch StringReader/StringWriter to ByteArray versions fo…

8bdc59f

…r BinaryRDF Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Switch internal test variables back to private

b840f46

They needed to be protected while the tests were being bootstrapped in TriGPrettyWriterTest, but not after they were relocated to RDFWriterTest Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

issue#207 : Switch exNs to private

60b93ec

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>

ansell mentioned this pull request Aug 16, 2016

NPE with BackgroundTupleResult.handleClose in stacktrace #293

Closed

abrokenjester reviewed Sep 1, 2016
View reviewed changes

abrokenjester merged commit 58f87f1 into eclipse-rdf4j:master Sep 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues/#207 turtle pretty print #290

Issues/#207 turtle pretty print #290

ansell commented Aug 12, 2016 •

edited

abrokenjester Aug 12, 2016

ansell Aug 12, 2016

abrokenjester Aug 12, 2016

ansell commented Aug 15, 2016

abrokenjester commented Aug 16, 2016

ansell commented Aug 16, 2016

ansell commented Aug 16, 2016

ansell commented Aug 16, 2016

ansell commented Sep 1, 2016

abrokenjester commented Sep 1, 2016

abrokenjester Sep 1, 2016

seralf Dec 13, 2016

abrokenjester commented Sep 1, 2016

ansell commented Sep 1, 2016

Issues/#207 turtle pretty print #290

Issues/#207 turtle pretty print #290

Conversation

ansell commented Aug 12, 2016 • edited

abrokenjester Aug 12, 2016

Choose a reason for hiding this comment

ansell Aug 12, 2016

Choose a reason for hiding this comment

abrokenjester Aug 12, 2016

Choose a reason for hiding this comment

ansell commented Aug 15, 2016

abrokenjester commented Aug 16, 2016

ansell commented Aug 16, 2016

ansell commented Aug 16, 2016

ansell commented Aug 16, 2016

ansell commented Sep 1, 2016

abrokenjester commented Sep 1, 2016

abrokenjester Sep 1, 2016

Choose a reason for hiding this comment

seralf Dec 13, 2016

Choose a reason for hiding this comment

abrokenjester commented Sep 1, 2016

ansell commented Sep 1, 2016

ansell commented Aug 12, 2016 •

edited