Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues/#207 turtle pretty print #290

Merged

Conversation

ansell
Copy link
Contributor

@ansell ansell commented Aug 12, 2016

This PR addresses GitHub issue: #207

Briefly describe the changes proposed in this PR:

  • Extends the current TurtleWriter and TriGWriter to support BasicWriterSettings.PRETTY_PRINT
  • Buffers Statements before writing them out for TurtleWriter and TriGWriter
  • Added many regression tests in RDFWriterTest
  • Fixed many regression test failures in other RDFParser implementations
  • Support blank node contexts in RDF/JSON based on _: prefix
  • Switch BasicWriterSettings.PRETTY_PRINT to default to false as it should only be switched on when appropriate

Make sure you've followed the Contributor Guidelines. In particular (please tick to indicate you've taken care of it):

  • RDF4J code formatting has been applied
  • tests are included
  • all tests succeed

Note that the current structure is not designed to allow pretty-printing of nested anonymous structures, and in particular, does not support pretty printed lists/collections. It is a medium ground between almost no pretty printing, and something that looks simpler to verify compared to arbitrary N-Triples/N-Quads.

Note, this is targeted as master, so does not involve the 2.0 release.

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Note: trig pretty writer is failing for some tests still at this point

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
TriGWriter wasn't sending statements to be buffered correctly

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
@@ -100,6 +100,32 @@ public void handleStatement(Statement st)
throw new RuntimeException("Document writing has not yet been started");
}

// If we are pretty-printing, all writing is buffered until endRDF is
// called
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid OoM errors, perhaps we can use a configurable buffer size instead of just waiting for endRDF; similar to what is done in BufferedGroupingRDFHandler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble with that is that it isn't a full solution and will corrupt some datasets that rely heavily on blank node use if we anonymise any blank nodes at all. For example, there are parts of the OWL to RDF translation which insist on the use of blank nodes and not IRIs which could be corrupted by a buffering/anonymising solution depending on how the statements are delivered. The pure streaming approach and the full buffering approach will never corrupt datasets that don't OOM.

There may need to be another option to switch on/off anonymising blank nodes that could be used to specify whether local grouping/buffering occurred rather than a full buffer.

One issue I have noticed is that Pretty-Printing setting is on by default, and we probably want to switch that in this pull request as I don't see pretty-printing as a production feature for any scalable system, even for the other formats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, hadn't thought of that.

…us blank node notation

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…o break

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…ll stack traces get through

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…ter tests for future use

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
… often

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
… by design or default

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…with strange IRIs

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…contexts

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…heck, so isolate it further

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…ect/object

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
As they both close the active context, they will both interfere with pretty-printing in some corner cases and need to check the spec to make sure they need to have this effect.

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…Comment closes context

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…king pretty printing

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…d handleComment

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…ot with it

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Preparing to move these tests to RDFWriterTest to verify the patterns don't fail parsing for any other parsers other than TriG

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…tions

The patterns shouldn't fail for any writer/parser combinations.

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…failing tests

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
…r BinaryRDF

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
They needed to be protected while the tests were being bootstrapped in TriGPrettyWriterTest, but not after they were relocated to RDFWriterTest

Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
Signed-off-by: Peter Ansell <p_ansell@yahoo.com>
@ansell
Copy link
Contributor Author

ansell commented Aug 15, 2016

@jeenbroekstra and others, this is ready for more review (after the 2.0 release is out the door). Many regression tests have been added to verify round tripping for the blank node anonymisation for Turtle/TriG, some of which also made visible regressions in the other parsers which have also been fixed as part of this pull request. No attempt was made to pretty-print list structures so far, but this could still be useful on its own.

@abrokenjester
Copy link
Contributor

Hudson reports a test failure on this PR. See https://hudson.eclipse.org/rdf4j/job/rdf4j-verify-pr/10/

@ansell
Copy link
Contributor Author

ansell commented Aug 16, 2016

I can't replicate that failure locally and I didn't modify any of the query result io code in this pull request so this pull request may not be the cause there.

It is hypothetically possible that it may have been related to a previous pull request of mine that did change query resultio code, but the exception is intermittent and NPE isn't the easiest thing to diagnose with multi-threading code :( I thought we had fixed the NPE's from BackgroundQueryResult at some stage in the past as they haven't occurred for a while.

....
[INFO] RDF4J compliance tests ............................. SUCCESS [  0.029 s]
[INFO] RDF4J Model compliance test ........................ SUCCESS [  8.093 s]
[INFO] RDF4J Query Result IO compliance tests ............. SUCCESS [  9.022 s]
[INFO] RDF4J Rio compliance tests ......................... SUCCESS [01:06 min]
[INFO] RDF4J HTTP server compliance tests ................. SUCCESS [ 14.270 s]
[INFO] RDF4J SAIL and Repository compliance test .......... SUCCESS [09:34 min]
[INFO] RDF4J SeRQL query parser compliance tests .......... SUCCESS [  4.501 s]
[INFO] RDF4J SPARQL query parser compliance tests ......... SUCCESS [02:25 min]
[INFO] RDF4J GeoSPARQL compliance tests ................... SUCCESS [  1.880 s]
[INFO] RDF4J BOM .......................................... SUCCESS [  0.006 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 19:47 min
[INFO] Finished at: 2016-08-17T09:21:31+10:00
[INFO] Final Memory: 127M/687M
[INFO] ------------------------------------------------------------------------

@ansell
Copy link
Contributor Author

ansell commented Aug 16, 2016

One bug report from sesame history that references handleClose and NPE is:

https://openrdf.atlassian.net/browse/SES-1978

@ansell
Copy link
Contributor Author

ansell commented Aug 16, 2016

I also reported a very similar stack trace during the migration to the new httpclient as part of the pull request review:

https://bitbucket.org/openrdf/sesame/pull-requests/134/update-to-apache-httpcomponents-httpclient/diff

@ansell
Copy link
Contributor Author

ansell commented Sep 1, 2016

This is good to be reviewed/merged from my point of view, as the two intermittent bugs that my testing (and hudson) has been finding are being dealt with separately and are not only happening on this branch.

It is a fairly useful form of Turtle pretty-printing with the major exception that it does not include lists so far, and would need quite a bit of work to support lists in general as there are quite a few rabbit holes to encounter on the way to that goal.

@abrokenjester
Copy link
Contributor

Looks good to me. This is already a useful improvement in itself and it nicely localizes the pretty-printing logic, so that for future extensions (handling lists etc) we need only tweak that part of the code.

// Cannot shorten this blank node as it is used as the object of a statement somewhere
// so must be written in a non-anonymous form
canShortenSubjectBNode = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be able to handle this kind of nested blank nodes (such as in anonymous objects, or list structures) we need to build up a dependency mapping of some sort. Or rather: if we have a case where the subject is the object of another statement, we can still shorten it if we first print that other statement.

Anyway, I realize there's more to it than that, so I'm fine with the implementation as-is for now. Just trying to think about future extension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking around future extensions, as you say: it seems to me someway similar to a naive framing algorithm: one could populate a "lazy" collection (stream?) of subject references, then try to resolve them for formatting when needed.
At the same time it could be possible to apply CURIEs or other short representation on the subject uris, as they can improve readability for turtle.

@abrokenjester
Copy link
Contributor

I'm happy with this as-is. Do you wish to merge now or do you have further things you want to put in?

@ansell
Copy link
Contributor Author

ansell commented Sep 1, 2016

No more things for now, merge as is.

@abrokenjester abrokenjester merged commit 58f87f1 into eclipse-rdf4j:master Sep 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants