Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamRDFWriter getWriterStream() #1296

Closed
AtesComp opened this issue May 7, 2022 · 6 comments
Closed

StreamRDFWriter getWriterStream() #1296

AtesComp opened this issue May 7, 2022 · 6 comments

Comments

@AtesComp
Copy link
Contributor

AtesComp commented May 7, 2022

StreamRDFWriter Class

The following issue came up using the Maven related release for Jena ARQ 4.4.0. I see it was just updated to 4.5.0.

In the StreamRDFWriter class, calling:
public static StreamRDF getWriterStream(OutputStream output, RDFFormat format)
causes a hung / lock up condition. However, calling:
public static StreamRDF getWriterStream(OutputStream output, RDFFormat format, Context context)
with a null context does not hang the process.
...at lease in the application I've developed as an extension to OpenRefine. See RDF Transform.

Reviewing the code doesn't appear to reveal any issue as getWriterStream(output, format) simply calls getWriterStream(output, format, null). Very odd. Perhaps a test pattern can help.

Additionally, there are some comment issues and / or possible code corrections for these functions. For:
getWriterStream(OutputStream output, RDFFormat format, Context context)
the comments declare:

@return StreamRDF, or null if format is not registered for streaming.

No mention of exceptions. However, the code clearly throws an exception:

   if ( x == null )
       throw new RiotException("Failed to find a writer factory for "+format) ;

As documented, a return null; would be enough.

StreamRDF... Classes

Some light humor...
Why are these StreamRDF... classes in ...riot/system/ and not in .../riot/writer/stream/?
And what's up with ...riot/system/stream only holding Locator... classes?

Related Documentation

There is a lot of good resource material for RDFStream that could use some attention. The documentation on RIOT streaming (see Working with RDF Streams in Apache Jena) needs some luv'n to document the access to and use of the various stream classes. Particularly, the use of RDFFormat vs Lang (RDFFormat seems to be the new hotness).

Most of the documentation is centered around using datasets, models, and graphs. Far enough. However, there are exigent use cases for processing large RDF datasets where the "pretty" printers just don't scale...as documented. An iterative, streaming service is needed without first loading up a structure (i.e., duplicating the data) whether "in memory" or "persistent". Sequentially reading in non-RDF data, processing discreet units to an RDF compliant form, and writing (preferably in BLOCKS form) directly to an RDF file (or to a repository) is more performant...even if there is some duplicative results.

Hmmm, the C coded Serd library seems to be very performant, small, and converts to several formats. Could the code be reviewed and converted to Java to help speed this kind of processing?

Conclusion

Class issues...documentation issues...a little frustration. I do plan on spending some time contributing to this effort...at least the documentation part.

Thanks for Jena.

@afs
Copy link
Member

afs commented May 8, 2022

Why are these StreamRDF... classes in ...riot/system/ and not in .../riot/writer/stream/?

Different meanings of "stream".

org.apache.jena.riot.system.stream is in support of the stream manager - that is IO streams.

org.apache.jena.system includes StreamRDF - a stream of triples./quads/prefixes.

@afs
Copy link
Member

afs commented May 8, 2022

RIOT has its own tokenizer and parsers - the combination is x2 to x4 faster. The tokenizer is the performance bottleneck.

The fastest parsers in Jena run at up to 1m triples/second on binary RDF Thrift. RDF PRotobuf is slightly less than 10% slower (making protobuf work for open ended streams of input seems to create an extra object and at 1microsecond a triple this is observable).

The performance of Turtle and N-triples etc is approximately 240 kTPS and 400 kTPS. The only difference is the grammar parser being much simpler than all the "if"s for Turtle.

All these are a minimum of x4 faster than Javacc.

All parsing performance is sensitive to the hardware used. So these figures are relative. (they are on a old core-I5 with SATA SSD as has been used consistently for measurements over time.)

Java has to convert to Java chars at some point which is a copy. In fact, it is faster to convert large buffers using Java built-in UTF-8 handling than to try to do one less copy but of each RDF term. Java checks all input for validity of UTF-8.

If you'd like to improve the tokenizer and provide a PR, then would be great.

@afs
Copy link
Member

afs commented May 8, 2022

@AtesComp Could you provide a test case to illustrate the issue with StreamRDFWriter.getWriterStream? There is a lot of rdf-transform that may be influencing issue.

@AtesComp
Copy link
Contributor Author

AtesComp commented May 8, 2022

Thanks, Andy. Of course, I should have wrote .../riot/io/stream, or just .../riot/stream, instead of .../riot/writer/stream. I was just locked onto writing out RDF export files. The system directory just didn't click with me.

As for the real issue, I'm noting that the OpenRefine 3.5.2 version uses an older 3.x Jena ARQ that is squashing my dependency on the 4.5.0 Jena ARQ Maven release. I looked at many ways to force it to use the newer jars but to no avail. Since my code is just a lowly red-headed stepchild extension to OpenRefine, I don't have much say in the matter. I'm fairly sure the getWriterStream issue is due to the older jar.

However, the OpenRefine 3.6-SNAPSHOT is up-to-date! So, now, if I can just get them to make an official release, all will be good. Well, mostly. The Jena documentation needs updating. I can live with in for now.

@afs
Copy link
Member

afs commented May 8, 2022

So this issue "StreamRDFWriter getWriterStream()" can be closed?

@AtesComp
Copy link
Contributor Author

AtesComp commented May 8, 2022 via email

@afs afs closed this as completed May 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants