StreamRDFWriter getWriterStream() #1296

AtesComp · 2022-05-07T23:15:30Z

StreamRDFWriter Class

The following issue came up using the Maven related release for Jena ARQ 4.4.0. I see it was just updated to 4.5.0.

In the StreamRDFWriter class, calling:
public static StreamRDF getWriterStream(OutputStream output, RDFFormat format)
causes a hung / lock up condition. However, calling:
public static StreamRDF getWriterStream(OutputStream output, RDFFormat format, Context context)
with a null context does not hang the process.
...at lease in the application I've developed as an extension to OpenRefine. See RDF Transform.

Reviewing the code doesn't appear to reveal any issue as getWriterStream(output, format) simply calls getWriterStream(output, format, null). Very odd. Perhaps a test pattern can help.

Additionally, there are some comment issues and / or possible code corrections for these functions. For:
getWriterStream(OutputStream output, RDFFormat format, Context context)
the comments declare:

@return StreamRDF, or null if format is not registered for streaming.

No mention of exceptions. However, the code clearly throws an exception:

   if ( x == null )
       throw new RiotException("Failed to find a writer factory for "+format) ;

As documented, a return null; would be enough.

StreamRDF... Classes

Some light humor...
Why are these StreamRDF... classes in ...riot/system/ and not in .../riot/writer/stream/?
And what's up with ...riot/system/stream only holding Locator... classes?

Related Documentation

There is a lot of good resource material for RDFStream that could use some attention. The documentation on RIOT streaming (see Working with RDF Streams in Apache Jena) needs some luv'n to document the access to and use of the various stream classes. Particularly, the use of RDFFormat vs Lang (RDFFormat seems to be the new hotness).

Most of the documentation is centered around using datasets, models, and graphs. Far enough. However, there are exigent use cases for processing large RDF datasets where the "pretty" printers just don't scale...as documented. An iterative, streaming service is needed without first loading up a structure (i.e., duplicating the data) whether "in memory" or "persistent". Sequentially reading in non-RDF data, processing discreet units to an RDF compliant form, and writing (preferably in BLOCKS form) directly to an RDF file (or to a repository) is more performant...even if there is some duplicative results.

Hmmm, the C coded Serd library seems to be very performant, small, and converts to several formats. Could the code be reviewed and converted to Java to help speed this kind of processing?

Conclusion

Class issues...documentation issues...a little frustration. I do plan on spending some time contributing to this effort...at least the documentation part.

Thanks for Jena.

The text was updated successfully, but these errors were encountered:

afs · 2022-05-08T12:35:04Z

Why are these StreamRDF... classes in ...riot/system/ and not in .../riot/writer/stream/?

Different meanings of "stream".

org.apache.jena.riot.system.stream is in support of the stream manager - that is IO streams.

org.apache.jena.system includes StreamRDF - a stream of triples./quads/prefixes.

afs · 2022-05-08T12:44:41Z

RIOT has its own tokenizer and parsers - the combination is x2 to x4 faster. The tokenizer is the performance bottleneck.

The fastest parsers in Jena run at up to 1m triples/second on binary RDF Thrift. RDF PRotobuf is slightly less than 10% slower (making protobuf work for open ended streams of input seems to create an extra object and at 1microsecond a triple this is observable).

The performance of Turtle and N-triples etc is approximately 240 kTPS and 400 kTPS. The only difference is the grammar parser being much simpler than all the "if"s for Turtle.

All these are a minimum of x4 faster than Javacc.

All parsing performance is sensitive to the hardware used. So these figures are relative. (they are on a old core-I5 with SATA SSD as has been used consistently for measurements over time.)

Java has to convert to Java chars at some point which is a copy. In fact, it is faster to convert large buffers using Java built-in UTF-8 handling than to try to do one less copy but of each RDF term. Java checks all input for validity of UTF-8.

If you'd like to improve the tokenizer and provide a PR, then would be great.

afs · 2022-05-08T12:47:34Z

@AtesComp Could you provide a test case to illustrate the issue with StreamRDFWriter.getWriterStream? There is a lot of rdf-transform that may be influencing issue.

AtesComp · 2022-05-08T15:06:24Z

Thanks, Andy. Of course, I should have wrote .../riot/io/stream, or just .../riot/stream, instead of .../riot/writer/stream. I was just locked onto writing out RDF export files. The system directory just didn't click with me.

As for the real issue, I'm noting that the OpenRefine 3.5.2 version uses an older 3.x Jena ARQ that is squashing my dependency on the 4.5.0 Jena ARQ Maven release. I looked at many ways to force it to use the newer jars but to no avail. Since my code is just a lowly red-headed stepchild extension to OpenRefine, I don't have much say in the matter. I'm fairly sure the getWriterStream issue is due to the older jar.

However, the OpenRefine 3.6-SNAPSHOT is up-to-date! So, now, if I can just get them to make an official release, all will be good. Well, mostly. The Jena documentation needs updating. I can live with in for now.

afs · 2022-05-08T15:11:18Z

So this issue "StreamRDFWriter getWriterStream()" can be closed?

AtesComp · 2022-05-08T15:21:26Z

Yep. I'll need to retest when a new official OpenRefine version comes out. It "looks" like it should work.

afs closed this as completed May 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamRDFWriter getWriterStream() #1296

StreamRDFWriter getWriterStream() #1296

AtesComp commented May 7, 2022 •

edited

afs commented May 8, 2022

afs commented May 8, 2022

afs commented May 8, 2022

AtesComp commented May 8, 2022

afs commented May 8, 2022

AtesComp commented May 8, 2022 via email

StreamRDFWriter getWriterStream() #1296

StreamRDFWriter getWriterStream() #1296

Comments

AtesComp commented May 7, 2022 • edited

StreamRDFWriter Class

StreamRDF... Classes

Related Documentation

Conclusion

afs commented May 8, 2022

afs commented May 8, 2022

afs commented May 8, 2022

AtesComp commented May 8, 2022

afs commented May 8, 2022

AtesComp commented May 8, 2022 via email

AtesComp commented May 7, 2022 •

edited