-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StreamRDFWriter getWriterStream() #1296
Comments
Different meanings of "stream". org.apache.jena.riot.system.stream is in support of the stream manager - that is IO streams. org.apache.jena.system includes StreamRDF - a stream of triples./quads/prefixes. |
RIOT has its own tokenizer and parsers - the combination is x2 to x4 faster. The tokenizer is the performance bottleneck. The fastest parsers in Jena run at up to 1m triples/second on binary RDF Thrift. RDF PRotobuf is slightly less than 10% slower (making protobuf work for open ended streams of input seems to create an extra object and at 1microsecond a triple this is observable). The performance of Turtle and N-triples etc is approximately 240 kTPS and 400 kTPS. The only difference is the grammar parser being much simpler than all the "if"s for Turtle. All these are a minimum of x4 faster than Javacc. All parsing performance is sensitive to the hardware used. So these figures are relative. (they are on a old core-I5 with SATA SSD as has been used consistently for measurements over time.) Java has to convert to Java chars at some point which is a copy. In fact, it is faster to convert large buffers using Java built-in UTF-8 handling than to try to do one less copy but of each RDF term. Java checks all input for validity of UTF-8. If you'd like to improve the tokenizer and provide a PR, then would be great. |
@AtesComp Could you provide a test case to illustrate the issue with |
Thanks, Andy. Of course, I should have wrote As for the real issue, I'm noting that the OpenRefine 3.5.2 version uses an older 3.x Jena ARQ that is squashing my dependency on the 4.5.0 Jena ARQ Maven release. I looked at many ways to force it to use the newer jars but to no avail. Since my code is just a lowly red-headed stepchild extension to OpenRefine, I don't have much say in the matter. I'm fairly sure the However, the OpenRefine 3.6-SNAPSHOT is up-to-date! So, now, if I can just get them to make an official release, all will be good. Well, mostly. The Jena documentation needs updating. I can live with in for now. |
So this issue "StreamRDFWriter getWriterStream()" can be closed? |
Yep. I'll need to retest when a new official OpenRefine version comes
out. It "looks" like it should work.
|
StreamRDFWriter Class
The following issue came up using the Maven related release for Jena ARQ 4.4.0. I see it was just updated to 4.5.0.
In the StreamRDFWriter class, calling:
public static StreamRDF getWriterStream(OutputStream output, RDFFormat format)
causes a hung / lock up condition. However, calling:
public static StreamRDF getWriterStream(OutputStream output, RDFFormat format, Context context)
with a null context does not hang the process.
...at lease in the application I've developed as an extension to OpenRefine. See RDF Transform.
Reviewing the code doesn't appear to reveal any issue as
getWriterStream(output, format)
simply callsgetWriterStream(output, format, null)
. Very odd. Perhaps a test pattern can help.Additionally, there are some comment issues and / or possible code corrections for these functions. For:
getWriterStream(OutputStream output, RDFFormat format, Context context)
the comments declare:
No mention of exceptions. However, the code clearly throws an exception:
As documented, a
return null;
would be enough.StreamRDF... Classes
Some light humor...
Why are these
StreamRDF...
classes in...riot/system/
and not in.../riot/writer/stream/
?And what's up with
...riot/system/stream
only holdingLocator...
classes?Related Documentation
There is a lot of good resource material for RDFStream that could use some attention. The documentation on RIOT streaming (see Working with RDF Streams in Apache Jena) needs some luv'n to document the access to and use of the various stream classes. Particularly, the use of RDFFormat vs Lang (RDFFormat seems to be the new hotness).
Most of the documentation is centered around using datasets, models, and graphs. Far enough. However, there are exigent use cases for processing large RDF datasets where the "pretty" printers just don't scale...as documented. An iterative, streaming service is needed without first loading up a structure (i.e., duplicating the data) whether "in memory" or "persistent". Sequentially reading in non-RDF data, processing discreet units to an RDF compliant form, and writing (preferably in
BLOCKS
form) directly to an RDF file (or to a repository) is more performant...even if there is some duplicative results.Hmmm, the C coded Serd library seems to be very performant, small, and converts to several formats. Could the code be reviewed and converted to Java to help speed this kind of processing?
Conclusion
Class issues...documentation issues...a little frustration. I do plan on spending some time contributing to this effort...at least the documentation part.
Thanks for Jena.
The text was updated successfully, but these errors were encountered: