-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
need streaming & async rdf readers/writers ( Marshallers/unmarshallers) #211
Comments
This reminds me how "fun" it was to work with streaming XML parsers in a previous project. For example, you cannot use XPath anymore, nor other standard XML technologies... Also, the kind of applications being written is in a totally different category, because you are forcing yourself to never aggregate the triples in a graph (or you defeat the purpose). I am still not sure what kind of applications one would write, and more importantly: how. Anyway, I am not saying this is useless: feel free to work on that! Just note that the current Reader/Writer framework is not designed to make use of that, and you will have to provide a totally different abstraction. Finally, you might want to rephrase your use-case in terms similar to what you can find in https://github.com/travisbrown/syzygist . |
"you cannot use XPath anymore, nor other standard XML technologies". Nothing should force the end user to stay in the streaming mode. A graph is just a simple fold away. But the parsers should not determine wether a graph will be used in the end, or wether the whole thing is just going to be looked at triple by triple in a big data scenario, or if things are just being sent in batches to an indexer, or indeed if they are just being converted to anther format and sent along into another pipe. Btw, there is no need to write a streaming XML rdf parser, we can use Jena's as I did in the rww project - not saying Iteratees are the way to go, just pointing to that as an example that is already written. As far as which library to use for streaming parsers I am not yet sure. This is a good thread to list some of them with pros and cons, or by linking this to a wiki page. |
Of course, but then people must realize that in that case, you have a performance penalty because you are async under the hood, while it was not necessary. If you are not sure why/where, Google is your friend. Anyway, implementing a synchronous parser on top of an asynchronous one is far from being free. And the whole thing becomes much more difficult to maintain, especially if you don't want to (or just cannot) rely on a framework to make things faster. One must not conflate scalable and fast. And one must understand that they are many ways to be scalable. In my applications, I do not need non-blocking parsers. But what really worries me is that you seem to confuse many different concepts: blocking, non-blocking, async, sync, reactive, stream, lazyness.
I know... I was just taking that as an example to explain the trade-offs. So far, you are the only one really interested in such parsers, so I wanted to make clear that there are caveats that people should not ignore.
As far as I know, there is no such framework today. And I guess you want something that will work with scala-js?
I think you have a very loose definition of stream... By the way, your current
If the consumer of the In any case, I think I have given enough remarks on that thread. But I am very curious with what you will come up with. |
Of course I know that my NTriples parser is blocking. I was giving it as an example of something that streamed results but still blocked on a thread. And the Iterator's blocking behaviour does not add a problem over and above the
CPUs are much faster than IO, and parsing NTriples is very easy, so in most devices you'd not have problems with the IO accumulating, especially if the IO is coming over the internet. (I'll refer people to Odersky's 2nd course on reactive programming on Coursera, for details about the difference in speed between the fastest cpu operation and sending a message over the internet: it's the difference between 1second and 4 years - at least for the reception of the first packet ). But what is sure is that the NTriples parser with its blocking io does not solve the JavaScript problem. There you need a non-blocking and streaming parser as explained for example by @RubenVerborgh in his article Lightening Fast RDF in JavaScript |
That's a fallacy: you don't know what the consumer is doing. It might well be that the consumer needs to perform multiple IO operations for every triple it receives. Hence, the consumer can be slower than the IO input stream, because the consumer is not necessarily CPU-bound. The fact that N-Triples requires little CPU to parse is not relevant; parsing is just an intermediary between IO and some (slow?) consumer. |
@RubenVerborgh wrote:
I agree . The point of that part of the argument was that there is an advantage over a streaming parser that returns results one by one and one over one that returns a whole graph when done, at least in so far as memory management goes. I was only arguing about the advantages of streaming the results. Currently our api for readers is /** RDF readers for a given syntax. */
trait RDFReader[Rdf <: RDF, M[_], +S] {
/** Tries parsing an RDF Graph from an [[java.io.InputStream]] and a base URI.
*
* If the encoding for the input is known, prefer the Reader
* version of this function.
* @param base the base URI to use, to resolve relative URLs found in the InputStream
*/
def read(is: InputStream, base: String): M[Rdf#Graph]
/** Tries parsing an RDF Graph from a [[java.io.Reader]] and a base URI.
* @param base the base URI to use, to resolve relative URLs found in the InputStream
**/
def read(reader: Reader, base: String): M[Rdf#Graph]
} The To solve the problem you are speaking about one needs to take account of back-pressure, which is why I was recently studying the akka reactive streams framework that takes that into account. ( Their code is being developed on the 2.3-dev branch. So there I agree with Alex. Akka is also being ported to scala-js so I really look forward to seeing what is going on there this week at Scala eXchange in London where I will be speaking. But those libraries are still very alpha, while your code, @RubenVerborgh, has been in production for a few years now. We here are still very new to the world of JavaScript, so here are some questions I have for you that could help us - I hope they don't sound too stupid. :-)
|
First of all, as you know, JavaScript is bigger than browsers only. So with platforms such as Node.js, streaming from files and HTTP makes a lot of sense. For the browser, jQuery is an abstraction that indeed gives you the whole request. But if we look at the API offered by browsers, it's much more straightforward to get partial input. The
Not really. Again, N3.js was foremost created for Node.js, where there is no UI but only an event loop (and the UI normally runs in a different thread in browsers anyway). The idea of N3.js is that you can give it data chunks of arbitrary length, and that the parser will always go as far as possible. The reasoning behind this has more to do with availability of data rather than returning fast to the event loop; i.e., it just parses as fast as possible, without waiting for the whole input to be available. I wouldn't do it to save cycles in the event loop, as you need to do the parsing anyway. Sometimes you just have part of the data, and you want to “parse it away” as soon as possible, so that you can already start acting on the parsed triples while you are waiting for more data to arrive. It's just faster and far more logical. Furthermore, this comes in handy if you want to parse files that are larger than your main memory. Non-blocking is what you get by default in Node.js. You don't actively wait until a file has been read into memory; instead, you process chunks as they arrive. This non-blocking behavior is implemented through asynchronous callbacks.
Streams with protection against back- and forward-pressure are horribly slow; don't use them. Let producers call a method on your parser to send data, and allow consumers to give you a callback through which you pass data when it arrives. |
Thanks @RubenVerborgh. No surprise, async/non-blocking comes at a price. But we lost Henry's use-case along the way: he doesn't want to accumulate the parsed triples in a graph. I am still not sure how that would work. |
Careful, @betehess, there's a difference between “async/non-blocking” and “streams with protection against back- and forward-pressure” (= the Node.js implementation of streams, which is different from JavaScript in general, and in particular from the N3.js library). In fact, non-blocking can be surprisingly fast or even faster, precisely because you don't need the lookahead. (Try parsing an 8GB file in a blocking way.) The streaming N3.js parser is currently by far the fastest in JavaScript. |
I know, I was just mentioning the trade-off: there is a price in context-switching, nothing new there. In many applications, I have seen non-blocking network IO along with blocking parsers working on separate threads, resulting in a async results. It seems to work really great for many people... Anyway, that was not my question. |
And there is also a price in memory consumption in building a graph. The point is to allow the developer to decide when he wants to pay the price. Perhaps he does not want to build the graph, just re-serialise it in a different form immediately, or send it to a store for indexing. Note that in a single threaded parser using |
Mhh, there is something going on in Parboiled2 regarding continuation parsing |
Grok is a streaming parser also, and in this video he explains very well how he gets very good performance. Look forward to it. https://www.parleys.com/tutorial/grok-optimistic-mini-library-efficient-safe-parsing |
Javascript runs in single thread, and so blocking parsers are a complete no-no. A parser has to be able to parse a small chunk of rdf quickly, and send the result on to be processed so that the event loop can be freed for drawing the UI.
The JavaScript parsers in rdfstorew work like that by sending each triple to a callback function which presumably passes through the event loop. I'll try to work out exactly how that works as I get more time.
Streaming/non blocking parsers also reduce the memory footprint:
This means that the RDFStore APIs also need to take not just graphs, but non blocking streaming constructs, so that a parser can send the store a stream of parsed triples.
The text was updated successfully, but these errors were encountered: