Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need streaming & async rdf readers/writers ( Marshallers/unmarshallers) #211

Open
bblfish opened this issue Dec 6, 2014 · 13 comments
Open

Comments

@bblfish
Copy link
Member

bblfish commented Dec 6, 2014

Javascript runs in single thread, and so blocking parsers are a complete no-no. A parser has to be able to parse a small chunk of rdf quickly, and send the result on to be processed so that the event loop can be freed for drawing the UI.

The JavaScript parsers in rdfstorew work like that by sending each triple to a callback function which presumably passes through the event loop. I'll try to work out exactly how that works as I get more time.

Streaming/non blocking parsers also reduce the memory footprint:

  • streaming parsers ( such as the current NTriples parser which is an Iterator) means that the whole graph does not ever need to be in memory, and can be either passed on to a store for indexation, or folded into a graph if needed.
  • blocking parsers like all the current ones ( except for the rdfstorew ones I think ) are expensive in threads taking up 0.5-1mb of ram and lead to a lot of context switching on io network. Saving files to the hard drive and then parsing them is not an option for larger files if one wants responses as fast as possible as when parsing for the purpose of a proxy.

This means that the RDFStore APIs also need to take not just graphs, but non blocking streaming constructs, so that a parser can send the store a stream of parsed triples.

@betehess
Copy link
Member

betehess commented Dec 6, 2014

This reminds me how "fun" it was to work with streaming XML parsers in a previous project. For example, you cannot use XPath anymore, nor other standard XML technologies... Also, the kind of applications being written is in a totally different category, because you are forcing yourself to never aggregate the triples in a graph (or you defeat the purpose). I am still not sure what kind of applications one would write, and more importantly: how.

Anyway, I am not saying this is useless: feel free to work on that! Just note that the current Reader/Writer framework is not designed to make use of that, and you will have to provide a totally different abstraction.

Finally, you might want to rephrase your use-case in terms similar to what you can find in https://github.com/travisbrown/syzygist .

@bblfish
Copy link
Member Author

bblfish commented Dec 6, 2014

"you cannot use XPath anymore, nor other standard XML technologies". Nothing should force the end user to stay in the streaming mode. A graph is just a simple fold away. But the parsers should not determine wether a graph will be used in the end, or wether the whole thing is just going to be looked at triple by triple in a big data scenario, or if things are just being sent in batches to an indexer, or indeed if they are just being converted to anther format and sent along into another pipe.

Btw, there is no need to write a streaming XML rdf parser, we can use Jena's as I did in the rww project - not saying Iteratees are the way to go, just pointing to that as an example that is already written.

As far as which library to use for streaming parsers I am not yet sure. This is a good thread to list some of them with pros and cons, or by linking this to a wiki page.

@betehess
Copy link
Member

betehess commented Dec 7, 2014

Nothing should force the end user to stay in the streaming mode. A graph is just a simple fold away.

Of course, but then people must realize that in that case, you have a performance penalty because you are async under the hood, while it was not necessary. If you are not sure why/where, Google is your friend. Anyway, implementing a synchronous parser on top of an asynchronous one is far from being free. And the whole thing becomes much more difficult to maintain, especially if you don't want to (or just cannot) rely on a framework to make things faster.

One must not conflate scalable and fast. And one must understand that they are many ways to be scalable. In my applications, I do not need non-blocking parsers. But what really worries me is that you seem to confuse many different concepts: blocking, non-blocking, async, sync, reactive, stream, lazyness.

Btw, there is no need to write a streaming XML rdf parser

I know... I was just taking that as an example to explain the trade-offs. So far, you are the only one really interested in such parsers, so I wanted to make clear that there are caveats that people should not ignore.

As far as which library to use for streaming parsers I am not yet sure. This is a good thread to list some of them with pros and cons, or by linking this to a wiki page.

As far as I know, there is no such framework today. And I guess you want something that will work with scala-js?

streaming parsers ( such as the current NTriples parser which is an Iterator)

I think you have a very loose definition of stream... By the way, your current NTriplesReader is 100% blocking: on the one hand, it blocks on its input as it reads synchronously from a Reader, and on other other hand, Iterator is also a blocking abstraction (it is just kinda lazy).

means that the whole graph does not ever need to be in memory

If the consumer of the Iterator is slower than the IO, then the whole graph can still end up in memory... Not as an RDF#Graph of course, but as bytes accumulated in the Reader. Async itself is not enough to solve this problem, I guess you'd want something like "reactive streams", à la akka-stream, where the producer and the consumer have ways to communicate. And you'd want to design your whole application around those streams.

In any case, I think I have given enough remarks on that thread. But I am very curious with what you will come up with.

@bblfish
Copy link
Member Author

bblfish commented Dec 7, 2014

streaming parsers ( such as the current NTriples parser which is an Iterator)

I think you have a very loose definition of stream... By the way, your current NTriplesReader is 100% blocking: on the one hand, it blocks on its input as it reads synchronously from a Reader, and on other other hand, Iterator is also a blocking abstraction (it is just kinda lazy).

Of course I know that my NTriples parser is blocking. I was giving it as an example of something that streamed results but still blocked on a thread. And the Iterator's blocking behaviour does not add a problem over and above the java.io.Reader since the Iterator is just an interface on the reader, and the blocking is in the same direction.

If the consumer of the Iterator is slower than the IO, then the whole graph can still end up in memory... Not as an RDF#Graph of course, but as bytes accumulated in the Reader.

CPUs are much faster than IO, and parsing NTriples is very easy, so in most devices you'd not have problems with the IO accumulating, especially if the IO is coming over the internet. (I'll refer people to Odersky's 2nd course on reactive programming on Coursera, for details about the difference in speed between the fastest cpu operation and sending a message over the internet: it's the difference between 1second and 4 years - at least for the reception of the first packet ).
In any case that cannot be a problem my NTriples parser has over any of the Jena or Sesame parsers, since they are also using the java.io.{InputStream,Reader} libraries. So compared to the current banana abstraction of those libraries the NTriples parser saves on needing to have the whole graph in memory. There are situations where that is a huge win, as described in the header of this issue. Sadly with the current reader abstraction in banana, that does not become visible to the user.

But what is sure is that the NTriples parser with its blocking io does not solve the JavaScript problem. There you need a non-blocking and streaming parser as explained for example by @RubenVerborgh in his article Lightening Fast RDF in JavaScript

@RubenVerborgh
Copy link

CPUs are much faster than IO, and parsing NTriples is very easy, so in most devices you'd not have problems with the IO accumulating, especially if the IO is coming over the internet.

That's a fallacy: you don't know what the consumer is doing. It might well be that the consumer needs to perform multiple IO operations for every triple it receives. Hence, the consumer can be slower than the IO input stream, because the consumer is not necessarily CPU-bound. The fact that N-Triples requires little CPU to parse is not relevant; parsing is just an intermediary between IO and some (slow?) consumer.

@bblfish
Copy link
Member Author

bblfish commented Dec 7, 2014

@RubenVerborgh wrote:

CPUs are much faster than IO, and parsing NTriples is very easy, so in most devices you'd not have problems with the IO accumulating, especially if the IO is coming over the internet.

That's a fallacy: you don't know what the consumer is doing. It might well be that the consumer needs to perform multiple IO operations for every triple it receives. Hence, the consumer can be slower than the IO input, because the consumer is not necessarily CPU-bound. The fact that N-Triples require little CPU to parse is not that relevant; parsing is just an intermediary between IO and some consumer.

I agree . The point of that part of the argument was that there is an advantage over a streaming parser that returns results one by one and one over one that returns a whole graph when done, at least in so far as memory management goes. I was only arguing about the advantages of streaming the results. Currently our api for readers is

/** RDF readers for a given syntax. */
trait RDFReader[Rdf <: RDF, M[_], +S] {


  /** Tries parsing an RDF Graph from an [[java.io.InputStream]] and a base URI.
    * 
    * If the encoding for the input is known, prefer the Reader
    * version of this function.
    * @param base the base URI to use, to resolve relative URLs found in the InputStream
    */
  def read(is: InputStream, base: String): M[Rdf#Graph]

  /** Tries parsing an RDF Graph from a [[java.io.Reader]] and a base URI.
    * @param base the base URI to use, to resolve relative URLs found in the InputStream
    **/
  def read(reader: Reader, base: String): M[Rdf#Graph]
}

The M in M[Rdf#Graph] allows the result to be non-blocking - as M can be a Future but it forces the return of a whole graph. It is more likely that what is needed (perhaps in additon to the above api) is the return values to be M[Triple]. Where M could be a Stream[Triple] ... ( one probably also needs to replace the java.io.{InputStream, Reader} with something else too.)

To solve the problem you are speaking about one needs to take account of back-pressure, which is why I was recently studying the akka reactive streams framework that takes that into account. ( Their code is being developed on the 2.3-dev branch. So there I agree with Alex. Akka is also being ported to scala-js so I really look forward to seeing what is going on there this week at Scala eXchange in London where I will be speaking.

But those libraries are still very alpha, while your code, @RubenVerborgh, has been in production for a few years now. We here are still very new to the world of JavaScript, so here are some questions I have for you that could help us - I hope they don't sound too stupid. :-)

  1. Using JQuery one gets the whole result of an internet request back in one go. ( Is there a way to get it back in blocks, to parse smaller pieces? WebSockets perhaps?) Because otherwise I can't quite make sense of the io back pressure argument in JavaScript: The browser always in JS seem to need to get the whole answer - which is a problem it seems if you were to receive a huge file. (I suppose this is where LDP Paging would come in useful, but then if it pages the pages small enough you'd be getting small enough chunks of RDF to parse them in one go - no?
  2. As I understand N3.js parses the RDF returned in small blocks so as to allow the event loop and other parts of the UI to draw themselves. ( Btw would it be more efficient to parse chunks of RDF so as to reduce event loop usage, which I suppose comes at some cost? ). The point of N3.js is that these parsers must be asynchronous, non-blocking and stream the results, for those reasons, right?
  3. What else am I missing?

@RubenVerborgh
Copy link

Using JQuery one gets the whole result of an internet request back in one go. ( Is there a way to get it back in blocks, to parse smaller pieces? WebSockets perhaps?) Because otherwise I can't quite make sense of the io back pressure argument in JavaScript. You always in JS seem to need to get the whole answer - which is a problem it seems if you were to receive a huge file.

First of all, as you know, JavaScript is bigger than browsers only. So with platforms such as Node.js, streaming from files and HTTP makes a lot of sense. For the browser, jQuery is an abstraction that indeed gives you the whole request. But if we look at the API offered by browsers, it's much more straightforward to get partial input. The XMLHttpRequest documentation can help you a lot here; basically, periodically (or on progress) check the responseText property, keeping track of the number of characters you have already read. Feed the deltas to the parser.

As I understand N3.js parses the RDF returned in small blocks so as to allow the event loop and other parts of the UI to draw themselves. ( Btw would it be more efficient to parse chunks of RDF so as to reduce event loop usage, which I suppose comes at some cost? ). The point of N3.js is that these parsers must be asynchronous, non-blocking and stream the results, for those reasons, right?

Not really. Again, N3.js was foremost created for Node.js, where there is no UI but only an event loop (and the UI normally runs in a different thread in browsers anyway). The idea of N3.js is that you can give it data chunks of arbitrary length, and that the parser will always go as far as possible. The reasoning behind this has more to do with availability of data rather than returning fast to the event loop; i.e., it just parses as fast as possible, without waiting for the whole input to be available. I wouldn't do it to save cycles in the event loop, as you need to do the parsing anyway. Sometimes you just have part of the data, and you want to “parse it away” as soon as possible, so that you can already start acting on the parsed triples while you are waiting for more data to arrive. It's just faster and far more logical. Furthermore, this comes in handy if you want to parse files that are larger than your main memory.

Non-blocking is what you get by default in Node.js. You don't actively wait until a file has been read into memory; instead, you process chunks as they arrive. This non-blocking behavior is implemented through asynchronous callbacks.

What else am I missing?

Streams with protection against back- and forward-pressure are horribly slow; don't use them. Let producers call a method on your parser to send data, and allow consumers to give you a callback through which you pass data when it arrives.

@betehess
Copy link
Member

betehess commented Dec 7, 2014

Thanks @RubenVerborgh. No surprise, async/non-blocking comes at a price.

But we lost Henry's use-case along the way: he doesn't want to accumulate the parsed triples in a graph. I am still not sure how that would work.

@RubenVerborgh
Copy link

Careful, @betehess, there's a difference between “async/non-blocking” and “streams with protection against back- and forward-pressure” (= the Node.js implementation of streams, which is different from JavaScript in general, and in particular from the N3.js library).

In fact, non-blocking can be surprisingly fast or even faster, precisely because you don't need the lookahead. (Try parsing an 8GB file in a blocking way.) The streaming N3.js parser is currently by far the fastest in JavaScript.

@betehess
Copy link
Member

betehess commented Dec 7, 2014

I know, I was just mentioning the trade-off: there is a price in context-switching, nothing new there. In many applications, I have seen non-blocking network IO along with blocking parsers working on separate threads, resulting in a async results. It seems to work really great for many people...

Anyway, that was not my question.

@bblfish
Copy link
Member Author

bblfish commented Dec 7, 2014

And there is also a price in memory consumption in building a graph. The point is to allow the developer to decide when he wants to pay the price. Perhaps he does not want to build the graph, just re-serialise it in a different form immediately, or send it to a store for indexing.

Note that in a single threaded parser using java.io.* you also get context switching. You get it there between threads, which are very expensive, each taking half a MB of RAM. If a thread is blocking on IO, the thread will be switched into context will be asked to do something and will then be immediately switched out of context ( assuming it has nothing to do). That is why actor frameworks such as akka are so powerful, allowing one to have an actor for the price of 300 bytes, instead of 500-1000kbytes for a thread. These actor frameworks become really important in the context of single threaded languages such as JavaScript. In fact my guess is that the akka actor framework ported to JS would be build upon the event loop mechanism provided by JavaScript. It would be really interesting to find out. I guess we just have to wait on Phillip Haller's report

@bblfish
Copy link
Member Author

bblfish commented Dec 9, 2014

Mhh, there is something going on in Parboiled2 regarding continuation parsing

@bblfish
Copy link
Member Author

bblfish commented Jun 21, 2015

Grok is a streaming parser also, and in this video he explains very well how he gets very good performance. Look forward to it. https://www.parleys.com/tutorial/grok-optimistic-mini-library-efficient-safe-parsing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants