Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make location handling less obtrusive #2

Closed
aantron opened this issue Jan 14, 2016 · 8 comments
Closed

Make location handling less obtrusive #2

aantron opened this issue Jan 14, 2016 · 8 comments

Comments

@aantron
Copy link
Owner

aantron commented Jan 14, 2016

The parsers currently emit streams of location * signal, which causes a problem – for each processing function, should it take location * signal streams or signal streams? This forces the user to think about when to call drop_locations.

I will probably change the parsers to return signal streams and a separate function unit -> location which will evaluate to the last location emitted. The user will have to be warned that peeking the signal stream will result in inaccurate location corresponding to the next call to next.

@dbuenzli would you happen to have any recommendations on this?

@dbuenzli
Copy link

I will probably change the parsers to return signal streams and a separate function unit -> location which will evaluate to the last location emitted. The user will have to be warned that peeking the signal stream will result in inaccurate location corresponding to the next call to next.

This looks a little bit brittle to use. Not sure you are going to like my answer.

What I would personally do is not provide a stream abstraction/interface at the lowest level of your library. Streams of signals should be built on top. One of the reason is that the streaming landscape in OCaml is a little bit unsatisfying and fragmented at the moment and the possible arrival of effects in the future may change things again (e.g. among others there is sequence, gen, core's sequence and we have been discussing privately on an alternative with @pqwy). I don't think that it's the task of your library to solve this problem. The value I personally see in your library is a non-blocking, error-correcting, incremental parsing of the insane HTML5 specification. I don't really care about its streaming abstraction and combinators, the latter should be left to a generic streaming library.

The separate function you propose looks like a closure acting on a hidden data structure, why not expose this data structure so that more operations can be performed on it ? This data structure, holding your parsing state, could simply be called a decoder and among the various operations on it, one would allow to get the next signal and another the position of the last signal (peeking can be built on top of that). This would allow to easily write adapters to the generic streaming libraries mentioned above. What I'm basically telling is that I think that you should follow jsonm's design which allows to adapt not only to the various async libraries but also to the various generic streaming libraries.

Once you have that lowest level abtraction you can provide various kind of lighter, quick and dirty api's to the end-user (e.g. only streams with no positions) without caring too much since you know that if you don't provide the right thing the user can easily dig one level down to access the data and form the stream it needs with the libraries it uses.

You should make a base API that allows maximal reuse of the value of your library which is: standard compliant, non-blocking and incremental html5/xml parsers.

@Drup
Copy link

Drup commented Jan 14, 2016

(jsonm's design is basically the same as gen, which is the right streaming library for stop-and-go streams, such as the one used in markup.ml)

@aantron
Copy link
Owner Author

aantron commented Jan 14, 2016

Well, there are several matters here.

CPS streams are the way in which the library is internally composed, i.e. this abstraction is not merely introduced by the interface. All the interface does is hide their CPS nature.

Of course, there are closures with mutable state involved, and I have considered exposing some of them as state types. However, I haven't yet had a strong reason to do so, and it would complicate the interface – so far, in my opinion, unnecessarily. For this issue, I was thinking of something like a parameter ?get_location : (unit -> location) ref. Ugly and C-like as it is, I think it may be worth it for keeping other parts of the interface clean.

Whether to provide an API like jsonm has been a question for me since the beginning. After thinking about it at length, I decided that CPS streams are largely equivalent in meaning, but easier to program with. It is also quite easy to adapt them to both "generic" stream types and to asynchronous programming. Markup_lwt is an example that does both. I am not yet sure how performance is impacted relative to a jsonm-style interface.

I don't believe streams (or I/O) belong in Markup.ml. My plan is to eventually factor these out into separate libraries. I'm also likely to expose the CPS interface at some point. Currently the closest way to it is the monadic I/O adapter functor Markup.Asynchronous which is exposed, though hidden from ocamldoc.

The main reason I did not use an existing sequence library is poor support for programming that is both agnostic to the question of synchronous vs. asynchronous usage and supports a straightforward concept of composition. I may be wrong about this, of course, but that is the impression I got.

I am very interested in this, and would be glad to join whatever discussions on sequence types that are ongoing. Besides Markup.ml, Lambda Soup and a few other projects I am considering also need good sequence types.

@c-cube
Copy link
Sponsor

c-cube commented Jan 14, 2016

I don't want to restart the iterator debate here, but we really need a place to do so. I fail to see why we need one or two additional types, and how they would improve on what currently exists (as cited above, Core.Sequence, gen, sequence, various lazy lists), even assuming effects are coming into OCaml (they would improve gen too, for instance).

@aantron
Copy link
Owner Author

aantron commented Jan 14, 2016

Can someone provide a link to effects? I vote in favor of debating sequences somewhere. This really belongs in a well-known (but tightly focused) library or even the standard library.

@c-cube
Copy link
Sponsor

c-cube commented Jan 14, 2016

A bit dated, but http://kcsrk.info/ocaml/multicore/2015/05/20/effects-multicore/ was the original post, I believe.

@dbuenzli
Copy link

Le jeudi, 14 janvier 2016 à 20:06, Anton a écrit :

Whether to provide an API like jsonm has been a question for me since the beginning. After thinking about it at length, I decided that CPS streams are largely equivalent in meaning, but easier to program with.
Providing a jsonm-like doesn't have to dictate what you use internally or provide in higher-level apis. I just argue that given this issue was opened and the rather unsatisfactory solution you propose (callbacks/closures floating in the ether), it seems better to me if you expose a jsonm-like api rather than reveal your CPS streams (which can be provided in other convenience APIs). This especially if you have the feeling that your streams could be a bottleneck. A simpler interface that is basically exposing your parser state as an abstract type with function acting on it could allow you to be more versatile w.r.t. internals rewritings if you hit performance problems.

Le jeudi, 14 janvier 2016 à 20:15, Simon Cruanes a écrit :

I fail to see why we need one or two additional types, and how they would improve on what currently exists (as cited above, Core.Sequence, gen, sequence, various lazy lists),

Well if you fail to see you didn't follow the discussion then. So to restate it, in my opinion none of these solutions have a compelling story for handling errors occuring at either the producer or consumer end, something that is important for my use cases.

Best,

Daniel

@aantron
Copy link
Owner Author

aantron commented Jan 15, 2016

Daniel, thanks for your comments. I took a "middle" approach for the time being. There is value in having a parser/decoder type for querying in the future. I don't think I will move to a full jsonm-like interface (i.e. ``Await`) for now, mainly because it would require non-trivial internal adaptation, and I don't yet see a benefit to usage. I will be able to make a more informed decision when I am working on performance. In the meantime, the current interface and a jsonm-like interface are both implementable in terms of each other, so this should not be a big problem. I can always make the basic interface of today be the high-level wrapper for tomorrow, or even just have a big breaking change before 1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants