Skip to content

Commit

Permalink
Added elements helper and advanced tutorial.
Browse files Browse the repository at this point in the history
- Documentation updates.
- Travis fixes.
- Fixes in write_html and write_xml that caused non-termination on Lwt.
- Exposed strings_to_bytes.
  • Loading branch information
aantron committed Jan 13, 2016
1 parent 788d92c commit c11dc8b
Show file tree
Hide file tree
Showing 9 changed files with 185 additions and 65 deletions.
3 changes: 2 additions & 1 deletion .travis.yml
Expand Up @@ -17,9 +17,10 @@ before_script:
- eval `opam config env`

script:
- opam install -y ounit
- "[ -n $DOCS ] || opam install -y lambdasoup"
- "[ -z $LWT ] || opam install -y lwt js_of_ocaml"
- "[ -z $COVERALLS ] || opam install -y bisect_ppx ocveralls"
- opam install -y ounit js_of_ocaml lambdasoup
- make install
- make dependency-test
- make test
Expand Down
139 changes: 90 additions & 49 deletions README.md
@@ -1,14 +1,20 @@
# Markup.ml   [![version pre0.5][version]][releases] [![(BSD license)][license-img]][license]
# Markup.ml   [![version 0.5][version]][releases] [![(BSD license)][license-img]][license]

[version]: https://img.shields.io/badge/version-pre0.5-blue.svg
[version]: https://img.shields.io/badge/version-0.5-blue.svg
[license-img]: https://img.shields.io/badge/license-BSD-blue.svg

Markup.ml is a pair of streaming, error-recovering parsers, one for HTML and one
for XML, with a simple interface: each parser is a function that transforms
streams.
Markup.ml is a pair of parsers implementing the HTML5 and XML specifications.
Usage is simple, because each parser is just a function from byte streams to
parsing signal streams.

Here is an example of pretty-printing and correcting an HTML fragment. The code
is in the left column.
HTML5 gives complicated rules for well-formed markup, error reporting, and
recovery. Markup.ml encapsulates them. While the XML specification does not
include error recovery, Markup.ml also recovers from XML errors, after reporting
them. Thus, it provides best-effort parsing of both HTML and XML.

Here is an example of Markup.ml correcting errors in a small HTML fragment, then
pretty-printing it. The code is in the left column, and the center column shows
the values produced.

```ocaml
open Markup;;
Expand All @@ -18,10 +24,9 @@ string s "<p><em>Markup.ml<p>rocks!" (* malformed HTML *)
|> parse_html `Start_element "p"
`Start_element "em"
`Text "Markup.ml"
~report (1, 4) (* can use ~report to abort parsing; *)
(`Unmatched_start_tag "em") (* ignored by default *)
~report (1, 4) (`Unmatched_start_tag "em")
`End_element (* /em: recovery *)
`End_element (* /p *)
`End_element (* /p: not an error *)
`Start_element "p"
`Start_element "em" (* recovery *)
`Text "rocks!"
Expand All @@ -39,26 +44,33 @@ string s "<p><em>Markup.ml<p>rocks!" (* malformed HTML *)
</p>" (* valid HTML *)
```

Some features:
In addition to being error-correcting, the parsers are:

- *streaming*: capable of parsing partial input while more input is still being
received;
- *lazy*: not parsing input unless it is needed to emit the next parsing signal,
so you can easily stop parsing partway through a document;
- *non-blocking*: they can be used with [Lwt][lwt], but still provide a
straightforward synchronous interface for simple usage; and
- *one-pass*: memory consumption is limited since the parsers don't build up a
document representation, nor buffer input beyond a small amount of lookahead.

The parsers detect character encodings automatically. Strings emitted are in
UTF-8.

- Supports both strict and error-correcting parsing.
- Based on the [HTML5][HTML5] and [XML][XML] specifications. This concerns HTML
error recovery especially.
- Character encodings detected automatically; emits UTF-8.
- Can be used in simple synchronous style or with [Lwt][lwt].
- Streaming and lazy – partial input is processed as it is received, but only if
needed.
- Parses input in one pass and does not build up a document representation in
memory.
The parsers are subjected to fairly thorough [testing][tests], with more tests
to be added in the future.

The interface is centered around four transformations between byte streams and
signal streams: [`parse_html`][parse_html], [`write_html`][write_html],
## Interface and simple usage

The interface is centered around four functions between byte streams and signal
streams: [`parse_html`][parse_html], [`write_html`][write_html],
[`parse_xml`][parse_xml], and [`write_xml`][write_xml]. These have several
optional arguments for fine-tuning their behavior. The rest of the functions
either input or output byte streams, or transform signal streams in some
interesting way.

Here are some more usage examples:
Some examples:

```ocaml
(* Show up to 10 XML well-formedness errors to the user. Stop after
Expand All @@ -82,8 +94,57 @@ file "some_file"
~element:(fun (_, name) _ children -> Element (name, children))
```

The library is subjected to fairly thorough [testing][tests], with more tests on
the way before 1.0 release.
## Advanced: Cohttp + Markup.ml + Lambda Soup + Lwt

The code below is a complete program that requests a Google search, then
performs a streaming scrape of result titles. The first GitHub link is printed,
then the program exits without waiting for the rest of input. Perhaps early exit
is not so important for a Google results page, but it may be needed for large
documents. Memory consumption is low because only the `h3` elements are
converted into DOM-like trees.

```ocaml
open Lwt.Infix
let () =
Markup_lwt.ensure_tail_calls (); (* Workaround for current Lwt :( *)
Lwt_main.run begin
Uri.of_string "https://www.google.com/search?q=markup.ml"
|> Cohttp_lwt_unix.Client.get
>|= snd (* Assume success and get body. *)
>|= Cohttp_lwt_body.to_stream (* Now an Lwt_stream.t. *)
>|= Markup_lwt.lwt_stream (* Now a Markup.stream. *)
>|= Markup.strings_to_bytes
>|= Markup.parse_html
>|= Markup.drop_locations
>|= Markup.elements (fun name _ -> snd name = "h3")
>>= Markup_lwt.iter begin fun h3_subtree ->
h3_subtree
|> Markup.write_html
|> Markup_lwt.to_string
>|= Soup.parse
>|= fun soup ->
let open Soup in
match soup $? "a[href*=github]" with
| None -> ()
| Some a -> a |> texts |> List.iter print_string; print_newline ()
end
end
```

This prints `aantron/markup.ml · GitHub`. To run it, do:

```sh
ocamlfind opt -linkpkg -package lwt.unix -package cohttp.lwt \
-package markup.lwt -package lambdasoup scrape.ml && ./a.out
```

You can get all the necessary packages by

```sh
opam install lwt cohttp lambdasoup markup
```

## Installing

Expand All @@ -103,15 +164,16 @@ To remove the pin later, run `make uninstall`.
## Documentation

The interface of Markup.ml is three modules [`Markup`][Markup],
[`Markup_lwt`][Markup_lwt], and [`Markup_lwt_unix`][Markup_lwt_unix].
[`Markup_lwt`][Markup_lwt], and [`Markup_lwt_unix`][Markup_lwt_unix]. The last
two are available only if you have Lwt installed.

## Help wanted

Parsing markup has more applications than one person can easily think of, which
makes it difficult to do exhaustive testing. I would greatly appreciate any bug
reports.

While the parsers are in an "advanced" state of completion, there is still
Although the parsers are in an "advanced" state of completion, there is still
considerable work to be done on standard conformance and speed. Again, any help
would be appreciated.

Expand All @@ -127,7 +189,7 @@ Feel free to open any issues on GitHub, or send me an email at

[travis]: https://travis-ci.org/aantron/markup.ml/branches
[travis-img]: https://img.shields.io/travis/aantron/markup.ml/master.svg
[coveralls]: google.com
[coveralls]: https://coveralls.io/github/aantron/markup.ml?branch=master
[coveralls-img]: https://img.shields.io/coveralls/aantron/markup.ml/master.svg

## License
Expand All @@ -138,27 +200,6 @@ The Markup.ml source distribution includes a copy of the HTML5 entity list,
which is distributed under the W3C document license. The copyright notices and
text of this license are also found in [LICENSE][license].

## Interesting

As it turns out, there is no simple way to read an entire text file into a
string using the standard library of OCaml. If you have Markup.ml installed,
however, you can do

```ocaml
file "foo.txt" |> to_string
```

This only supports text mode.

Markup.ml also makes a decent half of a character encodings library – you can
use it to convert byte sources into Unicode scalar values. For example, suppose
you have a file in UTF-16. Then, you can do

```ocaml
open Encoding
file "encoded.txt" |> decode utf_16 |> iter (*...do something with the ints...*)
```

[releases]: https://github.com/aantron/markup.ml/releases
[parse_html]: http://aantron.github.io/markup.ml/#VALparse_html
[write_html]: http://aantron.github.io/markup.ml/#VALwrite_html
Expand Down
10 changes: 5 additions & 5 deletions doc/descr
@@ -1,20 +1,20 @@
Error-recovering HTML and XML parsers and writers with a functional interface.
Error-recovering functional HTML5 and XML parsers and writers.

Markup.ml provides an HTML parser and an XML parser. The parsers are wrapped in
a simple interface: they are functions that transform byte streams to parsing
signal streams. Streams can be manipulated in various ways, such as processing
by fold, filter, and map, assembly into DOM tree structures, or serialization
back to HTML or XML.

Both parsers are based on their respective standards. The HTML parser, in
particular, is based on the state machines defined in HTML5.

The parsers are error-recovering by default, and accept fragments. This makes it
very easy to get a best-effort parse of some input. The parsers can, however, be
easily configured to be strict, and to accept only full documents.

Apart for this, the parsers are streaming (do not build up a document in
Apart from this, the parsers are streaming (do not build up a document in
memory), non-blocking (can be used with threading libraries), lazy (do not
consume input unless the signal stream is being read), and process the input in
a single pass. They automatically detect the character encoding of the input
stream, and convert everything to UTF-8.

Both parsers are based on their respective standards. The HTML parser, in
particular, is based on the state machines defined in HTML5.
6 changes: 3 additions & 3 deletions src/META
@@ -1,20 +1,20 @@
version = "0.5"
description = "Error-recovering streaming HTML and XML parsers"
description = "Error-recovering functional HTML5 and XML parsers"
requires = "uutf"
archive(byte) = "markup.cma"
archive(native) = "markup.cmxa"

package "lwt" (
version = "0.5"
description = "Error-recovering streaming HTML and XML parsers"
description = "Error-recovering functional HTML5 and XML parsers"
exists_if = "markup_lwt.cma"
requires = "markup lwt"
archive(byte) = "markup_lwt.cma"
archive(native) = "markup_lwt.cmxa"

package "unix" (
version = "0.5"
description = "Error-recovering streaming HTML and XML parsers"
description = "Error-recovering functional HTML5 and XML parsers"
exists_if = "markup_lwt_unix.cma"
requires = "markup.lwt lwt.unix"
archive(byte) = "markup_lwt_unix.cma"
Expand Down
2 changes: 1 addition & 1 deletion src/html_writer.ml
Expand Up @@ -105,7 +105,7 @@ let write signals =

| `End_element ->
begin match !open_elements with
| [] -> ()
| [] -> next_signal throw e k
| name::rest ->
open_elements := rest;
emit_list ["</"; name; ">"] throw e k
Expand Down
31 changes: 27 additions & 4 deletions src/markup.mli
@@ -1,8 +1,7 @@
(* This file is part of Markup.ml, released under the BSD 2-clause license. See
doc/LICENSE for details, or visit https://github.com/aantron/markup.ml. *)

(** Flexible error-recovering HTML and XML parsers and writers with a simple
interface.
(** Error-recovering functional HTML and XML parsers and writers.
Markup.ml is an HTML and XML parsing and serialization library. It:
Expand Down Expand Up @@ -74,8 +73,10 @@ val write_xml : signal stream -> char stream
{!ASYNCHRONOUS}, which will later be shared with a planned [Markup_async]
module.
Markup.ml is developed on GitHub and distributed under the BSD license.
[LINKS]. This documentation is for version 0.5 of the library. *)
Markup.ml is developed on {{:https://github.com/aantron/markup.ml} GitHub}
and distributed under the
{{:https://github.com/aantron/markup.ml/blob/master/doc/LICENSE}
BSD license}. This documentation is for version 0.5 of the library. *)



Expand Down Expand Up @@ -527,6 +528,24 @@ Element ("p" [
]}
*)

val elements :
(name -> (name * string) list -> bool) -> (signal, 's) stream ->
((signal, 's) stream, 's) stream
(** [elements f s] scans the signal stream [s] for
[`Start_element (name, attributes)] signals that satisfy
[f name attributes]. Each such matching signal is the beginning of a
substream that ends with the corresponding [`End_element] signal. The result
of [elements f s] is the stream of these substreams. In simpler words,
[elements f s] creates a sequence of streams of elements in [s] that match
[f].
Matches don't nest. If there is a matching element contained in another
matching element, only the top one results in a substream.
Code using [elements] does not have to read each substream to completion, or
at all. However, once the using code has tried to get the next substream, it
should not try to read a previous one. *)

val drop_locations : (location * signal, 's) stream -> (signal, 's) stream
(** Forgets location information emitted by the parsers. It is equivalent to
[map snd]. *)
Expand Down Expand Up @@ -571,6 +590,10 @@ val xhtml_entity : string -> string option
(** Translates XHTML entities. This function is for use with the [~entity]
argument of [parse_xml] when parsing XHTML. *)

val strings_to_bytes : (string, 's) stream -> (char, 's) stream
(** [strings_to_bytes s] is the stream of all the bytes of all strings in
[s]. *)



(** {2 Asynchronous interface} *)
Expand Down
2 changes: 1 addition & 1 deletion src/markup_lwt_unix.mli
@@ -1,7 +1,7 @@
(* This file is part of Markup.ml, released under the BSD 2-clause license. See
doc/LICENSE for details, or visit https://github.com/aantron/markup.ml. *)

(** Unix functions based on [Lwt_io] for the Lwt interface to Markup.ml.
(** Stream functions based on [Lwt_io].
This module contains additional functions over {!Markup_lwt}.
Expand Down

0 comments on commit c11dc8b

Please sign in to comment.