Added elements helper and advanced tutorial.

- Documentation updates. - Travis fixes. - Fixes in write_html and write_xml that caused non-termination on Lwt. - Exposed strings_to_bytes.
aantron · Jan 13, 2016 · c11dc8b · c11dc8b
1 parent 788d92c
commit c11dc8b
Show file tree

Hide file tree

Showing 9 changed files with 185 additions and 65 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -17,9 +17,10 @@ before_script:
   - eval `opam config env`
 
 script:
+  - opam install -y ounit
+  - "[ -n $DOCS ] || opam install -y lambdasoup"
   - "[ -z $LWT ] || opam install -y lwt js_of_ocaml"
   - "[ -z $COVERALLS ] || opam install -y bisect_ppx ocveralls"
-  - opam install -y ounit js_of_ocaml lambdasoup
   - make install
   - make dependency-test
   - make test

diff --git a/README.md b/README.md
@@ -1,14 +1,20 @@
-# Markup.ml &nbsp; [![version pre0.5][version]][releases] [![(BSD license)][license-img]][license]
+# Markup.ml &nbsp; [![version 0.5][version]][releases] [![(BSD license)][license-img]][license]
 
-[version]:       https://img.shields.io/badge/version-pre0.5-blue.svg
+[version]:       https://img.shields.io/badge/version-0.5-blue.svg
 [license-img]:   https://img.shields.io/badge/license-BSD-blue.svg
 
-Markup.ml is a pair of streaming, error-recovering parsers, one for HTML and one
-for XML, with a simple interface: each parser is a function that transforms
-streams.
+Markup.ml is a pair of parsers implementing the HTML5 and XML specifications.
+Usage is simple, because each parser is just a function from byte streams to
+parsing signal streams.
 
-Here is an example of pretty-printing and correcting an HTML fragment. The code
-is in the left column.
+HTML5 gives complicated rules for well-formed markup, error reporting, and
+recovery. Markup.ml encapsulates them. While the XML specification does not
+include error recovery, Markup.ml also recovers from XML errors, after reporting
+them. Thus, it provides best-effort parsing of both HTML and XML.
+
+Here is an example of Markup.ml correcting errors in a small HTML fragment, then
+pretty-printing it. The code is in the left column, and the center column shows
+the values produced.
 
 ```ocaml
 open Markup;;
@@ -18,10 +24,9 @@ string s                "<p><em>Markup.ml<p>rocks!"    (* malformed HTML *)
 |> parse_html           `Start_element "p"
                         `Start_element "em"
                         `Text "Markup.ml"
-                        ~report (1, 4)  (* can use ~report to abort parsing; *)
-                          (`Unmatched_start_tag "em")  (* ignored by default *)
+                        ~report (1, 4) (`Unmatched_start_tag "em")
                         `End_element                   (* /em: recovery *)
-                        `End_element                   (* /p *)
+                        `End_element                   (* /p: not an error *)
                         `Start_element "p"
                         `Start_element "em"            (* recovery *)
                         `Text "rocks!"
@@ -39,26 +44,33 @@ string s                "<p><em>Markup.ml<p>rocks!"    (* malformed HTML *)
                          </p>"                         (* valid HTML *)
 ```
 
-Some features:
+In addition to being error-correcting, the parsers are:
+
+- *streaming*: capable of parsing partial input while more input is still being
+  received;
+- *lazy*: not parsing input unless it is needed to emit the next parsing signal,
+  so you can easily stop parsing partway through a document;
+- *non-blocking*: they can be used with [Lwt][lwt], but still provide a
+  straightforward synchronous interface for simple usage; and
+- *one-pass*: memory consumption is limited since the parsers don't build up a
+  document representation, nor buffer input beyond a small amount of lookahead.
+
+The parsers detect character encodings automatically. Strings emitted are in
+UTF-8.
 
-- Supports both strict and error-correcting parsing.
-- Based on the [HTML5][HTML5] and [XML][XML] specifications. This concerns HTML
-  error recovery especially.
-- Character encodings detected automatically; emits UTF-8.
-- Can be used in simple synchronous style or with [Lwt][lwt].
-- Streaming and lazy – partial input is processed as it is received, but only if
-  needed.
-- Parses input in one pass and does not build up a document representation in
-  memory.
+The parsers are subjected to fairly thorough [testing][tests], with more tests
+to be added in the future.
 
-The interface is centered around four transformations between byte streams and
-signal streams: [`parse_html`][parse_html], [`write_html`][write_html],
+## Interface and simple usage
+
+The interface is centered around four functions between byte streams and signal
+streams: [`parse_html`][parse_html], [`write_html`][write_html],
 [`parse_xml`][parse_xml], and [`write_xml`][write_xml]. These have several
 optional arguments for fine-tuning their behavior. The rest of the functions
 either input or output byte streams, or transform signal streams in some
 interesting way.
 
-Here are some more usage examples:
+Some examples:
 
 ```ocaml
 (* Show up to 10 XML well-formedness errors to the user. Stop after
@@ -82,8 +94,57 @@ file "some_file"
   ~element:(fun (_, name) _ children -> Element (name, children))
 ```
 
-The library is subjected to fairly thorough [testing][tests], with more tests on
-the way before 1.0 release.
+## Advanced: Cohttp + Markup.ml + Lambda Soup + Lwt
+
+The code below is a complete program that requests a Google search, then
+performs a streaming scrape of result titles. The first GitHub link is printed,
+then the program exits without waiting for the rest of input. Perhaps early exit
+is not so important for a Google results page, but it may be needed for large
+documents. Memory consumption is low because only the `h3` elements are
+converted into DOM-like trees.
+
+```ocaml
+open Lwt.Infix
+
+let () =
+  Markup_lwt.ensure_tail_calls ();    (* Workaround for current Lwt :( *)
+
+  Lwt_main.run begin
+    Uri.of_string "https://www.google.com/search?q=markup.ml"
+    |> Cohttp_lwt_unix.Client.get
+    >|= snd                           (* Assume success and get body. *)
+    >|= Cohttp_lwt_body.to_stream     (* Now an Lwt_stream.t. *)
+    >|= Markup_lwt.lwt_stream         (* Now a Markup.stream. *)
+    >|= Markup.strings_to_bytes
+    >|= Markup.parse_html
+    >|= Markup.drop_locations
+    >|= Markup.elements (fun name _ -> snd name = "h3")
+    >>= Markup_lwt.iter begin fun h3_subtree ->
+      h3_subtree
+      |> Markup.write_html
+      |> Markup_lwt.to_string
+      >|= Soup.parse
+      >|= fun soup ->
+        let open Soup in
+        match soup $? "a[href*=github]" with
+        | None -> ()
+        | Some a -> a |> texts |> List.iter print_string; print_newline ()
+    end
+  end
+```
+
+This prints `aantron/markup.ml · GitHub`. To run it, do:
+
+```sh
+ocamlfind opt -linkpkg -package lwt.unix -package cohttp.lwt \
+    -package markup.lwt -package lambdasoup scrape.ml && ./a.out
+```
+
+You can get all the necessary packages by
+
+```sh
+opam install lwt cohttp lambdasoup markup
+```
 
 ## Installing
 
@@ -103,15 +164,16 @@ To remove the pin later, run `make uninstall`.
 ## Documentation
 
 The interface of Markup.ml is three modules [`Markup`][Markup],
-[`Markup_lwt`][Markup_lwt], and [`Markup_lwt_unix`][Markup_lwt_unix].
+[`Markup_lwt`][Markup_lwt], and [`Markup_lwt_unix`][Markup_lwt_unix]. The last
+two are available only if you have Lwt installed.
 
 ## Help wanted
 
 Parsing markup has more applications than one person can easily think of, which
 makes it difficult to do exhaustive testing. I would greatly appreciate any bug
 reports.
 
-While the parsers are in an "advanced" state of completion, there is still
+Although the parsers are in an "advanced" state of completion, there is still
 considerable work to be done on standard conformance and speed. Again, any help
 would be appreciated.
 
@@ -127,7 +189,7 @@ Feel free to open any issues on GitHub, or send me an email at
 
 [travis]:        https://travis-ci.org/aantron/markup.ml/branches
 [travis-img]:    https://img.shields.io/travis/aantron/markup.ml/master.svg
-[coveralls]:     google.com
+[coveralls]:     https://coveralls.io/github/aantron/markup.ml?branch=master
 [coveralls-img]: https://img.shields.io/coveralls/aantron/markup.ml/master.svg
 
 ## License
@@ -138,27 +200,6 @@ The Markup.ml source distribution includes a copy of the HTML5 entity list,
 which is distributed under the W3C document license. The copyright notices and
 text of this license are also found in [LICENSE][license].
 
-## Interesting
-
-As it turns out, there is no simple way to read an entire text file into a
-string using the standard library of OCaml. If you have Markup.ml installed,
-however, you can do
-
-```ocaml
-file "foo.txt" |> to_string
-```
-
-This only supports text mode.
-
-Markup.ml also makes a decent half of a character encodings library – you can
-use it to convert byte sources into Unicode scalar values. For example, suppose
-you have a file in UTF-16. Then, you can do
-
-```ocaml
-open Encoding
-file "encoded.txt" |> decode utf_16 |> iter (*...do something with the ints...*)
-```
-
 [releases]:        https://github.com/aantron/markup.ml/releases
 [parse_html]:      http://aantron.github.io/markup.ml/#VALparse_html
 [write_html]:      http://aantron.github.io/markup.ml/#VALwrite_html

diff --git a/doc/descr b/doc/descr
@@ -1,20 +1,20 @@
-Error-recovering HTML and XML parsers and writers with a functional interface.
+Error-recovering functional HTML5 and XML parsers and writers.
 
 Markup.ml provides an HTML parser and an XML parser. The parsers are wrapped in
 a simple interface: they are functions that transform byte streams to parsing
 signal streams. Streams can be manipulated in various ways, such as processing
 by fold, filter, and map, assembly into DOM tree structures, or serialization
 back to HTML or XML.
 
+Both parsers are based on their respective standards. The HTML parser, in
+particular, is based on the state machines defined in HTML5.
+
 The parsers are error-recovering by default, and accept fragments. This makes it
 very easy to get a best-effort parse of some input. The parsers can, however, be
 easily configured to be strict, and to accept only full documents.
 
-Apart for this, the parsers are streaming (do not build up a document in
+Apart from this, the parsers are streaming (do not build up a document in
 memory), non-blocking (can be used with threading libraries), lazy (do not
 consume input unless the signal stream is being read), and process the input in
 a single pass. They automatically detect the character encoding of the input
 stream, and convert everything to UTF-8.
-
-Both parsers are based on their respective standards. The HTML parser, in
-particular, is based on the state machines defined in HTML5.
diff --git a/src/META b/src/META
@@ -1,20 +1,20 @@
 version = "0.5"
-description = "Error-recovering streaming HTML and XML parsers"
+description = "Error-recovering functional HTML5 and XML parsers"
 requires = "uutf"
 archive(byte) = "markup.cma"
 archive(native) = "markup.cmxa"
 
 package "lwt" (
   version = "0.5"
-  description = "Error-recovering streaming HTML and XML parsers"
+  description = "Error-recovering functional HTML5 and XML parsers"
   exists_if = "markup_lwt.cma"
   requires = "markup lwt"
   archive(byte) = "markup_lwt.cma"
   archive(native) = "markup_lwt.cmxa"
 
   package "unix" (
     version = "0.5"
-    description = "Error-recovering streaming HTML and XML parsers"
+    description = "Error-recovering functional HTML5 and XML parsers"
     exists_if = "markup_lwt_unix.cma"
     requires = "markup.lwt lwt.unix"
     archive(byte) = "markup_lwt_unix.cma"

diff --git a/src/html_writer.ml b/src/html_writer.ml
@@ -105,7 +105,7 @@ let write signals =
 
       | `End_element ->
         begin match !open_elements with
-        | [] -> ()
+        | [] -> next_signal throw e k
         | name::rest ->
           open_elements := rest;
           emit_list ["</"; name; ">"] throw e k

diff --git a/src/markup.mli b/src/markup.mli
@@ -1,8 +1,7 @@
 (* This file is part of Markup.ml, released under the BSD 2-clause license. See
    doc/LICENSE for details, or visit https://github.com/aantron/markup.ml. *)
 
-(** Flexible error-recovering HTML and XML parsers and writers with a simple
-    interface.
+(** Error-recovering functional HTML and XML parsers and writers.
 
     Markup.ml is an HTML and XML parsing and serialization library. It:
 
@@ -74,8 +73,10 @@ val write_xml  : signal stream -> char stream
     {!ASYNCHRONOUS}, which will later be shared with a planned [Markup_async]
     module.
 
-    Markup.ml is developed on GitHub and distributed under the BSD license.
-    [LINKS]. This documentation is for version 0.5 of the library. *)
+    Markup.ml is developed on {{:https://github.com/aantron/markup.ml} GitHub}
+    and distributed under the
+    {{:https://github.com/aantron/markup.ml/blob/master/doc/LICENSE}
+    BSD license}. This documentation is for version 0.5 of the library. *)
 
 
 
@@ -527,6 +528,24 @@ Element ("p" [
 ]}
  *)
 
+val elements :
+  (name -> (name * string) list -> bool) -> (signal, 's) stream ->
+    ((signal, 's) stream, 's) stream
+(** [elements f s] scans the signal stream [s] for
+    [`Start_element (name, attributes)] signals that satisfy
+    [f name attributes]. Each such matching signal is the beginning of a
+    substream that ends with the corresponding [`End_element] signal. The result
+    of [elements f s] is the stream of these substreams. In simpler words,
+    [elements f s] creates a sequence of streams of elements in [s] that match
+    [f].
+
+    Matches don't nest. If there is a matching element contained in another
+    matching element, only the top one results in a substream.
+
+    Code using [elements] does not have to read each substream to completion, or
+    at all. However, once the using code has tried to get the next substream, it
+    should not try to read a previous one. *)
+
 val drop_locations : (location * signal, 's) stream -> (signal, 's) stream
 (** Forgets location information emitted by the parsers. It is equivalent to
     [map snd]. *)
@@ -571,6 +590,10 @@ val xhtml_entity : string -> string option
 (** Translates XHTML entities. This function is for use with the [~entity]
     argument of [parse_xml] when parsing XHTML. *)
 
+val strings_to_bytes : (string, 's) stream -> (char, 's) stream
+(** [strings_to_bytes s] is the stream of all the bytes of all strings in
+    [s]. *)
+
 
 
 (** {2 Asynchronous interface} *)

diff --git a/src/markup_lwt_unix.mli b/src/markup_lwt_unix.mli
@@ -1,7 +1,7 @@
 (* This file is part of Markup.ml, released under the BSD 2-clause license. See
    doc/LICENSE for details, or visit https://github.com/aantron/markup.ml. *)
 
-(** Unix functions based on [Lwt_io] for the Lwt interface to Markup.ml.
+(** Stream functions based on [Lwt_io].
 
     This module contains additional functions over {!Markup_lwt}.