Skip to content

Latest commit

 

History

History
342 lines (230 loc) · 8.76 KB

index.adoc

File metadata and controls

342 lines (230 loc) · 8.76 KB

Tubax

Tubax is a library to parse and convert XML raw data into native Clojurescript data structures.

It uses sax.js under the hood to provide a fast way to convert from XML.

Currently there is no good way to parse XML and other markup languages with Clojurescript. There are no Clojurescript-based libraries and most of the Javascript ones require access to the DOM.

This last point is critical because HTML5 Web Workers don’t have access to these APIs so an alternative is necessary.

Another alternative to XML processing is to go to a middle-ground. There are some libraries that will parse XML into a JSON format.

The problem with these is that JSON is not a faithfull representation of the XML format. There are some XML that couldn’t be represented as JSON.

For example, the following XML will loss information when transformed into JSON.

<root>
    <field-a>A</field-a>
    <field-b>B</field-b>
    <field-a>A</field-a>
</root>

Another main objective of tubax is to be fully compatible with the clojure.xml format so we can access the functionality currently in the Clojure API like zippers.

Warning
Not on clojars yet. I’ll update the information when it’s available

All examples will use this XML as if it existed in a (def xml-data "…​") definition.

<rss version="2.0">
  <channel>
    <title>RSS Title</title>
    <description>This is an example of an RSS feed</description>
    <link>http://www.example.com/main.html</link>
    <lastBuildDate>Mon, 06 Sep 2010 00:01:00 +0000 </lastBuildDate>
    <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
    <ttl>1800</ttl>
    <item>
      <title>Example entry</title>
      <description>Here is some text containing an interesting description.</description>
      <link>http://www.example.com/blog/post/1</link>
      <guid isPermaLink="false">7bd204c6-1655-4c27-aeee-53f933c5395f</guid>
      <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
    </item>
    <item>
      <title>Example entry2</title>
      <description>Here is some text containing an interesting description.</description>
      <link>http://www.example.com/blog/post/1</link>
      <guid isPermaLink="true">7bd204c6-1655-4c27-aeee-53f933c5395f</guid>
      <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
    </item>
  </channel>
</rss>

In order to parse a XML file you only have to make a call to the xml→clj function

(require '[tubax.core :refer [xml->clj]])

(xml->clj xml-data)

The library bundles sax.js library as it’s main dependency. You can pass the following options to the conversion to customize some behaviour.

default true

When not in strict mode the parser will be more forgiving on XML structure. If in strict mode, when there is a format failure the parsing will throw an exception.

Warning
Some "loosy" formats could cause unexpected behaviour so it’s not recommended.
(def xml-data "<a><b></a>")

(core/xml->clj xml-data {:strict false})

;; => {:tag :a :attributes {} :content {:tag :b :attributes {} :content []}}

(core/xml->clj xml-data {:strict true})

;; => js/Error #Parse error

default true

This option will make the parsing to remove all the leading and trailing whitespaces in the text nodes.

(def xml-data "<a>  test  </a>")

(core/xml->clj xml-data {:trim false})

;; => {:tag :a :attributes {} :content ["  test  "]}

(core/xml->clj xml-data {:trim true})

;; => {:tag :a :attributes {} :content ["test"]}

default false

Replace all whitespaces-characters (like tabs, end of lines, etc..) for whitespaces.

(def xml-data "<a>normalize\ntest</a>")

(core/xml->clj xml-data {:normalize false})

;; => {:tag :a :attributes {} :content ["normalize\ntest"]}

(core/xml->clj xml-data {:normalize true})

;; => {:tag :a :attributes {} :content ["normalize test"]}

default true

When on non-strict mode, all tags and attributes can be made upper-case just by setting this option.

(def xml-data "<root att1='t1'>test</root>")

(core/xml->clj xml-data {:strict false :lowercase true})

;; => {:tag :root :attributes {:att1 "t1"} :content ["test"]}

(core/xml->clj xml-data {:strict false :lowercase false})

;; => {:tag :ROOT :attributes {:ATT1 "t1"} :content ["test"]}

default false

By default there is no additional data when a XML namespace is found.

When the option xmlns is activated there will be more information regarding the namespaces inside the node elements.

(def xml-data "<element xmlns='http://foo'>value</element>")

(core/xml->clj xml-data {:xmlns false})

;; => {:tag :element :attributes {:xmlns "http://foo"} :content ["value"]}

(core/xml->clj xml-data {:xmlns true})

;; => {:tag :element :attributes {:xmlns {:name "xmlns" :value "http://foo" :prefix "xmlns" :local "" :uri "http://www.w3.org/2000/xmlns/"}} :content ["value"]}

default false

When activated, it makes the parser to fail when it founds a non-predefined entity

(def xml-data "<element>&aacute;</element>")

(core/xml->clj xml-data {:strict-entities false})

;; => {:tag :element :attributes {} :content ["á"]}

(core/xml->clj xml-data {:strict-entities true})

;; => js/Error #Parser error
(require '[tubax.helpers :as th])

For simplicity the following examples suppose:

(require '[tubax.core :refer [xml->clj]])

(def result (xml->clj xml-data))
(th/tag {:tag :item :attribute {} :content ["Text"]})
;; => :item
(th/attributes {:tag :item :attribute {} :content ["Text"]})
;; => {}
(th/children {:tag :item :attribute {} :content ["Text"]})
;; => ["Text"]
(th/text {:tag :item :attribute {} :content ["Text"]})
;; => Text

(th/text {:tag :item {} :content [{:tag :item :attributes {} :content [...]}]})
;; => nil

These methods retrieve the first node that match the query passed as argument.

(th/find-first result {:tag :item})

;; => {:tag :item :attributes {} :content [{:content :title :attributes {} :content ["Hello world"]}]}
(th/find-first result {:path [:rss :channel :description]})

;; => {:tag :description :attributes {} :content ["This is an example of an RSS feed"]}

Search for the first element that have the attribute defined

(th/find-first result {:attribute :isPermaLink})

;; => {:tag :guid :attributes {:isPermaLink "false"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]}

Search for the first element that have an attribute with the specified value

(th/find-first result {:attribute [:isPermaLink true]})

;; => {:tag :guid :attributes {:isPermaLink "true"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]}

These methods retrieve a lazy sequence with the elements which match the query used as argument.

(th/find-all result {:tag :link})

;; => ({:tag :link :attributes {} :content ["http://www.example.com/main.html"]}
;;     {:tag :link :attributes {} :content ["http://www.example.com/blog/post/1"]})
(th/find-all result {:path [:rss :channel :item :title]})

;; => ({:tag :title :attributes {} :content ["Example entry"]}
;;     {:tag :title :attributes {} :content ["Example entry2"]})
(th/find-all result {:attribute :isPermaLink})

;; => ({:tag :guid :attributes {:isPermaLink "true"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]}
;;     {:tag :guid :attributes {:isPermaLink "false"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]})
(th/find-all result {:attribute [:isPermaLink "true"]})

;; => ({:tag :guid :attributes {:isPermaLink "true"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]})

Tubax does not have many restrictions for contributions. Just open an issue or pull request.

lein test

This library is under the Apache 2.0 License.