Tubax

Table of Contents

1. Introduction
2. Rationale
3. Install
4. Usage
5. Contribute
6. Runing tests
7. License

1. Introduction

Tubax is a library to parse and convert XML raw data into native Clojurescript data structures.

It uses sax.js under the hood to provide a fast way to convert from XML.

2. Rationale

Currently there is no good way to parse XML and other markup languages with Clojurescript. There are no Clojurescript-based libraries and most of the Javascript ones require access to the DOM.

This last point is critical because HTML5 Web Workers don’t have access to these APIs so an alternative is necessary.

Another alternative to XML processing is to go to a middle-ground. There are some libraries that will parse XML into a JSON format.

The problem with these is that JSON is not a faithfull representation of the XML format. There are some XML that couldn’t be represented as JSON.

For example, the following XML will loss information when transformed into JSON.

<root>
    <field-a>A</field-a>
    <field-b>B</field-b>
    <field-a>A</field-a>
</root>

Another main objective of tubax is to be fully compatible with the clojure.xml format so we can access the functionality currently in the Clojure API like zippers.

3. Install

Warning

Not on clojars yet. I’ll update the information when it’s available

4. Usage

All examples will use this XML as if it existed in a (def xml-data "…") definition.

<rss version="2.0">
  <channel>
    <title>RSS Title</title>
    <description>This is an example of an RSS feed</description>
    <link>http://www.example.com/main.html</link>
    <lastBuildDate>Mon, 06 Sep 2010 00:01:00 +0000 </lastBuildDate>
    <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
    <ttl>1800</ttl>
    <item>
      <title>Example entry</title>
      <description>Here is some text containing an interesting description.</description>
      <link>http://www.example.com/blog/post/1</link>
      <guid isPermaLink="false">7bd204c6-1655-4c27-aeee-53f933c5395f</guid>
      <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
    </item>
    <item>
      <title>Example entry2</title>
      <description>Here is some text containing an interesting description.</description>
      <link>http://www.example.com/blog/post/1</link>
      <guid isPermaLink="true">7bd204c6-1655-4c27-aeee-53f933c5395f</guid>
      <pubDate>Sun, 06 Sep 2009 16:20:00 +0000</pubDate>
    </item>
  </channel>
</rss>

4.1. Basic usage

In order to parse a XML file you only have to make a call to the xml→clj function

(require '[tubax.core :refer [xml->clj]])

(xml->clj xml-data)

4.2. Additional options

The library bundles sax.js library as it’s main dependency. You can pass the following options to the conversion to customize some behaviour.

4.2.1. Strict mode

default true

When not in strict mode the parser will be more forgiving on XML structure. If in strict mode, when there is a format failure the parsing will throw an exception.

Warning

Some "loosy" formats could cause unexpected behaviour so it’s not recommended.

(def xml-data "<a><b></a>")

(core/xml->clj xml-data {:strict false})

;; => {:tag :a :attributes {} :content {:tag :b :attributes {} :content []}}

(core/xml->clj xml-data {:strict true})

;; => js/Error #Parse error

4.2.2. Trim whitespaces

default true

This option will make the parsing to remove all the leading and trailing whitespaces in the text nodes.

(def xml-data "<a>  test  </a>")

(core/xml->clj xml-data {:trim false})

;; => {:tag :a :attributes {} :content ["  test  "]}

(core/xml->clj xml-data {:trim true})

;; => {:tag :a :attributes {} :content ["test"]}

4.2.3. Normalize whitespaces

default false

Replace all whitespaces-characters (like tabs, end of lines, etc..) for whitespaces.

(def xml-data "<a>normalize\ntest</a>")

(core/xml->clj xml-data {:normalize false})

;; => {:tag :a :attributes {} :content ["normalize\ntest"]}

(core/xml->clj xml-data {:normalize true})

;; => {:tag :a :attributes {} :content ["normalize test"]}

4.2.4. Lowercase (non-strict mode only)

default true

When on non-strict mode, all tags and attributes can be made upper-case just by setting this option.

(def xml-data "<root att1='t1'>test</root>")

(core/xml->clj xml-data {:strict false :lowercase true})

;; => {:tag :root :attributes {:att1 "t1"} :content ["test"]}

(core/xml->clj xml-data {:strict false :lowercase false})

;; => {:tag :ROOT :attributes {:ATT1 "t1"} :content ["test"]}

4.2.5. Support for XML namespaces

default false

By default there is no additional data when a XML namespace is found.

When the option xmlns is activated there will be more information regarding the namespaces inside the node elements.

(def xml-data "<element xmlns='http://foo'>value</element>")

(core/xml->clj xml-data {:xmlns false})

;; => {:tag :element :attributes {:xmlns "http://foo"} :content ["value"]}

(core/xml->clj xml-data {:xmlns true})

;; => {:tag :element :attributes {:xmlns {:name "xmlns" :value "http://foo" :prefix "xmlns" :local "" :uri "http://www.w3.org/2000/xmlns/"}} :content ["value"]}

4.2.6. Strict entities

default false

When activated, it makes the parser to fail when it founds a non-predefined entity

(def xml-data "<element>&aacute;</element>")

(core/xml->clj xml-data {:strict-entities false})

;; => {:tag :element :attributes {} :content ["á"]}

(core/xml->clj xml-data {:strict-entities true})

;; => js/Error #Parser error

4.3. Utility functions

(require '[tubax.helpers :as th])

For simplicity the following examples suppose:

(require '[tubax.core :refer [xml->clj]])

(def result (xml->clj xml-data))

4.3.1. Access data-structure

(th/tag {:tag :item :attribute {} :content ["Text"]})
;; => :item

(th/attributes {:tag :item :attribute {} :content ["Text"]})
;; => {}

(th/children {:tag :item :attribute {} :content ["Text"]})
;; => ["Text"]

(th/text {:tag :item :attribute {} :content ["Text"]})
;; => Text

(th/text {:tag :item {} :content [{:tag :item :attributes {} :content [...]}]})
;; => nil

4.3.2. Find first node

These methods retrieve the first node that match the query passed as argument.

(th/find-first result {:tag :item})

;; => {:tag :item :attributes {} :content [{:content :title :attributes {} :content ["Hello world"]}]}

(th/find-first result {:path [:rss :channel :description]})

;; => {:tag :description :attributes {} :content ["This is an example of an RSS feed"]}

Search for the first element that have the attribute defined

(th/find-first result {:attribute :isPermaLink})

;; => {:tag :guid :attributes {:isPermaLink "false"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]}

Search for the first element that have an attribute with the specified value

(th/find-first result {:attribute [:isPermaLink true]})

;; => {:tag :guid :attributes {:isPermaLink "true"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]}

4.3.3. Find all nodes

These methods retrieve a lazy sequence with the elements which match the query used as argument.

(th/find-all result {:tag :link})

;; => ({:tag :link :attributes {} :content ["http://www.example.com/main.html"]}
;;     {:tag :link :attributes {} :content ["http://www.example.com/blog/post/1"]})

(th/find-all result {:path [:rss :channel :item :title]})

;; => ({:tag :title :attributes {} :content ["Example entry"]}
;;     {:tag :title :attributes {} :content ["Example entry2"]})

(th/find-all result {:attribute :isPermaLink})

;; => ({:tag :guid :attributes {:isPermaLink "true"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]}
;;     {:tag :guid :attributes {:isPermaLink "false"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]})

(th/find-all result {:attribute [:isPermaLink "true"]})

;; => ({:tag :guid :attributes {:isPermaLink "true"} :content ["7bd204c6-1655-4c27-aeee-53f933c5395f"]})

5. Contribute

Tubax does not have many restrictions for contributions. Just open an issue or pull request.

6. Runing tests

lein test

7. License

This library is under the Apache 2.0 License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.adoc

index.adoc

Tubax

1. Introduction

2. Rationale

3. Install

4. Usage

4.1. Basic usage

4.2. Additional options

4.2.1. Strict mode

4.2.2. Trim whitespaces

4.2.3. Normalize whitespaces

4.2.4. Lowercase (non-strict mode only)

4.2.5. Support for XML namespaces

4.2.6. Strict entities

4.3. Utility functions

4.3.1. Access data-structure

4.3.2. Find first node

4.3.3. Find all nodes

5. Contribute

6. Runing tests

7. License

Files

index.adoc

Latest commit

History

index.adoc

File metadata and controls

Tubax