Skip to content

Latest commit

 

History

History

zipper-primer

Zipper Primer

First steps into the wonderful world of uniform hierarchical data processing

Hierarchy is everywhere

When thinking of hierarchical data, words like "XML", "XHTML", "Relational Database" or "a tree" may come into mind. But these are forms, structures, representations or organisations, not data.

Let's focus on data: Hierarchical data belonging to a certain domain may be obvious: A genealogy software will unsurprisingly use terms like "parents", "children" and the like.

But even if not required by the domain, you can choose to see hierarchies: "This article has been read by these persons" or "this operand is part of that equation" or "this list is just a degenerated tree".

Parts and pieces

Hierarchical data consists of nodes and edges. A node may have a value.

Hierarchy

For now, let's assume that there will be exactly one path between two nodes. And we call it a tree. The-node-where-it-all-starts is called "root". But under these circumstances, the root is simply a point of view:

Hierarchy viewpoint Hierarchy yet another point of view

Data modelling

A naive data model approach of the arboreal first view could be a map:

    (def data {:root [{:node [1 2]} {:node [3 4]}]})

But root is an unnecessary part of information, because there will always be exactly one root. Same goes for node, because there will be only nodes. Stripping root and nodes apart leads to

    (def data [[1 2] [3 4]])

And this is a nested vector. Working with nested vectors is easy: You can access and change single values:

    (get-in data [0 0])
    (assoc-in data [0 1] "CUGB")

Now let's use a map to model this talk:

    (def talk {:talk "Clojure Zipper Primer"
               :speaker {:person {:firstname "Ingo"
                                  :lastname "Küper"}}}

You can read a single value:

    (get-in talk [:speaker :person :lastname])

You can even add a nested value:

    (assoc-in talk [:speaker :company] "doctronic GmbH & Co. KG")

So Clojure has great support for working with hierarchical data models. That's all nice and easy if you know the path of the involved element or if you can calculate that path. But you can not move nor navigate within the hierarchical data. (Unless you do some constant path (re)calculations, of course.)

But the greates disadvantage by far is the data model leaking in the algorithm:

  • If you use nested vectors, you have to use indices.
  • If you use nested maps, you have to use keywords.
  • If you use XML, you may find yourself lost in the DOM- or SAX-API.
  • Your algorithm will be flooded by details of the data model. Changing the data model from nested vectors to XML means rewriting the algorithm.

And that's no good. Let's decouple the algorithm from the concrete data model.

Zippers to the rescue

Zippers were introduced by Gérard Huet in 1997.

A zipper separates the tree's structure from its values. You will navigate through, examine and even reconstruct the tree by the use of Locations. A Location is a (horrific simple) data structure managed by the Zipper API. It contains information about the parent and the children and the siblings of a node. Usually, you have exactly one current node and therefore one current Location, but you're not limited to that.

To use zippers, add

    (:require [clojure.zip :as zip])

to your namespace.

Let's create a zipper for the nested vector representation of the tree, move downwards two times and get the node's value:

    (def data [[1 2] [3 4]])

    (-> data
        (zip/vector-zip)
        (zip/next)
        (zip/next)
        (zip/node))
  • zip/vector-zip returns the Location information of the root node.
  • zip/next operates on Locations. It moves downwards, because it uses a Depth-First approach to get to the next node. The result is the Location of the next node.
  • zip/node takes a Location of a node and returns the node's data.

Combined with the power of Clojure (data_processing.clj)

A Leaf is a node without children.

Let's use Function Composition and Sequences to get the values of all Leaves.

    (def data [[1 2] [3 4]])

    (defn leaf-nodes
      [zipper]
      (->> zipper
           (iterate zip/next)
           (take-while (complement zip/end?))
           (filter (complement zip/branch?))
           (map zip/node)))

    (-> data
        (zip/vector-zip)
        (leaf-nodes))
  • (iterate zip/next) returns an infinite sequence of nodes, repeating the last node ad infinitum.
  • (take-while (complement zip/end?)) returns a finite sequence of all nodes.
  • (filter (complement zip/branch?)) returns a sequence of non-branching (a.k.a. Leaf) nodes.
  • (map zip/node) returns a sequence of node values.

Great fun! Now let's change the value of every Leaf to "CUGB":

    (def data [[1 2] [3 4]])

    (defn replace-leaf-nodes
      [zipper]
      (->> zipper
           (iterate (fn [loc]
                      (zip/next
                       (if (zip/branch? loc)
                         loc
                         (zip/replace loc "CUGB")))))
           (filter zip/end?)
           (first)
           (zip/root)))

    (-> data
        (zip/vector-zip)
        (leaf-nodes))

Can you figure out how repleace-leaf-nodes works?

Changing the data model

Magnificent and marvelous so far!

Now let's change the data model. Same data, other model:

A calculation recipe, which will result in a lazy-sequence of values:

    (def data (partition 2 (range 1 5)))

Or a XML representation:

    (def data
      "<root>
         <branch>
           <node>1</node>
           <node>2</node>
         </branch>
         <branch>
           <node>3</node>
           <node>4</node>
         </branch>
       </root>")

And the only thing you need to change is the creation of the zipper!

Use

    (zip/seq-zip data)

for sequences and

    (zip/xml-zip (parse-xml-string data))

for XML data, given as a string and parsed by a simple function like

    (defn parse-xml-string
      [s]
      (xml/parse (java.io.ByteArrayInputStream. (.getBytes s))))

Use

    (xml/emit (replace-leaf-nodes
                (zip/xml-zip (parse-xml-string data))))

at the REPL to get

    <?xml version='1.0' encoding='UTF-8'?>
    <root>
      <branch>
        <node>
          CUGB
        </node>
        <node>
          CUGB
        </node>
      </branch>
      <branch>
        <node>
          CUGB
        </node>
        <node>
          CUGB
        </node>
      </branch>
    </root>

Amazing! We can change the data model without affecting our data processing functions!

Data transformation (data_transformation.clj)

But there is even more a zipper can do for you:

If you have a flat data model, but the data could be interpreted in one (or more!) hierarchical way, you don't need to convert them! Use a zipper!

Let's say we got a flat data structure of customers, projects and teams:

    (def data
      [{:customer 1
        :project "Clojure User Group"
        :team "Clojure"}
       {:customer 1
        :project "Java Web-Services"
        :team "Java"}
       {:customer 2
        :project "Clojure Web-Services"
        :team "Clojure"}
       {:customer 2
        :project "Java User Group"
        :team "Java"}])

You can break down several information: "Show me all projects per customer" or "Show me all projects per team" or "Show me all projects per customer per team" for example.

Using a zipper, you may end up having this piece of code

    (->> data
         (group-all-by :customer)
         (group-all-by :team))

to get this result

    {:customer
      {1
        {:team
          {"Clojure"
            [{:customer 1, :project "Clojure User Group", :team "Clojure"}],
           "Java"
            [{:customer 1, :project "Java Web-Services", :team "Java"}]}},
       2
         {:team
           {"Clojure"
             [{:customer 2, :project "Clojure Web-Services", :team "Clojure"}],
            "Java"
             [{:customer 2, :project "Java User Group", :team "Java"}]}}}}

To use a zipper, you only need to define three functions

  • a predicate which returns true for nodes which may have children.
  • a function to return a nodes children, if any.
  • a function to create a node with a given set of children

and call zip/zipper to create the Location of the root node.

Challenges

To get your head around zippers, you need to think in Locations within a hierarchical structure and wandering around from one Location to another. If you need a node's value, you take the value at the node's Location.

Thankfully, changing a tree isn't possible. You will always get a new tree. If you change the tree's structure by adding or removing nodes, you get a new Location. So you may end up having four different worlds:

  • the values (nodes) of the old tree
  • the structural information (Locations) of the old tree
  • the nodes of the new tree
  • the Locations of the new tree

The "new tree" is sometimes called "future tree", because you're just at a certain Location and haven't seen the full new tree yet. And in fact it doesn't even exists at that point of time. To materialize the new tree, you need to head to its root and emit the whole beast.

Resources