# Getting started with `delb`


This notebook gives a brief introduction with a few examples on TEI-XML
encoded transcriptions. The cells' output isn't pre-rendered, you need to
execute them in order to see the output of code cells. The first cell requires
an internet connection. The guide assumes that you have basic knowledge about
the Python programming language, XML markup and with text encodings that use
the latter.

The full API reference is available at https://delb.readthedocs.io/en/latest/api.html .

First, we will install the library, including the required dependencies to load 
resources over `https` and then import the `Document` class:

In [None]:
! pip install delb[https-loader]

# workaround for issue with entrypoint registration on mybinder.org
from _delb.plugins import https_loader # not necessary in local environments

from delb import Document

## Loading a document

`delb` can instantiate document representations from a variety of source
arguments, namely URLs, strings and `lxml` objects. If these are not enough,
more document loaders can be implemented and configured to be used. See the API
doc's *Document loaders* section as well as the chapter on *Extending delb* for
more on this.

Let's load R.L. Stevenson's *Treasure Island* from a web server:

In [None]:
treasure_island = Document(
    "https://ota.bodleian.ox.ac.uk/repository/xmlui/bitstream/handle/20.500.12024/5730/5730.xml"
)
print(f"{str(treasure_island)[:367]}\n[…]\n{str(treasure_island)[-384:]}")

#### Related API docs

[Document class](https://delb.readthedocs.io/en/latest/api.html#delb.Document),
[Documment loaders](https://delb.readthedocs.io/en/latest/api.html#module-delb.loaders)

## Node types

The document's content is contained in a tree whose root node is available as
the `root` property of a document instance and is of the `TagNode` type.

In [None]:
root = treasure_island.root
print("The root node's name:", root.universal_name)
print("The number of the root's child nodes:", len(root))

Beside `TagNode`s, there are also `TextNode`s as well as `CommentNode`s and
`ProcessingInstruction`s. The latter two are filtered out by default, there's
more on this in the API doc's *Default filters* section.

Querying document contents is of course a common task and in the context of the
whole document, CSS queries are straight-forward. So, what title is recorded in
the document's header section?

In [None]:
title_node = treasure_island.css_select("titleStmt title").first
print(title_node.full_text)

For our purpose we want to manipulate the title to really just contain the
work's title without further textual annotation, so we need to fetch the containing
text node and alter its content:

In [None]:
text_node = title_node.first_child
text_node.content = "Treasure Island"
print(title_node.full_text)

#### Related API docs

[`Document.root`](https://delb.readthedocs.io/en/latest/api.html#delb.Document.root),
[`TagNode` class](https://delb.readthedocs.io/en/latest/api.html#deld.TagNode),
[`TextNode` class](https://delb.readthedocs.io/en/latest/api.html#del.TextNode),
[`CommentNode` class](https://delb.readthedocs.io/en/latest/api.html#delb.CommentNode),
[`ProcessingInstruction` class](https://delb.readthedocs.io/en/latest/api.html#delb.ProcessingInstructionNode),
[default filters](https://delb.readthedocs.io/en/latest/api.html#default-filters),
[`Document.css_select`](https://delb.readthedocs.io/en/latest/api.html#delb.Document.css_select),
[`TagNode.full_text`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.full_text)

## Navigating the tree

Now, let's find out what the author record says and use different ways that
`delb` provides to navigate the tree.

Given we know that the document has only one author and that the node
holding that information is following the `title`, it can be simply fetched
by targeting the right sibling:

In [None]:
author_node = title_node.next_node()
print(author_node.full_text)

But a more generic approach would consider that there could be several authors
and that there's no constraint regarding the order of nodes in the containing
`titleStmt` node. Therefore we define a filter that matches only nodes with
the name `author` and use it after one that only passes `TagNode` s to fetch
all matching child nodes of the `titleStmt`:

In [None]:
from delb import is_tag_node

def has_author_name(node: "NodeBase") -> bool:
    return node.local_name == "author"

title_statement = title_node.parent
author_nodes = title_statement.child_nodes(is_tag_node, has_author_name)
print(author_nodes)

Wait, that `author_nodes` is a generator object, now what? Or firstly, why?
Employing generators allows lazy evaluation of iterables and avoids large
intermediate containers that could be expected in operations like this:

```python
def is_paragraph_node(node):
    return is_tag_node(node) and node.local_name == "p"

for node in root.child_nodes(is_paragraph_node, recurse=True):
    # in the previous expression a rather big list could be allocated in memory
    # while only one item is used at a time within the loop; also you might
    # wanna `break` out of the loop earlier
    pass
```

Since we are after the author names and not the containing nodes, we can use
that generator (once) in a
[list comprehension](https://docs.python.org/glossary.html#term-list-comprehension):

In [None]:
[node.full_text for node in author_nodes]

#### Related API docs

[`TagNode.next_node`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.next_node),
[`is_tag_node`](https://delb.readthedocs.io/en/latest/api.html#delb.is_tag_node),
[`TagNode.local_name`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.local_name),
[`TagNode.parent`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.parent),
[`TagNode.child_nodes`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.child_nodes),

## Rearranging nodes

What about we create another document that contains a table of contents, or
rather a tree, and sparse title information? As this shall demonstrate the
manipulation of trees, we start off with a copy of the document (instead of
just extracting the information and building a new tree, which would be more
appropriate in a real application). First we define a namespace, register it
with a prefix for serializations and clone the root node.

In [None]:
from delb import register_namespace

TOC_NS = "https://t.oc/"
register_namespace("toc", TOC_NS)

root = treasure_island.clone().root

Next we select and clone the nodes containing the title and author information,
alter their namespace and place them as first child nodes of the root. Note
that except for `replace` all methods that add nodes to a tree, can take a
variable amount of nodes as arguments and therefore the destructuring notation
of iterables by prefixing such with `*` can be used.

In [None]:
nodes = [
    node.clone(deep=True) for node in root.css_select("titleStmt title, titleStmt author")
]

for node in nodes:
    node.namespace = TOC_NS

root.insert_child(0, *nodes)
print(str(root)[:213])

It's noteworthy at this point that `delb` doesn't allow to just move nodes
around within or between trees. Following the paradigm that *explicit is better
than implicit*  it doesn't detach nodes for you and you have to either `clone`
(as before or by passing the `clone` argument as `True`) or `detach` a
non-root node before you insert it into a tree.

Then we get rid of the `teiHeader` and its contents. As the reference to the
detached nodes are lost, the nodes themselves will be removed from the heap
upon garbage collection.

In [None]:
root.css_select("teiHeader").first.detach()
print(str(root)[:208])

For the directory of sections, we'll go that more straight-forward way of
extracting information and assembling a new tree. So let's make a container for
the items:

In [None]:
contents = root.new_tag_node("contents", namespace=TOC_NS)
print(contents)

We'll need to implement the data conversion in a function because we need to
recursively scan for subsections. Also, filter functions are defined and used
to iterate over the relevant nodes. To reduce imperative instructions, the
`tag` function is employed that allows a brief, declarative way to do build
subtrees.

In [None]:
from delb import first, tag

def is_head(node: "NodeBase") -> bool:
    return node.local_name == "head"

def is_section(node: "NodeBase") -> bool:
    return node.local_name == "div"


def extract_section_titles(node: "TagNode") -> "List[TagNode, ...]":
    result = []
    for child_node in node.child_nodes(is_tag_node, is_section):
        head = first(child_node.child_nodes(is_tag_node, is_head))
        section_item = node.new_tag_node("section", namespace=TOC_NS, children=[
            tag("title", head.full_text)
        ])
        result.append(section_item)

        subsections = extract_section_titles(child_node)
        if subsections:
            section_item.append_child(tag("subsections", subsections))

    return result

body = root.css_select("text body").first
contents.append_child(*extract_section_titles(body))
root.append_child(contents)

Finally we'll alter the root node's identity and get rid of the originating
contents. For that last bit, we'll actually put these target nodes into a list
that references these. Because like a list, a tree is a mutable object, and
must not be changed when iterating over it.

In [None]:
root.namespace = TOC_NS
root.local_name = "book_contents"

def namespace_filter(node: "NodeBase") -> bool:
    return node.namespace != TOC_NS

for node in list(root.child_nodes(is_tag_node, namespace_filter)):
    node.detach()

#### Related API docs

[`register_namespace`](https://delb.readthedocs.io/en/latest/api.html#delb.register_namespace),
[`Document.clone`](https://delb.readthedocs.io/en/latest/api.html#delb.Document.clone),
[`TagNode.css_select`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.css_select),
[`TagNode.namespace`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.namespace),
[`TagNode.insert_child`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.insert_child),
[`TagNode.detach`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.detach),
[`TagNode.new_tag_node`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.new_tag_node),
[`tag`](https://delb.readthedocs.io/en/latest/api.html#delb.tag),
[`TagNode.append_child`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.append_child)

## Saving documents

All right then, to wrap all up, we'll attach the newly created tree to a
document, clean up the namespace-prefix-mess (well mostly, unfortunately
there's no way yet to declare a namespace that is already associated with a
prefix as a default namespace) and save the document to disk:

In [None]:
contents = Document(root)

from pathlib import Path
from tempfile import TemporaryDirectory

with TemporaryDirectory() as tmp_path:
    target = Path(tmp_path) / "treasure_island_sections.xml"
    contents.save(target, pretty=True)
    print(target.read_text())

#### Related API docs

[`Document.save`](https://delb.readthedocs.io/en/latest/api.html#delb.Document.save)