Codiicsa - Asciidoc Reversed

Introduction

Codiicsa is an extendable Python program for converting Docbook XML to the Asciidoc markup language.

By default it is configured to reverse Docbook with the conventions used in Docbook generated by Asciidoc but configuration will usually be needed for documents using other conventions.

Codiicsa is in its initial stages of development but so far it converts nested sections, simple lists, some text markup and some blocks.

Codiicsa outputs the tags for XML that it doesn’t understand to indicate where manual editing is needed.

Note	Codiicsa outputs the tags and contents for unrecognised elements, but not the attributes of those elements.

Usage

python codiicsa.py infile outfile

Design Notes

Codiicsa operates by parsing the input Docbook into a tree and then generates the output during a depth-first tree walk. The output is kept in memory so that the processing for a node can edit the output of its children if needed.

Peculiarities of ElementTree

Codiicsa uses the ElementTree parser which has some particular impacts on the processing of the resulting tree. The contents of each XML element is stored as:

the initial text in the text attribute on the node representing the element, and
a list of child nodes.

Text between the child nodes which is actually part of the text in elem is contained in the tail attributes on the child nodes.

If elem represents a node which does not contain text (either by Docbook rules or by convention) then the tail attributes can be ignored since they are just whitespace to make the Docbook look nice (Asciidoc generates a lot of this).

But if elem represents a node that can contain text care must be taken to ensure that the when elem is processed the tails of its children are processed. The tail of elem itself is outside its closing tag and is not part of its processing.

Attributes of ElementTree that can be useful in your processing:

elem.text contains the text as described above.
elem.tail contains text following the elements close tag.
elem.get( name ) gets the XML attribute value for the attribute name or None if the element doesn’t have that attribute.

The Out class

Codiicsa accumulates its output as a list of strings rather than catenating it into a single string since Python string handling might generate a whole new string each time a modification is made (or it might not, depends on exact circumstances).

The Out class is a convenience wrapper for a list that makes it simpler to add strings, and Out objects, eg those returned from processing child nodes.

The Out constructor takes either a string or a list of strings as the initial list of strings.

Out, and string objects can be combined using the + operator making it convenient to express a sequence of output, for example the code to generate the Asciidoc link syntax:

<<target,text>>

from the XML

<link linkend="target">text</link>

is

out = Out( '<<' ) + elem.get( 'linkend' ) + ',' + elem.text \
      + self.Children( elem ) + '>>' + elem.tail

Note that if the first item had not been made into an Out object it would still have worked but the first four items ( <<, the linkend attribute value, the , and the elements text) would have been catenated into a single string first by the standard Python + operator before prepending that string to the Out object returned from processing the children.

Because the + operator works for Out objects, Python makes the += work as well, so the following appends to out.

out += 'blah blah' + elem.text

Configuration

Codiicsa is designed to be extended or modified without changing the original code. This means that you do not have to attempt to merge your changes if bug fixes or extensions are made to the original Codiicsa.

Please read the design notes section, this provides some useful information to understand the process.

Codiicsa specifies all processing using a Python class with methods named to match the Docbook element names. When an element is processed the appropriately named method is called with the element tree node representing the docbook element as its parameter.

Note	The preceding paragraph is the key point, method names match docbook element names, thats how it finds them, so the case must match as well.

This allows extension or overriding in the way that cascading style sheets or the Asciidoc configuration files work. Methods in a child class are added to the class or override an existing method of the same name. You don’t need to change the original to configure it.

As a concrete (but silly) example lets output bibliorefs not handled when a <biblioref> element is seen and lets override the existing <link> element processing to output links are good.

This example will also serve to introduce a couple of other pieces of necessary boilerplate.

The user python file (say my_file.py) is:

import codiicsa # (1)
import sys # to get the command line args

class my_stuff( codiicsa.docbook_article ) : # (2)

    def __init__( self, *args ) :
        codiicsa.docbook_article.__init__( *args ) # (3)

    def biblioref( self, elem ) :
        return Out( 'bibliorefs not handled' )

    def link( self, elem ) :
        return Out( 'links are good' )

codiicsa.convert( sys.argv[1], sys.argv[2], my_stuff ) # (4)

Put the codiicsa.py file in the same directory so Python can find it to import
Derive your extension from the appropriate document type class, here article
Boilerplate to pass initial parameters to the base classes without caring how many there are, use keyword arguments if you need to add your own initial parameters. See below for standard parameters.
Call the convert routine passing the input file name, the output file name and the name of your class.

This configuration file overrides the standard Codiicsa processing of <link> and adds processing for <biblioref> which is not currently processed.

The program is then used by:

python my_file.py infilename outfilename

Adding top level elements

Codiicsa currently only supports the article top level element, but you can add support for others.

Classes defining top level elements must be derived from codiicsa.docbook_common instead of codiicsa.docbook_article.

The class must have a method for the new top level element eg book and for any specific contained elements you want to process, eg bookinfo.

You have to set the initial value for the section level counter so that the correct underlining is used for section titles (see docbook_article.article() ).

Constructor Parameters

The constructors of Codiicsa base class codiicsa.docbook_common and docbook.article have the following parameters.

codiicsa.classname( tree, all_ids = False )

tree: is the tree generated by ElementTree. Used to generate the dictionary of child to parent links.
all_ids: specifies if Pre (see below) ignores ids that start with underscore.

Library Methods

Codiicsa provides a set of processing methods (ie you call them with self.name) that provide common processing capabilities.

All names begin with an uppercase to minimise clashes with methods named for tag names. If tag names do clash call the library method by:

self.Processing.method( ... )

Pre

self.Pre( elem, inline = False, attrs = [] )

Generate prefix markup based on attributes of elem. Generates:

[[id]]: Based on the id attribute. Ids starting with underscore are ignored on the assumption that they are machine generated and will be re-generated. The all_ids initialisation parameter controls this behavior.
[attrs, role=…]: Outputs a comma separated list of the positional attributes passed as attrs and the role attribute if it is present on the element.

The inline parameter controls whether the [[id]] is inline or on a line by itself.

Process

self.Process( elem )

Process the specified element to an Out object. Only used if out of order processing is required.

Note	Use self.Children() to process children.

Children

self.Children( elem, do = None, dont = set() )

Process the children of elem.

do

specifies the tags names of the children to process. the values can be:

None: (default) all children
a string: just process children of that tagname
a set: just process children with tagnames in the set

dont

if do is None (ie all) dont is a set specifying exceptions.

returns

an Out object

Strip, Stripl, Stripr

self.Strip( from, chars )
self.Stripl( from, chars )
self.Stripr( from, chars )

Remove unwanted characters from both ends of the output. Strip removes from both ends, Stripl removes from the start (left) end and Stripr removes from the end (right) end.

from: String, list or Out object to be stripped.
chars: A string specifying a set of characters to remove.
returns: An Out object with a single string.

Note	Works by catenating the input to a single string then stripping it.

Underline_title

Generates an Asciidoc underlined title from the <title> child of the element. The underline character used is chosen from the current <section> nesting level.

self.Underline_title( elem )

returns: an empty Out if there is no <title> child or one containing the title.

Note	Since Underline_title has already processed the <title> child subsequent processing of the children of this element should specify dont='title'.

Block_title

Generates an Asciidoc block title from the <title> child of the element.

self.Block_title( elem )

returns: an empty Out if there is no <title> child or one containing the title.

Note	Since Blockline_title has already processed the <title> child subsequent processing of the children of this element should specify dont='title'.

Library Functions

The library functions are not members of a class so they are called conventionally. Since they can’t clash with tag names they have lower case names.

convert

Use the specified class to convert the input file to the output file.

convert( infile, outfile, \
         dbclass = None, cargs = [], kcargs = {}, \
         cwsl = True )

infile: Filename of the input docbook file.
outfile: Filename to write the Asciidoc to, overwrites existing files.
dbclass: The name of the class to use for the conversion processing. If None (the default) one of the standard classes is used based on the docbook root element type.
cargs: List of parameters to be passed to the dbclass constructor (see Constructor Parameters) in addition to the tree.
kcargs: Dictionary of keyword parameters to be passed to the dbclass constructor.
cwsl: Crush whitespace lines (see below).

For readability Docbook is often formatted with nested tags separated by newlines and indented so that they look like:

<articleinfo>
    <title>AsciiDoc User Guide</title>
    <author>
        <firstname>Stuart</firstname>
        <surname>Rackham</surname>
        <email>srackham@gmail.com</email>
    </author>
    <authorinitials>SJR</authorinitials>
</articleinfo>

When an element can contain text, technically this whitespace is part of the element, but it produces a lot of whitespace lines in the Asciidoc output.

When cwsl is true (the default) contiguous whitespace lines are reduced to a single blank line since in Asciidoc more than one is redundant. If you have literal elements with multiple blank lines this may be annoying, cwsl can be set to False to prevent crushing, but the effort to hand edit the literal is likely to be much less than editing the amount of redundant whitespace produced.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codiicsa.asciidoc

codiicsa.asciidoc

Codiicsa - Asciidoc Reversed

Introduction

Usage

Design Notes

Peculiarities of ElementTree

The Out class

Configuration

Adding top level elements

Constructor Parameters

Library Methods

Pre

Process

Children

Strip, Stripl, Stripr

Underline_title

Block_title

Library Functions

convert

Files

codiicsa.asciidoc

Latest commit

History

codiicsa.asciidoc

File metadata and controls

Codiicsa - Asciidoc Reversed

Introduction

Usage

Design Notes

Peculiarities of ElementTree

The Out class

Configuration

Adding top level elements

Constructor Parameters

Library Methods

Pre

Process

Children

Strip, Stripl, Stripr

Underline_title

Block_title

Library Functions

convert