Skip to content
Cassandra Targett edited this page Aug 18, 2016 · 4 revisions

Content Conversion Process

  • Export pages as HTML from Confluence

  • Lucidworks conversion ant tasks found in confluence-export/conversion-tools/jsoup/build.xml:

    • Clean up confluence styles (ant scrape)

    • Add document hierarchy as links to each page (ant map then ant hier)

  • Pandoc for html → asciidoc conversion

(Kudos to Mitzi Morris at Lucidworks for the scripts.)

Once the pages have been converted, they need some additional cleanup:

  • Pandoc uses asciidoc conventions, some of which are simplified and extended with asciidoctor. For consistency, these should be modified:

    • Headings in asciidoc use ~ and ^ symbols to indicate levels, while Asciidoctor uses multiple equal signs = to indicate the level (the number of equal signs is the heading level). The equal sign approach is more straightforward and readable when editing pages.

    • Code example boxes are defined in asciidoc by multiple hyphens in a row before and after the example, Asciidoctor uses 4 hyphens.

    • Anchors are defined with text between square brackets (like [[ ]]). Anchors are less necessary.

    • <more to come>

  • Images

    • The images exported out of Confluence have arcane naming and an odd directory structure. We should make the effort to clean those up in a consolidated image directory with human-readable names.

    • Images will also need to be modified to have double-colons between image and <path> so they are treated as block images instead of inline images. Once converted to block images, must also be on newlines instead of inline with the text or they will not render in the PDF.

The conversion script could be improved to programmatically fix some of these issues:

  • Remove strings that are more than 4 characters in a row (specifically ~, ^, -).

  • When ~, or ^ are found, add appropriate number of = at the start of the line above.

    • multiple ~ should be two equal signs ==

    • multiple ^ should be three equal signs ===

  • Remove text surrounded by square brackets.

It’s important to note that conversion of content may be a time- and labor-intensive process, but is only required once.

Clone this wiki locally