Aard Dictionary container format is a binary file format that combines dictionary metadata, lookup index and compressed article data.
Aard files have the following layout: header, metadata, index 1, index 2, articles
Aard files start with a fixed size header described by the following specification:
HEADER_SPEC = (('signature', '>4s'), ('sha1sum', '>40s'), ('version', '>H'), ('uuid', '>16s'), ('volume', '>H'), ('of', '>H'), ('meta_length', '>L'), ('index_count', '>L'), ('article_offset', '>L'), ('index1_item_format', '>4s'), ('key_length_format', '>2s'), ('article_length_format', '>2s'), )
- signature
- signature string 'aard' identifying file as Aard Dictionary file
- sha1sum
- sha1 sum of dictionary file content following signature and sha1 bytes
- version
- Aard format version, a number, currently must be 1
- uuid
- dictionary unique identifier shared by all volumes of the same dictionary
- volume
- volume number of this this file
- of
- total number of volumes for the dictionary
- meta_length
- length of metadata compressed string
- index_count
- number of words in the dictionary
- article_offset
- article offset
- index1_item_format
- either >LL or >LQ (if maximum volume file size is set to a value bigger then 2^32 - 1) - :mod:`struct` format for key pointer and article pointer.
- key_length_format
- >H - key length format in index2_item
- article_length_format
- >L - article length format
Metadata is a dictionary stored as a JSON-encoded string. Aard Dictionary viewer uses the following metadata keys:
- article_count
number of unique articles in the volume
.. versionchanged:: 0.9.0 In previous versions this property meant total number of articles in the whole dictionary. Readers should examine `article_count_is_volume_total` metadata property to determine whether this value refers to number of articles in volume or in dictionary.
- article_count_is_volume_total
.. versionadded:: 0.9.0
true
if article_count property means total number of articles in the volume. Since aardtools 0.9.0 compiler always sets this property totrue
.- index_language
- dictionary's "from" language (two or thre letter ISO code)
- article_language
- dictionary's "from" language (two or thre letter ISO code)
- title
- dictionary title
- version
- dictionary version
- description
- dictionary desription
- copyright
- copyright notice
- license
- full license text
- source
- description of the source from which dicionary data originated
Index 1 is a sequence of fixed-size items containing two values: pointer to Index 2 item and pointer to article item.
Index 2 is a sequence of variable-length items containing two values: length of dictionary key text and key text itself.
Articles is a sequence of variable length items containing two values: length of article text and article text itself.
.. seealso:: Module :mod:`struct` `Documentation <http://docs.python.org/library/struct.html>`_ for the :mod:`struct` module
From container format perspective article is just a string that is stored
either as is or compressed with gzip or bz2 whichever takes less space. Thus
articles in Aard files may be in any format that can be represented as
string, for example plain text or HTML. Current Aard Dictionary
implementation expects HTML 4 or XHTML 1.0 formatted text without
enclosing html
and body
tags.