Skip to content

FreeDict HOWTO – Writing A FreeDict Dictionary

Sebastian Humenda edited this page Sep 13, 2020 · 2 revisions

Writing A FreeDict Dictionary

Abstract

This chapter deals with the process of building a FreeDict format file from your gathered sources. We cover TEI DTD installation, SGML and XML catalog configuration, some introductory level XML, final formats and a couple of shortcuts.

Introduction

What we don't deal with here is the actual process of collating a translating dictionary. That task is potentially endless and will be very particular to your own circumstances. You need to develop your own approaches and processes for gathering source materials and checking the quality of your entries. Here are some of the process things you might look out for and check with small sample sets before you get too far along.

  • Your editor or word processor can output UTF-8 format TEXT - not word processor or browser specific markup, nor anything other than simple text that can handle the characters of the languages you are writing for. Using different fonts, while helpful in a word processor, generally won't work in plain text (or UTF-8) format. In your final output version it almost certainly won't.
  • If you are importing from a spreadsheet application, try exporting the pages as simple Comma Separated Value (CSV) format. You can often use almost any character or set of characters as a comma. You may be able to convert it to Dictd format with a simple script (in which case we have a shortcut for you), Chapter 7, The Dictd Approach.
  • If you are starting from scratch and writing your dictionary mostly by hand, please consider using a template, and an XML editor like (X)emacs. These make the process much less error prone and tedious. See the tools section for more information.

The FreeDict Entry Format

Abstract

Though we claim to adhere to TEI P5 XML, Chapter 9 Dictionaries, additional rules and restrictions apply.

At first sight the TEI guidelines are very complex. At second sight they are still, but it is important to notice that they were written under the primary assumption to encode as much existing text as possible by tagging it up to a reasonable level of details. The wide variety of existing text makes the TEI tagset very permissible, allowing almost any tags to be used inside any other.

This permissibility makes it difficult to process pure TEI with software to reformat TEI into other formats such as TeX, Formatting Objects or text.

Besides being too permissible, the TEI Guidelines are incomplete for our needs, because they do not define any ontologies. ontologies are needed for encoding different things in our dictionaries

  • the Part of Speech of headwords, ie. the contents of pos elements. Should verbs be marked as 'v', 'verb' or 'Verb'
  • the Usage Domain of entry meanings - technology, botanics etc.
  • the type of Cross References - whether the reference points to a synonym, an alternative spelling, a derived word etc.

Of course, these ontologies should be used for many dictionaries, allowing us to keep the processing software simple. If required, they can be localized before being presented to a dictionary user.

For these reasons, it is part of FreeDict's agenda to develop language neutral ontologies for above mentioned things.

Table 5.1. Part of Speech Typology (recommended contents of the pos element)

Element Content Meaning
n noun
v verb (transitivity unknown)
vt transitive verb
vi intransitive verb
vti transitive and intransitive verb
adv adverb
adj adjective
conj conjunction
prep preposition
int interjection
pron pronoun
art article
num numeral
int interjection

Best Practices

  • Avoid to use more than one orth element per entry. Instead create separate entries and link them to each other.
  • Put question marks into note elements of to be reviewed entries. Using this convention, other editers will be able to find those entries easily.

Two Approaches

There are at least two approaches you might take to building a FreeDict format dictionary.

  1. Use the FreeDict DTD's and produce TEI P5 XML, discussed in chapter 6. This gives you the most flexibility.
  2. Produce a simply (and accurately) formatted plain text file that you then process with some command line tool (which probably needs to be written or extended) which converts to TEI XML. This can be quicker if you are comfortable with it, but limits your options for lexicographic information. This option is discussed in chapter 7.

You may of course combine these or find any number of others, after all, it's your dictionary we just need it in a certain format :)