Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



37 Commits

Repository files navigation


Demo Site

A python module for easy conversion of numeric values to natural language with full internationalization. E.g. "2100" can be mapped to "two thousand one hundred", "deux mille cent", "2.wav,1000.wav,1.wav,100.wav", or any other representation, according to rules-based configuration.

Quick Start

The script provides an example usage of the library, allowing command line evaluation of natural language. E.g.:

$ python
Usage: <integer to translate> <locale code>
$ python 123456 fr_FR
['cent', 'vingt', 'trois', 'mille', 'quatre', 'cent', 'cinquante', 'six']
$ python 123456 en_GB
['one', 'hundred', 'and', 'twenty', 'three', 'thousand', 'four', 'hundred', 
'and', 'fifty', 'six']
  • The locale code must match a file in the config directory.
  • can be used as a starting point to use the library any way you want.
  • If new language rules are required (or amendments are required for existing ones), browse the language config files, and read about how rules are set up below.


Generation of natural language - even just for numbers - can be complex.
Generally, in natural languages a number is represented as a series of words drawn from a relatively small vocabulary (e.g. 'one', 'two', ...), together with some multipliers (e.g. 'hundred', 'thousand') and linking words (e.g. 'and'). For example, when learning how to count, you perhaps might start learning how to say one to ten. Then it is relatively easy to get as far as one hundred, by learning the words for twenty, thirty, etc., and combining them with what you already know. Then it is even easier to go into the hundreds and thousands once you know how to group numbers by place-value and combine them with their multiplier.

Based on this understanding, a first attempt in converting numeric values to natural language would be to create a lookup table for each word in the vocabulary, then write an algorithm to stitch the words together based on some logic. At first this seems a perfectly adequate approach. In fact it probably would be the best choice for a single language, but as soon as internationalization is attempted the complexity would increase drastically.
This is due to differences in not just the vocabulary across different languages, but also the rules for grouping, multipliers and other tricky corner-cases.

For example, in English the number 100 is spoken as "one hundred". But in French the same number is rendered "cent". So not only is the vocabulary different, but so is the syntax (there is no need to specify the multiplier of hundreds if it is 1, in French). Additionally, the number 101 is said "one hundred and one" in English, but is simply "cent un" in French, missing the linking word. Another interesting example is to compare how tens and units are combined by English and German. In English, 21 is "twenty one".
But in German it is "einundzwanzig" ("one and twenty"). Furthermore, the German rendering is a single word, whereas the English is two separate words.
Further complexities can be found in all languages.


NaturalNum offers the following:

  • Arbitrary mapping of numeric values to lists of string tokens, allowing a rich set of natural language representations
  • Rules-based approach, expressing knowledge of natural languages declaratively. This strategy scales horizontally - the rules for different languages are separately maintained, rather than nested among each other in a single algorithm.
  • Rules are triggered by pattern-matching. This is a natural choice considering how equivalence classes often depend on the 'place value' of digits.
  • Wildcarding and variable-binding is possible ('digivars'). This allows generic rules to be created for cases where the natural language representation is based on digits within the input value, e.g. when mapping to resource tokens such as audio filenames.
  • Rules can resolve a subset of the input number recursively, to allow reuse.
    This means the size of the rule-set scales very well with an increasing domain of input numbers.



Each supported natural language must be configured with rules; sample languages are provided in config/.

For a given input numeric value, the rules config for the current language is searched from top-to-bottom. A rule is triggered as soon as the input numeric value matches the pattern on the left-hand-side of each rule. The pattern-matching is based first on the number of digits, and then on the values of the digits themselves. As soon as a match is found no further matching will be attempted. This means rules should be specified from the most specific to the most general.

The right-hand-side of each rule is always resolved to a list of tokens representing the natural language representation for the matched numeric input.

The left-hand and right-hand sides of each rule are separated by '='.

A very simple example rule for English follows (note that '#' indicates the beginning of a comment, which is ignored by the reader):

21=twenty,one ## rule for 21

The list of tokens on the right-hand-side can be anything, so they could be used to identify resources, e.g. audio files containing spoken utterences for playback. A very simple example mapping the digits 0-9 to a corresponding audio filename is given below:

0=0.wav           ## an input of "0" is matched, and mapped to the output 
                  ## string "0.wav"
1=1.wav           ## "1" -> "1.wav"
                  ## (...)
9=9.wav           ## "9" -> "9.wav"

So far this does not seem too interesting; it is just a simple lookup table mapping literal numeric values to literal strings. It would not scale, and there would be a lot of repetition.

However, there is a way to make these rules more compact.


A digivar is a wildcard character which, when appearing on the left-hand side of a rule will match any single digit in the same position. What makes digivars especially useful is that they act as a variable which is bound to the matched digit, and when referenced on the right-hand-side (preceded by $) the digivar will be replaced with its bound value. Digivars can be mixed freely with literal values on the right-hand side of a rule.

For example, we can simplify the rules previously given for 0-9, as a single rule:


Any single digit value (0-9) will match this rule. The single digit in the matched numeric value will be bound to the digivar 'u' (any alphabetic character could have been chosen, but 'u' seems a natural choice for 'units'). When $u is encountered on the right-hand-side, this will be replaced with the digit matched to digivar 'u', followed by '.wav'. So '1' would resolve to '1.wav' and so on, up to '9.wav'.

There are a few restrictions to the use of digivars:

  • A digivar can only be a single alphabetic character (this makes the left-hand-side value easier to read from a pattern-matching perspective). If several alphabetic characters appear on the left-hand side in sequence, it will be assumed they are seperate single-character digivars.
  • Digivars must be alphabetic characters (case-sensitive), to distinguish from literal digits.
  • The same digivar cannot be used more than once on the left-hand-side (you cannot bind the same digivar to more than one digit).

It is perhaps intuitive that digivars are scoped only to the single rule they are defined in. So if the same digivar 'u' appears in a number of rules, it is bound in each case to the digit matched in that rule only.

Note that it is not mandatory for a digivar to appear on the right-hand-side.
So digivars can be used as straightforward wildcards, although there is no obvious use-case for this.

Care must be taken in defining rules in the correct order (most specific to least specific) to get the correct result. The following is a more advanced example of how we might continue the previous examples as far as 0-999 using digivars. Note that for the remainder of this example, we will define a minimal vocabulary of audio filenames to be the following:

  • 0.wav, 1.wav, ..., 19.wav
  • 20.wav, 30.wav, up to 90.wav
  • 100.wav (represents "hundred", not "one hundred")
  • and.wav (linking word)

The limited vocabularly means we will have to compose tokens together in sequence on the right-hand side, where necessary. For clarity, we will not specify the file extension (.wav) on the right-hand side of each rule (in practice it would be straightforward for a calling application to append the extension to each of the generated tokens)

## Single Digits (units)
u=$u                               ## Units (e.g. 7 = 7.wav)

## Double Digits (tens and units)
t0=$t0                             ## Exact tens (e.g. 30 = 30.wav)
1u=1$u                             ## 11-19 (e.g. 19 = 19.wav)
tu=$t0,$u                          ## Remaining 'tens and units' 
                                   ## (e.g 56 = 50.wav, 6.wav)

## Three Digits (hundreds, tens and units)
h00=$h,100                         ## Exact hundreds 
                                   ## (e.g. 300.wav = 3.wav, 100.wav)
h0u=$h,100,and,$u                  ## Hundreds and unit with no tens 
                                   ## (e.g. 709 = 7.wav, 100.wav, and.wav, 
                                   ## 9.wav)
ht0=$h,100,and,$t0                 ## Hundreds and exact tens 
                                   ## (e.g. 850 = 8.wav, 100.wav, and.wav, 
                                   ## 50.wav)
h1u=$h,100,and,1$u                 ## Hundreds and 11-19 
                                   ## (e.g. 218 = 2.wav, 100.wav, and.wav, 
                                   ## 18.wav)
htu=$h,100,and,$t0,$u              ## Hundreds and remaining tens and units 
                                   ## (e.g. 546 = 5.wav, 100.wav, and.wav, 
                                   ## 50.wav, 6.wav)

This is some good progress; we have specified the natural language rules for all numbers 0-999 in 9 rules. However this is not perfect, as there is some repetition. The rules for the single and double digits are duplicated further down in the rules for 3 digits. Without any further tricks, this redundancy would fan out as patterns for more digits are provided.

This is where recursive rules come into the picture.

Recursive Rules

It would be useful if a rule could defer part of its evaluation to another rule. For example, a number of languages have the same rules for hundreds, tens and units, regardless of where they appear in the value. More specifically, in English, we should not have to repeat ourselves when we define how the value "34" is represented when evaluating "34,000", "134", or "34,034,034". The latter case could be decomposed into steps in an informal 'rule' as follows:

+- (34) million
+- (34) thousand
+- and
+- (34)

In all but one of the steps, the value 34 is placed in brackets to show it is delegated to a separate rule which is already defined for 'tens and units'.

This is very similar to how recursive rules work in NaturalNum. In the example above, the rules for single and double digits were repeated within the rules for hundreds, tens and units. This would fan out into increasing redundancy as values of increasing length are supported.

However, we could change the way three digits are resolved, as follows:

## Single Digits (units)
u=$u                               ## Units (e.g. 7 = 7.wav)

## Double Digits (tens and units)
t0=$t0                             ## Exact tens (e.g. 30 = 30.wav)
1u=1$u                             ## 11-19 (e.g. 19 = 19.wav)
tu=$t0,$u                          ## Remaining 'tens and units' 
                                   ## (e.g 56 = 50.wav, 6.wav)

## Three Digits (hundreds, tens and units)
h00=$h,100                         ## Exact hundreds 
                                   ## (e.g. 300.wav = 3.wav, 100.wav)
h0u=$h,100,and,$u                  ## Hundreds and unit with no tens 
                                   ## (e.g. 709 = 7.wav, 100.wav, and.wav, 
                                   ## 9.wav)
htu=$h,100,and,($t$u)              ## Hundreds and tens and units - tens
                                   ## and units are deferred to a matching
                                   ## rule
                                   ## e.g. 546 = 5.wav, 
                                   ##            100.wav, 
                                   ##            and.wav, 
                                   ##            (56)
                                   ##              = 50.wav, 6.wav

The single and double digits are specified as in the previous example. But there are now only three rules for three digits, instead of five. This doesn't sound like a huge improvement, but as more digits are added the benefits are compounded (especially when the rules for three digits are re-used in 4+ digits).

The logic here is as follows. When "546" is resolved, the first rule matched is the final one (matching the pattern "htu"). This resolves to ("5", "100", "and"), but the tens and units are parenthesised, so are resolved recursively.
This means that the digits matching "$t$u" ("56") are fed back through the rules, matching the rule with pattern "tu". This rule would then give the remainder of the result, which is "50", "6".


NaturalNum goes some way towards solving the problem of natural language representation of integer values. It is not perfect however, as shown by the following issues:

  • It is still hard to write new rules. Defining rules for a new language requires some study, particularly finding the edge cases and identifying where candidates for re-use exist. Validation is time-consuming, perhaps requiring a native speaker of the language to confirm the details. This is the nature of the problem domain however.
  • A fairly large assumption is made that only cardinal, nominative numbers are required to be represented. This is fine for counting, but if we are trying to say 'six chairs', then the problem of gender and accusative, dative, etc. comes into play. For example, Polish has a table of variations of the number 1, depending on gender, plurality and cases (nominative, accusative, etc). [1], [2]. This can introduce problems if representing currency amounts, for example.
  • The project to analyse the suitability of this approach has not progressed beyond Europe. It is not yet known how feasible Asian language systems (for example) would adapt to this scheme. Either a Romanized system or Unicode could be used for the representations, but it is not yet known whether the syntax would lend itself well to the approach taken by NaturalNum.

Further Work

NaturalNum could be developed in the following ways:

  • Add new languages, particularly untested classes of language, e.g. Chinese

  • Expose as a web service

  • Extend configuration into the millions, and beyond

  • Create further abstractions to represent variations with respect to the gender and class of the related noun. A possible way to capture this:





A library for expression of numeric values as natural language.






No releases published


No packages published