danmar edited this page Sep 14, 2010 · 4 revisions
Clone this wiki locally

This will be a simple program for matching translated documents. It completely ignores the words! Instead it will perform the matching based on the structure and formattings of the document.

Here is an example in swedish/enlish:
swedish: skiner solen?
english: the dog is happy.

Even if you know neither swedish nor english you can tell that the sentences are not the same because one sentence is a question and the other is not.

Some possible checks:

  • each paragraph should have the same number of sentences
  • each list should have the same number of items
  • if numbers are found they should be the same
  • matching formattings (if some text is bold in the first document the there should be some bold text in the second document too)

This program is written primarily for xml-based documents (html/docbook/etc)