Brian McConnell email@example.com
Software developers typically start off building their application in their native language, more often than not, in English. Supporting other languages usually comes much later in the product development cycle, and by then it is often painful and expensive to retrofit the application to use localization and translation tools.
This article explains a simple technique that can be used to embed multilingual functionality in almost any application early on, while also enabling it to receive its translations from a variety of network based machine and human translation resources.
One of the common ways to localize an application is to use the gettext() package. Gettext enables developers to translate texts associated with their applications. This is typically done using text files that contain a list of prompts and their translations.
msgid “Hello World” msgstr “Hola Mundo” msgid “Goodbye” msgstr “Adiós”
Within the application, the prompts are generated by calling a function that examines the prompt catalog to look up translations, using notation such as:
print _(“Hello World”) print _(“Goodbye”)
This file based approach worked well when software release cycles were infrequent and when the messages to be translated were static. In today’s world of bi-weekly release cycles and dynamic content, this approach is unwieldy and expensive. Just keeping the translation files in sync with the source material is a challenge, even for relatively simple projects (and does not work at all for dynamic content).
I recently joined Gengo, a Tokyo/SF based translation technology company, following the acquisition of Worldwide Lexicon, an open source translation platform I worked on for many years. WWL combined machine, crowd and professional translation, presented via a web services API, and enabled users to request the best available translations while optimizing for speed, quality and cost. Meanwhile Gengo has built a network of thousands of freelance translators who are accessed via a web services API.
At OSCON, we are presenting a design pattern (and a Python library) for a cloud based localization and dynamic content translation tool. The Python library, code named Avalon, is available at mygengo.github.com/avalon. It queries a variety of translation resources on demand, and enables developers to localize their applications and translate dynamic content as it is served, using both machine or human translation. The library is intended primarily as a demo, and demonstrates the utility of this approach. This article describes how to replicate this approach in the development environment of your choice.
gettext is popular because it is easy to use within an application. Translating a string via gettext is as simple as:
print _(“Hello World”)
The utility we developed provides similar ease of use, but instead of using static message catalogs, calls out to machine and human translation services. With this tool, you simply do the following:
sl = ‘en’ # source language = English tl = ‘es’ # target language = Spanish google_apikey = ‘foo’ # define Google API key for translate API gengo_public_key = ‘bar’ # define Gengo API key gengo_private_key = ‘foobar’ # define Gengo private API key transate_order = [‘gengo’, ‘google’] # define order in which services are called print _(“Hello World”)
Behind the scenes, the utility is doing the following things:
- Checks memcached to see if there is already a translation for the text, if yes, use that
- If not, calls the translation services in the order listed, cache results upon success
- For human translation services, if no translation is available, triggers a request for translation (this will appear when it is completed and will take precedence over machine translations)
The basic design pattern used here is pretty simple, and automatically detects new texts that require translation. It’s also a very flexible approach, and enables the user to switch translation modes on the fly. For example, the app might use human translation for highly visible texts, and fall back to machine translation for content that appears “below the fold”.
There are three major categories of translation services available on the Internet: machine translation engines, translation memory, and human translation services.
Among machine translation services, Google Translate and Microsoft Translator, are the best in terms of accuracy and performance. Both are statistical machine translation systems, which are trained by using large corpora of translated texts. Microsoft Translator is free (with limits on query volume and message length), while Google Translate charges a nominal fee ($20 per 1 million characters or about $0.0002 per word). Both are fast, and provide reasonably accurate translations for major languages (although they are clearly computer generated translations, they are usually suitable for communicating the “gist” of the source material).
Translation memories are searchable databases of previously created human translations. Two systems of note are TAUS (www.translationautomation.com) and Transifex (www.transifex.net). Transifex is a particularly interesting system as it is a hosted service for managing localization projects, and also functions as a translation memory. It’s sort of like Github for translation and localization. Transifex is rolling out an API to query its translation memory, which will enable developers to manage their translation assets in a centralized repository, while eliminating the need to keep translation files in sync (translations can simply be loaded via API calls). We’ll add a connector for Transifex when it is ready.
Professional translation services, such as Gengo, enable developers to treat a network of human translators as an automated resource that can be incorporated into virtually any application or process. Gengo, for example, provides a well documented REST API and wrapper libraries through which applications can request, score and comment on translations. In effect, it makes a network of human translators look like a machine translation engine, although the translations are not instantaneous (I’ll discuss strategies for dealing with that in a bit).
This approach enables you to blend machine and human translation to optimize for cost, quality and speed. In fact, you can switch between different modes of translation or different translation providers for texts on the same page. An online newspaper, for example, might have headlines and lead paragraphs translated by professionals, while content “below the fold” is machine translated or translated by less expert translators. Another variation on this strategy is to monitor page views, and trigger human translation for texts that are viewed more than N times in a specific language. These strategies enable developers to build adaptive translation systems that adjust their spending for human translation to best meet user needs.
Online stores are an example of where cost optimization is important. Let’s consider a store with several thousand products in its catalog. The store wants to be accessible in Spanish to cater to the Hispanic market. Each product description contains about 100 words, and will cost between $5 to $10 to translate professionally. Translating the entire catalog via paid translators might cost more than the store initially wants to spend. On the other hand, translating the top 10% of the catalog via professionals and the bottom 90% via machine translation effectively reduces the cost by 90% without noticeably affecting presentation quality. As traffic increases, the store can translate a higher percentage of the catalog via professionals. This strategy allows companies to test the effects of machine versus professional translation, and to automatically invest in translating items that are most likely to yield the best ROI. This type of functionality will become a basic feature in many e-commerce platforms over the next year or so.
If you need something translated quickly, for example because the content ages quickly, machine translation provides instant results, albeit at lower quality. Human translation services, even highly automated systems, generally do not provide immediate results. Translation turnaround time is variable, and depends on a number of factors, including: the language you are translating to (and how many translators are available for that language), day of week, time of day, and the price paid. For common language pairs, such as English ← → Spanish, translations are generally done very quickly, within minutes to a few hours. Less common language pairs typically take longer, especially outside of normal business hours.
A typical approach developers take is to use machine translations as a temporary placeholder, and sometimes to identify them as such. For example, when translating a breaking news story, a publisher might display a header such as “This article was translated by Google Translate. This temporary machine translation will be replaced by a professional translation shortly. You can view the original page in English here”. This notifies the reader that the machine translation is temporary, and is being offered as a convenience while the content is replaced or post-edited by people.
API based human translation services typically allow you to poll for updates, or to register to receive an HTTP callback when a translation is completed or revised. If your application or translation repository has a publicly accessible domain, you can use callbacks to receive immediate updates. Then as soon as a human translation is completed, the machine translation is purged and replaced. If you rely on cache expiration and polling for change detection, this can delay the visibility of newly completed human translations somewhat.
Another benefit of this approach is its ability to deal with dynamic content because new texts will be detected automatically and queued for translation. One obvious issue, especially for websites or interfaces that have a large amount of dynamic content is that rendering will be slow when many new texts appear at once. There are a number of ways to deal with this and hide these performance issues from users.
The most obvious fix is to cache translations aggressively, using memcached or a local data store, whichever is most appropriate for your situation. I used memcached in cloudtext.py since its built in with App Engine, and is also a cheap way to temporarily store data in that environment. If you cache with a long time to live, most of your translations will be cached most of the time. If your translations are often updated after they are initially created, a common scenario in crowd translation systems or for post-edited professional translation, you’ll want to reduce the cache time to live so that post-edits propogate relatively quickly (depending on site traffic and update frequency, 15 minutes to an hour is usually a reasonable setting).
Another thing to do is to make translation requests asynchronously. In this case, the translation function is read-only, and either gets a cache hit or not. If it gets a cache miss, it sends a message to a background process that, in turn calls out to the translation service(s) and updates the cache when the results come back. The first time someone loads a new page in their language, the page may be mostly untranslated. Meanwhile translation requests are made in the background, and the page or interface is quickly updated when it is loaded a few moments later.
Spiders are another trick you can use to improve performance, as well as insure that new texts are queued for translation before users view the page. Simply configure a spider to crawl visible pages on your site. This will trigger translations for newly detected texts, so that when users begin to view the page, most of them will already have been translated. Preloading the most commonly used texts, for example using a dictionary of the most frequently used prompts, is another way to do this.
If possible, it will be best if you can maintain backward compatibility with the preferred file based localization utility for the language or framework you are working with. Then, query resources in the following order:
- memcache or local cache
- file based localization utility (e.g. gettext, yaml, Java properties, etc)
- translation memory (if enabled)
- human translation service (if enabled)
- machine translation engine (if enabled)
This sequence enables you to utilize static, file based translations for static content (e.g. localizing your site or app’s user interface), and to use on demand “over the air” translation for dynamic content. In other words, to get the best of both worlds, and can use manually curated translations where it makes sense, and on demand translation for everything else.
If you would like to build and share your own version of this tool, I’d like to hear from you. Just fork Avalon and get in touch!