Skip to content

An Apache OODT, Apache Tika, and Apache Solr based system to automatically take large TSV file datasets, and to translate them from one language to another. Built and inspired by the DARPA XDATA Employment dataset.

License

chrismattmann/bigtranslate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BigTranslate

A distributed, parallelized (Map Reduce) wrapper around Apache™ Tika and its Translation API provided by Tika-Python. BigTranslate uses Apache™ OODT to split and distribute machine translation of many millions of rows of data. The system has been tested on up to 190 million rows of TSV data involving millions of translations on 16-core nodes and finishes in reasonable amounts of time. BigTranslate uses ETLLib to provide a clean facade to JSON and TSV data processing, and to prepare data for translation using Tika. Once the data is translated it is ingested into Apache™ Solr for querying and large scale analytics and retrieval.

Apache™ Tika provides a facade to and has been tested with the following Machine Translation APIs.

See the wiki for more information on installing and running BigTranslate:

You can clone the wiki by running
git clone https://github.com/chrismattmann/bigtranslate.wiki.git

About

An Apache OODT, Apache Tika, and Apache Solr based system to automatically take large TSV file datasets, and to translate them from one language to another. Built and inspired by the DARPA XDATA Employment dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published