Skip to content
Apache Joshua
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin Fix path in JOSHUA-322 Nov 9, 2016
demo
distribution added full path to config file Mar 3, 2017
doc documentation munging Jun 1, 2015
examples removed --optimizer-runs from examples Sep 7, 2016
jni
scripts Fix: tuner utilize threads when NUM_THREADS is set Mar 6, 2018
src JOSHUA-329 - fixing wrong index selection Dec 17, 2018
.gitignore no Jira - added IDEA files to .gitignore Sep 27, 2017
.gitmodules Pulled JOSHUA-252 changes and Resolved Merge Conflicts May 26, 2016
.travis.yml
CHANGES.md Final preparation for RC#3 Jan 27, 2017
DISCLAIMER.txt Prepare for Joshia 6.1 incubating RC#2 Nov 23, 2016
KEYS no Jira - added pub key for tommaso Feb 18, 2017
LICENSE.txt JOSHUA-324 Address Apache Joshua 6.1 RC#2 Issues Jan 18, 2017
NOTICE.txt Prepare for Joshia 6.1 incubating RC#2 Nov 23, 2016
README.md Upddate badges Jun 28, 2017
doap_joshua.rdf Final preparation for RC#3 Jan 27, 2017
download-deps.sh
eclipse-codeformat.xml JOSHUA-254 Update README with correct branding Apr 26, 2016
pom.xml [maven-release-plugin] prepare for next development iteration Mar 8, 2017

README.md

Build Status homebrew license Jenkins Maven Central Twitter Follow

Welcome to Apache Joshua (Incubating)

Joshua is a statistical machine translation toolkit for both phrase-based (new in version 6.0) and syntax-based decoding. It can be run with pre-built language packs available for download, and can also be used to build models for new language pairs. Among the many features of Joshua are:

  • Support for both phrase-based and syntax-based decoding models
  • Translation of weighted input lattices
  • Thrax: a Hadoop-based, scalable grammar extractor
  • A sparse feature architecture supporting an arbitrary number of features

The latest release of Joshua is always linked to directly from the Home Page

New in 6.X

Joshua 6.X includes the following new features:

  • A fast phrase-based decoder with the ability to read Moses phrase tables
  • Large speed improvements compared to the previous syntax-based decoder
  • Special input handling
  • A host of bugfixes and stability improvements

Quick start

Joshua must be run with a Java JDK 1.8 minimum.

To run the decoder in any form requires setting a few basic environment variables: $JAVA_HOME, $JOSHUA, and, for certain (optional) portions of the model-training pipeline, potentially $MOSES.

export JAVA_HOME=/path/to/java  # maybe /usr/java/home
export JOSHUA=/path/to/joshua

You might also find it helpful to set these:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

Then, compile Joshua by typing:

cd $JOSHUA
mvn clean package

You also need to download and compile KenLM and Thrax:

bash ./download-deps.sh

The basic method for invoking the decoder looks like this:

cat SOURCE | $JOSHUA/bin/joshua-decoder -m MEM -c CONFIG OPTIONS > OUTPUT

Some example usage scenarios and scripts can be found in the examples/ directory.

Development With Eclipse

If you are hoping to work on the decoder, we suggest you use Eclipse. You can get started with this by typing

mvn eclipse:eclipse

Working with "language packs"

Joshua includes a number of "language packs", which are pre-built models that allow you to use the translation system as a black box, without worrying too much about how machine translation works. You can browse the models available for download on the Joshua website.

Building new models

Joshua includes a pipeline script that allows you to build new models, provided you have training data. This pipeline can be run (more or less) by invoking a single command, which handles data preparation, alignment, phrase-table or grammar construction, and tuning of the model parameters. See the documentation for a walkthrough and more information about the many available options.

License

Joshua is licensed and released under the permissive Apache License v2.0, a copy of which ships with the Joshua source code.

You can’t perform that action at this time.