Skip to content
This repository has been archived by the owner. It is now read-only.
Permalink
Browse files
Beginning 6.0 documentation
  • Loading branch information
mjpost committed Dec 30, 2014
1 parent 9ea77b0 commit 216a358d3741d44d4a7c44250dfea151df298812
Show file tree
Hide file tree
Showing 31 changed files with 11,659 additions and 24 deletions.
@@ -0,0 +1,3 @@
_site
.philog.LOGFILE
.#*
@@ -0,0 +1,7 @@
---
layout: default6
category: links
title: Advanced features
---


@@ -0,0 +1,24 @@
---
layout: default6
category: links
title: Bundling a configuration
---

A *bundled configuration* is a minimal set of configuration, resource, and script files. A script, `$JOSHUA/scripts/support/run-bundler.py` can be used to package up the run bundle. The resulting bundle can easily be transferred and shared.

**Example invocation:**

./run-bundler.py \
--force \
/path/to/rundir/runs/5/test/1/joshua.config \
/path/to/rundir/runs/5 \
bundled-configurations \
"-top-n 1 \
-output-format %S \
-mark-oovs false \
-server-port 5674 \
-tm/pt "thrax pt 20 /path/to/rundir/runs/5/test/1/grammar.gz"

A new directory `./bundled-configurations` will be created, and all the bundled files will be copied or created in it. To use the configuration with Joshua, run the executable file `./bundled-configurations/bundle-runner.sh`.

Note, the additional options between the pair of quotation marks are passed as arguments to the `$JOSHUA/scripts/copy-config.pl` script. That script has some special parameters, especially the `-tm/..` option.

Large diffs are not rendered by default.

@@ -0,0 +1,6 @@
---
layout: default6
title: Features
---

Joshua 5.0 uses a sparse feature representation to encode features internally.
@@ -0,0 +1,72 @@
---
layout: default6
category: advanced
title: Joshua file formats
---
This page describes the formats of Joshua configuration and support files.

## Translation models (grammars)

Joshua supports two grammar file formats: a text-based version (also used by Hiero, shared by
[cdec](), and supported by [hierarchical Moses]()), and an efficient
[packed representation](packing.html) developed by [Juri Ganitkevich](http://cs.jhu.edu/~juri).

Grammar rules follow this format.

[LHS] ||| SOURCE-SIDE ||| TARGET-SIDE ||| FEATURES

The source and target sides contain a mixture of terminals and nonterminals. The nonterminals are
linked across sides by indices. There is no limit to the number of paired nonterminals in the rule
or on the nonterminal labels (Joshua supports decoding with SAMT and GHKM grammars).

[X] ||| el chico [X,1] ||| the boy [X,1] ||| -3.14 0 2 17
[S] ||| el chico [VP,1] ||| the boy [VP,1] ||| -3.14 0 2 17
[VP] ||| [NP,1] [IN,2] [VB,3] ||| [VB,3] [IN,2] [NP,1] ||| 0.0019026637 0.81322956

The feature values can have optional labels, e.g.:

[X] ||| el chico [X,1] ||| the boy [X,1] ||| lexprob=-3.14 lexicalized=1 numwords=2 count=17

One file common to decoding is the glue grammar, which for hiero grammar is defined as follows:

[GOAL] ||| <s> ||| <s> ||| 0
[GOAL] ||| [GOAL,1] [X,2] ||| [GOAL,1] [X,2] ||| -1
[GOAL] ||| [GOAL,1] </s> ||| [GOAL,1] </s> ||| 0

Joshua's [pipeline](pipeline.html) supports extraction of Hiero and SAMT grammars via
[Thrax](thrax.html) or GHKM grammars using [Michel Galley](http://www-nlp.stanford.edu/~mgalley/)'s
GHKM extractor (included) or Moses' GHKM extractor (if Moses is installed).

## Language Model

Joshua has two language model implementations: [KenLM](http://kheafield.com/code/kenlm/) and
[BerkeleyLM](http://berkeleylm.googlecode.com). All language model implementations support the
standard ARPA format output by [SRILM](http://www.speech.sri.com/projects/srilm/). In addition,
KenLM and BerkeleyLM support compiled formats that can be loaded more quickly and efficiently. KenLM
is written in C++ and is supported via a JNI bridge, while BerkeleyLM is written in Java. KenLM is
the default because of its support for left-state minimization.

### Compiling for KenLM

To compile an ARPA grammar for KenLM, use the (provided) `build-binary` command, located deep within
the Joshua source code:

$JOSHUA/bin/build_binary lm.arpa lm.kenlm

This script takes the `lm.arpa` file and produces the compiled version in `lm.kenlm`.

### Compiling for BerkeleyLM

To compile a grammar for BerkeleyLM, type:

java -cp $JOSHUA/lib/berkeleylm.jar -server -mxMEM edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm.arpa lm.berkeleylm

The `lm.berkeleylm` file can then be listed directly in the [Joshua configuration file](decoder.html).

## Joshua configuration file

The [decoder page](decoder.html) documents decoder command-line and config file options.

## Thrax configuration

See [the thrax page](thrax.html) for more information about the Thrax configuration file.
@@ -0,0 +1,124 @@
---
layout: default6
title: Welcome to Joshua
---

<h4 class="blog-post-title">Welcome to Joshua!</h4>

<p>This blog post shows a few different types of content that's supported and styled with Bootstrap. Basic typography, images, and code are all supported.</p>
<hr>
<p>Cum sociis natoque penatibus et magnis <a href="#">dis parturient montes</a>, nascetur ridiculus mus. Aenean eu leo quam. Pellentesque ornare sem lacinia quam venenatis vestibulum. Sed posuere consectetur est at lobortis. Cras mattis consectetur purus sit amet fermentum.</p>
<blockquote>
<p>Curabitur blandit tempus porttitor. <strong>Nullam quis risus eget urna mollis</strong> ornare vel eu leo. Nullam id dolor id nibh ultricies vehicula ut id elit.</p>
</blockquote>
<p>Etiam porta <em>sem malesuada magna</em> mollis euismod. Cras mattis consectetur purus sit amet fermentum. Aenean lacinia bibendum nulla sed consectetur.</p>
<h2>Heading</h2>
<p>Vivamus sagittis lacus vel augue laoreet rutrum faucibus dolor auctor. Duis mollis, est non commodo luctus, nisi erat porttitor ligula, eget lacinia odio sem nec elit. Morbi leo risus, porta ac consectetur ac, vestibulum at eros.</p>
<h3>Sub-heading</h3>
<p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</p>
<pre><code>Example code block</code></pre>
<p>Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa.</p>
<h3>Sub-heading</h3>
<p>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aenean lacinia bibendum nulla sed consectetur. Etiam porta sem malesuada magna mollis euismod. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p>
<ul>
<li>Praesent commodo cursus magna, vel scelerisque nisl consectetur et.</li>
<li>Donec id elit non mi porta gravida at eget metus.</li>
<li>Nulla vitae elit libero, a pharetra augue.</li>
</ul>
<p>Donec ullamcorper nulla non metus auctor fringilla. Nulla vitae elit libero, a pharetra augue.</p>
<ol>
<li>Vestibulum id ligula porta felis euismod semper.</li>
<li>Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</li>
<li>Maecenas sed diam eget risus varius blandit sit amet non magna.</li>
</ol>
<p>Cras mattis consectetur purus sit amet fermentum. Sed posuere consectetur est at lobortis.</p>
</div><!-- /.blog-post -->

<!-- <nav> -->
<!-- <ul class="pager"> -->
<!-- <li><a href="#">Previous</a></li> -->
<!-- <li><a href="#">Next</a></li> -->
<!-- </ul> -->
<!-- </nav> -->


This page contains end-user oriented documentation for the 6.0 release of
[the Joshua decoder](http://joshua-decoder.org/).

## Download and Setup

1. Download Joshua by clicking the big green button above, or from the command line:

wget -q https://github.com/joshua-decoder/joshua-releases/joshua-6.0.tgz

1. Next, unpack it, set environment variables, and compile everything:

tar xzf joshua-6.0.tgz
cd joshua-6.0

# for bash
export JAVA_HOME=/path/to/java
export JOSHUA=$(pwd)
echo "export JOSHUA=$JOSHUA" >> ~/.bashrc

# for tcsh
setenv JAVA_HOME /path/to/java
setenv JOSHUA `pwd`
echo "setenv JOSHUA $JOSHUA" >> ~/.profile

ant

(If you don't know what to set `$JAVA_HOME` to, try `/usr/java/default`)

1. If you have a Hadoop installation, make sure that the environment variable `$HADOOP` is set and
points to it. If you don't, Joshua will roll one out for you in standalone mode. Hadoop is only
needed if you plan to build new models with Joshua.

1. In addition, you will need to install Moses if either of the following applies to you:

- You wish to build phrase-based models (Joshua 6.0 includes a phrase-based decoder, but
not the tools for building such a model)

- You are building your own models (phrase- or syntax-based) and wish to use Cherry & Foster's
[batch MIRA tuner](http://aclweb.org/anthology-new/N/N12/N12-1047v2.pdf) instead of the included MERT.

Follow [the instructions for installing Moses
here](http://www.statmt.org/moses/?n=Development.GetStarted), and then define the `$MOSES`
environment variable to point to the root of the Moses installation.

## Quick start

Our <a href="pipeline.html">pipeline script</a> is the quickest way to get started. For example, to
train and test a complete model translating from Bengali to English:

First, download the Indian languages data:

wget --no-check -O indian-languages.tgz https://github.com/joshua-decoder/indian-parallel-corpora/tarball/master
tar xf indian-languages.tgz
ln -s joshua-decoder-indian-parallel-corpora-b71d31a input

Then, train and test a model

$JOSHUA/bin/pipeline.pl --source bn --target en \
--no-prepare --aligner berkeley \
--corpus input/bn-en/tok/training.bn-en \
--tune input/bn-en/tok/dev.bn-en \
--test input/bn-en/tok/devtest.bn-en

This will align the data with the Berkeley aligner, build a Hiero model, tune with MERT, decode the
test sets, and reports results that should correspond with what you find on <a
href="/indian-parallel-corpora/">the Indian Parallel Corpora page</a>. For
more details, including information on the many options available with the pipeline script, please
see <a href="pipeline.html">its documentation page</a>.

## More information

For more detail on the decoder itself, including its command-line options, see
[the Joshua decoder page](decoder.html). You can also learn more about other steps of
[the Joshua MT pipeline](pipeline.html), including [grammar extraction](thrax.html) with Thrax and
Joshua's [efficient grammar representation](packing.html).

If you have problems or issues, you might find some help [on our answers page](faq.html) or
[in the mailing list archives](https://groups.google.com/forum/?fromgroups#!forum/joshua_support).

A [bundled configuration](bundle.html), which is a minimal set of configuration, resource, and script files, can be created and easily transferred and shared.
@@ -0,0 +1,139 @@
---
layout: default6
title: Alignment with Jacana
---

## Introduction

jacana-xy is a token-based word aligner for machine translation, adapted from the original
English-English word aligner jacana-align described in the following paper:

A Lightweight and High Performance Monolingual Word Aligner. Xuchen Yao, Benjamin Van Durme,
Chris Callison-Burch and Peter Clark. Proceedings of ACL 2013, short papers.

It currently supports only aligning from French to English with a very limited feature set, from the
one week hack at the [Eighth MT Marathon 2013](http://statmt.org/mtm13). Please feel free to check
out the code, read to the bottom of this page, and
[send the author an email](http://www.cs.jhu.edu/~xuchen/) if you want to add more language pairs to
it.

## Build

jacana-xy is written in a mixture of Java and Scala. If you build from ant, you have to set up the
environmental variables `JAVA_HOME` and `SCALA_HOME`. In my system, I have:

export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26
export SCALA_HOME=/home/xuchen/Downloads/scala-2.10.2

Then type:

ant

build/lib/jacana-xy.jar will be built for you.

If you build from Eclipse, first install scala-ide, then import the whole jacana folder as a Scala project. Eclipse should find the .project file and set up the project automatically for you.

Demo
scripts-align/runDemoServer.sh shows up the web demo. Direct your browser to http://localhost:8080/ and you should be able to align some sentences.

Note: To make jacana-xy know where to look for resource files, pass the property JACANA_HOME with Java when you run it:

java -DJACANA_HOME=/path/to/jacana -cp jacana-xy.jar ......

Browser
You can also browse one or two alignment files (*.json) with firefox opening src/web/AlignmentBrowser.html:



Note 1: due to strict security setting for accessing local files, Chrome/IE won't work.

Note 2: the input *.json files have to be in the same folder with AlignmentBrowser.html.

Align
scripts-align/alignFile.sh aligns tab-separated sentence files and outputs the output to a .json file that's accepted by the browser:

java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -src fr -tgt en -m fr-en.model -a s.txt -o s.json

scripts-align/alignFile.sh takes GIZA++-style input files (one file containing the source sentences, and the other file the target sentences) and outputs to one .align file with dashed alignment indices (e.g. "1-2 0-4"):

java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -m fr-en.model -src fr -tgt en -a s1.txt -b s2.txt -o s.align

Training
java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -r train.json -d dev.json -t test.json -m /tmp/align.model

The aligner then would train on train.json, and report F1 values on dev.json for every 10 iterations, when the stopping criterion has reached, it will test on test.json.

For every 10 iterations, a model file is saved to (in this example) /tmp/align.model.iter_XX.F1_XX.X. Normally what I do is to select the one with the best F1 on dev.json, then run a final test on test.json:

java -DJACANA_HOME=../ -jar ../build/lib/jacana-xy.jar -t test.json -m /tmp/align.model.iter_XX.F1_XX.X

In this case since the training data is missing, the aligner assumes it's a test job, then reads model file still from the -m option, and test on test.json.

All the json files are in a format like the following (also accepted by the browser for display):

[
{
"id": "0008",
"name": "Hansards.french-english.0008",
"possibleAlign": "0-0 0-1 0-2",
"source": "bravo !",
"sureAlign": "1-3",
"target": "hear , hear !"
},
{
"id": "0009",
"name": "Hansards.french-english.0009",
"possibleAlign": "1-1 6-5 7-5 6-6 7-6 13-10 13-11",
"source": "monsieur le Orateur , ma question se adresse à le ministre chargé de les transports .",
"sureAlign": "0-0 2-1 3-2 4-3 5-4 8-7 9-8 10-9 12-10 14-11 15-12",
"target": "Mr. Speaker , my question is directed to the Minister of Transport ."
}
]
Where possibleAlign is not used.

The stopping criterion is to run up to 300 iterations or when the objective difference between two iterations is less than 0.001, whichever happens first. Currently they are hard-coded. If you need to be flexible on this, send me an email!

Support More Languages
To add support to more languages, you need:

labelled word alignment (in the download there's already French-English under alignment-data/fr-en; I also have Chinese-English and Arabic-English; let me know if you have more). Usually 100 labelled sentence pairs would be enough
implement some feature functions for this language pair
To add more features, you need to implement the following interface:

edu.jhu.jacana.align.feature.AlignFeature

and override the following function:

addPhraseBasedFeature

For instance, a simple feature that checks whether the two words are translations in wiktionary for the French-English alignment task has the function implemented as:

def addPhraseBasedFeature(pair: AlignPair, ins:AlignFeatureVector, i:Int, srcSpan:Int, j:Int, tgtSpan:Int,
currState:Int, featureAlphabet: Alphabet){
if (j == -1) {
} else {
val srcTokens = pair.srcTokens.slice(i, i+srcSpan).mkString(" ")
val tgtTokens = pair.tgtTokens.slice(j, j+tgtSpan).mkString(" ")

if (WiktionaryMultilingual.exists(srcTokens, tgtTokens)) {
ins.addFeature("InWiktionary", NONE_STATE, currState, 1.0, srcSpan, featureAlphabet)
}

}
}
This is a more general function that also deals with phrase alignment. But it is suggested to implement it just for token alignment as currently the phrase alignment part is very slow to train (60x slower than token alignment).

Some other language-independent and English-only features are implemented under the package edu.jhu.jacana.align.feature, for instance:

StringSimilarityAlignFeature: various string similarity measures

PositionalAlignFeature: features based on relative sentence positions

DistortionAlignFeature: Markovian (state transition) features

When you add features for more languages, just create a new package like the one for French-English:

edu.jhu.jacana.align.feature.fr_en

and start coding!

0 comments on commit 216a358

Please sign in to comment.