A hypergraph-based syntactic translation grammar extractor for use with cdec and similar translation systems.
- extract various kinds (tree-to-string, tree-to-tree, string-to-tree) of tree transduction rules from aligned parallel corpora with parses on one or both sides
- tranduction rules can be minimal or composed, with limits on the size and complexity of the rules
- extract rules from parse forests or (k-best) lists of parses
- score extracted rules with a variety of standard features
python hg_rule_extractor.py test_data/test.fr test_data/test.en test_data/test.al --t2s -m -s 1000 > rules.t2s
The options are:
test_data/test.fr
is the source side of the bitext, parsed, one tree per linetest_data/test.en
is the target side of the bitext, one sentence per line (not parsed)--t2s
indicates that xRs rules should be extracted-m
indicates that minimal (non-composed) rules should be extracted-s 1000
indicates that rules my have up to 1000 symbols in them (effectively, this disables any size-based filtering)
The above command writes the following rules to the file rules.t2s
:
(PP [P] (NP (DT l') [NN])) ||| [1] [2] ||| 0-0 2-1 ||| count=1.0 sent_count=1
(VP (VB a) [VBN] [PP]) ||| [1] [2] ||| 1-0 2-1 ||| count=1.0 sent_count=1
(P à) ||| to ||| 0-0 ||| count=1.0 sent_count=1
(S [NP] [VP] [PUNC]) ||| [1] [2] [3] ||| 0-0 1-1 2-2 ||| count=1.0 sent_count=1
(PUNC .) ||| . ||| 0-0 ||| count=1.0 sent_count=1
(VBN marché) ||| walked ||| 0-0 ||| count=1.0 sent_count=1
(JJ petit) ||| young ||| 0-0 ||| count=1.0 sent_count=1
(DT le) ||| the ||| 0-0 ||| count=1.0 sent_count=1
(NP [DT] [JJ] [NN]) ||| [1] [2] [3] ||| 0-0 1-1 2-2 ||| count=1.0 sent_count=1
(NN école) ||| school ||| 0-0 ||| count=1.0 sent_count=1
(NN garçon) ||| boy ||| 0-0 ||| count=1.0 sent_count=1
./t2s_score/score.sh rules.t2s test_data/sgt-params.txt test_data/tgs-params.txt
The second and third files are lexical translation probabilities.
- For information on tree-to-string (xRs) translation rules, see
- For more information on the supported tree-to-tree formalism, see
- this paper
This software is a rewrite of the Grex grammar extractor