Skip to content

A hypergraph-based syntactic translation grammar extractor for use with cdec and similar translation systems.

Notifications You must be signed in to change notification settings

cmu-mtlab/hypergrex

Repository files navigation

HyperGrex

A hypergraph-based syntactic translation grammar extractor for use with cdec and similar translation systems.

Supported functionality

  • extract various kinds (tree-to-string, tree-to-tree, string-to-tree) of tree transduction rules from aligned parallel corpora with parses on one or both sides
  • tranduction rules can be minimal or composed, with limits on the size and complexity of the rules
  • extract rules from parse forests or (k-best) lists of parses
  • score extracted rules with a variety of standard features

Example tree-to-string extraction

python hg_rule_extractor.py test_data/test.fr test_data/test.en test_data/test.al --t2s -m -s 1000 > rules.t2s

The options are:

  • test_data/test.fr is the source side of the bitext, parsed, one tree per line
  • test_data/test.en is the target side of the bitext, one sentence per line (not parsed)
  • --t2s indicates that xRs rules should be extracted
  • -m indicates that minimal (non-composed) rules should be extracted
  • -s 1000 indicates that rules my have up to 1000 symbols in them (effectively, this disables any size-based filtering)

The above command writes the following rules to the file rules.t2s:

(PP [P] (NP (DT l') [NN])) ||| [1] [2] ||| 0-0 2-1 ||| count=1.0 sent_count=1
(VP (VB a) [VBN] [PP]) ||| [1] [2] ||| 1-0 2-1 ||| count=1.0 sent_count=1
(P à) ||| to ||| 0-0 ||| count=1.0 sent_count=1
(S [NP] [VP] [PUNC]) ||| [1] [2] [3] ||| 0-0 1-1 2-2 ||| count=1.0 sent_count=1
(PUNC .) ||| . ||| 0-0 ||| count=1.0 sent_count=1
(VBN marché) ||| walked ||| 0-0 ||| count=1.0 sent_count=1
(JJ petit) ||| young ||| 0-0 ||| count=1.0 sent_count=1
(DT le) ||| the ||| 0-0 ||| count=1.0 sent_count=1
(NP [DT] [JJ] [NN]) ||| [1] [2] [3] ||| 0-0 1-1 2-2 ||| count=1.0 sent_count=1
(NN école) ||| school ||| 0-0 ||| count=1.0 sent_count=1
(NN garçon) ||| boy ||| 0-0 ||| count=1.0 sent_count=1

Adding features to rules

./t2s_score/score.sh rules.t2s test_data/sgt-params.txt test_data/tgs-params.txt

The second and third files are lexical translation probabilities.

For further information

This software is a rewrite of the Grex grammar extractor

About

A hypergraph-based syntactic translation grammar extractor for use with cdec and similar translation systems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published