No description, website, or topics provided.
Scala Java
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
project
src
.gitignore
LICENSE
README.md
build.sbt
compile
sbt-launch-0.13.9.jar

README.md

CCG Parsing: 2015

This is the software used for the following publication:

Weakly-Supervised Grammar-Informed Bayesian CCG Parser Learning
Dan Garrette, Chris Dyer, Jason Baldridge, and Noah A. Smith
In Proceedings of AAAI 2015

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning
Dan Garrette, Chris Dyer, Jason Baldridge, and Noah A. Smith
In Proceedings of CoNLL 2015

Getting the code

$ get clone git@github.com:dhgarrette/2015-ccg-parsing.git
$ cd 2015-ccg-parsing

Data setup

Put the English, Chinese, and Italian data into the following directories:

 data/ccgbank
 data/ccgbank-chinese
 data/ccgbank-italian

The files should be arranged as follows:

$ ls data/ccgbank/AUTO
00  02  04  06  08  10  12  14  16  18  20  22  24
01  03  05  07  09  11  13  15  17  19  21  23
$ ls data/ccgbank-chinese/AUTO
00  02  04  06  08  10  20  22  24  26  28  30
01  03  05  07  09  11  21  23  25  27  29  31
$ ls data/ccgbank-italian/pro
civil_law.pro.txt   jrc_acquis.pro.txt  newspaper.pro.txt    

Running the code

First, compile the code and generate the run script:

$ ./compile

Then run:

$ target/start dhg.ccg.run.Parse2015Run [options]

Options

  • --model: The model to use. Options: (no default)
    • pcfg: for the pcfg model (see AAAI-2015 paper)
    • scg: for the supertag-context model (see CoNLL-2015 paper)
  • --learning: The learning algorithm to use to train the model. Options: {mcmc}. Default: mcmc.
  • --additional-rules: Additional CCG rules to be allowed by the parser (comma-separated). Example: FC,BX,FC2,BX2. Default x (meaning no additional rules).
  • --lang: The language of the CCGBank to use. Options: {en, ch, it}. Default: en.
  • --max-sent-len: The maximum sentence length allowed (filter all sentence longer than this). Options: an integer or all for no limit. Default: all.
  • --td-tok: The maximum number of tokens to be read when building the tag dictionary. Options: an integer (following an integer with k will expand to 000; e.g. 10k becomes 10000) or all for no limit. Default: all.
  • --train-sent: The maximum number of sentences to be used for training. Options: an integer or all for no limit. Default: all.
  • --test-sent: The maximum number of sentences to be used for testing. Options: an integer or all for no limit. Default: all.
  • --sampling-iterations: The number of MCMC sampling iterations to run. Default: 500.
  • --burnin-iterations: The number of MCMC burn-in iterations to run. Default: 0.
  • --alpha-root: See paper for details. Default: 1.0.
  • --alpha-biny: See paper for details. Default: 100.0.
  • --alpha-unry: See paper for details. Default: 100.0.
  • --alpha-term: See paper for details. Default: 10000.0.
  • --alpha-prod: See paper for details. Default: 100.0.
  • --alpha-cntx: See paper for details. Only relevant for --model scg. Default: 1000.0.
  • --root-init: Root parameter initializer. Options:
    • uniform
    • catprior: use the grammar-defined category prior.
    • tdecatprior: use the grammar-defined category prior, with atomic category probabilities estimated using the tag dictionary and raw data. DEFAULT.
  • --nt-prod-init: Nonterminal production parameter initializer (for both binary and unary). Options:
    • uniform.
    • catprior: use the grammar-defined category prior.
    • tdecatprior: use the grammar-defined category prior, with atomic category probabilities estimated using the tag dictionary and raw data. DEFAULT.
  • --term-prod-init: Terminal production parameter initializer (for both binary and unary). Options:
    • uniform.
    • tdentry: Use the tag dictionary and raw data to estimate terminal (word) probabilities for each supertag. DEFAULT.
  • --tr-init: Context production parameter initializer (for both left and right contexts). Only relevant for --model scg. Options:
    • uniform,
    • tdentry (use the tag dictionary and raw data to estimate transition probabilities),
    • combine-uniform (use CCG supertag combinability mixed with uniform),
    • combine-tdentry (use CCG supertag combinability mixed with tdentry). DEFAULT.
  • --pterm: See paper for details. Default: 0.7.
  • --pmod: See paper for details. Default: 0.1.
  • --pfwd: See paper for details. Default: 0.5.
  • --comb-tr-mass: Amount of probability mass devoted to "combinable" contexts (called sigma (σ) in the CoNLL-2015 paper). Only relevant for --model scg. Default: 0.85.
  • --td-cutoff: Exclude tag dictionary entries that occur with less than this proportion in the TD-training corpus. Default: 0.0.
  • --max-accept-tries: Number of samples drawn for each sentence in each iteration. Only relevant for --model scg. Default: 1.
  • --output-file: A file where the parsed trees of the test sentences should be written. Default: do not write out trees.
  • --train-termdel: Allow terminal deletion from a training sentence when a parse is not found. Choices {false,true}. Default: false.
  • --test-termdel: Allow terminal deletion from a test sentence when a parse is not found. Choices {false,true}. Default: false.
  • --max-train-tok: The maximum number of tokens to be read for the training data. Options: an integer (following an integer with k will expand to 000; e.g. 10k becomes 10000) or all for no limit. Default: all.
  • --mcmc-output-count-file: File where additional data should be written. Default: do not write out this information.