Elements of language processing and learning.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


ELPL course 2012

Steven Laan        6036031
Auke Wiggers       6036163

This repository contains an implementation of the CYK and Viterbi algorithm for the course Elements of 
Language Proccessing and Learning (2012). The authors are Steven Laan and Auke Wiggers. The main program, 
main.py, can be called with several parameters to initiate language parsing. Its two main uses are 
described below.

NOTE: This program uses the Python multiprocessing module. Each process will take as much processing power 
as it can, meaning that if your system has 4 cores, setting the number of processes to 4 will almost surely 
render you unable to use it for anything else. Also, use of multiple processes with setting -hl > 2 will 
cause a huge increase in memory use. This may cause your system to shut down. We're sorry for the inconvenience.
Some parameter settings are computational quite expensive, which results in very long computation times. 

To parse a single sentence and find the most probable parse for it:
$ python main.py -c <path_to_training_treebank> -s <sentence> 

To parse a text file and write the most probable parse for each sentence in it to an output file,
while forming a new gold standard file:
$ python main.py -c <path_to_training_treebank> -i <test_file> -o <output_file> -g <gold_file>
  -ng <new_gold_file>

These and other options can be displayed by using:
$ python main.py --help 

This displays the following information:

usage: main.py [-h] -c CORPUSFILE (-s PARSESENTENCE | -i INPUTFILE)
               [-cpu NUMBER_OF_PROCESSES] [-nl NUMBER_OF_LINES]
               [-msl MAX_SENTENCE_LENGTH] [-ru THRESHOLD]
               [-cl MAX_SUFFIX_LENGTH] [-lc] [-fp] [-rn] [-v] [-hl HEAD_LEX]

Implementation of CYK and Viterbi algorithms. Authors: Steven Laan and Auke

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUSFILE, --corpusfile CORPUSFILE
                        The path to the file containing the (training)
                        Parse the given sentence
  -i INPUTFILE, --inputfile INPUTFILE
                        Path to an input file containing testsentences.
  -o OUTPUTFILE, --outputfile OUTPUTFILE
                        Path to an output file in which most probable parses
                        will be stored.
                        Path to a gold standard file containing parsetrees.
                        Path to a file where the new gold standard file will
                        be saved.
  -cpu NUMBER_OF_PROCESSES, --number-of-processes NUMBER_OF_PROCESSES
                        The number of processes that will be used. Default 1.
  -nl NUMBER_OF_LINES, --number-of-lines NUMBER_OF_LINES
                        Maximum number of lines to parse. Default All.
  -msl MAX_SENTENCE_LENGTH, --max-sentence-length MAX_SENTENCE_LENGTH
                        Maximum sentence length (longer sentences will be
                        skipped). Default 15.
  -ru THRESHOLD, --replace-unknown THRESHOLD
                        Replace words that occur fewer than <threshold> times
                        by an 'unknown word'-token. Default 0.
  -cl MAX_SUFFIX_LENGTH, --classify-unknown MAX_SUFFIX_LENGTH
                        Classify words that do not occur in the corpus based
                        on suffix. Default 0.
  -lc, --lowercase      Lower case words in both grammar and testsentences.
  -fp, --forceparse     Force parsing of the grammar from treebank even if a
                        savefile is available.
  -rn, --replace-numeric
                        Replace numeric values by a 'numeral'-token.
  -v, --verbose         Print feedback during runtime.
  -hl HEAD_LEX, --head-lex HEAD_LEX
                        Enable head-lexicalization, keep track of <head_lex -
                        1> parent categories.

-nl, -lc, -fp, -rn, -v, -ru, -msl, -hl are all optional parameters. It is always required to enter the 
name of a corpusfile by using -c. Parsesentence (-s) and Inputfile (-i) are mutually exclusive, because 
these are two different options entirely. If -o, -g, -ng, -cpu or -nl is entered when parsing a sentence, 
these arguments will simply be ignored.