Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Elements of language processing and learning.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
ELPL course 2012 ---------------- Authors: Steven Laan 6036031 Auke Wiggers 6036163 -------------------------- This repository contains an implementation of the CYK and Viterbi algorithm for the course Elements of Language Proccessing and Learning (2012). The authors are Steven Laan and Auke Wiggers. The main program, main.py, can be called with several parameters to initiate language parsing. Its two main uses are described below. NOTE: This program uses the Python multiprocessing module. Each process will take as much processing power as it can, meaning that if your system has 4 cores, setting the number of processes to 4 will almost surely render you unable to use it for anything else. Also, use of multiple processes with setting -hl > 2 will cause a huge increase in memory use. This may cause your system to shut down. We're sorry for the inconvenience. Some parameter settings are computational quite expensive, which results in very long computation times. ---------------------------------- To parse a single sentence and find the most probable parse for it: $ python main.py -c <path_to_training_treebank> -s <sentence> To parse a text file and write the most probable parse for each sentence in it to an output file, while forming a new gold standard file: $ python main.py -c <path_to_training_treebank> -i <test_file> -o <output_file> -g <gold_file> -ng <new_gold_file> These and other options can be displayed by using: $ python main.py --help This displays the following information: usage: main.py [-h] -c CORPUSFILE (-s PARSESENTENCE | -i INPUTFILE) [-o OUTPUTFILE] [-g GOLDSTANDARD] [-ng NEWGOLDSTANDARD] [-cpu NUMBER_OF_PROCESSES] [-nl NUMBER_OF_LINES] [-msl MAX_SENTENCE_LENGTH] [-ru THRESHOLD] [-cl MAX_SUFFIX_LENGTH] [-lc] [-fp] [-rn] [-v] [-hl HEAD_LEX] Implementation of CYK and Viterbi algorithms. Authors: Steven Laan and Auke Wiggers. optional arguments: -h, --help show this help message and exit -c CORPUSFILE, --corpusfile CORPUSFILE The path to the file containing the (training) treebank. -s PARSESENTENCE, --parsesentence PARSESENTENCE Parse the given sentence -i INPUTFILE, --inputfile INPUTFILE Path to an input file containing testsentences. -o OUTPUTFILE, --outputfile OUTPUTFILE Path to an output file in which most probable parses will be stored. -g GOLDSTANDARD, --goldstandard GOLDSTANDARD Path to a gold standard file containing parsetrees. -ng NEWGOLDSTANDARD, --newgoldstandard NEWGOLDSTANDARD Path to a file where the new gold standard file will be saved. -cpu NUMBER_OF_PROCESSES, --number-of-processes NUMBER_OF_PROCESSES The number of processes that will be used. Default 1. -nl NUMBER_OF_LINES, --number-of-lines NUMBER_OF_LINES Maximum number of lines to parse. Default All. -msl MAX_SENTENCE_LENGTH, --max-sentence-length MAX_SENTENCE_LENGTH Maximum sentence length (longer sentences will be skipped). Default 15. -ru THRESHOLD, --replace-unknown THRESHOLD Replace words that occur fewer than <threshold> times by an 'unknown word'-token. Default 0. -cl MAX_SUFFIX_LENGTH, --classify-unknown MAX_SUFFIX_LENGTH Classify words that do not occur in the corpus based on suffix. Default 0. -lc, --lowercase Lower case words in both grammar and testsentences. -fp, --forceparse Force parsing of the grammar from treebank even if a savefile is available. -rn, --replace-numeric Replace numeric values by a 'numeral'-token. -v, --verbose Print feedback during runtime. -hl HEAD_LEX, --head-lex HEAD_LEX Enable head-lexicalization, keep track of <head_lex - 1> parent categories. -nl, -lc, -fp, -rn, -v, -ru, -msl, -hl are all optional parameters. It is always required to enter the name of a corpusfile by using -c. Parsesentence (-s) and Inputfile (-i) are mutually exclusive, because these are two different options entirely. If -o, -g, -ng, -cpu or -nl is entered when parsing a sentence, these arguments will simply be ignored.