Narrative Schemas are structures extracted from text. This algorithm (originally by Chambers and Jurafsky, 2008/2009) automatically extracts information about events from a corpus. Events are verbs and their dependencies. For example, in the text My dog eats potatoes, the events would be eat(subject, dog) and eat(object, potato). Please note that words are lemmatized prior to further processing.
From all extracted events, the algorithm then tries to build schemas. A schema is a list of events
that are likely to occur together in a text. This likelihood is based on the coreference and co-occurrence of fillers,
the actual words filling the dependencies.
Using the above example, dog and potato are fillers, dog filling the argument position of subject
and potato filling the argument position of object.
- Please note that in order to run this program, you need Java version 1.7.
- Please note that in order to run this program, you need the Stanford CoreNLP package as well as Apache Commons CLI.
Stanford CoreNLP can be downloaded at http://nlp.stanford.edu/software/corenlp.shtml
Apache Commons CLI can be downloaded at http://commons.apache.org/proper/commons-cli/
Unzip both packages.
Compile the java files to a jar file. (In this example, the file is named "NarrativeSchemas.jar")
Set the classpath to include the following files:
NarrativeSchemas.jar
commons-cli-1.2.jar
joda-time.jar
stanford-corenlp-YYYY-MM-DD-models.jar
stanford-corenlp-YYYY-MM-DD.jar
xom.jar
where YYYY-MM-DD represents a date. The program was written and tested with stanford-corenlp-2012-07-09. The second file is from the Apache Commons CLI package, the last four files are located in the Stanford CoreNLP package.
Run starter.TempStart <arguments>
where arguments
are as follows:
-buffer ARG path to buffer file
-error ARG path to error buffer file
-corpus ARG path to corpus
-f flag to indicate that path p is a folder, but does ONLY contain PLAIN TEXT files
-nyt flag to indicate that path p is a folder containing NYT formatted files
-output ARG desired filename of final output file
-size ARG size of schema (number of verbs). Default: 6
-shuffle shuffle verbs prior to schema building
-sort sort verbs prior to schema building
-write write frequency file. Default: false
-beta ARG set beta value. Default beta value: 0.2
-lambda ARG set lambda value. Default lambda value: 0.08
-fpi use full prepositional information for prepositions. Default: false
-co chain builder only. Only runs the ChainBuilder (first part of algorithm)
-so schema builder only. Only runs the SchemaBuilder (third part of algorithm)
-np no parse. Only runs ChainBuilder and SchemaBuilder
-f and -nyt cannot be set simultaneously.
-sort and -shuffe can be set simultaneously, but -sort always takes precedence over -shuffle.
-size, -shuffle, -sort, -write, -beta, -lambda, -fpi, -co, -so and -np are optional.
-buffer, -error, -corpus and -output take a filename/path as argument.
-size takes an integer as argument.
-beta and -lambda take a floating point number as argument.
A sample run (assuming that all relevant files are in a folder called "bin") would look like this:
java -cp bin/commons-cli-1.2.jar;bin/NarrativeSchemas.jar;bin/joda-time.jar;
bin/stanford-corenlp-2012-07-06-models.jar;bin/stanford-corenlp-2012-07-09.jar;bin/xom.jar
starter.Starter -buffer ./buffer -error ./errorb (-nyt|-f)
-corpus c:/path/to/corpus/nyt [-np|-co|-so] -output ./schemas_size6_v1 [-size 6] [-shuffle|-sort]
[-write] [-beta 0.3] [-lambda 0.07] [-fpi]
If you notice a lot of skipped files (Skipping file...filename), try running the program with more memory, using -Xmx. 1 GB should suffice. E.g. run:
java -Xmx1024m -cp starter.Starter
Once running, the program should not be interrupted. If it must be interrupted, it is best to do so while the following is shown
in the console:
Opening file...filename
Adding annotator tokenize
Adding annotator ...
This way, information from all files prior to the one interrupted will be saved. The information saved this way is only an intermediate stage of processing.
You can bypass the parsing phase and directly jump to the pair generation and schema builder if you already have a file containing data from a prior interrupted run. To do this, specify the flag "-np". All other arguments have to be set as well. Please note that the path to the buffer is the file with the data from a previous run.