Skip to content

daalft/NarrativeSchemas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 

Repository files navigation

NarrativeSchemas

About Narrative Schemas

Narrative Schemas are structures extracted from text. This algorithm (originally by Chambers and Jurafsky, 2008/2009) automatically extracts information about events from a corpus. Events are verbs and their dependencies. For example, in the text My dog eats potatoes, the events would be eat(subject, dog) and eat(object, potato). Please note that words are lemmatized prior to further processing.

From all extracted events, the algorithm then tries to build schemas. A schema is a list of events that are likely to occur together in a text. This likelihood is based on the coreference and co-occurrence of fillers, the actual words filling the dependencies.
Using the above example, dog and potato are fillers, dog filling the argument position of subject and potato filling the argument position of object.

Running the program

  • Please note that in order to run this program, you need Java version 1.7.
  • Please note that in order to run this program, you need the Stanford CoreNLP package as well as Apache Commons CLI.

Stanford CoreNLP can be downloaded at http://nlp.stanford.edu/software/corenlp.shtml
Apache Commons CLI can be downloaded at http://commons.apache.org/proper/commons-cli/

Unzip both packages.

Compile the java files to a jar file. (In this example, the file is named "NarrativeSchemas.jar")

Set the classpath to include the following files:
NarrativeSchemas.jar
commons-cli-1.2.jar
joda-time.jar
stanford-corenlp-YYYY-MM-DD-models.jar
stanford-corenlp-YYYY-MM-DD.jar
xom.jar

where YYYY-MM-DD represents a date. The program was written and tested with stanford-corenlp-2012-07-09. The second file is from the Apache Commons CLI package, the last four files are located in the Stanford CoreNLP package.

Run starter.TempStart <arguments> where arguments are as follows:

-buffer ARG path to buffer file
-error ARG path to error buffer file
-corpus ARG path to corpus
-f flag to indicate that path p is a folder, but does ONLY contain PLAIN TEXT files
-nyt flag to indicate that path p is a folder containing NYT formatted files
-output ARG desired filename of final output file
-size ARG size of schema (number of verbs). Default: 6
-shuffle shuffle verbs prior to schema building
-sort sort verbs prior to schema building
-write write frequency file. Default: false
-beta ARG set beta value. Default beta value: 0.2
-lambda ARG set lambda value. Default lambda value: 0.08
-fpi use full prepositional information for prepositions. Default: false
-co chain builder only. Only runs the ChainBuilder (first part of algorithm)
-so schema builder only. Only runs the SchemaBuilder (third part of algorithm)
-np no parse. Only runs ChainBuilder and SchemaBuilder

-f and -nyt cannot be set simultaneously.
-sort and -shuffe can be set simultaneously, but -sort always takes precedence over -shuffle.
-size, -shuffle, -sort, -write, -beta, -lambda, -fpi, -co, -so and -np are optional.

-buffer, -error, -corpus and -output take a filename/path as argument.
-size takes an integer as argument.
-beta and -lambda take a floating point number as argument.

A sample run (assuming that all relevant files are in a folder called "bin") would look like this:

java -cp bin/commons-cli-1.2.jar;bin/NarrativeSchemas.jar;bin/joda-time.jar; bin/stanford-corenlp-2012-07-06-models.jar;bin/stanford-corenlp-2012-07-09.jar;bin/xom.jar starter.Starter -buffer ./buffer -error ./errorb (-nyt|-f) -corpus c:/path/to/corpus/nyt [-np|-co|-so] -output ./schemas_size6_v1 [-size 6] [-shuffle|-sort] [-write] [-beta 0.3] [-lambda 0.07] [-fpi]

Problems

If you notice a lot of skipped files (Skipping file...filename), try running the program with more memory, using -Xmx. 1 GB should suffice. E.g. run:

java -Xmx1024m -cp starter.Starter

Interrupting the program

Once running, the program should not be interrupted. If it must be interrupted, it is best to do so while the following is shown in the console:
Opening file...filename
Adding annotator tokenize
Adding annotator ...

This way, information from all files prior to the one interrupted will be saved. The information saved this way is only an intermediate stage of processing.

You can bypass the parsing phase and directly jump to the pair generation and schema builder if you already have a file containing data from a prior interrupted run. To do this, specify the flag "-np". All other arguments have to be set as well. Please note that the path to the buffer is the file with the data from a previous run.

About

Narrative Schemas

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages