Table of Contents generated with DocToc
- Event Mention Evaluation (EvmEval)
This repository conducts, file conversion, and scoring for event mention detection. It consists of the following three pieces of code:
- A simple converter from Brat annotation tool format to CMU detection format
- A scorer that can score system performance based on CMU detection format
- A visualizer that use Embedded Brat Viewer (not actively maintained)
To use the software, we need to prepare the CMU format annotation file from the Brat annotation output using "brat2tbf.py". The scorer can then take 2 documents in such format, one as gold standard data, one as system output. The scorer also need the token files produced by the tokenizer. The usage of these codes are described below.
Use the example shell scripts "example_run.sh" to perform all the above steps in the sample documents, if success, you will find scoring results in the example_data directory
Most utility code can be found in the util directory.
The following scripts need to find corresponding files by docid and file extension, so the file extension will be provided exactly. The script have default values for these extensions, but may require additional argument if extensions are changed.
Here is how to find the extension:
For brat annotation files, they normally have the following name:
<docid>.ann
In such case, the file extension is ".ann", the converter assume this as the default extension. If not, change it with "-ae" argument
In the past evaluations, tokenization tables are provided, for tokenization table, they normally have the following name:
<docid>.tab
In such case, the file extension is ".tab", both the converter and scorer assume this as a default extension. If not, change them with "-te" argument.
The current scorer can score event mention detection and coreference based on the (.tbf) format. It also require the token table files to detect invisible words and to generate CoNLL style coreference files.
- Produce F1-like scoring by mapping system mentions to gold standard mentions, read the scoring documentation for more details.
- Be able to produce a comparison output indicating system and gold standard differences: a. A text based comparison output (-d option) b. A web based comparison output using Brat's embedded visualization (-v option)
- If specified, it will generate temporary conll format files, and use the conll reference-scorer to produce coreference scores
- Be able to conduct temporal evaluation as well if specified with the "-a" argument.
usage: scorer_v1.8.py [-h] -g GOLD -s SYSTEM [-d COMPARISON_OUTPUT]
[-o OUTPUT] [-c COREF] [-a SEQUENCING] [-t TOKEN_PATH]
[-m COREF_MAPPING] [-of OFFSET_FIELD]
[-te TOKEN_TABLE_EXTENSION] [-ct COREFERENCE_THRESHOLD]
[-b] [--eval_mode {char,token}] [-wl TYPE_WHITE_LIST]
[-dn DOC_ID_TO_EVAL]
Event mention scorer, provides support to Event Nugget scoring, Event Coreference and Event Sequencing scoring.
core arguments:
-g GOLD, --gold GOLD Golden Standard
-s SYSTEM, --system SYSTEM System output
optional arguments:
-d COMPARISON_OUTPUT, --comparison_output COMPARISON_OUTPUT
Compare and help show the difference between system
and gold
-o OUTPUT, --output OUTPUT
Optional evaluation result redirects, put eval result
to file
-c COREF, --coref COREF
Eval Coreference result output, need to put the
referenceconll coref scorer in the same folder with
this scorer
-a SEQUENCING, --sequencing SEQUENCING
Eval Event sequencing result output (After and
Subevent)
-t TOKEN_PATH, --token_path TOKEN_PATH
Path to the directory containing the token mappings
file, only used in token mode.
-m COREF_MAPPING, --coref_mapping COREF_MAPPING
Which mapping will be used to perform coreference
mapping.
-of OFFSET_FIELD, --offset_field OFFSET_FIELD
A pair of integer indicates which column we should
read the offset in the token mapping file, index
startsat 0, default value will be [2, 3]
-te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
any extension appended after docid of token table
files. Default is [.tab], only used in token mode.
-ct COREFERENCE_THRESHOLD, --coreference_threshold COREFERENCE_THRESHOLD
Threshold for coreference mention mapping
-b, --debug turn debug mode on
--eval_mode {char,token}
Use Span or Token mode. The Span mode will take a span
as range [start:end], while the Token mode consider
each token is provided as a single id.
-wl TYPE_WHITE_LIST, --type_white_list TYPE_WHITE_LIST
Provide a file, where each line list a mention type
subtype pair to be evaluated. Types that are out of
this white list will be ignored.
-dn DOC_ID_TO_EVAL, --doc_id_to_eval DOC_ID_TO_EVAL
Provide one single doc id to evaluate.
The validator check whether the supplied "tbf" file follows assumed structure . The validator will exit at status 255 if any errors are found, validation logs will be written at the same directory of the validator with "errlog" as extension.
usage: validator.py [-h] -s SYSTEM [-tm] [-t TOKEN_PATH] [-of OFFSET_FIELD]
[-te TOKEN_TABLE_EXTENSION] [-wc WORD_COUNT_FILE]
[-ty TYPE_FILE] [-b]
The validator check whether the supplied 'tbf' file follows assumed structure.
The validator will exit at status 255 if any errors are found, validation
logs will be written at the same directory of the validator with 'errlog' as
extension.
core arguments:
-s SYSTEM, --system SYSTEM System output
optional arguments:
-h, --help show this help message and exit
-tm, --token_mode Token mode, default is false.
-t TOKEN_PATH, --token_path TOKEN_PATH
Path to the directory containing the token mappings
file, only in token mode.
-of OFFSET_FIELD, --offset_field OFFSET_FIELD
A pair of integer indicates which column we should
read the offset in the token mapping file, index
starts at 0, default value will be [2, 3]. Only used
in token mode.
-te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
any extension appended after docid of token table
files. Default is [.tab]
-wc WORD_COUNT_FILE, --word_count_file WORD_COUNT_FILE
A word count file that can be used to help validation,
such as the character_counts.tsv in LDC2016E64.
-ty TYPE_FILE, --type_file TYPE_FILE
If provided, the validator will check whether the type
subtype pair is valid.
-b, --debug turn debug mode on
This is a tool that converts Brat Annotation format to TBF format. We currently try to make as little assumption as possible. However, in order to resolve coreference transitive redirect automatically, the relation name for coreference must be named as "Coreference". We also develop for event coreference only.
- ID convention
The default set up follows Brat v1.3 ID convention:
- T: text-bound annotation
- R: relation
- E: event
- A: attribute
- M: modification (alias for attribute, for backward compatibility)
- N: normalization [new in v1.3 of Brat]
- #: note
Further development might allow customized ID convention.
-
This code only scan and detect event mentions and its attributes. Event arguments and entities are currently not handled. Annotations other than Event Mention (with its attributes and Text Spans) will be ignored, which means, it will only read "E" annotations and its related attributes.
-
Discontinuous text-bound annotations will be supported
brat2tokenFormat.py [-h] (-d DIR | -f FILE) -t TOKENPATH [-o OUT]
[-oe EXT] [-i EID] [-w] [-te TOKEN_TABLE_EXTENSION]
[-ae ANNOTATION_EXTENSION] [-b]
This converter converts Brat annotation files to one single token based event mention description file (CMU format). It accepts a single file name or a directory name that contains the Brat annotation output. The converter also requires token offset files that shares the same name with the annotation file, with extension .txt.tab. The converter will search for the token file in the directory specified by '-t' argument
Required Arguments:
-d DIR, --dir DIR directory of the annotation files
-f FILE, --file FILE name of one annotation file
-t TOKENPATH, --tokenPath TOKENPATH
directory to search for the corresponding token files
Optional arguments:
-h, --help show this help message and exit
-o OUT, --out OUT output path, 'converted' in the current path by
default
-oe EXT, --ext EXT output extension, 'tbf' by default
-i EID, --eid EID an engine id that will appears at each line of the
output file. 'brat_conversion' will be used by default
-w, --overwrite force overwrite existing output file
-te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
any extension appended after docid of token table
files. Default is .txt.tab
-ae ANNOTATION_EXTENSION, --annotation_extension ANNOTATION_EXTENSION
any extension appended after docid of annotation
files. Default is .tkn.ann
-b, --debug turn debug mode on
This software converts LDC's XML format for the TAC KBP 2015 Event Nugget task to the Brat format. More specifically, it converts LDC's event nuggets and coreferences to events and coreference links that can be viewed via the Brat web interface. Brat annotation configurations for output are available at directory src/main/resources/
.
The software requires Java 1.8. See pom.xml
for other dependencies.
You can see its usage with the following command:
$ java -jar target/converter-1.0.3-jar-with-dependencies.jar -h
Option Description
------ -----------
-a <annotation dir> annotation directory
--ae <annotation file extension> annotation file extension
-d whether to detag text
-h help
-i <input mode> input mode ("event-nugget")
-o <output dir> output directory
-t <text dir> text directory
--te <text file extension> text file extension
The software requires Java 1.8. A precompiled jar locates at bin directory. To compile the project from source you will also need Maven 2.7+.
Our tokenizer implementation is based on the tokenizer in the Stanford CoreNLP tool . The software is implemented in Java, and its requirements are as follows:
- Java 1.8
- The same number of text files and brat annotation files (*.ann) with the same file base name
java -jar bin/token-file-maker-1.0.3-jar-with-dependencies.jar -a <annotation> -e <extension> [-h] -o <output> [-s <separator>] -t <text>
-a <annotation> annotation directory
-e <extension> text file extension
-h print this message
-o <output> output directory
-s <separator> separator chars for tokenization
-t <text> text directory
These are tab-delimited files which map the tokens to their tokenized files. A mapping table contains 3 columns for each row, and the rows contain an orderd listing of the document's tokens. The columns are:
- token_id: A string of "t" followed by a token-number beginning at 0
- token_str: The literal string of a given-token
- tkn_begin: Index of the token's first character in the tkn file
- tkn_end: Index of the token's last character in the tkn file
Please note that all 4 fields are required and will be used:
- The converter will use token_id, tkn_begin, tkn_end to convert characters to token id
- The scorer will use the token_str to detect invisible words
The tokenization table files are created using our automatic tool, which wraps the Stanford tokenizer and provide boundary checks.
The visualization is provided as a mechanism to compare different output, which is optional and can be ignored if one is only interested in the scores. This code maybe update frequently. Please refer to the command line "-h" for detailed instructions.
The visualize code represent mention differences in JSON, which is then passed to Embedded Brat .
Recent changes make visualizing clusters possible by creating additional JSON object. When enabled, there will be a cluster selector on the webpage, one could select the cluster and all other event mentions will hide.
The visualization mapping does not fully reflect the scoring process, it is just a mean to help compare the data. Note that there are up to 2^k different way of aligning the mentions, where k is the number of attributes. The input to the visualization system is the most basic mapping (span only). It need not capture the true mapping of mention type or realis status because several mapping options are identical in span only mapping, the visualization system simply choose whichever comes first.
The text based Visualization can be generated using the "scorer.py", by supplying the "-d" argument. The format is straightforward, a text document is produced for comparison. The annotation of both systems are displayed in one line, separated by "|"
The web base visualization takes the text visualization file, then:
- convert them to Brat Embedded JSON format and store it at the visualization folder (visualization/json)
- It will start a server at the visualization folder using localhost:8000
- Now user can browse the locally hosted site for comparison
- User can stop the server when done, and restart it at anytime using "start.sh", it is no longer necessary to regenerate the JSON data if one only wish to use the old ones
usage: visualize.py [-h] -d COMPARISON_OUTPUT -t TOKENPATH [-x TEXT]
[-v VISUALIZATION_HTML_PATH] [-of OFFSET_FIELD]
[-te TOKEN_TABLE_EXTENSION] [-se SOURCE_FILE_EXTENSION]
Mention visualizer, will create a side-by-side embedded visualization from the mapping
Required Arguments:
-d COMPARISON_OUTPUT, --comparison_output COMPARISON_OUTPUT
The comparison output file between system and gold,
used to recover the mapping
-t TOKENPATH, --tokenPath TOKENPATH
Path to the directory containing the token mappings
file
Optional Arguments:
-h, --help show this help message and exit
-x TEXT, --text TEXT Path to the directory containing the original text
-v VISUALIZATION_HTML_PATH, --visualization_html_path VISUALIZATION_HTML_PATH
The Path to find visualization web pages, default path
is [visualization]
-of OFFSET_FIELD, --offset_field OFFSET_FIELD
A pair of integer indicates which column we should
read the offset in the token mapping file, index
startsat 0, default value will be [2, 3]
-te TOKEN_TABLE_EXTENSION, --token_table_extension TOKEN_TABLE_EXTENSION
any extension appended after docid of token table
files. Default is [.txt.tab]
-se SOURCE_FILE_EXTENSION, --source_file_extension SOURCE_FILE_EXTENSION
any extension appended after docid of source
files.Default is [.tkn.txt]