Releases: UCDenver-ccp/CRAFT
CRAFT v5.0.2
This release includes a bug fix such that the #end document
tags are properly appended to the end of CoNLL formatted documents.
CRAFT v5.0.1
This release contains a bug fix that was precluding the use of the MONDO annotations with other annotation types.
CRAFT v5.0.0
This release of the CRAFT corpus incorporates annotations to the Mondo Disease Ontology (MONDO).
CRAFT v4.0.1
This release incorporates updates to a few incorrect annotations. The guidelines used for coreference annotation have also been added to the distribution.
CRAFT v4.0.0
CRAFT v4.0.0 marks the integration of the 30 reserved articles that were used as the evaluation set for the 2019 CRAFT Shared Task. The CRAFT corpus now consists of 97 full text articles and accompanying annotations.
CRAFT v3.1.3
Note: The CRAFT v3.1.3 release is the release used for the 2019 CRAFT Shared Task.
This release updates the file-conversion dependency to v0.2.2 to handle/prevent some improper discontinuous spans in the coreference annotations. For details, please see this issue: UCDenver-ccp/craft-shared-tasks#1 and the changes made in the file-conversion project: https://github.com/UCDenver-ccp/file-conversion/blob/master/CHANGES.md
CRAFT v3.1.2
This release includes:
-
Corrected erroneous extension class prefixes in the
concept-annotation/GO_MF/GO_MF+extensions/GO_MF_stub+GO_MF_extensions.obo
file -
Reverted Head rule used in dependency conversion back to STANFORD to match newly added CoNLL-U files and corresponding update to the CoNLL-X files.
-
Added correctly formatted CoNLL-U files for the dependency parses. See
structural-annotation/dependency/conllu
. Many thanks to Manuel Ciosici and Sampo Pyysalo for their help in creating and vetting these files.
CRAFT v3.1.1
Changes in this release include the following:
-
Returned the dependency file format for the CRAFT dependency data back to the CoNLL-X format. The CoNLL-U files in v3.1 were improperly formatted (XPOS and UPOS columns were mistakenly swapped among other things) and there is no UPOS data to include. This change aligns the dependency files more closely with the original CRAFT dependency files (available in CRAFT v3.0 and earlier releases). Those original files were missing one POS column which is now included to fully comply with the CoNLL-X format. The non-compliant CoNLL-U files have been removed from the distribution. Also, a minor change was made to the HeadRule used in the conversion from treebank files to the dependency files. The CONLL HeadRule is now used instead of the STANFORD HeadRule.
-
The file-conversion library dependency was updated to 0.2.1 to include changes to support the CoNLL-X file generation mentioned above as well as updates to allow the boot script to work with Java >= 9.0.
-
The parse for final sentence of document 14611657 was added to the 14611657.tree file and the dependency parse was automatically derived and added to 14611657.conll
-
For document 16098226, use of NCBITaxon:1910954 was swapped with NCBITaxon:10847 (partly because NCBITaxon:1910954 is not present in the NCBITaxonomy OBO file that is distributed with CRAFT)
CRAFT v3.1
The changes detailed below were prompted in part by preparation of the CRAFT corpus for the CRAFT Shared Task.
Changes for v3.1
-
The top-level directory has been reorganized into three main directories for annotations.
- concept_annotation/ stores all annotations of ontology concept mentions
- structural_annotation/ stores all syntactic annotations and annotations related to document structure
- coreference_annotation/ stores all coreference annotations
-
A Clojure Boot script has been added to the distribution to facilitate dynamic generation of annotation files in different formats at a user's request. With this addition, annotation files in alternative formats (e.g. brat, uima, knowtator-2 etc.) have been removed leaving only the native file format for each annotation type. Doing so has reduced the overall size of the CRAFT project to under the 1GB threshold imposed by GitHub.
-
Knowtator-2 project archives have been removed from the distribution. They can now be created dynamically using the new Clojure Boot script.
-
Some treebank files have been adjusted based on errors reported by the CoNLL 2018 universal dependency shared task evaluation script (http://universaldependencies.org/conll18/evaluation.html) when run over dependency parses derived from the treebank files. Most errors took the form of multiple ROOT nodes present in the dependency parse and were related to nested CAPTION constructs in the treebank files. These were addressed by un-nesting the CAPTION constructs. There were also a few errors related to empty forms in the resulting dependency parses. These stemmed from lists in the treebank files that used empty forms, e.g. (: ) or (SYM ) and these were removed from the treebank files.
-
New versions of the dependency files have been derived from the manually annotated treebank files using the ClearNLP library, specifically the C2DConvert.java application (https://github.com/clir/clearnlp/blob/master/src/main/java/edu/emory/clir/clearnlp/bin/C2DConvert.java). The file format for the dependency files has also been updated to use the CoNLL-U file format (https://universaldependencies.org/format.html). The original versions of the dependency files have been removed from the repository.
-
Some erroneous relations were removed from a single knowtator-2 annotation file for the CL+extension concepts
-
The coreference annotations have been revised to resolve instances of identity chains sharing mentions. The original knowtator files have been removed and replaced with knowtator-2 format files that contain the revised annotations. For details on the changes to the coreference annotations, please see this README.
-
The distribution now includes XSD files for the knowtator and knowtator-2 XML file formats. See the schema/ directory
CRAFT v3.0
This CRAFT v3.0 release consists of 67 full-text, open-access biomedical journal articles and gold-standard annotations of them along multiple axes, specifically: sentence segmentation, tokenization, part-of-speech tagging, markup of dependency structures, treebanking, markup of coreferential noun phrases, markup of document sections and typography, and annotation of concepts represented in ten Open Biomedical Ontologies (the Chemical Entities of Biological Interest ontology, Cell Ontology, Gene Ontology Biological Process, Gene Ontology Cellular Component, Gene Ontology Molecular Function, Molecular Process Ontology, NCBI Taxonomy, Protein Ontology, Sequence Ontology, and the Uberon anatomical ontology). Also included are the versions of the ontologies used for the concept annotations and various text files useful for comparing automatically generated concept annotations to this gold standard.