-
Notifications
You must be signed in to change notification settings - Fork 68
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into feature/issue1190-maui
* master: #1041 - Add parameter to enable lower-cased lookup of first word in sentence in SfstAnnotator #1362 - NifWriter does not write out NE identifier #1362 - NifWriter does not write out NE identifier #1152 - Introduce "order" feature on tokens #1366 - Added support in CONLL-U reader for document and paragraph IDs #1041 - Add parameter to enable lower-cased lookup of first word in sentence in SfstAnnotator #1366 - Added support in CONLL-U reader for document and paragraph IDs #1367 - Support TCF orthography via SofaChangeAnnotations #1041 - Add parameter to enable lower-cased lookup of first word in sentence in SfstAnnotator #1327 - Update LIF support #1366 - Added support in CONLL-U reader for document and paragraph IDs #1367 - Support TCF orthography via SofaChangeAnnotations Forgot to commit the list declaration Warn if CONLL-U file contains multiple documents Added support in CONLL-U reader for document and paragraph IDs #186 - Change artifactId to "dkpro-core-XXX" #1299 - Update to CoreNLP 3.9.2 #1337 - Connl2012 writer uses WordSense, but does not declare it #1299 - Update to CoreNLP 3.9.2 Added parameter to enable lower-cased lookup of first word in sentence.
- Loading branch information
Showing
41 changed files
with
3,397 additions
and
812 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
79 changes: 79 additions & 0 deletions
79
dkpro-core-api-segmentation-asl/src/main/resources/desc/type/LexicalUnits_customized.xml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,72 +1,151 @@ | ||
<?xml version="1.0" encoding="UTF-8"?> | ||
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier"> | ||
|
||
<name>Segmentation</name> | ||
|
||
<description/> | ||
|
||
<version>${version}</version> | ||
|
||
<vendor>Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt</vendor> | ||
|
||
<imports> | ||
|
||
<import name="desc.type.LexicalUnits"/> | ||
|
||
</imports> | ||
|
||
<types> | ||
|
||
<typeDescription> | ||
|
||
<name>de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Compound</name> | ||
|
||
<description>This type represents a decompounding word, i.e.: flowerpot. Each Compound one have at least two Splits.</description> | ||
|
||
<supertypeName>uima.tcas.Annotation</supertypeName> | ||
|
||
<features> | ||
|
||
<featureDescription> | ||
|
||
<name>splits</name> | ||
|
||
<description>A word that can be decomposed into different parts.</description> | ||
|
||
<rangeTypeName>uima.cas.FSArray</rangeTypeName> | ||
|
||
<elementType>de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Split</elementType> | ||
|
||
</featureDescription> | ||
|
||
</features> | ||
|
||
</typeDescription> | ||
|
||
<typeDescription> | ||
|
||
<name>de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token</name> | ||
|
||
<description><p>Token is one of the two types commonly produced by a segmenter (the other being Sentence). A Token usually represents a word, although it may be used to represent multiple tightly connected words (e.g. "New York") or parts of a word (e.g. the possessive "'s"). One may choose to split compound words into multiple tokens, e.g. ("CamelCase" -&gt; "Camel", "Case"; "Zauberstab" -&gt; "Zauber", "stab"). Most processing components operate on Tokens, usually within the limits of the surrounding Sentence. E.g. a part-of-speech tagger analyses each Token in a Sentence and assigns a part-of-speech to each Token.</p></description> | ||
|
||
<supertypeName>uima.tcas.Annotation</supertypeName> | ||
|
||
<features> | ||
|
||
<featureDescription> | ||
|
||
<name>parent</name> | ||
|
||
<description>the parent of this token. This feature is meant to be used in when the token participates in a constituency parse and then refers to a constituent containing this token. The type of this feature is {@link Annotation} to avoid adding a dependency on the syntax API module.</description> | ||
|
||
<rangeTypeName>uima.tcas.Annotation</rangeTypeName> | ||
|
||
</featureDescription> | ||
|
||
<featureDescription> | ||
|
||
<name>lemma</name> | ||
|
||
<description/> | ||
|
||
<rangeTypeName>de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma</rangeTypeName> | ||
|
||
</featureDescription> | ||
|
||
<featureDescription> | ||
|
||
<name>stem</name> | ||
|
||
<description/> | ||
|
||
<rangeTypeName>de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Stem</rangeTypeName> | ||
|
||
</featureDescription> | ||
|
||
<featureDescription> | ||
|
||
<name>pos</name> | ||
|
||
<description/> | ||
|
||
<rangeTypeName>de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS</rangeTypeName> | ||
|
||
</featureDescription> | ||
|
||
<featureDescription> | ||
|
||
<name>morph</name> | ||
|
||
<description>The morphological feature associated with this token.</description> | ||
|
||
<rangeTypeName>de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.morph.MorphologicalFeatures</rangeTypeName> | ||
|
||
</featureDescription> | ||
|
||
<featureDescription> | ||
|
||
<name>id</name> | ||
|
||
<description>If this unit had an ID in the source format from which it was imported, it may be stored here. IDs are typically not assigned by DKPro Core components. If an ID is present, it should be respected by writers.</description> | ||
|
||
<rangeTypeName>uima.cas.String</rangeTypeName> | ||
|
||
</featureDescription> | ||
|
||
<featureDescription> | ||
|
||
<name>form</name> | ||
|
||
<description>Potentially normalized form of the token text that should be used instead of the covered text if set.</description> | ||
|
||
<rangeTypeName>de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.TokenForm</rangeTypeName> | ||
|
||
</featureDescription> | ||
|
||
<featureDescription> | ||
|
||
<name>syntacticFunction</name> | ||
|
||
<description/> | ||
|
||
<rangeTypeName>uima.cas.String</rangeTypeName> | ||
|
||
</featureDescription> | ||
|
||
<featureDescription> | ||
|
||
<name>order</name> | ||
|
||
<description>Disambiguates the token order for tokens which have the same offsets, e.g. when the contraction "à" is analyzed as two tokens "a" and "a".</description> | ||
|
||
<rangeTypeName>uima.cas.Integer</rangeTypeName> | ||
|
||
</featureDescription> | ||
</features> | ||
|
||
</typeDescription> | ||
|
||
</types> | ||
|
||
</typeSystemDescription> |
Oops, something went wrong.