Skip to content

A Taxonomy of Processors

Keith Alcock edited this page Apr 5, 2021 · 7 revisions

The hierarchy of Processor classes is described here, including those from other related projects. This is organized as a class hierarchy, but it also describes which processors contain other ones and forward some calls to their delegates.

  • Processor (trait, object org.clulab.processors) - Trait for all processors implementations. Key method here is annotate, which contains the entire annotation functionality of a given processors class
    • ShallowNLPProcessor (class, object org.clulab.processors.shallownlp) - performs only shallow analysis, which includes tokenization, POS tagging, and NER. Note that this class uses our own tokenizer, and POS tagger and NER from Stanford's CoreNLP.
      • CoreNLPProcessor (class, object org.clulab.processors.corenlp) - this is a wrapper for the entire Stanford CoreNLP pipeline, which contains their constituent parser and coreference resolution (on top of what ShallowNLPProcessor does). Use this class if you need the classic CoreNLP behavior. If you'd like to use their dependency parser use FastNLPProcessor instead.
        • BioNLPProcessor (class org.clulab.processors.bionlp) - customizes CoreNLPProcessor for biomedical texts. This includes a new tokenizer that is better suited for biomedical texts, as well as a biomedical NER. This class resides in the reach project, not processors.
      • FastNLPProcessor (class, object org.clulab.processors.fastnlp) - Almost the same as CoreNLPProcessor, but uses Stanford's dependency parser instead of their constituent parser. Because of this, the annotate method in this class tends to be faster than the one on CoreNLPProcessor. Use this class if you need dependency trees rather than constituent trees.
        • FastNLPProcessorWithSemanticRoles (class org.clulab.processors.fastnlp) - adds semantic roles from CluProcessor on top of all the functionality in FastNLPProcessor.
          • EidosEnglishProcessor (class org.clulab.wm.eidos) - adds the traits of EidosProcessor to the superclass, adapting it to work for the eidos project where the class resides.
          • EidosCluProcessor (class org.clulab.wm.eidos) - also adds the traits of EidosProcessor to the superclass. That superclass is currently the same as the superclass of EidosEnglishProcessor, which makes this class redundant. However, as the name suggests, the superclass has in the past been CluCoreProcessor and this remains in case it needs to be changed back without affecting EidosEnglishProcessor.
        • FastBioNLPProcessor (class org.clulab.processors.bionlp) - customizes FastNLPProcessor for biomedical texts. This includes a new tokenizer that is better suited for biomedical texts, as well as a biomedical NER. This class resides in the reach project, not processors.
    • CluProcessor (class, object org.clulab.processors.clu) - uses tools developed in in house, all released with Apache license. This includes tokenizer, POS tagger, NER, dependency parser, and semantic role labeling (SRL).
      • CluCoreProcessor (class, object org.clulab.processors.clucore) - adds Stanford's NumericEntityRecognizer, which recognizes numeric entities such as dates, times, and money, to the functionality of CluProcessor.
      • SpanishCluProcessor (class org.clulab.processors.clu) - CluProcessor for Spanish
        • EidosSpanishProcessor (class org.clulab.wm.eidos) - adds EidosProcessor trait to superclass
      • PortugueseCluProcessor (class org.clulab.processors.clu) - CluProcessor for Portuguese
        • EidosCluProcessor (class org.clulab.wm.eidos) - adds EidosProcessor trait to superclass
    • EidosProcessor (trait org.clulab.wm.eidoscommon) - describes processors used in the eidos project. These are processors which include traits of SentencesExtractor, LanguageSpecific, Tokenizing, and EidosTokenizing. The SentenceExtractor includes two minimal methods for extracting documents and sentences that can be called externally without a the caller needing to know about the entire EidosSystem class. LanguageSpecific means that it applies to a particular language and has tag set to match. Tokenizing means that it can supply access to its tokenizer and EidosTokenizer means that it has an EidosTokenizer. This special tokenizer provides important paragraph splitting functionality among other things.