1. [`Language`](#language)
1. [`Doc`](#doc)
1. [`Process`](#process)
1. [`Pipeline`](#pipeline)
1. [`MorphosyntacticFeature`](#morpho)
1. [`MorphosyntacticFeatureBundle`](#morpho-bundle)
1. [`Form`](#form)
1. [`DecisionTree`](#dt)

# `Language` <a name="language"></a>

`Language` are used to identify each language and keep track of their attributes (lat-long, 3-letter ISO code, etc.). Users do not normally need to create these themselves, though this type could be extended to distinguish dialects.

In [1]:
from cltk.core.data_types import Language

In [2]:
Language(
    name="Classical Mongolian",
    glottolog_id="",
    latitude=0.0,
    longitude=0.0,
    dates=[],
    family_id="",
    parent_id="",
    level="",
    iso_639_3_code="cmg",
    type="h",
)

Language(name='Classical Mongolian', glottolog_id='', latitude=0.0, longitude=0.0, dates=[], family_id='', parent_id='', level='', iso_639_3_code='cmg', type='h')

In [3]:
from cltk.languages.glottolog import LANGUAGES

In [4]:
print(LANGUAGES["gmh"])

Language(name='Middle High German', glottolog_id='midd1343', latitude=0.0, longitude=0.0, dates=[], family_id='indo1319', parent_id='midd1349', level='language', iso_639_3_code='gmh', type='h')


The following records all pre-modern languages as identified by the [Glottolog project](http://glottolog.org).

In [5]:
from pprint import pprint
pprint([(l.name, l.iso_639_3_code) for _, l in LANGUAGES.items()])

[('Aequian', 'xae'),
 ('Aghwan', 'xag'),
 ('Akkadian', 'akk'),
 ('Alanic', 'xln'),
 ('Ancient Greek', 'grc'),
 ('Ancient Hebrew', 'hbo'),
 ('Ancient Ligurian', 'xlg'),
 ('Ancient Macedonian', 'xmk'),
 ('Ancient North Arabian', 'xna'),
 ('Ancient Zapotec', 'xzp'),
 ('Andalusian Arabic', 'xaa'),
 ('Anglo-Norman', 'xno'),
 ('Aquitanian', 'xaq'),
 ('Ardhamāgadhī Prākrit', 'pka'),
 ('Armazic', 'xrm'),
 ('Avestan', 'ave'),
 ('Bactrian', 'xbc'),
 ('Bengali', 'ben'),
 ('Bolgarian', 'xbo'),
 ('Burma Pyu', 'pyx'),
 ('Camunic', 'xcc'),
 ('Carian', 'xcr'),
 ('Celtiberian', 'xce'),
 ('Church Slavic', 'chu'),
 ('Cisalpine Gaulish', 'xcg'),
 ('Classical Armenian', 'xcl'),
 ('Classical Mandaic', 'myz'),
 ('Classical Mongolian', 'cmg'),
 ('Classical Nahuatl', 'nci'),
 ('Classical Newari', 'nwc'),
 ('Classical Quechua', 'qwc'),
 ('Classical Syriac', 'syc'),
 ('Classical Tibetan', 'xct'),
 ('Coptic', 'cop'),
 ('Cumbric', 'xcb'),
 ('Cuneiform Luwian', 'xlu'),
 ('Curonian', 'xcu'),
 ('Dacian', 'xdc'),
 ('E

# `Word` <a name="word"></a>

`Word` is one of the most important objects. It contains all token-level information generated by a `Process`.

In [6]:
from cltk.core.data_types import Word

In [7]:
Word(index_char_start=0, index_char_stop=6, index_token=0, string="Gallia", pos="nom")

Word(index_char_start=0, index_char_stop=6, index_token=0, index_sentence=None, string='Gallia', pos='nom', lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

# `Doc` <a name="doc"></a>

`Doc` is the other of the two most important types. It too is build up each time a `Process` runs.

Token-level information is stored at `Doc.words`, while larger units of information (e.g., an original input string) are kept elsewhere here.

In [8]:
from cltk.core.data_types import Doc

In [9]:
Doc(raw="Gallia est omnis divisa in partes tres")

Doc(language=None, words=None, pipeline=None, raw='Gallia est omnis divisa in partes tres', normalized_text=None)

It contains a number of helper methods, too, which read (usually) from `.word` and return convenient data structures.

In [10]:
from cltk import NLP
cltk_nlp = NLP(language="lat")
cltk_doc = cltk_nlp.analyze(text="Gallia est omnis divisa in partes tres")

‎𐤀 CLTK version '1.0.11'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinNERProcess`, `LatinLexiconProcess`.


In [11]:
cltk_doc.tokens  # List[str]

['Gallia', 'est', 'omnis', 'divisa', 'in', 'partes', 'tres']

# `Process` <a name="process"></a>

A `Process` is a Python `class` that wraps a particular algorithm type for a particular language (e.g., Sanskrit tokenization). It is designed to invoked by the `Pipeline`, though a user may call it directly, too.

See notebook [Make custom Process and add to Pipeline](https://github.com/cltk/cltk/blob/dev/notebooks/Make%20custom%20Process%20and%20add%20to%20Pipeline.ipynb) for a demonstration.

In [12]:
from dataclasses import dataclass, field
from typing import List, Type
from boltons.cacheutils import cachedproperty
from cltk.core.data_types import Process

In [13]:
@dataclass
class StopsProcess(Process):
    """

    >>> from cltk.core.data_types import Doc, Word
    >>> from cltk.stops.processes import StopsProcess
    >>> from cltk.languages.example_texts import get_example_text
    >>> lang = "lat"
    >>> words = [Word(string=token) for token in split_punct_ws(get_example_text(lang))]
    >>> stops_process = StopsProcess(language=lang)
    >>> output_doc = stops_process.run(Doc(raw=get_example_text(lang), words=words))
    >>> output_doc.words[1].string
    'est'
    >>> output_doc.words[1].stop
    True
    """

    @cachedproperty
    def algorithm(self):
        return Stops(iso_code=self.language).get_stopwords()

    def run(self, input_doc: Doc) -> Doc:
        """Note this marks a word a stop if there is a match on
        either the inflected form (``Word.string``) or the
        lemma (``Word.lemma``).
        """
        output_doc = deepcopy(input_doc)
        stops_list = self.algorithm

        for index, word_obj in enumerate(output_doc.words):
            if (word_obj.string in stops_list) or (word_obj.lemma in stops_list):
                word_obj.stop = True
            else:
                word_obj.stop = False
            output_doc.words[index] = word_obj

        return output_doc

# `Pipeline` <a name="pipeline"></a>

A `Pipeline` is a list containing the algorithms, in order, as they are to be invoked by `NLP()`.

See notebook [Make custom Process and add to Pipeline](https://github.com/cltk/cltk/blob/dev/notebooks/Make%20custom%20Process%20and%20add%20to%20Pipeline.ipynb) for a demonstration.

In [14]:
from cltk.core.data_types import Pipeline
from cltk.languages.utils import get_lang

In [15]:
@dataclass
class AkkadianPipeline(Pipeline):
    """Default ``Pipeline`` for Akkadian.

    >>> from cltk.languages.pipelines import AkkadianPipeline
    >>> a_pipeline = AkkadianPipeline()
    >>> a_pipeline.description
    'Pipeline for the Akkadian language.'
    >>> a_pipeline.language
    Language(name='Akkadian', glottolog_id='akka1240', latitude=33.1, longitude=44.1, dates=[], family_id='afro1255', parent_id='east2678', level='language', iso_639_3_code='akk', type='a')
    >>> a_pipeline.language.name
    'Akkadian'
    >>> a_pipeline.processes[0]
    <class 'cltk.tokenizers.processes.AkkadianTokenizationProcess'>
    """

    description: str = "Pipeline for the Akkadian language."
    language: Language = get_lang("akk")
    processes: List[Type[Process]] = field(
        default_factory=lambda: [AkkadianTokenizationProcess, StopsProcess]
    )

# `MorphosyntacticFeature` <a name="morpho"></a>

This model inherits from the Python builtin `IntEnum`. A `MorphosyntacticFeature` (at [cltk/morphology/universal_dependencies_features.py]()) has been made for each [morphsyntactic tag defined by v2 of the Universal Dependencies project](https://universaldependencies.org/u/feat/all.html).

In [16]:
from enum import auto
from cltk.morphology.universal_dependencies_features import Case, Gender, MorphosyntacticFeature, Number

In [17]:
# 'hercule' ('by Hercules, assuredly, indeed')
# http://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.04.0060:entry=hercule
gender = Gender.masculine
print("Gender:", gender)
case = Case.vocative
print("Case:", case)
number = Number.singular
print("Number", number)

Gender: masculine
Case: vocative
Number singular


In [18]:
type(case)

Case

In [19]:
print("`MorphosyntacticFeature.name`:", case.name)
print("`MorphosyntacticFeature.value`", case.value)

`MorphosyntacticFeature.name`: vocative
`MorphosyntacticFeature.value` 17


In [20]:
# Example declaration
class Case(MorphosyntacticFeature):
    """The case of a noun phrase.
    see https://universaldependencies.org/u/feat/Case.html
    """

    # structural cases
    nominative = auto()
    accusative = auto()
    ergative = auto()
    absolutive = auto()

    # oblique cases
    abessive = auto()
    befefactive = auto()
    causative = auto()
    comparative = auto()
    considerative = auto()
    comitative = auto()
    dative = auto()
    distributive = auto()
    equative = auto()
    genitive = auto()
    instrumental = auto()
    partitive = auto()
    vocative = auto()

    # spatiotemporal cases
    ablative = auto()
    additive = auto()
    adessive = auto()
    allative = auto()
    delative = auto()
    elative = auto()
    essive = auto()
    illative = auto()
    inessive = auto()
    lative = auto()
    locative = auto()
    perlative = auto()
    sublative = auto()
    superessive = auto()
    terminative = auto()
    temporal = auto()
    translative = auto()

In [21]:
# Users can learn a bit about these features, too
#help(case)

Note there is a distinction between POS and a word's features. POS tags are more general categories for general classes of words, like noun, verb, etc.

In [22]:
from cltk.morphology.universal_dependencies_features import POS

In [23]:
# so the Latin word 'hercule' would be
pos = POS.interjection
print(pos)

interjection


# `MorphosyntacticFeatureBundle` <a name="morpho-bundle"></a>

`MorphosyntacticFeature`s are brought together to describe a word with `MorphosyntacticFeatureBundle`.

In [24]:
from cltk.morphology.morphosyntax import MorphosyntacticFeatureBundle

In [25]:
bundle = MorphosyntacticFeatureBundle(case, gender, number)

In [26]:
print(type(bundle))

<class 'cltk.morphology.morphosyntax.MorphosyntacticFeatureBundle'>


In [27]:
print(bundle)

{Case: [vocative], Gender: [masculine], Number: [singular]}


In [28]:
# How to access features within a bundle
print("Gender", bundle["Gender"])
print("Case", bundle["Case"])
print("Number", bundle["Number"])

Gender [masculine]
Case [vocative]
Number [singular]


`POS` and `MorphosyntacticFeatureBundle` are intended to be placed within a `Word`instance.

In [29]:
Word(string="hercule", pos=pos, features=bundle)

Word(index_char_start=None, index_char_stop=None, index_token=None, index_sentence=None, string='hercule', pos=interjection, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={Case: [vocative], Gender: [masculine], Number: [singular]}, category={}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)

# `Form` <a name="form"></a>

Since the `Doc.words` is a flat list of `Word` objects, it can be difficult to model hierearchical data, such as dependency trees. To help such modeling, the `Form` and `DependencyTree` types were created.

`Form` inherits from the builtin `xml` package's `Element` and `DependencyTree` from `ElementTree`.

See [Modeling syntax with Form and DependencyTree](https://github.com/cltk/cltk/blob/dev/notebooks/CLTK%20Demonstration.ipynb) for a full example.

In [30]:
from cltk.dependency.tree import Form

# `DecisionTree` <a name="dt"></a>

See [Modeling syntax with Form and DependencyTree](https://github.com/cltk/cltk/blob/dev/notebooks/CLTK%20Demonstration.ipynb) for a full example.

In [31]:
from cltk.dependency.tree import DependencyTree