This notebook shows how to wrap a function with a `Process`, then to call it in a `Pipeline`

# Make a new `Process`

To understand how a `Process` works, we will create a new one here. We will make one specific for transliteration, then subclass that for a particular language.

In [1]:
from cltk.core.data_types import Process

In [2]:
# this code in the CLTK takes the Anglo-Saxon runic alphabet and turns it into the Latin alphabet
from cltk.phonology.ang.transliteration import Transliterate

In [3]:
oe_runes = "·ö©·ö†·õè ·õã·ö≥·ö£·õö·õû ·õã·ö≥·õñ·ö†·õÅ·õù ·õã·ö≥·õ†·ö¶·õñ·öæ·ö™ ·ö¶·ö±·õ†·õè·ö¢·õó"  # type str
oe_latin = Transliterate().transliterate(text=oe_runes, mode="Latin")  # type str
print(oe_latin)

oft scyld scefin sceathena threatum


In [4]:
from dataclasses import dataclass
from copy import deepcopy
from boltons.cacheutils import cachedproperty
from cltk.core.exceptions import CLTKException
from cltk.core.data_types import Doc, Word

In [5]:
@dataclass
class OldEnglishTransliterationProcess(Process):
    """A simple ``Process`` for transliteration of 
    Old English in the runic alphabet.
    """
    
    language: str = None

    @cachedproperty
    def algorithm(self):
        """This is the algo to be sent over"""
        return Transliterate().transliterate

    def run(self, input_doc: Doc) -> Doc:
        output_doc = deepcopy(input_doc)
        
        for index, word_obj in enumerate(output_doc.words):
            oe_latin = self.algorithm(text=word_obj.string, mode="Latin")  # type str
            word_obj.phonetic_transcription = oe_latin
            output_doc.words[index] = word_obj

        return output_doc

oe_words = [Word(string=w) for w in oe_runes.split()]
cltk_doc_oe = Doc(words=oe_words)
translit_proc = OldEnglishTransliterationProcess
cltk_doc_oe = translit_proc().run(input_doc=cltk_doc_oe)

In [6]:
# now you can see that a value has been added to Word.phonetic_transcription
print(cltk_doc_oe.words[0])

Word(index_char_start=None, index_char_stop=None, index_token=None, index_sentence=None, string='·ö©·ö†·õè', pos=None, lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=None, named_entity=None, syllables=None, phonetic_transcription='oft', definition=None)


In [7]:
print([(w.string, w.phonetic_transcription) for w in cltk_doc_oe.words])

[('·ö©·ö†·õè', 'oft'), ('·õã·ö≥·ö£·õö·õû', 'scyld'), ('·õã·ö≥·õñ·ö†·õÅ·õù', 'scefin'), ('·õã·ö≥·õ†·ö¶·õñ·öæ·ö™', 'sceathena'), ('·ö¶·ö±·õ†·õè·ö¢·õó', 'threatum')]


Note that most ``Process``es in the CLTK library are more complex than this, as they allow for inheritance, which helps the project scale better. For instance:

`Process` <--- `StemmingProcess` <--- {`LatinStemmingProcess`, `MiddleEnglishStemmingProcess`, `MiddleHighGermanStemmingProcess`, `OldFrenchStemmingProcess`}

In these cases, the separation of `algorithm` from `run` allows for different functions to be called for each language.

# Add a `Process` to a `Pipeline`

Each `Process` takes a `Doc`, adds information to it and its `Word`s, then and returns the `Doc`. A `Process` like our new `OldEnglishTransliterationProcess` might belong at the end, so in the following we append it to the end of the `Pipeline`.

In [8]:
from cltk import NLP

In [9]:
# Load the Old English NLP class
cltk_nlp = NLP(language="ang")

‚Äéê§Ä CLTK version '1.1.5'.
Pipeline for language 'Old English (ca. 450-1100)' (ISO: 'ang'): `MultilingualTokenizationProcess`, `OldEnglishLemmatizationProcess`, `OldEnglishEmbeddingsProcess`, `StopsProcess`.


In [10]:
# Inspect the Pipline, which is contained in NLP
from pprint import pprint
pprint(cltk_nlp.pipeline.processes)

[<class 'cltk.tokenizers.processes.MultilingualTokenizationProcess'>,
 <class 'cltk.lemmatize.processes.OldEnglishLemmatizationProcess'>,
 <class 'cltk.embeddings.processes.OldEnglishEmbeddingsProcess'>,
 <class 'cltk.stops.processes.StopsProcess'>]


In [11]:
# Add the new custom Process to the end
cltk_nlp.pipeline.processes.append(OldEnglishTransliterationProcess)

In [12]:
# Now run the pipeline and see the results written to Word.phonetic_transcription
cltk_doc = cltk_nlp.analyze(text=oe_runes)
print(cltk_doc.words[0])

This part of the CLTK depends upon models from the CLTK project.
Do you want to download 'https://github.com/cltk/ang_models_cltk' to '~/cltk_data/ang'? [Y/n] 
Y
CLTK message: This part of the CLTK depends upon word embedding models from the Fasttext project.
Do you want to download file 'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ang.vec' to '/Users/kylejohnson/cltk_data/ang/embeddings/fasttext/wiki.ang.vec'? [Y/n] 
Y


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 34.0M/34.0M [00:02<00:00, 13.7MiB/s]


Word(index_char_start=0, index_char_stop=3, index_token=0, index_sentence=None, string='·ö©·ö†·õè', pos=None, lemma='·ö©·ö†·õè', stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=False, named_entity=None, syllables=None, phonetic_transcription='oft', definition=None)
