## spaCy

As sample text we have multi-sentence fragment from [The Atlantic](https://www.theatlantic.com/technology/archive/2018/12/what-us-cities-can-learn-from-venices-floods/577255/):

In [23]:
text = '''
The Venetian climate-adaptation repertoire is staged in four settings. 
The barrier islands that separate the Adriatic from the lagoon are the first line of defense against high water. 
The three inlets that cut through the islands where the tides surge and ebb constitute the second: 
Here, when completed, the mose barriers will shut when needed. Then comes the lagoon and, finally, the city. 
Large-scale engineering, which tends to get more attention, is confined to the mose barriers. 
Projects on the islands and within the lagoon focus on restoring damaged landforms and the habitats they offer to indigenous plants and animals. 
Meanwhile, urban adaptation campaigns are focused on dredging city canals and bolstering individual buildings.
'''

We'll create a spaCy pipeline and process the text.  I'm using the "large" model for English, documented here: https://github.com/explosion/spacy-models/releases//tag/en_core_web_lg-2.0.0

The spaCy website is a strength of the project, incidentally.  The clarity, organization', and directness of the docs is appealing, though I find them a bit thin on details, e.g. of the APIs.

So: this model comes with a POS tagger, NER, a dependency parser (a CNN trained on OntoNotes), GloVe vectors (300 dim).  Tokenization and lemmatization too. 

In [24]:
import spacy

nlp = spacy.load('en_core_web_lg')
type(nlp)

spacy.lang.en.English

Interesting class name!  English is a [Language](https://spacy.io/api/language), and Language has [Pipe](http://spacy.io/api/pipe) objects.  So unlike the CLTK, there's a single object that provides access to all tasks available for each language.

Note that by default, when you load a model, all components of the pipeline are active.

Like in the stanford CoreNLP, processing text returns a container object called a [Document](https://spacy.io/api/doc), which is a sequence of [Token](https://spacy.io/api/token) objects.  The English object is callable, so simply passing it the text launches the complete pipeline.

In [25]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [26]:
doc = nlp(text)
type(doc), type(doc[0])

(spacy.tokens.doc.Doc, spacy.tokens.token.Token)

The Token has many functions for interrogating its place in the dependency graph, and doesn't necessarily have the greatest interface, e.g.

In [27]:
token = doc[2]
token, token.lemma, token.pos

(Venetian, 17525330293902761285, 83)

Looks like we're getting IDs for lemma and POS tag.  Docs say to append an underscore to the attributes:

In [28]:
token.lemma_ ,token.pos_

('venetian', 'ADJ')

So to get a complete set of tags, we might say:

In [29]:
[(token.text, token.lemma_, token.pos_) for token in doc]

[('\n', '\n', 'SPACE'),
 ('The', 'the', 'DET'),
 ('Venetian', 'venetian', 'ADJ'),
 ('climate', 'climate', 'NOUN'),
 ('-', '-', 'PUNCT'),
 ('adaptation', 'adaptation', 'NOUN'),
 ('repertoire', 'repertoire', 'NOUN'),
 ('is', 'be', 'VERB'),
 ('staged', 'stag', 'VERB'),
 ('in', 'in', 'ADP'),
 ('four', 'four', 'NUM'),
 ('settings', 'setting', 'NOUN'),
 ('.', '.', 'PUNCT'),
 ('\n', '\n', 'SPACE'),
 ('The', 'the', 'DET'),
 ('barrier', 'barrier', 'NOUN'),
 ('islands', 'island', 'NOUN'),
 ('that', 'that', 'ADJ'),
 ('separate', 'separate', 'VERB'),
 ('the', 'the', 'DET'),
 ('Adriatic', 'adriatic', 'PROPN'),
 ('from', 'from', 'ADP'),
 ('the', 'the', 'DET'),
 ('lagoon', 'lagoon', 'NOUN'),
 ('are', 'be', 'VERB'),
 ('the', 'the', 'DET'),
 ('first', 'first', 'ADJ'),
 ('line', 'line', 'NOUN'),
 ('of', 'of', 'ADP'),
 ('defense', 'defense', 'NOUN'),
 ('against', 'against', 'ADP'),
 ('high', 'high', 'ADJ'),
 ('water', 'water', 'NOUN'),
 ('.', '.', 'PUNCT'),
 ('\n', '\n', 'SPACE'),
 ('The', 'the', 'DET'),

So, in summary:

In [30]:
import spacy

nlp = spacy.load('en_core_web_lg')
lemmas_tags = [(token.text, token.lemma_, token.pos_) for token in nlp(text)]

The interface is tight and seemingly economical, though a beginner may not immediately realize that all tasks defined for the language are run on invoking the `nlp` object.  In particular, the syntactic parser slows things down by an order of magnitude.  

The `disable` keywoard argument specifies which pipeline components to not run.

In [31]:
import time

start_time = time.time()
nlp(text*100)
print("Full pipeline: {0} seconds".format(time.time() - start_time))

start_time = time.time()
nlp(text*100, disable=['parser', 'ner'])
print("Tokenizer and tagger only: {0} seconds".format(time.time() - start_time))

Full pipeline: 2.6630516052246094 seconds
Tokenizer and tagger only: 0.16348052024841309 seconds


I think I would prefer to explicitly name the pipeline components I want when creating the pipeline.  This is how the Stanford CoreNLP system works.  Here's an example from a production system at work (no IP here ;).  We use Clojure.

```clojure
(defn make-pipeline
  "Creates and returns Stanford CoreNLP (pipeline) object with specified annotators"
  [annotators]
  (StanfordCoreNLP. (doto 
                      (Properties.) (.put "annotators" annotators)
                      (.setProperty "tokenize.options" "unicodeEllipsis=true"))))

(defonce master-pipe (make-pipeline "tokenize, ssplit, pos, lemma, ner, parse"))
```

## CLTK

Let's use Latin, on the assumption that Latin tools are among the most popular of the CLTK.  Note that I've never used the CLTK Latin tools, so I'm approaching this like the newbie I am.

First we make sure we have the corpora installed.

In [32]:
from cltk.corpus.utils.importer import CorpusImporter

corpus_importer = CorpusImporter('latin')
corpus_importer.list_corpora


['latin_text_perseus',
 'latin_treebank_perseus',
 'latin_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'latin_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia']

In [33]:
corpus_importer.import_corpus('latin_models_cltk')

For a text, let's try with a bit from the greatest autobiography ever:

In [34]:
latin_text = '''
per idem tempus annorum novem, ab undevicensimo anno aetatis meae usque ad duodetricensimum, 
seducebamur et seducebamus, falsi atque fallentes in variis cupiditatibus, 
et palam per doctrinas quas liberales vocant, occulte autem falso nomine religionis, 
hic superbi, ibi superstitiosi, ubique vani, hac popularis gloriae sectantes inanitatem,
usque ad theatricos plausus et contentiosa carmina et agonem coronarum faenearum 
et spectaculorum nugas et intemperantiam libidinum, illac autem purgari nos ab istis sordibus expetentes, 
cum eis qui appellarentur electi et sancti afferremus escas de quibus nobis in officina aqualiculi 
sui fabricarent angelos et deos per quos liberaremur.
'''

As we know, lemmas and POS tags must be requested separately.  Starting with the former, one is immediately faced with, it seems, several options, according to http://docs.cltk.org/en/latest/latin.html  There is a dictionary-based lemmatizer, and then an extended discussion about n-gram backoff lemmatizers.  Which one works best?  We aren't told, but one might reason that since the dictionary-based implementation amounts to a unigram model, perhaps a backoff mechanism will better handle syncretism (ambiguities in the paradigms).

However the documentation for the backoff lemmatizers seems to be incomplete (?) So we'll stick to the simpler model.

The text doesn't need the J/I and V/U replacer.  The lemmatizer itself has the interesting name `LemmaReplacer`, as if it's in a class hierarchy with the `JVReplacer`.

In [35]:
from cltk.stem.lemma import LemmaReplacer

latin_text = latin_text.lower()
lemmatizer = LemmaReplacer('latin')
lemmatizer.lemmatize(latin_text)

['per',
 'idem',
 'tempus',
 'annus',
 'novo',
 ',',
 'ab',
 'undevicensimo',
 'adnato',
 'aetas',
 'meus',
 'usque',
 'ad',
 'duodetricensimum',
 ',',
 'seducebamur',
 'et',
 'seducebamus',
 ',',
 'fallo',
 'atque',
 'fallo',
 'in',
 'varius1',
 'cupiditas',
 ',',
 'et',
 'pala',
 'per',
 'doctrina',
 'qui1',
 'liberalis1',
 'voco',
 ',',
 'occulo',
 'autem',
 'fallo',
 'nomen',
 'religio',
 ',',
 'hic',
 'superbus',
 ',',
 'ibi',
 'superstitiosus',
 ',',
 'ubique',
 'vanus',
 ',',
 'hic',
 'populo',
 'gloria',
 'sector2',
 'inanitas',
 ',',
 'usque',
 'ad',
 'theatricos',
 'plaudo',
 'et',
 'contentiosus',
 'carmen1',
 'et',
 'agon',
 'corona',
 'faenearum',
 'et',
 'spectaculum',
 'nugae',
 'et',
 'intemperantia',
 'libido',
 ',',
 'illic',
 'autem',
 'purgo',
 'nos',
 'ab',
 'iste',
 'sordes',
 'expeto',
 ',',
 'cum',
 'is',
 'qui1',
 'appello',
 'eligo',
 'et',
 'sancio',
 'afferremus',
 'esca',
 'de',
 'qui1',
 'nos',
 'in',
 'officina',
 'aqualiculi',
 'suo',
 'fabricarent',
 'a

My extremely weak knowledge of Latin suggests that there are a few errors: `anno` -> `adnato`, and `seducebamur` is not reduced, among the others.  Effects of a dictionary-based implementation, I'm guessing.

Also it would be nice to have the inflected forms returned with the lemmas, in effect getting tokenization from the lemmatizer.  No need though, since the POS tagger also returns individual tokens.

Now for the POS tags.  Again there are several options.  Not knowing the facts for Latin, I happen to believe the CRF tagger works better than TnT.

In [36]:
from cltk.tag.pos import POSTag

tagger = POSTag('latin')
tagger.tag_crf(latin_text)

[('per', 'R--------'),
 ('idem', 'P-S---NA-'),
 ('tempus', 'N-S---NA-'),
 ('annorum', 'A-S---MA-'),
 ('novem', 'N-S---MA-'),
 (',', 'U--------'),
 ('ab', 'R--------'),
 ('undevicensimo', 'A-S---MB-'),
 ('anno', 'N-S---MB-'),
 ('aetatis', 'N-S---FG-'),
 ('meae', 'A-S---FG-'),
 ('usque', 'D--------'),
 ('ad', 'R--------'),
 ('duodetricensimum', 'N-S---MA-'),
 (',', 'U--------'),
 ('seducebamur', 'V1PPIP---'),
 ('et', 'C--------'),
 ('seducebamus', 'N-S---MN-'),
 (',', 'U--------'),
 ('falsi', 'T-PRPPMN-'),
 ('atque', 'C--------'),
 ('fallentes', 'N-P---MN-'),
 ('in', 'R--------'),
 ('variis', 'A-P---FB-'),
 ('cupiditatibus', 'N-P---FB-'),
 (',', 'U--------'),
 ('et', 'C--------'),
 ('palam', 'N-S---FA-'),
 ('per', 'R--------'),
 ('doctrinas', 'N-P---FA-'),
 ('quas', 'A-P---FA-'),
 ('liberales', 'A-P---MN-'),
 ('vocant', 'V3PPIA---'),
 (',', 'U--------'),
 ('occulte', 'D--------'),
 ('autem', 'C--------'),
 ('falso', 'A-S---NB-'),
 ('nomine', 'N-S---NB-'),
 ('religionis', 'N-S---MG-'),
 (

Interesting, we don't only get the tags, but a whole feature bundle.  As an aside, the doc page doesn't tell us how to interpret the complex tag, and I suggest that we should evaluate the accuracy of these.  

So now we just have to join the two output lists to obtain the equivalent to the spaCy output, praying that the number of tokens returned by the two components is the same (one hopes that the same tokenizer is used).

In [37]:
lemmas = lemmatizer.lemmatize(latin_text)
tags = tagger.tag_crf(latin_text)
[(token, lemma, tag) for ((token, tag), lemma) in zip(tags, lemmas)]

[('per', 'per', 'R--------'),
 ('idem', 'idem', 'P-S---NA-'),
 ('tempus', 'tempus', 'N-S---NA-'),
 ('annorum', 'annus', 'A-S---MA-'),
 ('novem', 'novo', 'N-S---MA-'),
 (',', ',', 'U--------'),
 ('ab', 'ab', 'R--------'),
 ('undevicensimo', 'undevicensimo', 'A-S---MB-'),
 ('anno', 'adnato', 'N-S---MB-'),
 ('aetatis', 'aetas', 'N-S---FG-'),
 ('meae', 'meus', 'A-S---FG-'),
 ('usque', 'usque', 'D--------'),
 ('ad', 'ad', 'R--------'),
 ('duodetricensimum', 'duodetricensimum', 'N-S---MA-'),
 (',', ',', 'U--------'),
 ('seducebamur', 'seducebamur', 'V1PPIP---'),
 ('et', 'et', 'C--------'),
 ('seducebamus', 'seducebamus', 'N-S---MN-'),
 (',', ',', 'U--------'),
 ('falsi', 'fallo', 'T-PRPPMN-'),
 ('atque', 'atque', 'C--------'),
 ('fallentes', 'fallo', 'N-P---MN-'),
 ('in', 'in', 'R--------'),
 ('variis', 'varius1', 'A-P---FB-'),
 ('cupiditatibus', 'cupiditas', 'N-P---FB-'),
 (',', ',', 'U--------'),
 ('et', 'et', 'C--------'),
 ('palam', 'pala', 'N-S---FA-'),
 ('per', 'per', 'R--------'),
 ('

Summarizing again,

In [38]:
import spacy

nlp = spacy.load('en_core_web_lg')
doc = nlp(text, disable=['parser', 'ner'])
lemmas_tags_spacy = [(token.text, token.lemma_, token.pos_) for token in doc]

In [39]:
from cltk.stem.lemma import LemmaReplacer
from cltk.tag.pos import POSTag

latin_text = latin_text.lower()

lemmatizer = LemmaReplacer('latin')
lemmas = lemmatizer.lemmatize(latin_text)

tagger = POSTag('latin')
tags = tagger.tag_crf(latin_text)

lemmas_tags_cltk = [(token, lemma, tag) for ((token, tag), lemma) in zip(tags, lemmas)]