# NLP with pretrained models - spaCy and StanfordNLP

In [None]:
import spacy

en = spacy.load("en_core_web_sm")
#nlp = spacy.load("en_core_web_sm")

In [None]:
text = ("Donald John Trump (born June 14, 1946) is the 45th and current president of "
        "the United States.  Before entering politics, he was a businessman and television personality.")
print(text)

In [None]:
doc_en = en(text)

First spaCy splits your document into sentences, and the sentences in tokens.

In [None]:
list(doc_en.sents)

In addition, spaCy also identifies a number of linguistic features for every token. The most basic of these are the lemma, and two types of parts-of-speech tags: the `pos_` attribute contains the [Universal POS tags](https://universaldependencies.org/u/pos/) from the [Universal Dependencies](https://universaldependencies.org/), while the `tag_` attribute contains more fine-grained, language-specific part-of-speech tags.

In [None]:
features = [[t.orth_, t.lemma_, t.pos_, t.tag_] for t  in doc_en]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

Next, spaCy also offers pre-trained models for named entity recognition. Their results can be found on the `ent_iob_` and `ent_type` attributes. The `ent_type` attribute tells us what type of entity the token refers to. In the English models, these entity types follow the [OntoNotes standard](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf). In our example, we see that `Donald John Trump` refers to a person, `June 14, 1946` to a date, `45th` to an ordinal number, and `the United States` to a geo-political entity (GPE). 

The letters on the `ent_iob_` attribute give the position of the token in the entity. `O` means the token is outside of an entity, `B` means the token is at the beginning of an entity, and `I` means it is inside an entity (at any position except for the beginning). In this way, we can tell apart several entities of the same type that immediately follow each other. Together these letters form the so-called `BIO` tagging scheme. There are other tagging schemes, such as `BILUO`, which also has letters for the last position and single (unique) tokens in an entity, but the BIO scheme gives you all the information you need.  

In [None]:
entities = [(t.orth_, t.ent_iob_, t.ent_type_) for t in doc_en]
display(HTML(tabulate.tabulate(entities, tablefmt='html')))

You can also access the entities directly on the `ents` attribute of the document: 

In [None]:
print([(ent.text, ent.label_) for ent in doc_en.ents])

Finally, spaCy also contains a dependency parser, which analyzes the grammatical relations between the tokens. 

In [None]:
syntax = [[token.text, token.dep_, token.head.text ] for token in doc_en]
display(HTML(tabulate.tabulate(syntax, tablefmt='html')))

## Multingual NLP

SpaCy doesn't only have models for English, but also for many other languages. Here's an example of a Dutch sentence, which means "Charles Michel is the prime minister of Belgium".

In [None]:
nl = spacy.load("nl_core_news_sm")
text_nl = "Charles Michel is de eerste minister van België."
doc_nl = nl(text_nl)

The tokens in the Dutch document have the same attributes as those in the English one. Take care, however, because the functionality of the models can differ across languages. Here are three main differences between the English and the Dutch model: 

- The Dutch model does not offer lemmatization: the lemma_ attribute is identical to the orth_ attribute.
- The Dutch model has a very different fine-grained part-of-speech tags on the tag_ attribute.
- The Dutch model has different entity types (PER, LOC and ORG) than the English one. 

This is a result of the training corpora that were used to build the models, whose annotation guidelines may be very different.

In [None]:
info = [(t.orth_, t.lemma_, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_nl]
display(HTML(tabulate.tabulate(info, tablefmt='html')))

## StanfordNLP

Another library whose functionality overlaps with that of spaCy is StanfordNLP. [StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/), not to be confused with Stanford's Java [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library, is a [Python library](https://github.com/stanfordnlp/stanfordnlp) built on top of PyTorch that offers a fully neural pipeline with tokenization (including multi-word units), lemmatization, part-of-speech tagging (including morphological features) and dependency parsing. These components were built and trained for the [CoNLL-2018 shared task](https://nlp.stanford.edu/pubs/qi2018universal.pdf). There are no named entities, but the quality of the dependency parsing is state of the art. On top of that, it also offers a Python interface to CoreNLP. 

Its API is very similar to that of spaCy:

In [None]:
import stanfordnlp

stanfordnlp.download('nl')
nl_stanford = stanfordnlp.Pipeline(lang="nl")

In [None]:
doc_nl_stanford = nl_stanford(text_nl)

The `text` and `lemma` properties speak for themselves. The `upos` attribute contains the universal dependencies we also find on spaCy's `pos_` attribute; the `xpos` attribute corresponds to spaCy's `tag_` attribute and contains the fine-grained tags with morphological properties. The `governor` attribute contains the (1-based) index of the head of each token; `dependency_relation` contains the grammatical relation between the two. 

In [None]:
stanford_info = []
for sentence in doc_nl_stanford.sentences:
    for token in sentence.tokens:
        for word in token.words:
            stanford_info.append((len(stanford_info)+1, word.text, word.lemma, word.upos, word.xpos, word.dependency_relation, word.governor))

In [None]:
display(HTML(tabulate.tabulate(stanford_info, tablefmt='html')))