<b>Stanza</b>
<br>
Stanza provides simple, flexible, and unified interfaces for downloading and running various NLP models, you can refer to the Downloading Models and Pipeline pages. At a high level, to start annotating text, firstly, you need to initialize a Pipeline, which pre-loads and chains up a series of Processors, with each processor performing a specific NLP task (e.g., tokenization, dependency parsing, or named entity recognition). 
<br>
<br>
Literally saying, it is essential in most of the cases to download the pre-trained model language from Stanza before conducting further training with NLP tasks. It’s just simple with the stanza.download command. The language can be specified with either a full language name (e.g., "Japanese"), or a short code (e.g., "ja")
<br>
The reference paper for Stanza is available on this <a href="https://arxiv.org/abs/2003.07082"> link </a>
<br>
In this course we are going to work with pre-trained language models. Of course, if you download the stanza code from the corresponding github page, you can start working on your own models. To create for instance your own named entity recognition.

In [1]:
import stanza
stanza.download('en',verbose=False)

In [2]:
text = "Barack Obama was born in Hawaii. His style is different from Donald Trump's"

<b> Tokenizer</b><br>

In [3]:
nlp = stanza.Pipeline('en', processors='tokenize',use_gpu=False, verbose=False, pos_batch_size=3000) 

In [4]:
doc = nlp(text) # Run the pipeline on the input text

In [5]:
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

id: (1,)	text: Barack
id: (2,)	text: Obama
id: (3,)	text: was
id: (4,)	text: born
id: (5,)	text: in
id: (6,)	text: Hawaii
id: (7,)	text: .
id: (1,)	text: His
id: (2,)	text: style
id: (3,)	text: is
id: (4,)	text: different
id: (5,)	text: from
id: (6,)	text: Donald
id: (7,)	text: Trump's


You also can perform tokenizing your text given existing full sentence `without segmentation`, one just needs to set `tokenize_no_ssplit` as `True` to disable sentence segmentation:

In [6]:
nlp = stanza.Pipeline('en', processors='tokenize',use_gpu=False, tokenize_no_ssplit=True,verbose=False, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp(text) # Run the pipeline on the input text

for i, sentence in enumerate(doc.sentences):
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

id: (1,)	text: Barack
id: (2,)	text: Obama
id: (3,)	text: was
id: (4,)	text: born
id: (5,)	text: in
id: (6,)	text: Hawaii
id: (7,)	text: .
id: (8,)	text: His
id: (9,)	text: style
id: (10,)	text: is
id: (11,)	text: different
id: (12,)	text: from
id: (13,)	text: Donald
id: (14,)	text: Trump's


If you have already tokenized your text, and just want to use Stanza for downstream processing, setting `tokenize_pretokenized` as `True` to bypass the neural tokenizer.

In [7]:
# from the previous text we already had:
tokens = [i.text for i in sentence.tokens]
print(tokens)

['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii', '.', 'His', 'style', 'is', 'different', 'from', 'Donald', "Trump's"]


In [8]:
nlp = stanza.Pipeline('en',tokenize_pretokenized=True,verbose=False,processors='tokenize')
doc= nlp(tokens)
print(tokens)

['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii', '.', 'His', 'style', 'is', 'different', 'from', 'Donald', "Trump's"]


<b>Part of Speech (POS)</b><br>
Stanza also supplies a processor to label the token with their universal POS (`UPOS`) tags, treebank-specific POS (`XPOS`) tags, and universal morphological features (`UFeats`). 
<br>
The part-of-speech tags can be accessed via the `upos`(pos) and `xpos` fields of each Word from the Sentences. 

<br>
Note: POSProcessor requires the TokenizeProcessor and MWTProcessor in the pipeline. 
<br>
More information on the POS tags: https://universaldependencies.org/u/pos/index.html



In [9]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos', verbose=False)
doc = nlp(text)
for i, sent in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}' for word in sent.words], sep='\n')

word: Barack	upos: PROPN	xpos: NNP
word: Obama	upos: PROPN	xpos: NNP
word: was	upos: AUX	xpos: VBD
word: born	upos: VERB	xpos: VBN
word: in	upos: ADP	xpos: IN
word: Hawaii	upos: PROPN	xpos: NNP
word: .	upos: PUNCT	xpos: .
word: His	upos: PRON	xpos: PRP$
word: style	upos: NOUN	xpos: NN
word: is	upos: AUX	xpos: VBZ
word: different	upos: ADJ	xpos: JJ
word: from	upos: ADP	xpos: IN
word: Donald	upos: PROPN	xpos: NNP
word: Trump's	upos: PROPN	xpos: NNP


<b>LemmaProcessor</b>
<br>
As other NLP toolkits, Stanza also supports Lemmatisation process, it called `LemmaProcessor`.<br>
TokenizeProcessor, MWTProcessor, and POSProcessor are the requisite in the pipeline to run LemmaProcessor. 
<br>
Lemmatizing words in a sentence and accessing their lemmas afterwards can be done as below.

In [10]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma', verbose=False)
doc = nlp(text)
for i, sent in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for word in sent.words], sep='\n')

word: Barack 	lemma: Barack
word: Obama 	lemma: Obama
word: was 	lemma: be
word: born 	lemma: bear
word: in 	lemma: in
word: Hawaii 	lemma: Hawaii
word: . 	lemma: .
word: His 	lemma: his
word: style 	lemma: style
word: is 	lemma: be
word: different 	lemma: different
word: from 	lemma: from
word: Donald 	lemma: Donald
word: Trump's 	lemma: Trump'


<b>DepparseProcessor</b>
<br>
To check how well you model can understand each word in your full sentence, you can use `DepparseProcessor` which provides an accurate syntactic dependency parser.
<br>
Remember: DepparseProcessor requiresTokenizeProcessor, MWTProcessor, POSProcessor and LemmaProcessor in the pipeline. The head index of each Word can be accessed by the property `head`, and the dependency relation between the words `deprel` .
<br>
This is example:

In [11]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse', verbose = False)
doc = nlp(text)
for i, sent in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}'  for word in sent.words], sep='\n')

id: 1	word: Barack	head id: 4	head: born	deprel: nsubj:pass
id: 2	word: Obama	head id: 1	head: Barack	deprel: flat
id: 3	word: was	head id: 4	head: born	deprel: aux:pass
id: 4	word: born	head id: 0	head: root	deprel: root
id: 5	word: in	head id: 6	head: Hawaii	deprel: case
id: 6	word: Hawaii	head id: 4	head: born	deprel: obl
id: 7	word: .	head id: 4	head: born	deprel: punct
id: 1	word: His	head id: 2	head: style	deprel: nmod:poss
id: 2	word: style	head id: 4	head: different	deprel: nsubj
id: 3	word: is	head id: 4	head: different	deprel: cop
id: 4	word: different	head id: 0	head: root	deprel: root
id: 5	word: from	head id: 6	head: Donald	deprel: case
id: 6	word: Donald	head id: 4	head: different	deprel: obl
id: 7	word: Trump's	head id: 6	head: Donald	deprel: flat


To make a better visualization the result, you can install `spacy_stanza` package. This package wraps the Stanza library, so you can use the display api from spacy to render the result like this example.

<b>Name Entity Recognition</b>
<br>
In Stanza, NER is performed by the NERProcessor and can be invoked by the name `ner`. NER must be used together with the tokenizer in the process pipeline. For the moment, this is only supported for 8 out of the 66 languages.

In [13]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner', verbose = False)

doc = nlp(text)
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

entity: Barack Obama	type: PERSON
entity: Hawaii	type: GPE
entity: Donald Trump's	type: PERSON
