# Stanza Tutorial

(C) 2023 by [Damir Cavar](http://damir.cavar.me/)

**Version:** 1.0, September 2023

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

This is a tutorial related to the L645 Advanced Natural Language Processing course in Fall 2023 at Indiana University. The following tutorial assumes that you are using a newer distribution of [Python 3.x](https://python.org/) and [Stanza](https://stanfordnlp.github.io/stanza/) 1.5.1 or newer.

This notebook assumes that you have set up [Stanza](https://stanfordnlp.github.io/stanza/) on your computer with your [Python](https://python.org/) distribution. Follow the instructions on the [Stanza](https://stanfordnlp.github.io/stanza/) installation page to set up a working environment for the following code. The code will also require that you are online and that the specific language models can be downloaded and installed.

Loading the [Stanza](https://stanfordnlp.github.io/stanza/) module and [spaCy's Displacy](https://spacy.io/usage/visualizers) for visualization:

In [1]:
import stanza
from spacy import displacy

The following code will load the English language model for [Stanza](https://stanfordnlp.github.io/stanza/):

In [2]:
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-09-18 11:11:56 INFO: Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/default.zip:   0%|          | 0…

2023-09-18 11:12:26 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.


We can configure the [Stanza](https://stanfordnlp.github.io/stanza/) pipeline to contain all desired linguistic annotation modules. In this case we use:
- tokenizer
- multi-word-tokenizer
- Part-of-Speech tagger
- lemmatizer
- dependency parser
- constituent parser

In [3]:
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,depparse,constituency', use_gpu=False, download_method="reuse_resources")

2023-09-18 11:12:27 INFO: Loading these models for language: en (English):
| Processor    | Package  |
---------------------------
| tokenize     | combined |
| pos          | combined |
| lemma        | combined |
| constituency | wsj      |
| depparse     | combined |

2023-09-18 11:12:27 INFO: Using device: cpu
2023-09-18 11:12:27 INFO: Loading: tokenize
2023-09-18 11:12:27 INFO: Loading: pos
2023-09-18 11:12:27 INFO: Loading: lemma
2023-09-18 11:12:28 INFO: Loading: constituency
2023-09-18 11:12:28 INFO: Loading: depparse
2023-09-18 11:12:28 INFO: Done loading processors!


## Processing Dependency Parse Trees

I wrote the following function to convert the [Stanza](https://stanfordnlp.github.io/stanza/) dependency tree data structure to a [spaCy's Displacy](https://spacy.io/usage/visualizers) compatible data structure for the visualization of dependency trees using [spaCy's](https://spacy.io/) excellent visualizer:

In [4]:
def get_stanza_dep_displacy_manual(doc):
    res = []
    for x in doc.sentences:
        words = []
        arcs  = []
        for w in x.words:
            if w.head > 0:
                head_text = x.words[w.head-1].text
            else:
                head_text = "root"
            words.append({"text": w.text, "tag": w.upos})
            if w.deprel == "root": continue
            start = w.head-1
            end = w.id-1
            if start < end:
                arcs.append({ "start":start, "end":end, "label": w.deprel, "dir":"right"})
            else:
                arcs.append({ "start":end, "end":start, "label": w.deprel, "dir":"left"})
        res.append( { "words": words, "arcs": arcs } )
    return res

We can generate an annotation object with [Stanza](https://stanfordnlp.github.io/stanza/) similarly to [spaCy's](https://spacy.io/) approach submitting a sentence or text segment to the NLP pipeline we specified above and assigned to the `nlp` variable:

In [5]:
doc = nlp("John loves to read books.")

We can now generate the [spaCy](https://spacy.io/)-compatible data format from the dependency tree to be able to visualize it:

In [6]:
res = get_stanza_dep_displacy_manual(doc)

The rendering can be achieved using the [Displacy](https://spacy.io/usage/visualizers) call:

In [7]:
displacy.render(res, style="dep", manual=True, options={"compact":False, "distance":110})

**(C) 2023 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**