# Stanza Tutorial

(C) 2023-2025 by [Damir Cavar](http://damir.cavar.me/)

**Version:** 1.2, January 2025

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

**Prerequisites:**

In [None]:
!pip install -U stanza

To install [spaCy](https://spacy.io/) follow the instructions on the [Install spaCy page](https://spacy.io/usage).

In [None]:
!pip install -U pip setuptools wheel

The following installation of spaCy is ideal for my environment, i.e., using a GPU and CUDA 12.x. See the [spaCy homepage](https://spacy.io/usage) for detailed installation instructions.

In [None]:
!pip install -U 'spacy[cuda12x,transformers,lookups,ja]'

## Introduction

This is a tutorial related to the [L645 Advanced Natural Language Processing](http://damir.cavar.me/l645/) course in Fall 2023 at Indiana University. The following tutorial assumes that you are using a newer distribution of [Python 3.x](https://python.org/) and [Stanza](https://stanfordnlp.github.io/stanza/) 1.5.1 or newer.

This notebook assumes that you have set up [Stanza](https://stanfordnlp.github.io/stanza/) on your computer with your [Python](https://python.org/) distribution. Follow the instructions on the [Stanza](https://stanfordnlp.github.io/stanza/) installation page to set up a working environment for the following code. The code will also require that you are online and that the specific language models can be downloaded and installed.

Loading the [Stanza](https://stanfordnlp.github.io/stanza/) module and [spaCy's Displacy](https://spacy.io/usage/visualizers) for visualization:

In [None]:
import stanza
from stanza.models.common.doc import Document
from stanza.pipeline.core import Pipeline
from spacy import displacy

The following code will load the English language model for [Stanza](https://stanfordnlp.github.io/stanza/):

In [None]:
stanza.download('de')

We can configure the [Stanza](https://stanfordnlp.github.io/stanza/) pipeline to contain all desired linguistic annotation modules. In this case we use:
- tokenizer
- multi-word-tokenizer
- Part-of-Speech tagger
- lemmatizer
- dependency parser
- constituent parser

In [None]:
nlp = stanza.Pipeline('de', processors='tokenize,mwt,pos,lemma,ner,depparse,constituency,sentiment', package={"ner": ["ncbi_disease", "ontonotes"]}, use_gpu=False, download_method="reuse_resources")

In [None]:
doc = nlp("Gummib채rchen habe ich gr체ne noch keine gegessen.")
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

In [None]:
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

In [None]:
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

In [None]:
for sentence in doc.sentences:
    print(sentence.constituency)

In [None]:
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

In [None]:
print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\n')

In [None]:
for i, sentence in enumerate(doc.sentences):
    print("%d -> %d" % (i, sentence.sentiment))

## Language ID

In [None]:
stanza.download(lang="multilingual")
stanza.download(lang="en")
# stanza.download(lang="fr")
stanza.download(lang="de")

In [None]:
nlp = Pipeline(lang="multilingual", processors="langid")
docs = ["Hello world.", "Hallo, Welt!", "Ciao mondo!", "Hola mundo!"]
docs = [Document([], text=text) for text in docs]
nlp(docs)
print("\n".join(f"{doc.text}\t{doc.lang}" for doc in docs))

## Processing Dependency Parse Trees

I wrote the following function to convert the [Stanza](https://stanfordnlp.github.io/stanza/) dependency tree data structure to a [spaCy's Displacy](https://spacy.io/usage/visualizers) compatible data structure for the visualization of dependency trees using [spaCy's](https://spacy.io/) excellent visualizer:

In [None]:
def get_stanza_dep_displacy_manual(doc):
    res = []
    for x in doc.sentences:
        words = []
        arcs  = []
        for w in x.words:
            if w.head > 0:
                head_text = x.words[w.head-1].text
            else:
                head_text = "root"
            words.append({"text": w.text, "tag": w.upos})
            if w.deprel == "root": continue
            start = w.head-1
            end = w.id-1
            if start < end:
                arcs.append({ "start":start, "end":end, "label": w.deprel, "dir":"right"})
            else:
                arcs.append({ "start":end, "end":start, "label": w.deprel, "dir":"left"})
        res.append( { "words": words, "arcs": arcs } )
    return res

We can generate an annotation object with [Stanza](https://stanfordnlp.github.io/stanza/) similarly to [spaCy's](https://spacy.io/) approach submitting a sentence or text segment to the NLP pipeline we specified above and assigned to the `nlp` variable:

In [None]:
doc = nlp("Gummib채rchen habe ich gr체ne noch keine gegessen.")

We can now generate the [spaCy](https://spacy.io/)-compatible data format from the dependency tree to be able to visualize it:

In [None]:
res = get_stanza_dep_displacy_manual(doc)

The rendering can be achieved using the [Displacy](https://spacy.io/usage/visualizers) call:

In [None]:
displacy.render(res, style="dep") # , manual=True, options={"compact":False, "distance":110})

## Data Format - CoNLL

In [None]:
from stanza.utils.conll import CoNLL

In [None]:
CoNLL.write_doc2conll(doc, "output.conllu")

## Visualization using PyPlot

In [None]:
stanza.download('en')

In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse,constituency', use_gpu=True)

In [None]:
doc = nlp("I saw the man with the binoculars.")

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

In [None]:
G = nx.DiGraph()

In [None]:
for sentence in doc.sentences:
    # Add nodes for each word
    for word in sentence.words:
        G.add_node(word.id, label=word.text)

    # Add edges based on dependency relations
    for word in sentence.words:
        if word.head > 0:  # Not the root
            G.add_edge(word.head, word.id, label=word.deprel)
        else: # Handle the root node (e.g., connect to a virtual root or identify it as such)
            G.add_node(0, label="ROOT") # Add a virtual root node
            G.add_edge(0, word.id, label="root")


In [None]:
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, labels=nx.get_node_attributes(G, 'label'))
nx.draw_networkx_edge_labels(G, pos, edge_labels=nx.get_edge_attributes(G, 'label'))
plt.show()

## Visualize the Constituent Parse Tree with NLTK

In [None]:
from nltk import Tree

In [None]:
constituent_tree_string = str(doc.sentences[0].constituency)

In [None]:
nltk_tree = Tree.fromstring(constituent_tree_string)

In [None]:
nltk_tree.draw()

**(C) 2023-2025 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**