# Testing Dependency Parsers on English

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)

Generating Dependency Parse Trees for English ellipsis constructions. See for more details [THEC Russian Sub-corpus](https://github.com/dcavar/thec_rus). If you use this code, please cite:

Cavar, Damir and Mompelat, Ludovic and Abdo, Muhammad (2024) The Typology of Ellipsis: A Corpus for Linguistic Analysis and Machine Learning Applications. In Proceedings Hahn, M. et al. (eds.) Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Association for Computational Linguistics, St. Julian's, Malta", pages 46-54.

and

Cavar, Damir and V. Holthenrichs (2024) *NLP Corpus of Ellipsis: Modeling Ellipsis in Slavic.* Paper presented at the Formal Approaches to Slavic Linguistics (FASL) 33. Halifax, Canada.


In [None]:
!pip install stanza
!pip install spacy

In [2]:
from pprint import pprint
import stanza
from spacy import displacy
import spacy

  _torch_pytree._register_pytree_node(


The following cell loads all the different language models that we want to test:

In [3]:
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-10 20:18:08 INFO: Downloaded file to /home/damir/stanza_resources/resources.json
2024-05-10 20:18:08 INFO: Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/default.zip:   0%|          | 0…

2024-05-10 20:18:18 INFO: Downloaded file to /home/damir/stanza_resources/en/default.zip
2024-05-10 20:18:21 INFO: Finished downloading models and saved to /home/damir/stanza_resources


We will use a pipeline for English with a dependency and a constituentcy parser:

In [4]:
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse,mwt,constituency', use_gpu=True, download_method="reuse_resources")

2024-05-10 21:26:15 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| mwt          | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |

2024-05-10 21:26:15 INFO: Using device: cuda
2024-05-10 21:26:15 INFO: Loading: tokenize
2024-05-10 21:26:16 INFO: Loading: mwt
2024-05-10 21:26:16 INFO: Loading: pos
2024-05-10 21:26:16 INFO: Loading: lemma
2024-05-10 21:26:16 INFO: Loading: constituency
2024-05-10 21:26:17 INFO: Loading: depparse
2024-05-10 21:26:17 INFO: Done loading processors!


The following function converts the Stanza dependency tree data structure to a spaCy Displacy data structure for dependency trees:

In [5]:
def get_stanza_dep_displacy_manual(doc):
    res = []
    for x in doc.sentences:
        words = []
        arcs  = []
        for w in x.words:
            if w.head > 0:
                head_text = x.words[w.head-1].text
            else:
                head_text = "root"
            words.append({"text": w.text, "tag": w.upos})
            if w.deprel == "root": continue
            start = w.head-1
            end = w.id-1
            if start < end:
                arcs.append({ "start":start, "end":end, "label": w.deprel, "dir":"right"})
            else:
                arcs.append({ "start":end, "end":start, "label": w.deprel, "dir":"left"})
        res.append( { "words": words, "arcs": arcs } )
    return res

The data structure in the following cell (variable *test_dep*) shows what the spaCy Displacy module expects for visualization of a dependency tree:

Here is another more complex data structure:

The following example sentence is analyzed and the Displacy data structure is generated:

In [6]:
doc = nlp("Some people like broccoli, but many don't.")
res = get_stanza_dep_displacy_manual(doc)

In [7]:
html = displacy.render(res, style="dep", manual=True, options={"compact":False, "distance":110})

In [8]:
doc = nlp("Some people like broccoli, but many people don't like broccoli.")
res = get_stanza_dep_displacy_manual(doc)

In [9]:
html = displacy.render(res, style="dep", manual=True, options={"compact":False, "distance":110})

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)