# Parsing with Stanza Demo

### The Stanza NLP Library

This notebook gives a demo of the Stanza NLP Python package from Stanford: https://stanfordnlp.github.io/stanza/

This demo uses these stanza datatypes:
* [PIpeline](https://stanfordnlp.github.io/stanza/pipeline.html#pipeline)
* [Document](https://stanfordnlp.github.io/stanza/data_objects.html#document)
* [Sentence](https://stanfordnlp.github.io/stanza/data_objects.html#sentence)
* [ParseTree](https://stanfordnlp.github.io/stanza/data_objects.html#parsetree)
* [Word](https://stanfordnlp.github.io/stanza/data_objects.html#word)


### Data
Data from corpus of Readling level complexity: https://github.com/nishkalavallabhi/OneStopEnglishCorpus

In [1]:
import pandas as pd
import numpy as np

import stanza

In [2]:
!pip freeze | grep stanza

stanza==1.6.1


In [3]:
df_data = pd.read_csv("OneStopEnglishCorpus/Texts-Together-OneCSVperFile/Thatcher.csv",
                      delimiter=',',encoding='cp1252')

print(len(df_data))

17


In [4]:
df_data.head()

Unnamed: 0,Elementary,Intermediate,Advanced
0,"Margaret Thatcher, the most famous British\npr...","Margaret Thatcher, the best known British prim...","Margaret Thatcher, the most dominant British\n..."
1,"The British prime minister, David Cameron,\nsa...","The British prime minister, David Cameron, sai...","The British prime minister, David Cameron,\nwh..."
2,"President Barack Obama said, “Here in\nAmerica...","In a statement, President Barack Obama said,\n...","In a statement, President Barack Obama said,\n..."
3,Margaret Thatcher was the first woman\nleader ...,The first woman elected to lead a major wester...,The first woman elected to lead a major wester...
4,"When they heard of her death, politicians\nfro...","Thatcher, who was 87, had been in poor health\...","Thatcher, who was 87, had been in declining\nh..."


In [5]:
# some sentences don't have Elementary and Intermediate versions:
df_data.tail()

Unnamed: 0,Elementary,Intermediate,Advanced
12,,As the British economy recovered from the very...,Boosted by the newly arrived revenues from\nBr...
13,,"After she retired, she wrote highly successful...",But she also deployed her notorious\n“handbagg...
14,,,Her allies in the tabloid press egged her on.\...
15,,,"But untrammelled power, with the defeat or\nre..."
16,,,"In retirement, she wrote highly successful\nme..."


In [6]:
# One text in this dataset can be composed of multiple sentences:

text1 = df_data.Elementary.iloc[1]
print(text1)

The British prime minister, David Cameron,
said: “I was very sad when l heard of Lady
Thatcher’s death. We’ve lost a great leader,
a great prime minister and a great Briton.”
He added: “She was our first woman prime
minister – and she didn’t just lead our
country, she saved our country.” He added
that he believed she would be remembered
as the greatest British peacetime
prime minister.


In [7]:
text2 = df_data.Advanced.iloc[1]
print(text2)

The British prime minister, David Cameron,
who is cutting short his trip to Europe to return
to London following the news, said: “It was with
great sadness that l learned of Lady Thatcher’s
death. We’ve lost a great leader, a great prime
minister and a great Briton.” He added: “As our
first woman prime minister, Margaret Thatcher
succeeded against all the odds, and the real
thing about Margaret Thatcher is that she didn’t
just lead our country, she saved our country, and
I believe she will go down as the greatest British
peacetime prime minister.”


## Applying the Stanza Constituent parsing model 

In [8]:
# Pipeline requires tokenization and POS-tagging 
nlp_pipeline = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency')

# stanza.pipeline.core.Pipeline
print(type(nlp_pipeline))

2023-11-06 22:49:28 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-11-06 22:49:29 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| pos          | combined_charlm     |
| constituency | ptb3-revised_charlm |

2023-11-06 22:49:29 INFO: Using device: cpu
2023-11-06 22:49:29 INFO: Loading: tokenize
2023-11-06 22:49:31 INFO: Loading: pos
2023-11-06 22:49:31 INFO: Loading: constituency
2023-11-06 22:49:32 INFO: Done loading processors!


<class 'stanza.pipeline.core.Pipeline'>


In [9]:
doc = nlp_pipeline(text1)
print(type(doc))

<class 'stanza.models.common.doc.Document'>


In [10]:
sentences = doc.sentences
print(f"# sentences in Elementary text: {len(sentences)}\n")

print("First 5 words of first sentences:")
print(sentences[0].words[:5])

print("\nFirst 5 words of 2nd sentences:")
print(sentences[1].words[:5])

# sentences in Elementary text: 4

First 5 words of first sentences:
[{
  "id": 1,
  "text": "The",
  "upos": "DET",
  "xpos": "DT",
  "feats": "Definite=Def|PronType=Art",
  "start_char": 0,
  "end_char": 3
}, {
  "id": 2,
  "text": "British",
  "upos": "ADJ",
  "xpos": "JJ",
  "feats": "Degree=Pos",
  "start_char": 4,
  "end_char": 11
}, {
  "id": 3,
  "text": "prime",
  "upos": "ADJ",
  "xpos": "JJ",
  "feats": "Degree=Pos",
  "start_char": 12,
  "end_char": 17
}, {
  "id": 4,
  "text": "minister",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "start_char": 18,
  "end_char": 26
}, {
  "id": 5,
  "text": ",",
  "upos": "PUNCT",
  "xpos": ",",
  "start_char": 26,
  "end_char": 27
}]

First 5 words of 2nd sentences:
[{
  "id": 1,
  "text": "We",
  "upos": "PRON",
  "xpos": "PRP",
  "feats": "Case=Nom|Number=Plur|Person=1|PronType=Prs",
  "start_char": 104,
  "end_char": 106
}, {
  "id": 2,
  "text": "’ve",
  "upos": "AUX",
  "xpos": "VBP",
  "feats": "Mood=Ind|Number=Pl

In [11]:
# words in first sentence:
print([w.text for w in sentences[0].words])

['The', 'British', 'prime', 'minister', ',', 'David', 'Cameron', ',', 'said', ':', '“', 'I', 'was', 'very', 'sad', 'when', 'l', 'heard', 'of', 'Lady', 'Thatcher', '’s', 'death', '.']


In [12]:
# Viewing parse tree of first sentence in text/document:
parse = sentences[0].constituency
print(type(parse))

<class 'stanza.models.constituency.parse_tree.Tree'>


In [13]:
print(parse)

(ROOT (S (NP (NP (DT The) (JJ British) (JJ prime) (NN minister)) (, ,) (NP (NNP David) (NNP Cameron)) (, ,)) (VP (VBD said) (: :) (`` “) (S (NP (PRP I)) (VP (VBD was) (ADJP (RB very) (JJ sad)) (SBAR (WHADVP (WRB when)) (S (NP (NNP l)) (VP (VBD heard) (PP (IN of) (NP (NNP Lady) (NNP Thatcher))) (NP (POS ’s) (NN death)))))))) (. .)))


In [14]:
print(parse.pretty_print())

(ROOT
  (S
    (NP
      (NP (DT The) (JJ British) (JJ prime) (NN minister))
      (, ,)
      (NP (NNP David) (NNP Cameron))
      (, ,))
    (VP
      (VBD said)
      (: :)
      (`` “)
      (S
        (NP (PRP I))
        (VP
          (VBD was)
          (ADJP (RB very) (JJ sad))
          (SBAR
            (WHADVP (WRB when))
            (S
              (NP (NNP l))
              (VP
                (VBD heard)
                (PP
                  (IN of)
                  (NP (NNP Lady) (NNP Thatcher)))
                (NP (POS ’s) (NN death))))))))
    (. .)))



In [15]:
# depth of each of 4 sentences in this document:
for idx,sent in enumerate(sentences):
    parse = sent.constituency
    print(f"Depth of parse {idx}: ",parse.depth())
    

Depth of parse 0:  11
Depth of parse 1:  7
Depth of parse 2:  11
Depth of parse 3:  14


In [16]:
# Apply same code to different text - the ≠"Advanced" version of same sentence

doc = nlp_pipeline(text2)
sentences_advanced = doc.sentences

print(f"# sentences in Advanced text: {len(sentences_advanced)}\n")

for idx,sent in enumerate(sentences_advanced):
    parse = sent.constituency
    print(f"Depth of parse {idx}: ",parse.depth())

# sentences in Advanced text: 3

Depth of parse 0:  13
Depth of parse 1:  7
Depth of parse 2:  15


In [17]:
# words in first sentence:
print([w.text for w in sentences_advanced[0].words])

['The', 'British', 'prime', 'minister', ',', 'David', 'Cameron', ',', 'who', 'is', 'cutting', 'short', 'his', 'trip', 'to', 'Europe', 'to', 'return', 'to', 'London', 'following', 'the', 'news', ',', 'said', ':', '“', 'It', 'was', 'with', 'great', 'sadness', 'that', 'l', 'learned', 'of', 'Lady', 'Thatcher', '’s', 'death', '.']


In [18]:
parse = sentences_advanced[0].constituency
print(parse.pretty_print())

(ROOT
  (S
    (NP
      (NP
        (NP (DT The) (JJ British) (JJ prime) (NN minister))
        (, ,)
        (NP (NNP David) (NNP Cameron)))
      (, ,)
      (SBAR
        (WHNP (WP who))
        (S
          (VP
            (VBZ is)
            (VP
              (VBG cutting)
              (ADVP (JJ short))
              (NP
                (NP (PRP$ his) (NN trip))
                (PP
                  (IN to)
                  (NP (NNP Europe))))
              (S
                (VP
                  (TO to)
                  (VP
                    (VB return)
                    (PP
                      (IN to)
                      (NP (NNP London)))
                    (PP
                      (VBG following)
                      (NP (DT the) (NN news))))))))))
      (, ,))
    (VP
      (VBD said)
      (: :)
      (`` “)
      (S
        (NP (PRP It))
        (VP
          (VBD was)
          (PP
            (IN with)
            (NP (JJ great) (NN sadness)))
          (

## Dependency Parsing with Stanza

Based on:
https://stanfordnlp.github.io/stanza/depparse.html   
* Note language in this example is French `fr`


In [19]:
import stanza.models.constituency.parse_tree as Tree

In [20]:
# create new pipeline from Stanza dependency parser: 

nlp_dep_pipeline = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse')

print(type(nlp_dep_pipeline))

2023-11-06 22:49:41 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json:   0%|   …

2023-11-06 22:49:42 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |

2023-11-06 22:49:42 INFO: Using device: cpu
2023-11-06 22:49:42 INFO: Loading: tokenize
2023-11-06 22:49:42 INFO: Loading: pos
2023-11-06 22:49:43 INFO: Loading: lemma
2023-11-06 22:49:43 INFO: Loading: depparse
2023-11-06 22:49:44 INFO: Done loading processors!


<class 'stanza.pipeline.core.Pipeline'>


In [21]:
doc2 = nlp_dep_pipeline(text1)
sentences_depparse = doc2.sentences

In [22]:
print(sentences_depparse[0].words[:5])

[{
  "id": 1,
  "text": "The",
  "lemma": "the",
  "upos": "DET",
  "xpos": "DT",
  "feats": "Definite=Def|PronType=Art",
  "head": 4,
  "deprel": "det",
  "start_char": 0,
  "end_char": 3
}, {
  "id": 2,
  "text": "British",
  "lemma": "British",
  "upos": "ADJ",
  "xpos": "JJ",
  "feats": "Degree=Pos",
  "head": 4,
  "deprel": "amod",
  "start_char": 4,
  "end_char": 11
}, {
  "id": 3,
  "text": "prime",
  "lemma": "prime",
  "upos": "ADJ",
  "xpos": "JJ",
  "feats": "Degree=Pos",
  "head": 4,
  "deprel": "amod",
  "start_char": 12,
  "end_char": 17
}, {
  "id": 4,
  "text": "minister",
  "lemma": "minister",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 9,
  "deprel": "nsubj",
  "start_char": 18,
  "end_char": 26
}, {
  "id": 5,
  "text": ",",
  "lemma": ",",
  "upos": "PUNCT",
  "xpos": ",",
  "head": 6,
  "deprel": "punct",
  "start_char": 26,
  "end_char": 27
}]


In [23]:
sentences_depparse[0].print_dependencies()

('The', 4, 'det')
('British', 4, 'amod')
('prime', 4, 'amod')
('minister', 9, 'nsubj')
(',', 6, 'punct')
('David', 4, 'appos')
('Cameron', 6, 'flat')
(',', 4, 'punct')
('said', 0, 'root')
(':', 9, 'punct')
('“', 15, 'punct')
('I', 15, 'nsubj')
('was', 15, 'cop')
('very', 15, 'advmod')
('sad', 9, 'ccomp')
('when', 18, 'advmod')
('l', 18, 'nsubj')
('heard', 15, 'advcl')
('of', 23, 'case')
('Lady', 23, 'nmod:poss')
('Thatcher', 20, 'flat')
('’s', 20, 'case')
('death', 18, 'obl')
('.', 9, 'punct')


In [24]:
# another view of dependencies, similar to:
#parse=sentences_depparse[0]
#print(parse.dependencies_string())

for word in sentences_depparse[0].words:
    head_id = word.head
    head_word_text = sentences_depparse[0].words[head_id-1].text if head_id>0 else " ****ROOT****"
    print(word.id, word.text,"-->",head_id,head_word_text)

1 The --> 4 minister
2 British --> 4 minister
3 prime --> 4 minister
4 minister --> 9 said
5 , --> 6 David
6 David --> 4 minister
7 Cameron --> 6 David
8 , --> 4 minister
9 said --> 0  ****ROOT****
10 : --> 9 said
11 “ --> 15 sad
12 I --> 15 sad
13 was --> 15 sad
14 very --> 15 sad
15 sad --> 9 said
16 when --> 18 heard
17 l --> 18 heard
18 heard --> 15 sad
19 of --> 23 death
20 Lady --> 23 death
21 Thatcher --> 20 Lady
22 ’s --> 20 Lady
23 death --> 18 heard
24 . --> 9 said


In [25]:
# look for head ID of zero - this is root of sentence

print("Showing head word of sentence:")
[word for word in sentences_depparse[0].words if word.head==0]

Showing head word of sentence:


[{
   "id": 9,
   "text": "said",
   "lemma": "say",
   "upos": "VERB",
   "xpos": "VBD",
   "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
   "head": 0,
   "deprel": "root",
   "start_char": 43,
   "end_char": 47
 }]