# Welcome to Stanza!

![Latest Version](https://img.shields.io/pypi/v/stanza.svg?colorB=bc4545)
![Python Versions](https://img.shields.io/pypi/pyversions/stanza.svg?colorB=bc4545)

Stanza is a Python NLP toolkit that supports 60+ human languages. It is built with highly accurate neural network components that enable efficient training and evaluation with your own annotated data, and offers pretrained models on 100 treebanks. Additionally, Stanza provides a stable, officially maintained Python interface to Java Stanford CoreNLP Toolkit.

In this tutorial, we will demonstrate how to set up Stanza and annotate text with its native neural network NLP models. For the use of the Python CoreNLP interface, please see other tutorials.

## 1. Installing Stanza

Note that Stanza only supports Python 3.6 and above. Installing and importing Stanza are as simple as running the following commands:

In [4]:
# Install; note that the prefix "!" is not needed if you are running in a terminal
#!pip install stanza

# Import the package
import stanza

#https://stanfordnlp.github.io/stanza/client_regex.html

### More Information

For common troubleshooting, please visit our [troubleshooting page](https://stanfordnlp.github.io/stanfordnlp/installation_usage.html#troubleshooting).

## 2. Downloading Models

You can download models with the `stanza.download` command. The language can be specified with either a full language name (e.g., "english"), or a short code (e.g., "en"). 

By default, models will be saved to your `~/stanza_resources` directory. If you want to specify your own path to save the model files, you can pass a `dir=your_path` argument.


In [8]:
# Download an English model into the default directory
print("Downloading English model...")
stanza.download('en')

# Similarly, download a (simplified) Chinese model
# Note that you can use verbose=False to turn off all printed messages
#print("Downloading Chinese model...")
#stanza.download('zh', verbose=False)

Downloading English model...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 13.4MB/s]                    
2020-11-03 10:56:55 INFO: Downloading default packages for language: en (English)...
2020-11-03 10:56:56 INFO: File exists: /Users/emlanza/stanza_resources/en/default.zip.
2020-11-03 10:57:01 INFO: Finished downloading models and saved to /Users/emlanza/stanza_resources.


### More Information

Pretrained models are provided for 60+ different languages. For all languages, available models and the corresponding short language codes, please check out the [models page](https://stanfordnlp.github.io/stanza/models.html).


## 3. Processing Text


### Constructing Pipeline

To process a piece of text, you'll need to first construct a `Pipeline` with different `Processor` units. The pipeline is language-specific, so again you'll need to first specify the language (see examples).

- By default, the pipeline will include all processors, including tokenization, multi-word token expansion, part-of-speech tagging, lemmatization, dependency parsing and named entity recognition (for supported languages). However, you can always specify what processors you want to include with the `processors` argument.

- Stanza's pipeline is CUDA-aware, meaning that a CUDA-device will be used whenever it is available, otherwise CPUs will be used when a GPU is not found. You can force the pipeline to use CPU regardless by setting `use_gpu=False`.

- Again, you can suppress all printed messages by setting `verbose=False`.

In [3]:
# Build an English pipeline, with all processors by default
print("Building an English pipeline...")
en_nlp = stanza.Pipeline('en')

# Build a Chinese pipeline, with customized processor list and no logging, and force it to use CPU
#print("Building a Chinese pipeline...")
#zh_nlp = stanza.Pipeline('zh', processors='tokenize,lemma,pos,depparse', verbose=False, use_gpu=False)

2020-11-03 10:52:04 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| depparse  | ewt       |
| sentiment | sstplus   |
| ner       | ontonotes |

2020-11-03 10:52:04 INFO: Use device: cpu
2020-11-03 10:52:04 INFO: Loading: tokenize
2020-11-03 10:52:04 INFO: Loading: pos


Building an English pipeline...


2020-11-03 10:52:05 INFO: Loading: lemma
2020-11-03 10:52:05 INFO: Loading: depparse
2020-11-03 10:52:06 INFO: Loading: sentiment
2020-11-03 10:52:07 INFO: Loading: ner
2020-11-03 10:52:08 INFO: Done loading processors!


Building a Chinese pipeline...


### Annotating Text

After a pipeline is successfully constructed, you can get annotations of a piece of text simply by passing the string into the pipeline object. The pipeline will return a `Document` object, which can be used to access detailed annotations from. For example:


In [16]:
# Processing English text
en_doc = en_nlp("I deny that")
print(type(en_doc))

# Processing Chinese text
#zh_doc = zh_nlp("达沃斯世界经济论坛是每年全球政商界领袖聚在一起的年度盛事。")
#print(type(zh_doc))

<class 'stanza.models.common.doc.Document'>


### More Information

For more information on how to construct a pipeline and information on different processors, please visit our [pipeline page](https://stanfordnlp.github.io/stanfordnlp/pipeline.html).

## 4. Accessing Annotations

Annotations can be accessed from the returned `Document` object. 

A `Document` contains a list of `Sentence`s, and a `Sentence` contains a list of `Token`s and `Word`s. For the most part `Token`s and `Word`s overlap, but some tokens can be divided into mutiple words, for instance the French token `aux` is divided into the words `à` and `les`, while in English a word and a token are equivalent. Note that dependency parses are derived over `Word`s.

Additionally, a `Span` object is used to represent annotations that are part of a document, such as named entity mentions.


The following example iterate over all English sentences and words, and print the word information one by one:

In [17]:
for i, sent in enumerate(en_doc.sentences):
    print("[Sentence {}]".format(i+1))
    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
    print("")

[Sentence 1]
I           	I           	PRON  	2	nsubj       
deny        	deny        	VERB  	0	root        
that        	that        	PRON  	2	obj         



The following example iterate over all extracted named entity mentions and print out their character spans and types.

In [18]:
print("Mention text\tType\tStart-End")
for ent in en_doc.ents:
    print("{}\t{}\t{}-{}".format(ent.text, ent.type, ent.start_char, ent.end_char))

Mention text	Type	Start-End


Alternatively, you can directly print a `Word` object to view all its annotations as a Python dict:

In [12]:
word = en_doc.sentences[0].words[0]
print(word)

{
  "id": 1,
  "text": "Oscar",
  "lemma": "Oscar",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 2,
  "deprel": "nsubj",
  "misc": "start_char=0|end_char=5"
}


### More Information

For all information on different data objects, please visit our [data objects page](https://stanfordnlp.github.io/stanza/data_objects.html).

## 5. Resources

Apart from this interactive tutorial, we also provide tutorials on our website that cover a variety of use cases such as how to use different model "packages" for a language, how to use spaCy as a tokenizer, how to process pretokenized text without running the tokenizer, etc. For these tutorials please visit [our Tutorials page](https://stanfordnlp.github.io/stanza/tutorials.html).

Other resources that you may find helpful include:

- [Stanza Homepage](https://stanfordnlp.github.io/stanza/index.html)
- [FAQs](https://stanfordnlp.github.io/stanza/faq.html)
- [GitHub Repo](https://github.com/stanfordnlp/stanza)
- [Reporting Issues](https://github.com/stanfordnlp/stanza/issues)
- [Stanza System Description Paper](http://arxiv.org/abs/2003.07082)


In [13]:
nlp = stanza.Pipeline('en', processors='tokenize,pos', use_gpu=True, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp("Barack Obama was born in Hawaii.") # Run the pipeline on the input text
print(doc) # Look at the result

2020-11-03 11:43:31 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | ewt     |
| pos       | ewt     |

2020-11-03 11:43:31 INFO: Use device: cpu
2020-11-03 11:43:31 INFO: Loading: tokenize
2020-11-03 11:43:31 INFO: Loading: pos
2020-11-03 11:43:32 INFO: Done loading processors!


[
  [
    {
      "id": 1,
      "text": "Barack",
      "upos": "PROPN",
      "xpos": "NNP",
      "feats": "Number=Sing",
      "misc": "start_char=0|end_char=6"
    },
    {
      "id": 2,
      "text": "Obama",
      "upos": "PROPN",
      "xpos": "NNP",
      "feats": "Number=Sing",
      "misc": "start_char=7|end_char=12"
    },
    {
      "id": 3,
      "text": "was",
      "upos": "AUX",
      "xpos": "VBD",
      "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
      "misc": "start_char=13|end_char=16"
    },
    {
      "id": 4,
      "text": "born",
      "upos": "VERB",
      "xpos": "VBN",
      "feats": "Tense=Past|VerbForm=Part|Voice=Pass",
      "misc": "start_char=17|end_char=21"
    },
    {
      "id": 5,
      "text": "in",
      "upos": "ADP",
      "xpos": "IN",
      "misc": "start_char=22|end_char=24"
    },
    {
      "id": 6,
      "text": "Hawaii",
      "upos": "PROPN",
      "xpos": "NNP",
      "feats": "Number=Sing",
      "misc": "sta

In [12]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,ner,lemma')
doc = nlp("Oscar, how about you give us our first number?")
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

In [37]:
print(*[f'token: {token.text}\tner: {token.ner}\tlemma: {token.lemma}' for sent in doc.sentences for token in sent.tokens], sep='\n')

NameError: name 'lemma' is not defined

In [None]:
text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."
with CoreNLPClient(
        annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse'],
        timeout=30000,
        memory='16G') as client:

    # Use tokensregex patterns to find who wrote a sentence.
    pattern = '([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/'
    matches = client.tokensregex(text, pattern)
    # sentences contains a list with matches for each sentence.
    print(len(matches["sentences"])) # prints: 3
    # length tells you whether or not there are any matches in this
    print(matches["sentences"][1]["length"]) # prints: 1
    # You can access matches like most regex groups.
    print(matches["sentences"][1]["0"]["text"]) # prints: "Chris wrote a simple sentence"
    print(matches["sentences"][1]["0"]["1"]["text"]) # prints: "Chris"

In [31]:
# Use tokensregex patterns to find who wrote a sentence.
pattern = '([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/'
matches = tokensregex(text, pattern)

NameError: name 'tokensregex' is not defined

In [30]:

# sentences contains a list with matches for each sentence.
print(len(matches["sentences"])) # prints: 3

# length tells you whether or not there are any matches in this
print(matches["sentences"][1]["length"]) # prints: 1

# You can access matches like most regex groups.
print(matches["sentences"][1]["0"]["text"]) # prints: "Chris wrote a simple sentence"
print(matches["sentences"][1]["0"]["1"]["text"]) # prints: "Chris"

NameError: name 'matches' is not defined

In [29]:
import stanza

# Requirements

In [38]:
stanza.install_corenlp(dir="/Users/emlanza/Desktop/KAIU/stanford-corenlp-4.1.0")



In [55]:
!export CORENLP_HOME=/Users/emlanza/Desktop/KAIU/stanford-corenlp-4.1.0

In [42]:
stanza.download_corenlp_models(model='english', version='4.1.0', dir="/Users/emlanza/Desktop/KAIU/stanford-corenlp-4.1.0")

2020-11-03 16:53:54 INFO: Downloading english models (version 4.1.0) into directory /Users/emlanza/Desktop/KAIU/stanford-corenlp-4.1.0...
Downloading http://nlp.stanford.edu/software/stanford-corenlp-4.1.0-models-english.jar: 100%|██████████| 671M/671M [18:07<00:00, 617kB/s]   


# Programa

In [5]:
from stanza.server import CoreNLPClient

In [6]:
import os
corenlp_dir="/Users/emlanza/Desktop/KAIU/stanford-corenlp-4.1.0"
os.environ["CORENLP_HOME"] = corenlp_dir

In [73]:
# Examine the CoreNLP installation folder to make sure the installation is successful
!ls $CORENLP_HOME

[31mCoreNLP-to-HTML.xsl[m[m
LIBRARY-LICENSES
LICENSE.txt
Makefile
README.txt
RESOURCE-LICENSES
SemgrexDemo.java
ShiftReduceDemo.java
StanfordCoreNlpDemo.java
StanfordDependenciesManual.pdf
build.xml
[31mcorenlp.sh[m[m
ejml-core-0.38-sources.jar
ejml-core-0.38.jar
ejml-ddense-0.38-sources.jar
ejml-ddense-0.38.jar
ejml-simple-0.38-sources.jar
ejml-simple-0.38.jar
input.txt
input.txt.out
input.txt.xml
javax.activation-api-1.2.0-sources.jar
javax.activation-api-1.2.0.jar
javax.json-api-1.0-sources.jar
javax.json.jar
jaxb-api-2.4.0-b180830.0359-sources.jar
jaxb-api-2.4.0-b180830.0359.jar
jaxb-core-2.3.0.1-sources.jar
jaxb-core-2.3.0.1.jar
jaxb-impl-2.4.0-b180830.0438-sources.jar
jaxb-impl-2.4.0-b180830.0438.jar
joda-time-2.10.5-sources.jar
joda-time.jar
jollyday-0.4.9-sources.jar
jollyday.jar
[34mpatterns[m[m
pom-java-11.xml
pom.xml
protobuf.jar
slf4j-api.jar
slf4j-simple.jar
stanford-corenlp-4.1.0-javadoc.jar
stanford-corenlp-4.1.0-models-

In [74]:
# Import client module
from stanza.server import CoreNLPClient

In [84]:
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], 
    memory='4G', 
    endpoint='http://localhost:9000',
    be_quiet=True)
print(client)

# Start the background server and wait for some time
# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed
client.start()
import time; time.sleep(10)

2020-11-03 20:10:42 INFO: Writing properties to tmp file: corenlp_server-a594a5a9bb464cdb.props
2020-11-03 20:10:42 INFO: Starting server with command: java -Xmx4G -cp /Users/emlanza/Desktop/KAIU/stanford-corenlp-4.1.0/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-a594a5a9bb464cdb.props -annotators tokenize,ssplit,pos,lemma,ner -preload -outputFormat serialized


<stanza.server.client.CoreNLPClient object at 0x7fee77f98090>


In [78]:
# Annotate some text
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
document = client.annotate(text)
print(type(document))

<class 'CoreNLP_pb2.Document'>


In [79]:
# Iterate over all tokens in all sentences, and print out the word, lemma, pos and ner tags
print("{:12s}\t{:12s}\t{:6s}\t{}".format("Word", "Lemma", "POS", "NER"))

for i, sent in enumerate(document.sentence):
    print("[Sentence {}]".format(i+1))
    for t in sent.token:
        print("{:12s}\t{:12s}\t{:6s}\t{}".format(t.word, t.lemma, t.pos, t.ner))
    print("")

Word        	Lemma       	POS   	NER
[Sentence 1]
Albert      	Albert      	NNP   	PERSON
Einstein    	Einstein    	NNP   	PERSON
was         	be          	VBD   	O
a           	a           	DT    	O
German      	german      	JJ    	NATIONALITY
-           	-           	HYPH  	O
born        	bear        	VBN   	O
theoretical 	theoretical 	JJ    	TITLE
physicist   	physicist   	NN    	TITLE
.           	.           	.     	O

[Sentence 2]
He          	he          	PRP   	O
developed   	develop     	VBD   	O
the         	the         	DT    	O
theory      	theory      	NN    	O
of          	of          	IN    	O
relativity  	relativity  	NN    	O
.           	.           	.     	O



In [80]:
# Iterate over all detected entity mentions
print("{:30s}\t{}".format("Mention", "Type"))

for sent in document.sentence:
    for m in sent.mentions:
        print("{:30s}\t{}".format(m.entityMentionText, m.entityType))

Mention                       	Type
Albert Einstein               	PERSON
German                        	NATIONALITY
theoretical physicist         	TITLE
He                            	PERSON


In [85]:
# Shut down the background CoreNLP server
client.stop()

time.sleep(10)
!ps -o pid,cmd | grep java

ps: cmd: keyword not found


In [11]:
print("Starting a server with the Python \"with\" statement...")

with CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], 
                   memory='4G', endpoint='http://localhost:9000', be_quiet=True) as client:
    text = "Albert Einstein was a German-born theoretical physicist."
    document = client.annotate(text)

    print("{:30s}\t{}".format("Mention", "Type"))
    for sent in document.sentence:
        for m in sent.mentions:
            print("{:30s}\t{}".format(m.entityMentionText, m.entityType))

print("\nThe server should be stopped upon exit from the \"with\" statement.")

In [10]:
with CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], 
                   memory='4G', endpoint='http://localhost:9000', be_quiet=True) as client:
    text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."

    # Use tokensregex patterns to find who wrote a sentence.
    pattern = '{}=subject >obj {}=object'
    matches = client.tokensregex(text, pattern)

    # sentences contains a list with matches for each sentence.
    print(len(matches["sentences"])) # prints: 3

    # length tells you whether or not there are any matches in this
    print(matches["sentences"][1]["length"]) # prints: 1
    
    # You can access matches like most regex groups.
    print(matches["sentences"][1]["0"]["text"]) # prints: "Chris wrote a simple sentence"
    print(matches["sentences"][1]["0"]["1"]["text"]) # prints: "Chris"

In [8]:
import stanza
import stanza.server.semgrex as semgrex

# stanza.download("en")
nlp = stanza.Pipeline("en", processors="tokenize,pos,lemma,depparse")

doc = nlp("Banning opal removed all artifact decks from the meta.  I miss playing lantern.")
semgrex_results = semgrex.process_doc(doc,
                                      "{pos:NN}=object <obl {}=action",
                                      "{cpos:NOUN}=thing <obj {cpos:VERB}=action")
print(semgrex_results)

In [9]:
# get noun phrases with tregex
def noun_phrases(_client, _text, _annotators=None):
    pattern = 'NP'
    matches = _client.tregex(_text,pattern,annotators=_annotators)
    print("\n".join(["\t"+sentence[match_id]['spanString'] for sentence in matches['sentences'] for match_id in sentence]))

# English example
with CoreNLPClient(timeout=30000, memory='16G') as client:
    englishText = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
    print('---')
    print(englishText)
    noun_phrases(client,englishText,_annotators="tokenize,ssplit,pos,lemma,parse")