# Stanza is a collection of accurate and efficient tools for many human languages in one place.Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing.

Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities. The toolkit is designed to be parallel among more than 60 languages, using the Universal Dependencies formalism.

Native Python implementation requiring minimal efforts to set up;
Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging, dependency parsing, and named entity recognition;
Pretrained neural models supporting 66 (human) languages;
A stable, officially maintained Python interface to CoreNLP.

In [2]:
import stanza
stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 13.8MB/s]                    
2020-12-05 16:03:13 INFO: Downloading default packages for language: en (English)...
Downloading http://nlp.stanford.edu/software/stanza/1.1.0/en/default.zip: 100%|██████████| 428M/428M [03:40<00:00, 1.94MB/s]  
2020-12-05 16:07:06 INFO: Finished downloading models and saved to /home/akshay/stanza_resources.


In [3]:
nlp = stanza.Pipeline('en')

2020-12-05 16:07:25 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| depparse  | ewt       |
| sentiment | sstplus   |
| ner       | ontonotes |

  return torch._C._cuda_getDeviceCount() > 0
2020-12-05 16:07:25 INFO: Use device: cpu
2020-12-05 16:07:25 INFO: Loading: tokenize
2020-12-05 16:07:25 INFO: Loading: pos
2020-12-05 16:07:26 INFO: Loading: lemma
2020-12-05 16:07:26 INFO: Loading: depparse
2020-12-05 16:07:28 INFO: Loading: sentiment
2020-12-05 16:07:29 INFO: Loading: ner
2020-12-05 16:07:30 INFO: Done loading processors!


In [4]:
doc = nlp("Akshay is teaching Stanza library.")

In [5]:
print(doc)

[
  [
    {
      "id": 1,
      "text": "Akshay",
      "lemma": "Akshay",
      "upos": "PROPN",
      "xpos": "NNP",
      "feats": "Number=Sing",
      "head": 3,
      "deprel": "nsubj",
      "misc": "start_char=0|end_char=6",
      "ner": "S-PERSON"
    },
    {
      "id": 2,
      "text": "is",
      "lemma": "be",
      "upos": "AUX",
      "xpos": "VBZ",
      "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
      "head": 3,
      "deprel": "aux",
      "misc": "start_char=7|end_char=9",
      "ner": "O"
    },
    {
      "id": 3,
      "text": "teaching",
      "lemma": "teach",
      "upos": "VERB",
      "xpos": "VBG",
      "feats": "Tense=Pres|VerbForm=Part",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=10|end_char=18",
      "ner": "O"
    },
    {
      "id": 4,
      "text": "Stanza",
      "lemma": "stanza",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Number=Sing",
      "head": 5,
      "deprel": "compound",
   

In [6]:
doc.entities

[{
   "text": "Akshay",
   "type": "PERSON",
   "start_char": 0,
   "end_char": 6
 },
 {
   "text": "Stanza",
   "type": "ORG",
   "start_char": 19,
   "end_char": 25
 }]

In [8]:
for sentence in doc.sentences:
    for word in sentence.words:
        print(word.text, word.lemma, word.xpos)


Akshay Akshay NNP
is be VBZ
teaching teach VBG
Stanza stanza NN
library library NN
. . .


In [12]:
processor_dict = {
    'tokenize': 'gsd', 
    'pos': 'hdt', 
    'ner': 'conll03', 
    'lemma': 'default'
}
stanza.download('en', processors=processor_dict, package=None)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 16.5MB/s]                    
2020-12-05 16:16:56 INFO: Downloading these customized packages for language: en (English)...
| Processor       | Package  |
------------------------------
| lemma           | ewt      |
| ner             | conll03  |
| forward_charlm  | 1billion |
| backward_charlm | 1billion |

2020-12-05 16:16:56 INFO: File exists: /home/akshay/stanza_resources/en/lemma/ewt.pt.
Downloading http://nlp.stanford.edu/software/stanza/1.1.0/en/ner/conll03.pt:   8%|▊         | 6.42M/80.8M [00:18<03:33, 349kB/s] 


KeyboardInterrupt: 

# Tokenization and Sentence segmentation

In [27]:

nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('Akshay is teaching Stanza.Stanza is the next revolution')
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

2020-12-05 16:27:56 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | ewt     |

2020-12-05 16:27:56 INFO: Use device: cpu
2020-12-05 16:27:56 INFO: Loading: tokenize
2020-12-05 16:27:56 INFO: Done loading processors!


id: (1,)	text: Akshay
id: (2,)	text: is
id: (3,)	text: teaching
id: (4,)	text: Stanza
id: (5,)	text: .
id: (1,)	text: Stanza
id: (2,)	text: is
id: (3,)	text: the
id: (4,)	text: next
id: (5,)	text: revolution


# Parts of Speech

In [31]:

nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos')
doc = nlp('Barack Obama was born in Hawaii.')
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

2020-12-05 16:34:41 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | ewt     |
| pos       | ewt     |

2020-12-05 16:34:41 INFO: Use device: cpu
2020-12-05 16:34:41 INFO: Loading: tokenize
2020-12-05 16:34:41 INFO: Loading: pos
2020-12-05 16:34:42 INFO: Done loading processors!


word: Barack	upos: PROPN	xpos: NNP	feats: Number=Sing
word: Obama	upos: PROPN	xpos: NNP	feats: Number=Sing
word: was	upos: AUX	xpos: VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word: born	upos: VERB	xpos: VBN	feats: Tense=Past|VerbForm=Part|Voice=Pass
word: in	upos: ADP	xpos: IN	feats: _
word: Hawaii	upos: PROPN	xpos: NNP	feats: Number=Sing
word: .	upos: PUNCT	xpos: .	feats: _


# Sentiment Analysis using Stanza

In [20]:
nlp = stanza.Pipeline('en', processors='tokenize,sentiment')

2020-12-05 16:24:21 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | ewt     |
| sentiment | sstplus |

2020-12-05 16:24:21 INFO: Use device: cpu
2020-12-05 16:24:21 INFO: Loading: tokenize
2020-12-05 16:24:21 INFO: Loading: sentiment
2020-12-05 16:24:23 INFO: Done loading processors!


In [21]:
doc = nlp('Ram is a bad boy')

In [23]:
for i, sentence in enumerate(doc.sentences):
    print(i, sentence.sentiment)

0 0


In [24]:
doc = nlp('Ram is a good boy')
for i, sentence in enumerate(doc.sentences):
    print(i, sentence.sentiment)

0 2


In [25]:
doc = nlp('Ram is a boy')
for i, sentence in enumerate(doc.sentences):
    print(i, sentence.sentiment)

0 1


# Lemmatization

In [33]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')
doc = nlp('Akshay is teaching Stanza.')
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

2020-12-05 16:36:18 INFO: Loading these models for language: en (English):
| Processor | Package |
-----------------------
| tokenize  | ewt     |
| pos       | ewt     |
| lemma     | ewt     |

2020-12-05 16:36:18 INFO: Use device: cpu
2020-12-05 16:36:18 INFO: Loading: tokenize
2020-12-05 16:36:18 INFO: Loading: pos
2020-12-05 16:36:19 INFO: Loading: lemma
2020-12-05 16:36:19 INFO: Done loading processors!


word: Akshay 	lemma: Akshay
word: is 	lemma: be
word: teaching 	lemma: teach
word: Stanza 	lemma: stanza
word: . 	lemma: .
