<a href="https://colab.research.google.com/github/Viny2030/NLP/blob/main/Stanza_Beginners_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Stanza!

![Latest Version](https://img.shields.io/pypi/v/stanza.svg?colorB=bc4545)
![Python Versions](https://img.shields.io/pypi/pyversions/stanza.svg?colorB=bc4545)

Stanza is a Python NLP toolkit that supports 60+ human languages. It is built with highly accurate neural network components that enable efficient training and evaluation with your own annotated data, and offers pretrained models on 100 treebanks. Additionally, Stanza provides a stable, officially maintained Python interface to Java Stanford CoreNLP Toolkit.

In this tutorial, we will demonstrate how to set up Stanza and annotate text with its native neural network NLP models. For the use of the Python CoreNLP interface, please see other tutorials.

Stanza es un conjunto de herramientas de procesamiento de lenguaje natural (PLN) para Python que admite más de 60 lenguajes humanos. Está construido con componentes de redes neuronales de alta precisión que permiten un entrenamiento y una evaluación eficientes con sus propios datos anotados, y ofrece modelos entrenados previamente en 100 bancos de árboles. Además, Stanza proporciona una interfaz de Python estable y con mantenimiento oficial para el conjunto de herramientas Java Stanford CoreNLP.

En este tutorial, demostraremos cómo configurar Stanza y anotar texto con sus modelos de PNL de redes neuronales nativos. Para el uso de la interfaz Python CoreNLP, consulte otros tutoriales.

## 1. Installing Stanza

Note that Stanza only supports Python 3.6 and above. Installing and importing Stanza are as simple as running the following commands:

# **1. Instalación de Stanza**
Tenga en cuenta que Stanza solo es compatible con Python 3.6 y versiones posteriores. Instalar e importar Stanza es tan sencillo como ejecutar los siguientes comandos:

In [1]:
# Install; note that the prefix "!" is not needed if you are running in a terminal
!pip install stanza

# Import the package
import stanza

Collecting stanza
  Downloading stanza-1.9.2-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading stanza-1.9.2-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading emoji-2.14.0-py3-none-any.whl (586 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji, stanza
Successfully installed emoji-2.14.0 stanza-1.9.2


https://stanfordnlp.github.io/stanza/tutorials.html


### More Information

For common troubleshooting, please visit our [troubleshooting page](https://stanfordnlp.github.io/stanfordnlp/installation_usage.html#troubleshooting).

# **Más información**
Para solucionar problemas comunes, visite nuestra página de solución de problemas.

## 2. Downloading Models

You can download models with the `stanza.download` command. The language can be specified with either a full language name (e.g., "english"), or a short code (e.g., "en").

By default, models will be saved to your `~/stanza_resources` directory. If you want to specify your own path to save the model files, you can pass a `dir=your_path` argument.


# **2. Descarga de modelos**
Puede descargar modelos con el comando stanza.download. El idioma se puede especificar con el nombre completo del idioma (por ejemplo, "english") o con un código corto (por ejemplo, "en").

De manera predeterminada, los modelos se guardarán en el directorio ~/stanza_resources. Si desea especificar su propia ruta para guardar los archivos de modelos, puede pasar un argumento dir=your_path.

In [2]:
# Download an English model into the default directory
print("Downloading English model...")
stanza.download('en')

# Similarly, download a (simplified) Chinese model
# Note that you can use verbose=False to turn off all printed messages
print("Downloading Chinese model...")
stanza.download('zh', verbose=False)

Downloading English model...


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.9.0/models/default.zip:   0%|          | 0…

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


Downloading Chinese model...


### More Information

Pretrained models are provided for 60+ different languages. For all languages, available models and the corresponding short language codes, please check out the [models page](https://stanfordnlp.github.io/stanza/models.html).


# **Más información**
Se proporcionan modelos preentrenados para más de 60 idiomas diferentes. Para conocer todos los idiomas, los modelos disponibles y los códigos de idioma cortos correspondientes, consulte la página de modelos.

## 3. Processing Text


### Constructing Pipeline

To process a piece of text, you'll need to first construct a `Pipeline` with different `Processor` units. The pipeline is language-specific, so again you'll need to first specify the language (see examples).

- By default, the pipeline will include all processors, including tokenization, multi-word token expansion, part-of-speech tagging, lemmatization, dependency parsing and named entity recognition (for supported languages). However, you can always specify what processors you want to include with the `processors` argument.

- Stanza's pipeline is CUDA-aware, meaning that a CUDA-device will be used whenever it is available, otherwise CPUs will be used when a GPU is not found. You can force the pipeline to use CPU regardless by setting `use_gpu=False`.

- Again, you can suppress all printed messages by setting `verbose=False`.

Construcción de un pipeline
Para procesar un fragmento de texto, primero deberá construir un pipeline con diferentes unidades de procesador. El pipeline es específico del lenguaje, por lo que nuevamente deberá especificar primero el lenguaje (consulte los ejemplos).

De manera predeterminada, el pipeline incluirá todos los procesadores, incluida la tokenización, la expansión de tokens de múltiples palabras, el etiquetado de partes del discurso, la lematización, el análisis de dependencias y el reconocimiento de entidades con nombre (para los idiomas admitidos). Sin embargo, siempre puede especificar qué procesadores desea incluir con el argumento de procesadores.

El pipeline de Stanza es compatible con CUDA, lo que significa que se utilizará un dispositivo CUDA siempre que esté disponible; de ​​lo contrario, se utilizarán las CPU cuando no se encuentre una GPU. Puede forzar al pipeline a utilizar la CPU independientemente configurando use_gpu=False.

Nuevamente, puede suprimir todos los mensajes impresos configurando verbose=False.


In [3]:
# Build an English pipeline, with all processors by default
print("Building an English pipeline...")
en_nlp = stanza.Pipeline('en')

# Build a Chinese pipeline, with customized processor list and no logging, and force it to use CPU
print("Building a Chinese pipeline...")
zh_nlp = stanza.Pipeline('zh', processors='tokenize,lemma,pos,depparse', verbose=False, use_gpu=False)

Building an English pipeline...


  checkpoint = torch.load(filename, lambda storage, loc: storage)
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  checkpoint = torch.load(filename, lambda storage, loc: storage)


Building a Chinese pipeline...


  checkpoint = torch.load(filename, lambda storage, loc: storage)


### Annotating Text

After a pipeline is successfully constructed, you can get annotations of a piece of text simply by passing the string into the pipeline object. The pipeline will return a `Document` object, which can be used to access detailed annotations from. For example:


# **Anotación de texto**
Una vez que se ha creado correctamente una secuencia de comandos, puede obtener anotaciones de un fragmento de texto simplemente pasando la cadena al objeto de secuencia de comandos. La secuencia de comandos devolverá un objeto Document, que se puede utilizar para acceder a anotaciones detalladas. Por ejemplo:

In [10]:
import pandas as pd

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [15]:
input = pd.read_csv('https://raw.githubusercontent.com/Viny2030/datasets/refs/heads/main/input.txt', sep='\t', header=None)

In [16]:
input

Unnamed: 0,0
0,Jane bought me these books.
1,Jane bought a book for me.
2,She dropped a line to him. Thank you.
3,She sleeps.
4,I sleep a lot.
5,I was born in Madrid.
6,the cat was chased by the dog.
7,I was born in Madrid during 1995.
8,"Out of all this , something good will come."
9,Susan left after the rehearsal. She did it well.


In [4]:
# Processing English text
en_doc = en_nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
print(type(en_doc))

# Processing Chinese text
zh_doc = zh_nlp("达沃斯世界经济论坛是每年全球政商界领袖聚在一起的年度盛事。")
print(type(zh_doc))

<class 'stanza.models.common.doc.Document'>
<class 'stanza.models.common.doc.Document'>


In [18]:
# Processing English text
doc = en_nlp("input")


In [19]:
doc


[
  [
    {
      "id": 1,
      "text": "input",
      "lemma": "input",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Number=Sing",
      "head": 0,
      "deprel": "root",
      "start_char": 0,
      "end_char": 5,
      "ner": "O",
      "multi_ner": [
        "O"
      ],
      "misc": "SpaceAfter=No"
    }
  ]
]

### More Information

For more information on how to construct a pipeline and information on different processors, please visit our [pipeline page](https://stanfordnlp.github.io/stanfordnlp/pipeline.html).

# **Más información**
Para obtener más información sobre cómo construir un pipeline e información sobre diferentes procesadores, visite nuestra página de pipelines.

## 4. Accessing Annotations

Annotations can be accessed from the returned `Document` object.

A `Document` contains a list of `Sentence`s, and a `Sentence` contains a list of `Token`s and `Word`s. For the most part `Token`s and `Word`s overlap, but some tokens can be divided into mutiple words, for instance the French token `aux` is divided into the words `à` and `les`, while in English a word and a token are equivalent. Note that dependency parses are derived over `Word`s.

Additionally, a `Span` object is used to represent annotations that are part of a document, such as named entity mentions.


The following example iterate over all English sentences and words, and print the word information one by one:

# 4. Acceso a anotaciones
Se puede acceder a las anotaciones desde el objeto Document devuelto.

Un Document contiene una lista de Sentences y una Sentence contiene una lista de Tokens y Words. En su mayor parte, los Tokens y Words se superponen, pero algunos tokens se pueden dividir en varias palabras, por ejemplo, el token francés aux se divide en las palabras à y les, mientras que en inglés una palabra y un token son equivalentes. Tenga en cuenta que los análisis de dependencia se derivan sobre Words.

Además, se utiliza un objeto Span para representar anotaciones que forman parte de un documento, como menciones de entidades con nombre.

El siguiente ejemplo itera sobre todas las oraciones y palabras en inglés e imprime la información de las palabras una por una:


In [20]:
for i, sent in enumerate(doc.sentences):
    print("[Sentence {}]".format(i+1))
    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
    print("")

[Sentence 1]
input       	input       	NOUN  	0	root        



In [5]:
for i, sent in enumerate(en_doc.sentences):
    print("[Sentence {}]".format(i+1))
    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
    print("")

[Sentence 1]
Barack      	Barack      	PROPN 	4	nsubj:pass  
Obama       	Obama       	PROPN 	1	flat        
was         	be          	AUX   	4	aux:pass    
born        	bear        	VERB  	0	root        
in          	in          	ADP   	6	case        
Hawaii      	Hawaii      	PROPN 	4	obl         
.           	.           	PUNCT 	4	punct       

[Sentence 2]
He          	he          	PRON  	3	nsubj:pass  
was         	be          	AUX   	3	aux:pass    
elected     	elect       	VERB  	0	root        
president   	president   	NOUN  	3	xcomp       
in          	in          	ADP   	6	case        
2008        	2008        	NUM   	3	obl         
.           	.           	PUNCT 	3	punct       



The following example iterate over all extracted named entity mentions and print out their character spans and types.

In [21]:
print("Mention text\tType\tStart-End")
for ent in doc.ents:
    print("{}\t{}\t{}-{}".format(ent.text, ent.type, ent.start_char, ent.end_char))

Mention text	Type	Start-End


In [6]:
print("Mention text\tType\tStart-End")
for ent in en_doc.ents:
    print("{}\t{}\t{}-{}".format(ent.text, ent.type, ent.start_char, ent.end_char))

Mention text	Type	Start-End
Barack Obama	PERSON	0-12
Hawaii	GPE	25-31
2008	DATE	62-66


And similarly for the Chinese text:

In [22]:
for i, sent in enumerate(zh_doc.sentences):
    print("[Sentence {}]".format(i+1))
    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
    print("")

[Sentence 1]
达沃斯         	达沃斯         	PROPN 	4	nmod        
世界          	世界          	NOUN  	4	nmod        
经济          	经济          	NOUN  	4	nmod        
论坛          	论坛          	NOUN  	16	nsubj       
是           	是           	AUX   	16	cop         
每年          	每年          	DET   	10	det         
全球          	全球          	NOUN  	10	nmod        
政商          	政商          	NOUN  	9	compound    
界           	界           	PART  	10	nmod        
领袖          	领袖          	NOUN  	11	nsubj       
聚           	聚           	VERB  	16	acl:relcl   
在           	在           	VERB  	11	mark        
一起          	一起          	NOUN  	11	obj         
的           	的           	PART  	11	mark:rel    
年度          	年度          	NOUN  	16	nmod        
盛事          	盛事          	NOUN  	0	root        
。           	。           	PUNCT 	16	punct       



In [7]:
for i, sent in enumerate(zh_doc.sentences):
    print("[Sentence {}]".format(i+1))
    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
    print("")

[Sentence 1]
达沃斯         	达沃斯         	PROPN 	4	nmod        
世界          	世界          	NOUN  	4	nmod        
经济          	经济          	NOUN  	4	nmod        
论坛          	论坛          	NOUN  	16	nsubj       
是           	是           	AUX   	16	cop         
每年          	每年          	DET   	10	det         
全球          	全球          	NOUN  	10	nmod        
政商          	政商          	NOUN  	9	compound    
界           	界           	PART  	10	nmod        
领袖          	领袖          	NOUN  	11	nsubj       
聚           	聚           	VERB  	16	acl:relcl   
在           	在           	VERB  	11	mark        
一起          	一起          	NOUN  	11	obj         
的           	的           	PART  	11	mark:rel    
年度          	年度          	NOUN  	16	nmod        
盛事          	盛事          	NOUN  	0	root        
。           	。           	PUNCT 	16	punct       



Alternatively, you can directly print a `Word` object to view all its annotations as a Python dict:

In [23]:
word = doc.sentences[0].words[0]
print(word)

{
  "id": 1,
  "text": "input",
  "lemma": "input",
  "upos": "NOUN",
  "xpos": "NN",
  "feats": "Number=Sing",
  "head": 0,
  "deprel": "root",
  "start_char": 0,
  "end_char": 5
}


In [8]:
word = en_doc.sentences[0].words[0]
print(word)

{
  "id": 1,
  "text": "Barack",
  "lemma": "Barack",
  "upos": "PROPN",
  "xpos": "NNP",
  "feats": "Number=Sing",
  "head": 4,
  "deprel": "nsubj:pass",
  "start_char": 0,
  "end_char": 6
}


### More Information

For all information on different data objects, please visit our [data objects page](https://stanfordnlp.github.io/stanza/data_objects.html).

Más información
Para obtener toda la información sobre los diferentes objetos de datos, visite nuestra página de objetos de datos.
Envia

## 5. Resources

Apart from this interactive tutorial, we also provide tutorials on our website that cover a variety of use cases such as how to use different model "packages" for a language, how to use spaCy as a tokenizer, how to process pretokenized text without running the tokenizer, etc. For these tutorials please visit [our Tutorials page](https://stanfordnlp.github.io/stanza/tutorials.html).

Other resources that you may find helpful include:

- [Stanza Homepage](https://stanfordnlp.github.io/stanza/index.html)
- [FAQs](https://stanfordnlp.github.io/stanza/faq.html)
- [GitHub Repo](https://github.com/stanfordnlp/stanza)
- [Reporting Issues](https://github.com/stanfordnlp/stanza/issues)
- [Stanza System Description Paper](http://arxiv.org/abs/2003.07082)


# **5. Recursos**
Además de este tutorial interactivo, también ofrecemos tutoriales en nuestro sitio web que cubren una variedad de casos de uso, como cómo usar diferentes "paquetes" de modelos para un lenguaje, cómo usar spaCy como tokenizador, cómo procesar texto pretokenizado sin ejecutar el tokenizador, etc. Para ver estos tutoriales, visite nuestra página de Tutoriales.

Otros recursos que pueden resultarle útiles incluyen:

Página de inicio de Stanza
Preguntas frecuentes
Repositorio de GitHub
Informar problemas
Documento de descripción del sistema de Stanza