<a href="https://colab.research.google.com/github/dml2611/pymusas/blob/main/Tag%20Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tag Text**


---


In this guide, we are going to show you how to tag text using the **PyMUSAS [RuleBasedTagger](https://ucrel.github.io/pymusas/api/spacy_api/taggers/rule_based#rulebasedtagger)** so that you can extract token level **[USAS Semantic Tags](https://ucrel.lancs.ac.uk/usas/)** from the tagged text.

The guide is broken down into different languages, for each guide we are going to:
1. Download the relevant pre-configured PyMUSAS `RuleBasedTagger` spaCy component for the language.
2. Download and use a Natural Language Processing (NLP) pipeline that will tokenize, lemmatize, and Part Of Speech (POS) tag. In most cases, this will be a spaCy pipeline. Note that the PyMUSAS `RuleBasedTagger` only requires at minimum the data to be tokenized but having the lemma and POS tag will improve the accuracy of the tagging of the text.
3. Run the PyMUSAS `RuleBasedTagger`.
4. Extract token-level linguistic information from the tagged text, which will include USAS semantic tags.
5. For Chinese, Italian, Portuguese, Spanish, Welsh, and English taggers which support Multi Word Expression (MWE) identification and tagging we will show how
to extract this information from the tagged text as well.



---


# **CHINESE**


---
First download both the [Chinese PyMUSAS RuleBasedTagger](https://github.com/UCREL/pymusas-models/releases/tag/cmn_dual_upos2usas_contextual-0.3.3) SpaCy component and the small [Chinese SpaCy model](https://spacy.io/models/zh):


In [1]:
!pip install https://github.com/UCREL/pymusas-models/releases/download/cmn_dual_upos2usas_contextual-0.3.3/cmn_dual_upos2usas_contextual-0.3.3-py3-none-any.whl
!python -m spacy download zh_core_web_sm

Collecting cmn-dual-upos2usas-contextual==0.3.3
  Downloading https://github.com/UCREL/pymusas-models/releases/download/cmn_dual_upos2usas_contextual-0.3.3/cmn_dual_upos2usas_contextual-0.3.3-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Collecting pymusas<0.4.0,>=0.3.0 (from cmn-dual-upos2usas-contextual==0.3.3)
  Downloading pymusas-0.3.0-py3-none-any.whl (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.9/51.9 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
Collecting click<8.1.0 (from pymusas<0.4.0,>=0.3.0->cmn-dual-upos2usas-contextual==0.3.3)
  Downloading click-8.0.4-py3-none-any.whl (97 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.5/97.5 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: click, pymusas, cmn-dual-upos2usas-contextual
  Attempting uninstall: click
    Found existing installation: click 8.1.6

Then create the tagger, in a Python script:

`NOTE: Currently, there is no lemmatization component in the SpaCy pipeline for Chinese.`

In [2]:
import spacy

# We exclude the following components as we do not need them.
nlp = spacy.load('zh_core_web_sm', exclude=['parser', 'ner'])
# Load the Chinese PyMUSAS rule-based tagger in a separate spaCy pipeline
chinese_tagger_pipeline = spacy.load('cmn_dual_upos2usas_contextual')
# Adds the Chinese PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=chinese_tagger_pipeline)

<pymusas.spacy_api.taggers.rule_based.RuleBasedTagger at 0x795dedde1680>

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). \
The example text is taken from the Chinese Wikipedia page on the topic of [The Nile River](https://zh.wikipedia.org/wiki/%E5%B0%BC%E7%BD%97%E6%B2%B3):

In [3]:
text = "尼罗河 是一条流經非洲東部與北部的河流，與中非地區的剛果河、非洲南部的赞比西河以及西非地区的尼日尔河並列非洲最大的四個河流系統。"

output_doc = nlp(text)

print(f'Text\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.pos_}\t{token._.pymusas_tags}')

Text	POS	USAS Tags
尼罗河	PROPN	['Z2']
是	VERB	['A3', 'Z5']
一	NUM	['N1']
条	NUM	['G2.1/P1', 'S7.4-', 'A1.7+', 'S8-']
流經	NOUN	['Z99']
非洲	PROPN	['Z2']
東部	NOUN	['Z99']
與北部	PROPN	['Z99']
的	PART	['Z5']
河流	NOUN	['W3/M4', 'N5+']
，	PUNCT	['PUNCT']
與	VERB	['Z99']
中非	PROPN	['Z99']
地區	NOUN	['Z99']
的	PART	['Z5']
剛果河	PROPN	['Z99']
、	PUNCT	['PUNCT']
非洲	PROPN	['Z2']
南部	NOUN	['M6']
的	PART	['Z5']
赞比西河	NOUN	['Z99']
以及	CCONJ	['N5++', 'N5.2+', 'A13.3', 'Z5']
西非	PROPN	['Z99']
地区	NOUN	['A1.1.1', 'B3/X1', 'G1.1c', 'W3', 'F4/M7', 'K2', 'M7', 'A4.1', 'N3.6', 'B1', 'T1.1', 'O4.4', 'N5.1-', 'S5+c', 'B3', 'Y1', 'C1/H1@']
的	PART	['Z5']
尼日尔河	NOUN	['Z99']
並列	VERB	['Z99']
非洲	PROPN	['Z2']
最	ADV	['A11.1+', 'N5+++', 'N3.2+++', 'A11.1+++', 'N5.1+', 'O2/M4', 'O3']
大	VERB	['A11.1+', 'N5+++', 'N3.2+++', 'A11.1+++', 'N5.1+', 'O2/M4', 'O3']
的	PART	['Z5']
四	NUM	['N1']
個	NUM	['N1']
河流	NOUN	['W3/M4', 'N5+']
系統	NOUN	['Z99']
。	PUNCT	['PUNCT']


For Chinese the tagger also identifies and tags Multi-Word Expressions (MWE), to find these MWE's you can run the following:

In [4]:
print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')
for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Text	POS	MWE start and end index	USAS Tags
最	ADV	(28, 30)	['A11.1+', 'N5+++', 'N3.2+++', 'A11.1+++', 'N5.1+', 'O2/M4', 'O3']
大	VERB	(28, 30)	['A11.1+', 'N5+++', 'N3.2+++', 'A11.1+++', 'N5.1+', 'O2/M4', 'O3']




---


# **DUTCH**


---
First download both the [Dutch PyMUSAS RuleBasedTagger](https://github.com/UCREL/pymusas-models/releases/tag/nl_single_upos2usas_contextual-0.3.3) SpaCy component and the small [Dutch SpaCy model](https://spacy.io/models/nl):


In [13]:
!pip install https://github.com/UCREL/pymusas-models/releases/download/nl_single_upos2usas_contextual-0.3.3/nl_single_upos2usas_contextual-0.3.3-py3-none-any.whl
!python -m spacy download nl_core_news_sm

Collecting nl-single-upos2usas-contextual==0.3.3
  Downloading https://github.com/UCREL/pymusas-models/releases/download/nl_single_upos2usas_contextual-0.3.3/nl_single_upos2usas_contextual-0.3.3-py3-none-any.whl (159 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.0/160.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nl-single-upos2usas-contextual
Successfully installed nl-single-upos2usas-contextual-0.3.3
Collecting nl-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-3.6.0/nl_core_news_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nl-core-news-sm
Successfully installed nl-core-news-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('nl_core_news_sm')


Then create the tagger, in a Python script:

In [14]:
import spacy

# We exclude the following components as we do not need them.
nlp = spacy.load('nl_core_news_sm', exclude=['parser', 'ner', 'tagger'])
# Load the Dutch PyMUSAS rule-based tagger in a separate spaCy pipeline
dutch_tagger_pipeline = spacy.load('nl_single_upos2usas_contextual')
# Adds the Dutch PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=dutch_tagger_pipeline)

<pymusas.spacy_api.taggers.rule_based.RuleBasedTagger at 0x795de720f340>

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last). \\
The example text is taken from the Dutch Wikipedia page on the topic of [The Nile River](https://nl.wikipedia.org/wiki/Nijl):

In [15]:
text = "De Nijl is met een lengte van 5499 tot 6695 km de langste of de op een na langste rivier van de wereld."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Text	Lemma	POS	USAS Tags
De	de	DET	['Z5']
Nijl	Nijl	PROPN	['Z99']
is	zijn	AUX	['A3+', 'Z5']
met	met	ADP	['Z5']
een	een	DET	['Z5']
lengte	lengte	NOUN	['N3.7', 'T1.3', 'M4']
van	van	ADP	['Z5']
5499	5499	NUM	['N1']
tot	tot	ADP	['Z99']
6695	6695	NUM	['N1']
km	km	SYM	['Z99']
de	de	DET	['Z5']
langste	lang	ADJ	['N3.7+', 'T1.3+', 'N3.3+', 'N3.2+', 'X7+']
of	of	CCONJ	['Z5']
de	de	DET	['Z5']
op	op	ADP	['A5.1+', 'G2.2+', 'A1.1.1', 'M6', 'Z5']
een	één	NUM	['N1', 'T3', 'T1.2', 'Z8']
na	na	ADP	['N4', 'Z5']
langste	lang	ADJ	['N3.7+', 'T1.3+', 'N3.3+', 'N3.2+', 'X7+']
rivier	rivier	NOUN	['W3/M4', 'N5+']
van	van	ADP	['Z5']
de	de	DET	['Z5']
wereld	wereld	NOUN	['W1', 'S5+c', 'A4.1', 'N5+']
.	.	PUNCT	['PUNCT']




---


# **FRENCH**


---
First download both the [French PyMUSAS RuleBasedTagger](https://github.com/UCREL/pymusas-models/releases/tag/fr_single_upos2usas_contextual-0.3.3) spaCy component and the small [French spaCy model](https://spacy.io/models/fr):



In [16]:
!pip install https://github.com/UCREL/pymusas-models/releases/download/fr_single_upos2usas_contextual-0.3.3/fr_single_upos2usas_contextual-0.3.3-py3-none-any.whl
!python -m spacy download fr_core_news_sm

Collecting fr-single-upos2usas-contextual==0.3.3
  Downloading https://github.com/UCREL/pymusas-models/releases/download/fr_single_upos2usas_contextual-0.3.3/fr_single_upos2usas_contextual-0.3.3-py3-none-any.whl (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.7/88.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-single-upos2usas-contextual
Successfully installed fr-single-upos2usas-contextual-0.3.3
Collecting fr-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.6.0/fr_core_news_sm-3.6.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


Then create the tagger, in a Python script:

In [17]:
import spacy

# We exclude the following components as we do not need them.
nlp = spacy.load('fr_core_news_sm', exclude=['parser', 'ner'])
# Load the French PyMUSAS rule-based tagger in a separate spaCy pipeline
french_tagger_pipeline = spacy.load('fr_single_upos2usas_contextual')
# Adds the French PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=french_tagger_pipeline)

<pymusas.spacy_api.taggers.rule_based.RuleBasedTagger at 0x795debee0180>

The tagger is now set up for tagging text through the spaCy pipeline like so (this example follows on from the last).\
The example text is taken from the French Wikipedia page on the topic of [The Nile River](https://fr.wikipedia.org/wiki/Nil):

In [18]:
text = "Le Nil est un fleuve d'Afrique. Avec une longueur d'environ 6 700 km, c'est avec le fleuve Amazone, le plus long fleuve du monde."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Text	Lemma	POS	USAS Tags
Le	le	DET	['Z5']
Nil	Nil	PROPN	['Z99']
est	être	AUX	['M6']
un	un	DET	['Z5']
fleuve	fleuve	NOUN	['W3/M4', 'N5+']
d'	de	ADP	['Z5']
Afrique	Afrique	PROPN	['Z99']
.	.	PUNCT	['PUNCT']
Avec	avec	ADP	['Z5']
une	un	DET	['Z5']
longueur	longueur	NOUN	['N3.7', 'T1.3', 'M4']
d'	de	ADP	['Z5']
environ	environ	ADV	['Z5']
6	6	DET	['Z99']
700	700	NUM	['N1']
km	kilomètre	NOUN	['N3.3', 'N3.7']
,	,	PUNCT	['PUNCT']
c'	ce	PRON	['Z8']
est	être	VERB	['M6']
avec	avec	ADP	['Z5']
le	le	DET	['Z5']
fleuve	fleuve	NOUN	['W3/M4', 'N5+']
Amazone	Amazone	PROPN	['Z99']
,	,	PUNCT	['PUNCT']
le	le	DET	['Z5']
plus	plus	ADV	['Z5']
long	long	ADJ	['Z99']
fleuve	fleuve	NOUN	['W3/M4', 'N5+']
du	de	ADP	['Z5']
monde	monde	NOUN	['Z99']
.	.	PUNCT	['PUNCT']




---


# **ITALIAN**


---



In [19]:
!pip install https://github.com/UCREL/pymusas-models/releases/download/it_dual_upos2usas_contextual-0.3.3/it_dual_upos2usas_contextual-0.3.3-py3-none-any.whl
!python -m spacy download it_core_news_sm

Collecting it-dual-upos2usas-contextual==0.3.3
  Downloading https://github.com/UCREL/pymusas-models/releases/download/it_dual_upos2usas_contextual-0.3.3/it_dual_upos2usas_contextual-0.3.3-py3-none-any.whl (522 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m522.9/522.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: it-dual-upos2usas-contextual
Successfully installed it-dual-upos2usas-contextual-0.3.3
Collecting it-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-3.6.0/it_core_news_sm-3.6.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: it-core-news-sm
Successfully installed it-core-news-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('it_core_news_sm')


In [20]:
import spacy

# We exclude the following components as we do not need them.
nlp = spacy.load('it_core_news_sm', exclude=['parser', 'ner', 'tagger'])
# Load the Italian PyMUSAS rule-based tagger in a separate spaCy pipeline
italian_tagger_pipeline = spacy.load('it_dual_upos2usas_contextual')
# Adds the Italian PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=italian_tagger_pipeline)

<pymusas.spacy_api.taggers.rule_based.RuleBasedTagger at 0x795de4f17600>

In [21]:
text = "Il Nilo è un fiume africano lungo 6.852 km che attraversa otto stati dell'Africa. Tradizionalmente considerato il fiume più lungo del mondo, contende il primato della lunghezza al Rio delle Amazzoni."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Text	Lemma	POS	USAS Tags
Il	il	DET	['Z5']
Nilo	Nilo	PROPN	['Z99']
è	essere	AUX	['A5.1', 'S7.1++', 'X3.2', 'Q2.2', 'A8', 'N3.1%']
un	uno	DET	['Z5']
fiume	fiume	NOUN	['W3']
africano	africano	ADJ	['Z2']
lungo	lungo	ADJ	['N3.7+']
6.852	6.852	NUM	['N1']
km	chilometro	NOUN	['N3.3']
che	che	PRON	['Z8']
attraversa	attraversare	VERB	['M1', 'M6', 'S8-', 'A1.8+', 'A6.3+', 'F4/L2', 'O4.4', 'Q1.2', 'E3-', 'S1.1.1', 'S9@']
otto	otto	NUM	['N1']
stati	stato	NOUN	['G2.1/H1', 'B2', 'A3']
dell'	di il	ADP	['Z99']
Africa	Africa	PROPN	['Z2']
.	.	PUNCT	['PUNCT']
Tradizionalmente	Tradizionalmente	NOUN	['Z99']
considerato	considerare	VERB	['A5.1', 'N2', 'A11.1+', 'Q2.2', 'S1.1.1', 'Q1.3', 'S9%', 'X2.1', 'X2.4', 'X6']
il	il	DET	['Z5']
fiume	fiume	NOUN	['W3']
più	più	ADV	['N3.3+', 'A13.3']
lungo	lungo	ADJ	['N3.3+', 'A13.3']
del	di il	ADP	['Z5']
mondo	mondo	NOUN	['W1']
,	,	PUNCT	['PUNCT']
contende	contendere	VERB	['S7.3']
il	il	DET	['Z5']
primato	primato	NOUN	['A5.1+++', 'A11.1+']
della	di il	ADP	['Z99']
lunghezz

In [22]:
print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')

for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Text	POS	MWE start and end index	USAS Tags
più	ADV	(20, 22)	['N3.3+', 'A13.3']
lungo	ADJ	(20, 22)	['N3.3+', 'A13.3']




---


# **PORTUGUESE**


---



In [23]:
!pip install https://github.com/UCREL/pymusas-models/releases/download/pt_dual_upos2usas_contextual-0.3.3/pt_dual_upos2usas_contextual-0.3.3-py3-none-any.whl
!python -m spacy download pt_core_news_sm

Collecting pt-dual-upos2usas-contextual==0.3.3
  Downloading https://github.com/UCREL/pymusas-models/releases/download/pt_dual_upos2usas_contextual-0.3.3/pt_dual_upos2usas_contextual-0.3.3-py3-none-any.whl (286 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.1/286.1 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pt-dual-upos2usas-contextual
Successfully installed pt-dual-upos2usas-contextual-0.3.3
Collecting pt-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.6.0/pt_core_news_sm-3.6.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pt-core-news-sm
Successfully installed pt-core-news-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_sm')


In [24]:
import spacy

# We exclude the following components as we do not need them.
nlp = spacy.load('pt_core_news_sm', exclude=['parser', 'ner'])
# Load the Portuguese PyMUSAS rule-based tagger in a separate spaCy pipeline
portuguese_tagger_pipeline = spacy.load('pt_dual_upos2usas_contextual')
# Adds the Portuguese PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=portuguese_tagger_pipeline)

<pymusas.spacy_api.taggers.rule_based.RuleBasedTagger at 0x795ddeff4d00>

In [25]:
text = "Todos estes estudos levam a que o comprimento de ambos os rios permaneça em aberto, continuando por isso o debate e como tal, continuando-se a considerar o Nilo como o rio mais longo."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Text	Lemma	POS	USAS Tags
Todos	todo	DET	['N5.1+']
estes	este	DET	['Z5', 'Z8']
estudos	estudo	NOUN	['P1', 'X2.4', 'H2', 'Q1.2', 'C1']
levam	levar	VERB	['A9+', 'T1.3', 'C1', 'A1.1.1', 'M2', 'S7.1-', 'A2.1+', 'X2.4', 'S6+', 'S7.4+', 'N3', 'A2.1+', 'P1', 'M1', 'X2.5+', 'F1@', 'F2@', 'Q1.2@', 'B3@']
a	a	SCONJ	['M6', 'Z5']
que	que	SCONJ	['A13.3', 'A6.1+', 'Z5', 'Z8']
o	o	DET	['Z5']
comprimento	comprimento	NOUN	['N3.7', 'T1.3', 'M4']
de	de	ADP	['Z5']
ambos	ambos	DET	['N5']
os	o	DET	['Z5']
rios	rio	NOUN	['W3/M4', 'N5+']
permaneça	permanecer	VERB	['T2++', 'M8', 'N5.2+']
em	em	SCONJ	['A5.1+', 'G2.2+', 'A1.1.1', 'M6', 'O4.2+', 'Z5']
aberto	abrir	VERB	['A10+', 'T2+']
,	,	PUNCT	['PUNCT']
continuando	continuar	VERB	['Z99']
por	por	ADP	['N4', 'Z5', 'T1.2']
isso	isso	PRON	['N4', 'Z5', 'T1.2']
o	o	DET	['Z5']
debate	debate	NOUN	['Q2.1', 'Q2.1/A6.1-', 'Q2.1/E3-', 'Q2.2']
e	e	CCONJ	['Z5']
como	como	ADP	['Z5']
tal	tal	PRON	['Z5']
,	,	PUNCT	['PUNCT']
continuando-se	continuar se	VERB	['Z99']
a	a	SCONJ	['M6',

In [26]:
print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')

for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Text	POS	MWE start and end index	USAS Tags
por	ADP	(17, 19)	['N4', 'Z5', 'T1.2']
isso	PRON	(17, 19)	['N4', 'Z5', 'T1.2']
mais	ADV	(33, 35)	['T1.3++', 'N3.7++', 'N3.3++', 'N3.2++']
longo	ADJ	(33, 35)	['T1.3++', 'N3.7++', 'N3.3++', 'N3.2++']




---


# **SPANISH**


---



In [5]:
!pip install https://github.com/UCREL/pymusas-models/releases/download/es_dual_upos2usas_contextual-0.3.3/es_dual_upos2usas_contextual-0.3.3-py3-none-any.whl
!python -m spacy download es_core_news_sm

Collecting es-dual-upos2usas-contextual==0.3.3
  Downloading https://github.com/UCREL/pymusas-models/releases/download/es_dual_upos2usas_contextual-0.3.3/es_dual_upos2usas_contextual-0.3.3-py3-none-any.whl (208 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.0/209.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: es-dual-upos2usas-contextual
Successfully installed es-dual-upos2usas-contextual-0.3.3
Collecting es-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.6.0/es_core_news_sm-3.6.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [6]:
import spacy

# We exclude the following components as we do not need them.
nlp = spacy.load('es_core_news_sm', exclude=['parser', 'ner'])
# Load the Spanish PyMUSAS rule-based tagger in a separate spaCy pipeline
spanish_tagger_pipeline = spacy.load('es_dual_upos2usas_contextual')
# Adds the Spanish PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=spanish_tagger_pipeline)



<pymusas.spacy_api.taggers.rule_based.RuleBasedTagger at 0x795dd7cfd580>

In [7]:
text = "Los Países Bajos son un país soberano ubicado al noreste de la Europa continental y el país constituyente más grande de los cuatro que, junto con las islas de Aruba, Curazao y San Martín, forman el Reino de los Países Bajos."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Text	Lemma	POS	USAS Tags
Los	el	DET	['Z5']
Países	Países	PROPN	['Z2']
Bajos	Bajos	PROPN	['Z2']
son	ser	AUX	['A3+', 'L1', 'Z5']
un	uno	DET	['Z5', 'N1']
país	país	NOUN	['G1.1c', 'W3', 'M7']
soberano	soberano	ADJ	['Z99']
ubicado	ubicado	ADJ	['Z99']
al	al	ADP	['Z5']
noreste	noreste	NOUN	['Z99']
de	de	ADP	['Z5']
la	el	DET	['Z5']
Europa	Europa	PROPN	['Z2', 'S7', 'M7']
continental	continental	PROPN	['Z99']
y	y	CCONJ	['Z5', 'A1.8+']
el	el	DET	['Z5']
país	país	NOUN	['G1.1c', 'W3', 'M7']
constituyente	constituyente	ADJ	['Z99']
más	más	ADV	['A13.3', 'N6++', 'Z5']
grande	grande	ADJ	['N3.1+/A6.1+/A13.2+', 'A5']
de	de	ADP	['Z5']
los	el	DET	['Z5']
cuatro	cuatro	NUM	['N1']
que	que	PRON	['Z5', 'Z8']
,	,	PUNCT	['PUNCT']
junto	junto	ADJ	['A2.2', 'S5+', 'A1.8+']
con	con	ADP	['Z5', 'A4.1']
las	el	DET	['Z5']
islas	isla	NOUN	['W3M7']
de	de	ADP	['Z5']
Aruba	Aruba	PROPN	['Z99']
,	,	PUNCT	['PUNCT']
Curazao	Curazao	PROPN	['Z99']
y	y	CCONJ	['Z5', 'A1.8+']
San	San	PROPN	['S9', 'S2', 'A4.1']
Martín	Martín	PROPN	['Z

In [8]:
print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')

for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Text	POS	MWE start and end index	USAS Tags
Países	PROPN	(1, 3)	['Z2']
Bajos	PROPN	(1, 3)	['Z2']
Países	PROPN	(42, 44)	['Z2']
Bajos	PROPN	(42, 44)	['Z2']




---


# **FINNISH**


---



In [27]:
!pip install https://github.com/UCREL/pymusas-models/releases/download/fi_single_upos2usas_contextual-0.3.3/fi_single_upos2usas_contextual-0.3.3-py3-none-any.whl
!python -m spacy download fi_core_news_sm

Collecting fi-single-upos2usas-contextual==0.3.3
  Downloading https://github.com/UCREL/pymusas-models/releases/download/fi_single_upos2usas_contextual-0.3.3/fi_single_upos2usas_contextual-0.3.3-py3-none-any.whl (664 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.9/664.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fi-single-upos2usas-contextual
Successfully installed fi-single-upos2usas-contextual-0.3.3
Collecting fi-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fi_core_news_sm-3.6.0/fi_core_news_sm-3.6.0-py3-none-any.whl (14.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.3/14.3 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fi-core-news-sm
Successfully installed fi-core-news-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fi_core_news_sm')


In [28]:
import spacy

# We exclude the following components as we do not need them.
nlp = spacy.load("fi_core_news_sm", exclude=['tagger', 'parser', 'attribute_ruler', 'ner'])
# Load the Finnish PyMUSAS rule-based tagger in a separate spaCy pipeline
finnish_tagger_pipeline = spacy.load('fi_single_upos2usas_contextual')
# Adds the Finnish PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=finnish_tagger_pipeline)

<pymusas.spacy_api.taggers.rule_based.RuleBasedTagger at 0x795ddc09fe80>

In [29]:
text = "Pankki on instituutio, joka tarjoaa finanssipalveluita, erityisesti maksuliikenteen hoitoa ja luotonantoa."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')
print(f'{"Text":<20}{"Lemma":<20}{"POS":<8}USAS Tags')
for token in output_doc:
    print(f'{token.text:<20}{token.lemma_:<20}{token.pos_:<8}{token._.pymusas_tags}')

Text	Lemma	POS	USAS Tags
Pankki	pankki	NOUN	['I1/H1', 'K5.2/I1.1']
on	olla	AUX	['A3+', 'A1.1.1', 'M6', 'Z5']
instituutio	instituutio	NOUN	['S5+']
,	,	PUNCT	['PUNCT']
joka	joka	PRON	['Z8', 'N5.1+']
tarjoaa	tarjota	VERB	['A9-', 'Q2.2', 'F1', 'S6+', 'A7+', 'I2.2']
finanssipalveluita	finanssipalvelu	NOUN	['Z99']
,	,	PUNCT	['PUNCT']
erityisesti	erityisesti	ADV	['A14']
maksuliikenteen	maksuliikenteen	NOUN	['Z99']
hoitoa	hoito	NOUN	['B3', 'S4']
ja	ja	CCONJ	['Z5']
luotonantoa	luotonanto	NOUN	['Z99']
.	.	PUNCT	['PUNCT']
Text                Lemma               POS     USAS Tags
Pankki              pankki              NOUN    ['I1/H1', 'K5.2/I1.1']
on                  olla                AUX     ['A3+', 'A1.1.1', 'M6', 'Z5']
instituutio         instituutio         NOUN    ['S5+']
,                   ,                   PUNCT   ['PUNCT']
joka                joka                PRON    ['Z8', 'N5.1+']
tarjoaa             tarjota             VERB    ['A9-', 'Q2.2', 'F1', 'S6+', 'A7+', 'I2.2']
finans



---


# **ENGLISH**

---



In [9]:
!pip install https://github.com/UCREL/pymusas-models/releases/download/en_dual_none_contextual-0.3.3/en_dual_none_contextual-0.3.3-py3-none-any.whl
!python -m spacy download en_core_web_sm

Collecting en-dual-none-contextual==0.3.3
  Downloading https://github.com/UCREL/pymusas-models/releases/download/en_dual_none_contextual-0.3.3/en_dual_none_contextual-0.3.3-py3-none-any.whl (902 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m902.0/902.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-dual-none-contextual
Successfully installed en-dual-none-contextual-0.3.3
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [10]:
import spacy

# We exclude the following components as we do not need them.
nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner'])
# Load the English PyMUSAS rule-based tagger in a separate spaCy pipeline
english_tagger_pipeline = spacy.load('en_dual_none_contextual')
# Adds the English PyMUSAS rule-based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline)

<pymusas.spacy_api.taggers.rule_based.RuleBasedTagger at 0x795dd8b2aac0>

In [11]:
text = "The Nile is a major north-flowing river in Northeastern Africa."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')

Text	Lemma	POS	USAS Tags
The	the	DET	['Z5']
Nile	Nile	PROPN	['Z2']
is	be	AUX	['A3+', 'Z5']
a	a	DET	['Z5']
major	major	ADJ	['A11.1+', 'N3.2+']
north	north	NOUN	['M6']
-	-	PUNCT	['PUNCT']
flowing	flow	VERB	['M4', 'M1']
river	river	NOUN	['W3/M4', 'N5+']
in	in	ADP	['Z5']
Northeastern	Northeastern	PROPN	['Z1mf', 'Z3c']
Africa	Africa	PROPN	['Z1mf', 'Z3c']
.	.	PUNCT	['PUNCT']


In [12]:
print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')

for token in output_doc:
    start, end = token._.pymusas_mwe_indexes[0]
    if (end - start) > 1:
        print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')

Text	POS	MWE start and end index	USAS Tags
Northeastern	PROPN	(10, 12)	['Z1mf', 'Z3c']
Africa	PROPN	(10, 12)	['Z1mf', 'Z3c']
