<a href="https://colab.research.google.com/github/castroborges/castroborges.github.io/blob/main/POS_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Marcação de Classe de Palavra

Conhecida em inglês como *Part-of-speech tagging* ou *POS Tagging*, visa classificar automaticamente cada palavra de um texto de acordo com sua classe gramatical (substantivo, verbo, adjetivo etc.)

## Corpora

Corpora com anotação de classe de palavra foram complilados para diferentes idiomas.

Neste colab, exploraremos dois corpora de marcação de classe de palavra em Português: Mac-Morpho e uma parte em Português da Universal Dependencies *(UD_Portuguese-GSD)*


## Mac-Morpho

Baixando o corpus (Fonte [NILC](http://www.nilc.icmc.usp.br/macmorpho/)).

In [1]:
!wget http://www.nilc.icmc.usp.br/macmorpho/macmorpho-v3.tgz
!tar zxvf macmorpho-v3.tgz

--2024-11-20 00:54:12--  http://www.nilc.icmc.usp.br/macmorpho/macmorpho-v3.tgz
Resolving www.nilc.icmc.usp.br (www.nilc.icmc.usp.br)... 143.107.183.225
Connecting to www.nilc.icmc.usp.br (www.nilc.icmc.usp.br)|143.107.183.225|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2463485 (2.3M) [application/x-gzip]
Saving to: ‘macmorpho-v3.tgz’


2024-11-20 00:54:14 (1.94 MB/s) - ‘macmorpho-v3.tgz’ saved [2463485/2463485]

macmorpho-dev.txt
macmorpho-test.txt
macmorpho-train.txt


Carregando os conjuntos de treino, desenvolvimento e teste do corpus


In [10]:
with open('macmorpho-train.txt') as f:
  doc = f.read().split('\n')

traindata = []
for linha in doc:
  sentenca = [tuple(par.split('_')) for par in linha.split()]
  traindata.append(sentenca)

with open('macmorpho-dev.txt') as f:
  doc = f.read().split('\n')

devdata = []
for linha in doc:
  sentenca = [tuple(par.split('_')) for par in linha.split()]
  devdata.append(sentenca)

with open('macmorpho-test.txt') as f:
  doc = f.read().split('\n')

testdata = []
for linha in doc:
  sentenca = [tuple(par.split('_')) for par in linha.split()]
  testdata.append(sentenca)

corpus = traindata + devdata + testdata

Sentença anotada com as classes de palavras no Mac-Morpho

In [11]:
corpus[100]

[('A', 'ART'),
 ('braquiária', 'N'),
 ('decunbens', 'ADJ'),
 ('só', 'PDEN'),
 ('fracassou', 'V'),
 ('na', 'PREP+ART'),
 ('Amazônia', 'NPROP'),
 ('e', 'KC'),
 ('é', 'V'),
 ('claro', 'ADJ'),
 ('não', 'ADV'),
 ('podia', 'V'),
 ('se', 'PROPESS'),
 ('dar', 'V'),
 ('bem', 'ADV'),
 ('no', 'PREP+ART'),
 ('sul', 'NPROP'),
 ('.', 'PU')]

Tamanho do corpus

In [12]:
len(corpus)

49934

Contando cada marcador no corpus

In [13]:
from collections import Counter

tagset = []
for snt in corpus:
  for palavra in snt:
    tagset.append(palavra[1])

Counter(tagset)

Counter({'N': 200977,
         'V': 99621,
         'PREP': 91314,
         'CUR': 2473,
         'NUM': 16181,
         'PREP+ART': 58335,
         'NPROP': 91765,
         'PU': 138865,
         'PROADJ': 15415,
         'PRO-KS': 10919,
         'ADJ': 43269,
         'KC': 23366,
         'ART': 68618,
         'KS': 12099,
         'PCP': 19548,
         'ADV': 24814,
         'PROPESS': 11538,
         'PREP+PROADJ': 1715,
         'PDEN': 5666,
         'PROSUB': 6381,
         'PREP+PROPESS': 533,
         'ADV-KS': 1041,
         'PREP+PRO-KS': 219,
         'PREP+PROSUB': 710,
         'IN': 284,
         'PREP+ADV': 85})

## UD_Portuguese-GSD

Baixando o corpus

In [14]:
!git clone https://github.com/UniversalDependencies/UD_Portuguese-GSD.git

Cloning into 'UD_Portuguese-GSD'...
remote: Enumerating objects: 12585, done.[K
remote: Counting objects: 100% (4316/4316), done.[K
remote: Compressing objects: 100% (2459/2459), done.[K
remote: Total 12585 (delta 2226), reused 2854 (delta 1857), pack-reused 8269 (from 1)[K
Receiving objects: 100% (12585/12585), 34.91 MiB | 18.26 MiB/s, done.
Resolving deltas: 100% (8662/8662), done.


Carregando os conjuntos de treino, desenvolvimento e teste do corpus

In [19]:
import os

def parse(fname):
  with open(fname) as f:
    doc = f.read()
  doc = doc.split('\n\n')

  snts = []
  for j, inst in enumerate(doc[:-1]):
    snt = []
    rows = inst.split('\n')
    for i, elem in enumerate(rows):
      if elem[0] != '#':
        r = elem.split('\t')
        palavra, tag = r[1], r[3]
        if tag != '_':
          snt.append((palavra, tag))
      snts.append(snt)
  return snts

path = 'UD_Portuguese-GSD/'

traindata = parse(os.path.join(path, 'pt_gsd-ud-train.conllu'))
devdata = parse(os.path.join(path, 'pt_gsd-ud-dev.conllu'))
testdata = parse(os.path.join(path, 'pt_gsd-ud-test.conllu'))
corpus = traindata + devdata + testdata

corpus = traindata + devdata + testdata

Sentenca anotada com as classes de palavras no Mac-Morpho

In [20]:
len(corpus)

367900

In [21]:
corpus[0]

[('O', 'DET'),
 ('objetivo', 'NOUN'),
 ('de', 'ADP'),
 ('os', 'DET'),
 ('principais', 'ADJ'),
 ('hotéis', 'NOUN'),
 ('de', 'ADP'),
 ('a', 'DET'),
 ('cidade', 'NOUN'),
 ('é', 'AUX'),
 ('que', 'CCONJ'),
 ('o', 'DET'),
 ('hóspede', 'NOUN'),
 ('jamais', 'ADV'),
 ('tenha', 'AUX'),
 ('que', 'CCONJ'),
 ('sair', 'VERB'),
 ('dali', 'ADV'),
 ('e', 'CCONJ'),
 ('gaste', 'VERB'),
 ('a', 'ADP'),
 ('cada', 'DET'),
 ('minuto', 'NOUN'),
 ('de', 'ADP'),
 ('a', 'DET'),
 ('estadia', 'NOUN'),
 ('.', 'PUNCT')]

Contando cada marcador no corpus

In [22]:
from collections import Counter

tagset = []
for snt in corpus:
  for palavra in snt:
    tagset.append(palavra[1])

Counter(tagset)

Counter({'DET': 1891119,
         'NOUN': 2206437,
         'ADP': 2084110,
         'ADJ': 582072,
         'AUX': 240795,
         'CCONJ': 405391,
         'ADV': 350058,
         'VERB': 1017428,
         'PUNCT': 1617659,
         'PROPN': 1375042,
         'NUM': 332684,
         'PRON': 291703,
         'SCONJ': 52872,
         'SYM': 43102,
         'PART': 20942,
         'X': 14166})