<a href="https://colab.research.google.com/github/castroborges/castroborges.github.io/blob/main/Parsing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Relações de constituência e depenência

**Parsing** é a tarefa que visa analisar uma sentença em termos de sua sintaxe e produzir uma representação das relações entre seus comonentes por meio de uma estrutura ("árvore")

## Parsing de Cosntituência

Baseado na ideia de que grupos de palavras podem se comportar como unidades únicas ou constituintes.

Utilizaremos o [Berkeley Neural Parser](https://spacy.io/universe/project/self-attentive-parser) para analisar textos em inglês.

Baixando as dependências:

In [1]:
!pip install sentencepiece
!pip install benepar
!pip install -U pip setuptools wheel
!pip install -U spacy[cuda102]
!python3 -m spacy download en_core_web_md


Collecting spacy[cuda102]
  Downloading spacy-3.8.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting thinc<8.4.0,>=8.3.0 (from spacy[cuda102])
  Downloading thinc-8.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting cupy-cuda102<13.0.0,>=5.0.0b4 (from spacy[cuda102])
  Downloading cupy_cuda102-12.3.0-cp310-cp310-manylinux2014_x86_64.whl.metadata (2.7 kB)
Collecting blis<1.1.0,>=1.0.0 (from thinc<8.4.0,>=8.3.0->spacy[cuda102])
  Downloading blis-1.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
INFO: pip is looking at multiple versions of thinc to determine which version is compatible with other requirements. This could take a while.
Collecting thinc<8.4.0,>=8.3.0 (from spacy[cuda102])
  Downloading thinc-8.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
  Downloading thinc-8.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata

Importando as dependências e inicializando os modelos

In [6]:
import benepar, spacy

benepar.download('benepar_en3')

[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!


True

In [8]:
nlp = spacy.load('en_core_web_md')
nlp.add_pipe('benepar', config={'model': 'benepar_en3'})

  state_dict = torch.load(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


<benepar.integrations.spacy_plugin.BeneparComponent at 0x7d2f2b0a99c0>

Extraindo a árvore de constituência

In [9]:
texto = "Hal, switch to manual hibernation control."

doc = nlp(texto)

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


In [15]:
sent = list(doc.sents)[0]
sent._.parse_string

'(FRAG (INTJ (UH Hal)) (, ,) (VP (VB switch) (PP (IN to) (NP (JJ manual) (NN hibernation) (NN control)))) (. .))'

Iterando os constituentes

In [17]:
for const in sent._.constituents:
  if len(const._.labels) != 0:
    print(const._.labels, const)

('FRAG',) Hal, switch to manual hibernation control.
('INTJ',) Hal
('VP',) switch to manual hibernation control
('PP',) to manual hibernation control
('NP',) manual hibernation control


FRAG: Fragmento

INTJ: Interjeição

VP: Sintagma verbal

PP: Sintagma preposcional

NP: Sintagma nominal

# Parsing de Dependência

Em árvores de dependência, as relações entre as palavras de uma sentença são representadas por meio de relações sintáticas entre elas, tendo uma delas, chamada de raiz / root, como ponto de partida.

# SpaCy

Vamos ver como obter automaticamente a árvore de dependência de um texto utilizando a biblioteca Spacy. Para mais informações, leia a [documentação completa disponível AQUI](https://spacy.io/usage).

Baixando as dependências

In [18]:
!pip install tabulate



Baixando os modelos treinados para Português e Inglês

In [19]:
!python3 -m spacy download pt_core_news_lg
!python3 -m spacy download en_core_web_trf

Collecting pt-core-news-lg==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_lg-3.7.0/pt_core_news_lg-3.7.0-py3-none-any.whl (568.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m568.2/568.2 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pt-core-news-lg
Successfully installed pt-core-news-lg-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

## Parsing em Português

Inicializando o modelo

In [17]:
import spacy

spacy.prefer_gpu()
nlp = spacy.load('pt_core_news_lg')

Obtendo as dependências

In [23]:
doc = nlp('Hal, mude para o controle de hibernação.')

In [24]:
import tabulate

data = []
for token in doc:
  data.append((token.i, token.lemma_, token.pos_, token.morph, token.dep_, token.head))

header = ['Idx', 'Lemma', 'Classe de palavra', 'Morfologia', 'Dependência', 'Governador']
print(tabulate.tabulate(data, header))


  Idx  Lemma       Classe de palavra    Morfologia                                             Dependência    Governador
-----  ----------  -------------------  -----------------------------------------------------  -------------  ------------
    0  Hal         PROPN                Gender=Masc|Number=Sing                                nsubj          mude
    1  ,           PUNCT                                                                       punct          mude
    2  mudar       VERB                 Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin  ROOT           mude
    3  para        ADP                                                                         case           controle
    4  o           DET                  Definite=Def|Gender=Masc|Number=Sing|PronType=Art      det            controle
    5  controle    NOUN                 Gender=Masc|Number=Sing                                obl            mude
    6  de          ADP                                    

Buscando subestruturas

In [32]:
def get_subfrase(root, tokens, subfrase):
  subfrase.append((root.i, str(root)))


  for token in root.children:
    subfrase = get_subfrase(token, tokens, subfrase)
  return subfrase

In [33]:
r = get_subfrase(doc[5], doc, [])

[w[1] for w in sorted(r, key=lambda x: x[0])]



['para', 'o', 'controle', 'de', 'hibernação']

## Parsing em inglês

Importando as dependências e inicializando o modelo

In [34]:
import spacy

spacy.prefer_gpu()
nlp = spacy.load('en_core_web_trf')

  model.load_state_dict(torch.load(filelike, map_location=device))


Obtendo as dependências

In [36]:
doc = nlp('Hal, switch to manual hibernation control.')

  with torch.cuda.amp.autocast(self._mixed_precision):


In [37]:
import tabulate

data = []
for token in doc:
  data.append((token.i, token.lemma_, token.pos_, token.morph, token.dep_, token.head))

header = ['Idx', 'Lemma', 'Classe de palavra', 'Morfologia', 'Dependência', 'Governador']
print(tabulate.tabulate(data, header))

  Idx  Lemma        Classe de palavra    Morfologia      Dependência    Governador
-----  -----------  -------------------  --------------  -------------  ------------
    0  Hal          PROPN                Number=Sing     npadvmod       switch
    1  ,            PUNCT                PunctType=Comm  punct          switch
    2  switch       VERB                 VerbForm=Inf    ROOT           switch
    3  to           ADP                                  prep           switch
    4  manual       ADJ                  Degree=Pos      amod           control
    5  hibernation  NOUN                 Number=Sing     compound       control
    6  control      NOUN                 Number=Sing     pobj           to
    7  .            PUNCT                PunctType=Peri  punct          switch
