<a href="https://colab.research.google.com/github/blancavazquez/PLN/blob/main/notebooks/3_POS_tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Implementación del etiquetado de partes del discurso

El objetivo de esta libreta es etiquetar las partes del discurso usando dos herramientas ampliamente conocidas.

* NLTK
* Spacy


## NLTK [Natural Language Toolkit](https://www.nltk.org/)

Es un conjunto de bibliotecas y programas para el procesamiento del lenguaje natural (PLN) escrito en Python.

Para el etiquetado de las partes del discurso utiliza [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) y [Universal tagset](https://universaldependencies.org/u/pos/)

In [None]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [None]:
sentence = """Details matter, it's worth waiting to get it right. Steve Jobs."""
tokens = nltk.word_tokenize(sentence) #Tokekiza el texto
print("tokens:", tokens)

tokens: ['Details', 'matter', ',', 'it', "'s", 'worth', 'waiting', 'to', 'get', 'it', 'right', '.', 'Steve', 'Jobs', '.']


In [None]:
etiquetas = nltk.pos_tag(tokens) #Lleva a cabo el etiquetado POS
print("************ PoS Tagging Result ************ ")
for word, pos_tag in etiquetas:
    print(f"{word}: {pos_tag}")

************ PoS Tagging Result ************ 
Details: NNS
matter: NN
,: ,
it: PRP
's: VBZ
worth: JJ
waiting: VBG
to: TO
get: VB
it: PRP
right: RB
.: .
Steve: NNP
Jobs: NNP
.: .


In [None]:
from nltk.corpus import treebank
nltk.download('treebank')
t = treebank.parsed_sents('wsj_0001.mrg')[0]
print(t)

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


# Spacy

[Spacy](https://spacy.io/) es una biblioteca de código abierto para Procesamiento del Lenguaje Natural (PLN) en Python y Cython.

Usa el etiquetador [Universal](https://universaldependencies.org/u/pos/) para POS-tagging.

In [None]:
!pip install -U spacy



In [None]:
import spacy

nlp = spacy.load("en_core_web_sm") #cargando el tokenizador, etiquetador, analizador y NER en inglés
text = ("Details matter, it's worth waiting to get it right. Steve Jobs.") #Texto a procesar
doc = nlp(text) #tokenizando el texto

print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

for entity in doc.ents: #Identifica entidades nombradas
    print(entity.text, entity.label_)

Noun phrases: ['Details', 'it', 'it', 'Steve Jobs']
Verbs: ['matter', 'wait', 'get']
Steve Jobs PERSON


## Entidades nombradas

Una entidad nombrada es una palabra o frase específica que se refiere a una persona, lugar, organización, dinero, tiempo u otros valores reales.

* Las entidades nombradas son importantes en el Procesamiento del Lenguaje Natural (PLN) porque proporcionan información valiosa y contexto sobre el texto.

In [None]:
#Spacy
texto = ("Originally marketed as a temperance drink and intended as a patent medicine, Coca-Cola was invented in the late 19th century by John Stith Pemberton in Atlanta. In 1888, Pemberton sold the ownership rights to Asa Griggs Candler, a businessman, whose marketing tactics led Coca-Cola to its dominance of the global soft-drink market throughout the 20th and 21st centuries.[4] The name refers to two of its original ingredients: coca leaves and kola nuts (a source of caffeine)") #Texto a procesar
ner_categories = ["PERSON", "ORG", "GPE", "PRODUCT"] #Se definen las entidades a identificar en el texto
doc = nlp(texto) #tokenizando el texto
entidades = []
for entidad in doc.ents: #Identifica entidades nombradas
  if entidad.label_ in ner_categories:
    entidades.append((entidad.text, entidad.label_))

In [None]:
for entidad, categoria in entidades:
  print(f"{entidad}: {categoria}")

Coca-Cola: ORG
John Stith Pemberton: PERSON
Atlanta: GPE
Pemberton: ORG
Griggs Candler: PERSON
Coca-Cola: ORG


In [None]:
#Visualizando resultados
spacy.displacy.render(doc,style="ent")

In [None]:
#NLTK
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


True

In [None]:
texto = ("Originally marketed as a temperance drink and intended as a patent medicine, Coca-Cola was invented in the late 19th century by John Stith Pemberton in Atlanta. In 1888, Pemberton sold the ownership rights to Asa Griggs Candler, a businessman, whose marketing tactics led Coca-Cola to its dominance of the global soft-drink market throughout the 20th and 21st centuries.[4] The name refers to two of its original ingredients: coca leaves and kola nuts (a source of caffeine)") #Texto a procesar
tokens = word_tokenize(texto)
pos_tags = pos_tag(tokens)

In [None]:
entidades = ne_chunk(pos_tags)
print(entidades)

(S
  Originally/RB
  marketed/VBN
  as/IN
  a/DT
  temperance/NN
  drink/NN
  and/CC
  intended/VBN
  as/IN
  a/DT
  patent/NN
  medicine/NN
  ,/,
  (PERSON Coca-Cola/NNP)
  was/VBD
  invented/VBN
  in/IN
  the/DT
  late/JJ
  19th/JJ
  century/NN
  by/IN
  (PERSON John/NNP Stith/NNP Pemberton/NNP)
  in/IN
  (GPE Atlanta/NNP)
  ./.
  In/IN
  1888/CD
  ,/,
  (PERSON Pemberton/NNP)
  sold/VBD
  the/DT
  ownership/NN
  rights/NNS
  to/TO
  (PERSON Asa/NNP Griggs/NNP Candler/NNP)
  ,/,
  a/DT
  businessman/NN
  ,/,
  whose/WP$
  marketing/NN
  tactics/NNS
  led/VBD
  Coca-Cola/NNP
  to/TO
  its/PRP$
  dominance/NN
  of/IN
  the/DT
  global/JJ
  soft-drink/NN
  market/NN
  throughout/IN
  the/DT
  20th/JJ
  and/CC
  21st/CD
  centuries/NNS
  ./.
  [/$
  4/CD
  ]/IN
  The/DT
  name/NN
  refers/NNS
  to/TO
  two/CD
  of/IN
  its/PRP$
  original/JJ
  ingredients/NNS
  :/:
  coca/NN
  leaves/NNS
  and/CC
  kola/NN
  nuts/NNS
  (/(
  a/DT
  source/NN
  of/IN
  caffeine/NN
  )/))


# Recursos léxicos

In [None]:
import nltk
from nltk.corpus import words
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#Listando las palabras vacías (stopwords)
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

# WordNet

In [None]:
from nltk.corpus import wordnet as wn
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
wn.synsets('motorcar')
print("lemmas: ", wn.synset('car.n.01').lemmas())
wn.synset('car.n.01').definition()

lemmas:  [Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'), Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')]


'a motor vehicle with four wheels; usually propelled by an internal combustion engine'