# Etiquetado POS

El POS (part of speech) tagging o etiquetado morfológico es el proceso mediante el cual se clasifican las partes de un texto de acuerdo a su clasificación léxica.

Cada palabra recibirá una clasificación léxica a partir de una colección de etiquetas codificadas de acuerdo a su significado en el idioma correspondiente. Para poder realizar un etiquetado POS el texto debe estar previamente tokenizado.

NLKT ofrece una función llamada pos_tag. Esta función clasifica las palabras en ingés según un sistema de codificación pre-definido. Este etiquetador en particular está basado en machine learning y ha sido entrenado a partir de miles de ejemplos de oraciones pre-etiquetadas de manera manual. De esta manera puede estimar la clasificación léxica más probable de un término lo cuál no significa que esté libre de errores.

Es posible obtener una lista completa de los códigos de etiquetado para NLTK

In [9]:
import nltk
#nltk.download('tagsets')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

Es posible obtener la descripción cada una categoría específica.

In [10]:
nltk.help.upenn_tagset("NNP")

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


Ya que este etiquetador puede no ser suficiente bueno en algunos casos es posible mejorar la eficiencia del etiquetado sumando etiquetadores POS creados manualmente.

## Etiquetado

In [14]:
example = "The Palace of Westminster serves as the meeting place for both the House of Commons and the House of Lords, the two houses of the Parliament of the United Kingdom. Informally known as the Houses of Parliament after its occupants, the Palace lies on the north bank of the River Thames in the City of Westminster, in central London, England."
# Tokenizar texto
tokenized_text = nltk.word_tokenize(example)
print(tokenized_text)
# Etiquetar texto con pos_tag
text_pos = nltk.pos_tag(tokenized_text)
print(text_pos)

['The', 'Palace', 'of', 'Westminster', 'serves', 'as', 'the', 'meeting', 'place', 'for', 'both', 'the', 'House', 'of', 'Commons', 'and', 'the', 'House', 'of', 'Lords', ',', 'the', 'two', 'houses', 'of', 'the', 'Parliament', 'of', 'the', 'United', 'Kingdom', '.', 'Informally', 'known', 'as', 'the', 'Houses', 'of', 'Parliament', 'after', 'its', 'occupants', ',', 'the', 'Palace', 'lies', 'on', 'the', 'north', 'bank', 'of', 'the', 'River', 'Thames', 'in', 'the', 'City', 'of', 'Westminster', ',', 'in', 'central', 'London', ',', 'England', '.']
[('The', 'DT'), ('Palace', 'NNP'), ('of', 'IN'), ('Westminster', 'NNP'), ('serves', 'NNS'), ('as', 'IN'), ('the', 'DT'), ('meeting', 'NN'), ('place', 'NN'), ('for', 'IN'), ('both', 'CC'), ('the', 'DT'), ('House', 'NNP'), ('of', 'IN'), ('Commons', 'NNPS'), ('and', 'CC'), ('the', 'DT'), ('House', 'NNP'), ('of', 'IN'), ('Lords', 'NNPS'), (',', ','), ('the', 'DT'), ('two', 'CD'), ('houses', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Parliament', 'NNP'), ('of'

# POS en español

Si queremos hacer un etiquetado morfológico en otro idioma entonces es necesario encontrar un etiquetador ya entrenado para ese idioma o entrenar uno nosotros mismos. También es necesario saber cuales son las clasificaciones de palabras existentes para dicho idioma.

En el siguiente <a href="https://colab.research.google.com/github/vitojph/kschool-nlp-18/blob/master/notebooks/pos-tagger-es.ipynb">enlace se muestra un ejemplo práctico de etiquetado POS en español.

# Ejercicio

- Obtener de la API una lista de Tweets que no sean retweet y que contengan el hashtag #GRAMMYs en inglés.
- Realizar la tokenización
- Realizar un etiquetado POS con la función pos_tag de NLTK
- Obtener la lista y frecuencia de los sustantivos en singular y plural

In [19]:
# Obtener de la API una lista de Tweets que no sean retweet y que contengan el hashtag #GRAMMYs en inglés.
import os
import pandas as pd
import requests
from dotenv import load_dotenv
# Cargar valores del archivo .env en las variables de entorno
load_dotenv()
# Cargar valor del token a variable
bearer_token = os.environ.get("BEARER_TOKEN")

url = "https://api.twitter.com/2/tweets/search/recent"

headers = {
    "Authorization": f"Bearer {bearer_token}",
    "User-Agent":"v2FullArchiveSearchPython"
} 

hashtag='#GRAMMYs'
params = {
    'query': f'lang:en {hashtag} -is:retweet',
    'max_results': 10,
    'tweet.fields': "lang"
} 

response = requests.get(url, headers=headers, params=params)
print(response)
# Generar excepción si la respuesta no es exitosa
if response.status_code != 200:
    raise Exception(response.status_code, response.text)
df = pd.json_normalize(response.json()['data'])
# mostrando y exportando tweets
pd.set_option('display.max_colwidth', None)
df.to_csv('tweets_pos.csv')  
df

<Response [200]>


Unnamed: 0,id,lang,text
0,1446193464782344194,en,How the heck do you eat healthy when running from airport to airport? NYC next! #Carnegie #billboard #grammys #kittwakeley #symphonyofsinnersandsaints https://t.co/W0RTvf4JUb
1,1446193384075444224,en,"Legendary #Guitarist and #Grammys nominee @JonFinn appears as husband/wife #bluesy duo with #Performing #RecordingArtist @JuliFinn at @RegentArlington Great #Guitar Night 11/5, celebrating #Boston’s finest in a slam dunk showcase all under one roof! https://t.co/6G9B3g5X3S https://t.co/wMldRd9EIC"
2,1446188882396123138,en,Even #Adele was upset by #Beyonce's #Grammys snub https://t.co/eXQiBLSUz7
3,1446180310970945541,en,📷 Tbt with Latin Grammy Award Winner @alexandergdz @gentedezona We’re all the way up❤🙏🎵⭐ . . . . . . . #grammys #latin #awards #winner #cuba #venezuela #guyana #reggaeton #pop #star #work #business #grind #street #picoftheday #tbt... https://t.co/kQPG9hrwOm
4,1446174603399909380,en,FC let's vote \nWe are behind \nGrammys: Who Do You Hope Is Nominated for Song &amp; Album of the Year at 2022 ceremony? Vote! https://t.co/jfReZ28dQk #GRAMMYs #Newcastle https://t.co/snOOx4QtSn
5,1446171651662618625,en,https://t.co/QgKJDLoejV @davido @wizkhalifa @lilbaby4PF @MI_Abaga need your support please check out my new song #KanyeWest #ExpensivePain #BandcampFriday #MyUniverse #Jemappelle #GRAMMYs #drake #NewMusic #music #DONDA #MEMENTOMORI #LilNasX
6,1446162654448558086,en,This is the weight each genre and voting bodies have for the final voting to pick the winners\n\n#Grammys https://t.co/78x0YYXo66
7,1446157006205771776,en,#GRAMMYs an informational thread:
8,1446156881907572738,en,".@Adele says she spoke to Beyoncé after the #Grammys in 2017\n\n— ""They don’t want to support the way that she’s moving things forward with her releases and the things that she’s talking about.” https://t.co/vvEbjzqnTY"
9,1446155234347294725,en,There’s a lemon in it #GRAMMYs


In [20]:
# Limpieza y tokenización
from nltk.tokenize import TweetTokenizer

# removiendo signos de pregunta y exclamación
df['text'] = df['text'].str.replace('¿', '')
df['text'] = df['text'].str.replace('?', '', regex=False) # Daba un warning
df['text'] = df['text'].str.replace('!', '')
df['text'] = df['text'].str.replace('¡', '')
# removiendo tildes
df['text'] = df['text'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
# removiendo simbolo hashtag
df['text'] = df['text'].str.replace('#', '')
# removiendo menciones
df['text'] = df['text'].replace(r'(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)', '', regex=True)
# removiendo caracteres numéricos
df['text'] = df['text'].replace(r'[0-9]+', '', regex=True)
# removiendo emojis
df = df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
# eliminando urls
df['text'] = df['text'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)
# removiendo saltos de linea, espacios en blanco y tabs
df['text'] = df['text'].str.replace('\n', '')
df['text'] = df['text'].str.replace('\t', '')
df['text'] = df['text'].str.replace(' {2,}', ' ', regex=True)
df['text'] = df['text'].str.strip()
# convirtiendo texto a minúscula
df['text'] = df['text'].str.lower()
# removiendo filas vacias
df.dropna()
df['text'].astype(bool)
df = df[df['text'].astype(bool)]

# tokenización

# Tokenización
tt = TweetTokenizer()
df['tokenized_text'] = df['text'].apply(tt.tokenize)
df

Unnamed: 0,id,lang,text,tokenized_text
0,1446193464782344194,en,how the heck do you eat healthy when running from airport to airport nyc next carnegie billboard grammys kittwakeley symphonyofsinnersandsaints,"[how, the, heck, do, you, eat, healthy, when, running, from, airport, to, airport, nyc, next, carnegie, billboard, grammys, kittwakeley, symphonyofsinnersandsaints]"
1,1446193384075444224,en,legendary guitarist and grammys nominee appears as husbandwife bluesy duo with performing recordingartist at great guitar night celebrating bostons finest in a slam dunk showcase all under one roof,"[legendary, guitarist, and, grammys, nominee, appears, as, husbandwife, bluesy, duo, with, performing, recordingartist, at, great, guitar, night, celebrating, bostons, finest, in, a, slam, dunk, showcase, all, under, one, roof]"
2,1446188882396123138,en,even adele was upset by beyonces grammys snub,"[even, adele, was, upset, by, beyonces, grammys, snub]"
3,1446180310970945541,en,tbt with latin grammy award winner were all the way up grammys latin awards winner cuba venezuela guyana reggaeton pop star work business grind street picoftheday tbt,"[tbt, with, latin, grammy, award, winner, were, all, the, way, up, grammys, latin, awards, winner, cuba, venezuela, guyana, reggaeton, pop, star, work, business, grind, street, picoftheday, tbt]"
4,1446174603399909380,en,fc lets vote we are behind grammys who do you hope is nominated for song amp album of the year at ceremony vote grammys newcastle,"[fc, lets, vote, we, are, behind, grammys, who, do, you, hope, is, nominated, for, song, amp, album, of, the, year, at, ceremony, vote, grammys, newcastle]"
5,1446171651662618625,en,abaga need your support please check out my new song kanyewest expensivepain bandcampfriday myuniverse jemappelle grammys drake newmusic music donda mementomori lilnasx,"[abaga, need, your, support, please, check, out, my, new, song, kanyewest, expensivepain, bandcampfriday, myuniverse, jemappelle, grammys, drake, newmusic, music, donda, mementomori, lilnasx]"
6,1446162654448558086,en,this is the weight each genre and voting bodies have for the final voting to pick the winnersgrammys,"[this, is, the, weight, each, genre, and, voting, bodies, have, for, the, final, voting, to, pick, the, winnersgrammys]"
7,1446157006205771776,en,grammys an informational thread,"[grammys, an, informational, thread]"
8,1446156881907572738,en,says she spoke to beyonce after the grammys in they dont want to support the way that shes moving things forward with her releases and the things that shes talking about,"[says, she, spoke, to, beyonce, after, the, grammys, in, they, dont, want, to, support, the, way, that, shes, moving, things, forward, with, her, releases, and, the, things, that, shes, talking, about]"
9,1446155234347294725,en,theres a lemon in it grammys,"[theres, a, lemon, in, it, grammys]"


- Obtener la lista y frecuencia de los nombres propios en singular y plural


In [76]:
from collections import defaultdict, Counter
corpus_counts = defaultdict(Counter)

# Etiquetar texto con pos_tag
# Obtener la lista y frecuencia de los nombres propios en singular y plural
nnp_words = []
nnps_words = []
import nltk
for row in df['tokenized_text']:
    tags = nltk.pos_tag(row)
    for word, tag in tags:
        if tag == 'NNP':
            nnp_words.append(word)
        if tag == 'NNPS':
            nnps_words.append(word)
        
print("Nombres propios en singular, cantidad y lista:", len(nnp_words))            
print(nnp_words)
print("Nombres propios en plural, cantidad y lista:", len(nnps_words))
print(nnps_words)


Nombres propios en singular, cantidad y lista: 0
[]
Nombres propios en plural, cantidad y lista: 0
[]


- Obtener la lista y frecuencia de los verbos en todos los tiempos verbales

In [78]:
# Etiquetar texto con pos_tag
# Obtener la lista y frecuencia de todos los adjetivos
verb_words = []
verb_tags = ["VBZ", "VBP", "VBN", "VBG", "VBD", "VB"]
import nltk
for row in df['tokenized_text']:
    tags = nltk.pos_tag(row)
    for word, tag in tags:
        if tag in verb_tags:
            verb_words.append(word)
        
print("Verbos, cantidad y lista:", len(verb_words))            
print(verb_words)



Verbos, cantidad y lista: 38
['do', 'eat', 'running', 'airport', 'appears', 'performing', 'celebrating', 'was', 'upset', 'snub', 'were', 'winner', 'grind', 'vote', 'are', 'do', 'hope', 'is', 'nominated', 'grammys', 'need', 'please', 'check', 'drake', 'is', 'have', 'pick', 'says', 'spoke', 'beyonce', 'dont', 'want', 'support', 'shes', 'moving', 'shes', 'talking', 'grammys']


- Obtener la lista y frecuencia de todos los adjetivos

In [79]:
# Etiquetar texto con pos_tag
# Obtener la lista y frecuencia de todos los adjetivos
adjectives_words = []
adjetives_tags = ["JJ", "JJR", "JJS"]
import nltk
for row in df['tokenized_text']:
    tags = nltk.pos_tag(row)
    for word, tag in tags:
        if tag in adjetives_tags:
            adjectives_words.append(word)
        
print("Verbos, cantidad y lista:", len(adjectives_words))            
print(adjectives_words)


Verbos, cantidad y lista: 23
['healthy', 'nyc', 'next', 'legendary', 'great', 'finest', 'slam', 'latin', 'grammy', 'latin', 'cuba', 'picoftheday', 'fc', 'amp', 'newcastle', 'new', 'song', 'kanyewest', 'bandcampfriday', 'newmusic', 'weight', 'final', 'informational']
