https://www.butte.edu/departments/cas/tipsheets/grammar/parts_of_speech.html

https://medium.com/data-science/part-of-speech-tagging-514c25fd7882

    Please book my flight to California
    I read a very good book


POS using markov chains:
https://medium.com/data-science/part-of-speech-tagging-with-hidden-markov-chain-models-e9fccc835c0e
https://medium.com/data-science/part-of-speech-tagging-for-beginners-3a0754b2ebba



https://www.nltk.org/book/ch05.html
https://www.nltk.org/howto/chunk.html
https://web.archive.org/web/20150412115803/http://www.eecis.udel.edu:80/~trnka/CISC889-11S/lectures/dongqing-chunking.pdf



In [1]:
import json
import codecs
import copy
import re
import textwrap
import os
import pandas as pd
from IPython.display import HTML

import string
from collections import Counter


import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import words as nltk_words, wordnet
from nltk.corpus import words
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk import pos_tag, word_tokenize
from unidecode import unidecode

import spacy
from spacy.tokens import Doc
# !python -m spacy download en_core_web_sm
from spacy import displacy

from textblob import TextBlob
from textblob.taggers import PatternTagger

from collections import OrderedDict


# Download required resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('words')
nltk.download('omw-1.4')
nltk.download('universal_tagset')


[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/user/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to /home/user/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data

True

## What is a Part of Speech?

Part of Speech (POS) is a way to describe the grammatical function of a word. In Natural Language Processing (NLP), POS is an essential building block of language models and interpreting text. While POS tags are used in higher-level functions of NLP, it’s important to understand them on their own, and it’s possible to leverage them for useful purposes in your text analysis.

There are eight (sometimes nine ) different parts of speech in English that are commonly defined. The Butte College introduces the following definitions:

    Noun: A noun is the name of a person, place, thing, or idea.
    Pronoun: A pronoun is a word used in place of a noun.
    Verb: A verb expresses action or being.
    Adjective: An adjective modifies or describes a noun or pronoun.
    Adverb: An adverb modifies or describes a verb, an adjective, or another adverb.
    Preposition: A preposition is a word placed before a noun or pronoun to form a phrase modifying another word in the sentence.
    Conjunction: A conjunction joins words, phrases, or clauses.
    Interjection: An interjection is a word used to express emotion.
    Determiner or Article: A grammatical marker of definiteness (the) or indefiniteness (a, an). These are not always considered POS but are often included in POS tagging libraries.

In [2]:

some_text = "Yesterday, I booked a table at a new Italian restaurant downtown. Later that evening, I picked up a book on Renaissance art from the library."
# some_text = "While I watched the show, the guard stood at the watch post, unmoving."
# some_text = "He had to bear the weight of responsibility, even though a bear had just been spotted near the cabin."
# some_text = "The coach will train the athletes to improve their train of thought and reaction times."
# some_text = "I didn't object to the idea, but the object in question was clearly broken."



# Tokenization using NLTK
nltk_tokens = word_tokenize(some_text)

# Tokenization using spaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp(some_text)
spacy_tokens = [token.text for token in doc]

common_tokens = list(OrderedDict.fromkeys(token for token in nltk_tokens if token in spacy_tokens)) #spacy_pos_tags



# POS Tagging using NLTK
nltk_pos_tags = pos_tag(nltk_tokens)

# POS Tagging using TextBlob
blob = TextBlob(some_text, pos_tagger=PatternTagger())
blob_pos_tags = blob.tags

# POS Tagging using spaCy
spacy_pos_tags = {
    token.text: token.pos_ 
    for token in doc 
    if not token.is_punct and not token.is_space
}

df = pd.DataFrame({
    "Token": common_tokens,
    "POS Tag (NLTK)": [dict(nltk_pos_tags).get(token) for token in common_tokens],
    "POS Tag (spaCy)": [spacy_pos_tags.get(token) for token in common_tokens],
    "POS Tag (TextBlob)": [dict(blob_pos_tags).get(token) for token in common_tokens],
})

HTML(df.to_html(index=False))


Token,POS Tag (NLTK),POS Tag (spaCy),POS Tag (TextBlob)
Yesterday,NN,NOUN,NN
",",",",,
I,PRP,PRON,PRP
booked,VBD,VERB,VBN
a,DT,DET,DT
table,NN,NOUN,NN
at,IN,ADP,IN
new,JJ,ADJ,JJ
Italian,JJ,ADJ,JJ
restaurant,NN,NOUN,NN


In [17]:


def get_penn_treebank_mapping():
    """
    This function outputs the alphabetical list of part-of-speech tags used in the Penn Treebank Project.
    Source: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    
    """
    penn_treebank_mapping = {
        "CC": "Coordinating conjunction",
        "CD": "Cardinal number",
        "DT": "Determiner",
        "EX": "Existential there",
        "FW": "Foreign word",
        "IN": "Preposition or subordinating conjunction",
        "JJ": "Adjective",
        "JJR": "Adjective, comparative",
        "JJS": "Adjective, superlative",
        "LS": "List item marker",
        "MD": "Modal",
        "NN": "Noun, singular or mass",
        "NNS": "Noun, plural",
        "NNP": "Proper noun, singular",
        "NNPS": "Proper noun, plural",
        "PDT": "Predeterminer",
        "POS": "Possessive ending",
        "PRP": "Personal pronoun",
        "PRP$": "Possessive pronoun",
        "RB": "Adverb",
        "RBR": "Adverb, comparative",
        "RBS": "Adverb, superlative",
        "RP": "Particle",
        "SYM": "Symbol",
        "TO": "to",
        "UH": "Interjection",
        "VB": "Verb, base form",
        "VBD": "Verb, past tense",
        "VBG": "Verb, gerund or present participle",
        "VBN": "Verb, past participle",
        "VBP": "Verb, non-3rd person singular present",
        "VBZ": "Verb, 3rd person singular present",
        "WDT": "Wh-determiner",
        "WP": "Wh-pronoun",
        "WP$": "Possessive wh-pronoun",
        "WRB": "Wh-adverb",
    }

    penn_treebank_df = pd.DataFrame.from_dict([penn_treebank_mapping]).T.reset_index()
    penn_treebank_df.columns = ["Tag", "Description"]
    return penn_treebank_df

get_penn_treebank_mapping()

Unnamed: 0,Tag,Description
0,CC,Coordinating conjunction
1,CD,Cardinal number
2,DT,Determiner
3,EX,Existential there
4,FW,Foreign word
5,IN,Preposition or subordinating conjunction
6,JJ,Adjective
7,JJR,"Adjective, comparative"
8,JJS,"Adjective, superlative"
9,LS,List item marker


## Visualizing Part of Speech tags with spaCy

In [3]:
displacy.render(doc, style="dep", jupyter=True)

In [4]:
displacy.render(doc, style="ent", jupyter=True)

In [None]:

# Inspiration: https://towardsdatascience.com/visualizing-part-of-speech-tags-with-nltk-and-spacy-42056fcd777e/

# Universal POS tags as used by spaCy
ALL_POS_TAGS = [
    "ADJ", "ADP", "ADV", "AUX", "CCONJ", "DET", "INTJ",
    "NOUN", "NUM", "PART", "PRON", "PROPN", "PUNCT",
    "SCONJ", "SYM", "VERB", "X", "SPACE"
]

def get_spacy_pos_ents(doc, pos_tags):
    return [
        {"start": token.idx, "end": token.idx + len(token.text), "label": token.pos_}
        for token in doc
        if token.pos_ in pos_tags and not token.is_space and not token.is_punct
    ]

def get_default_pos_colors():
    return {
        "ADJ": "lime",
        "ADP": "khaki",
        "ADV": "orange",
        "AUX": "plum",
        "CCONJ": "cornflowerblue",
        "DET": "forestgreen",
        "INTJ": "lightcoral",
        "NOUN": "turquoise",
        "NUM": "salmon",
        "PART": "yellow",
        "PRON": "blueviolet",
        "PROPN": "lightseagreen",
        "PUNCT": "lightgray",
        "SCONJ": "rosybrown",
        "SYM": "darkorange",
        "VERB": "lightpink",
        "X": "silver",
        "SPACE": "white"
    }

def visualize_pos_spacy(doc, pos_tags=None, colors=None, displacy_options=None):
    """
    Visualiza POS tags usando displacy com estilo 'ent' a partir de um spaCy Doc.

    Parâmetros:
    - doc: objeto spaCy Doc já processado
    - pos_tags: lista de POS tags spaCy (como "NOUN", "VERB", etc.) a serem visualizadas
    - colors: dicionário de cores personalizadas para as POS tags
    - displacy_options: dicionário com configurações adicionais para displacy
    """
    if not isinstance(doc, Doc):
        raise TypeError("O parâmetro 'doc' deve ser uma instância de spacy.tokens.Doc.")

    if pos_tags is None:
        pos_tags = ALL_POS_TAGS

    if colors is None:
        colors = get_default_pos_colors()

    if displacy_options is None:
        displacy_options = {}

    ents = get_spacy_pos_ents(doc, pos_tags)

    options = {"ents": pos_tags, "colors": colors}
    options.update(displacy_options)

    displacy.render({"text": doc.text, "ents": ents}, style="ent", manual=True, options=options)

doc = nlp(some_text)
visualize_pos_spacy(doc)