# Parts of Speech Tagging

Part-of-Speech (POS) tagging is a natural language processing technique that involves assigning specific grammatical categories or labels (such as nouns, verbs, adjectives, adverbs, pronouns, etc.) to individual words within a sentence. This process provides insights into the syntactic structure of the text, aiding in understanding word relationships, disambiguating word meanings, and facilitating various linguistic and computational analyses of textual data.

Part of speech tagging is important because it allows computers to understand the grammatical structure of a sentence by identifying the role each word plays (like noun, verb, adjective) which is crucial for tasks like machine translation, sentiment analysis, question answering, and other natural language processing applications, as it helps disambiguate words with multiple meanings and provides a foundation for deeper semantic analysis.

## Universal POS tags

These tags mark the core part-of-speech categories. To distinguish additional lexical and grammatical properties of words, use the universal features.

* ADJ: adjective
* ADP: adposition
* ADV: adverb
* AUX: auxiliary
* CCONJ: coordinating conjunction
* DET: determiner
* INTJ: interjection
* NOUN: noun
* NUM: numeral
* PART: particle
* PRON: pronoun
* PROPN: proper noun
* PUNCT: punctuation
* SCONJ: subordinating conjunction
* SYM: symbol
* VERB: verb
* X: other


## Workflow

The following are the processes in a typical natural language processing (NLP) example of part-of-speech (POS) tagging:

1. Tokenization: Divide the input text into discrete tokens, which are usually units of words or subwords. The first stage in NLP tasks is tokenization.

2. Loading Language Models: To utilize a library such as NLTK or SpaCy, be sure to load the relevant language model. These models offer a foundation for comprehending a language’s grammatical structure since they have been trained on a vast amount of linguistic data.
3. Text Processing: If required, preprocess the text to handle special characters, convert it to lowercase, or eliminate superfluous information. Correct PoS labeling is aided by clear text.
4. Linguistic Analysis: To determine the text’s grammatical structure, use linguistic analysis. This entails understanding each word’s purpose inside the sentence, including whether it is an adjective, verb, noun, or other.
5. Part-of-Speech Tagging: To determine the text’s grammatical structure, use linguistic analysis. This entails understanding each word’s purpose inside the sentence, including whether it is an adjective, verb, noun, or other.
6. Results Analysis: Verify the accuracy and consistency of the PoS tagging findings with the source text. Determine and correct any possible problems or mistagging.

## Types

1. Rule Based POS Tagging

    Rule-based part-of-speech (POS) tagging is a method of labeling words with their corresponding parts of speech using a set of pre-defined rules. This is in contrast to machine learning-based POS tagging, which relies on training a model on a large annotated corpus of text.

    In a rule-based POS tagging system, words are assigned POS tags based on their characteristics and the context in which they appear. For example, a rule-based POS tagger might assign the tag “noun” to any word that ends in “-tion” or “-ment,” as these suffixes are often used to form nouns.

    Rule-based POS taggers can be relatively simple to implement and are often used as a starting point for more complex machine learning-based taggers. However, they can be less accurate and less efficient than machine learning-based taggers, especially for tasks with large or complex datasets.

    Here is a very basic example of how a rule-based POS tagger might work:

   1. Define a set of rules for assigning POS tags to words. For example:

       * If the word ends in “-tion,” assign the tag “noun.”
       * If the word ends in “-ment,” assign the tag “noun.”
       * If the word is all uppercase, assign the tag “proper noun.”
       * If the word is a verb ending in “-ing,” assign the tag “verb.”

   2. Iterate through the words in the text and apply the rules to each word in turn. For example:

       * “Nation” would be tagged as “noun” based on the first rule.
       * “Investment” would be tagged as “noun” based on the second rule.
       * “UNITED” would be tagged as “proper noun” based on the third rule.
       * “Running” would be tagged as “verb” based on the fourth rule.

   3. Output the POS tags for each word in the text.

    More complex systems can include additional rules and logic to handle more varied and nuanced text.

2. Statistical POS Tagging

    Statistical part-of-speech (POS) tagging is a method of labeling words with their corresponding parts of speech using statistical techniques. This is in contrast to rule-based POS tagging, which relies on pre-defined rules, and to unsupervised learning-based POS tagging, which does not use any annotated training data.

    In statistical POS tagging, a model is trained on a large annotated corpus of text to learn the patterns and characteristics of different parts of speech. The model uses this training data to predict the POS tag of a given word based on the context in which it appears and the probability of different POS tags occurring in that context.

    Statistical POS taggers can be more accurate and efficient than rule-based taggers, especially for tasks with large or complex datasets. However, they require a large amount of annotated training data and can be computationally intensive to train.

    Here is an example of how a statistical POS tagger might work:

    1. Collect a large annotated corpus of text and divide it into training and testing sets.

    2. Train a statistical model on the training data, using techniques such as maximum likelihood estimation or hidden Markov models.
    3. Use the trained model to predict the POS tags of the words in the testing data.
    4. Evaluate the performance of the model by comparing the predicted tags to the true tags in the testing data and calculating metrics such as precision and recall.
    5. Fine-tune the model and repeat the process until the desired level of accuracy is achieved.
    6. Use the trained model to perform POS tagging on new, unseen text.

    There are various statistical techniques that can be used for POS tagging, and the choice of technique will depend on the specific characteristics of the dataset and the desired level of accuracy.

3. Transformation-based tagging (TBT)

    Transformation-based tagging (TBT) is a method of part-of-speech (POS) tagging that uses a series of rules to transform the tags of words in a text. This is in contrast to rule-based POS tagging, which assigns tags to words based on pre-defined rules, and to statistical POS tagging, which relies on a trained model to predict tags based on probability.

    In TBT, a set of rules is defined to transform the tags of words in a text based on the context in which they appear. For example, a rule might change the tag of a verb to a noun if it appears after a determiner such as “the.” The rules are applied to the text in a specific order, and the tags are updated after each transformation.

    TBT can be more accurate than rule-based tagging, especially for tasks with complex grammatical structures. However, it can be more computationally intensive and requires a larger set of rules to achieve good performance.

    Here is a very basic example of how a TBT system might work: 

    1. Define a set of rules for transforming the tags of words in the text. For example:

        * If the word is a verb and appears after a determiner, change the tag to “noun.”
        * If the word is a noun and appears after an adjective, change the tag to “adjective.”

    2. Iterate through the words in the text and apply the rules in a specific order. For example:

        * In the sentence “The cat sat on the mat,” the word “sat” would be changed from a verb to a noun based on the first rule.
        * In the sentence “The red cat sat on the mat,” the word “red” would be changed from an adjective to a noun based on the second rule.

    3. Output the transformed tags for each word in the text.

    More complex systems can include additional rules and logic to handle more varied and nuanced text.

4. Hidden Markov Model POS tagging

    Hidden Markov models (HMMs) are a type of statistical model that can be used for part-of-speech (POS) tagging in natural language processing (NLP). In an HMM-based POS tagger, a model is trained on a large annotated corpus of text to learn the patterns and characteristics of different parts of speech. The model uses this training data to predict the POS tag of a given word based on the probability of different tags occurring in the context of the word.

    An HMM-based POS tagger consists of a set of states, each corresponding to a possible POS tag, and a set of transitions between the states. The model is trained on the training data to learn the probabilities of transitioning from one state to another and the probabilities of observing different words given a particular state.

    To perform POS tagging on a new text using an HMM-based tagger, the model uses the probabilities learned during training to compute the most likely sequence of POS tags for the words in the text. This is typically done using the Viterbi algorithm, which calculates the probability of each possible sequence of tags and selects the most likely one.

    HMMs are widely used for POS tagging and other tasks in NLP due to their ability to model complex sequential data and their efficiency in computation. However, they can be sensitive to the quality of the training data and may require a large amount of annotated data to achieve good performance.

## POS Tagging with SpaCy

In [6]:
# Import libraries
import spacy

# Load the english language model
nlp = spacy.load('en_core_web_sm')

In [15]:
# Sample Text
sample_text = "Virat Kohli’s first act of the day stemmed from forgetfulness. He went and stood at second slip,\
    his customary position when Rohit Sharma tenants the first slip position. KL Rahul gestured to Kohli, who smiled,\
        and went and stood in the first slip. That was the only ‘lapse’ as Kohli was actively involved in every little move through Australia’s batting innings."

# Lowercasing
text = sample_text.lower()

# Process the text using spacy's english model
doc = nlp(text)

# Print the original text and pos tags
print("Original Text: ", sample_text)
print("POS tagging result: ")

for token in doc:
    print(f"{token} -> {token.pos_}")

Original Text:  Virat Kohli’s first act of the day stemmed from forgetfulness. He went and stood at second slip,    his customary position when Rohit Sharma tenants the first slip position. KL Rahul gestured to Kohli, who smiled,        and went and stood in the first slip. That was the only ‘lapse’ as Kohli was actively involved in every little move through Australia’s batting innings.
POS tagging result: 
virat -> PROPN
kohli -> PROPN
’s -> PART
first -> ADJ
act -> NOUN
of -> ADP
the -> DET
day -> NOUN
stemmed -> VERB
from -> ADP
forgetfulness -> NOUN
. -> PUNCT
he -> PRON
went -> VERB
and -> CCONJ
stood -> VERB
at -> ADP
second -> ADJ
slip -> NOUN
, -> PUNCT
    -> SPACE
his -> PRON
customary -> ADJ
position -> NOUN
when -> SCONJ
rohit -> PROPN
sharma -> PROPN
tenants -> VERB
the -> DET
first -> ADJ
slip -> ADJ
position -> NOUN
. -> PUNCT
kl -> PROPN
rahul -> PROPN
gestured -> VERB
to -> ADP
kohli -> PROPN
, -> PUNCT
who -> PRON
smiled -> VERB
, -> PUNCT
        -> SPACE
and -> CCON


## Use Cases

Here are Some Use Cases of POS tagging:

1. Syntactic Analysis: By understanding the grammatical role of each word (e.g., noun phrase, verb phrase), POS tagging helps analyze the sentence structure and relationships between words. This is achieved using hidden Markov models and other algorithms that predict the most likely sequence of POS tags based on the given text.

2. Disambiguation: Words like “play” can be a noun or verb. POS tagging helps identify the correct meaning based on context, using tagsets that define the possible tags for each word type and their contexts.
3. Language Modeling: POS tags provide valuable information about the relationships between words, which is useful for building statistical models of language. These models can be enhanced with deep learning techniques to improve their accuracy and handling of complex linguistic patterns.
4. Preprocessing for Other NLP Tasks: POS tagging is often a preliminary step for tasks like named entity recognition and information extraction. By identifying the part of speech for each word, we can better understand the structure of the text and extract relevant information more accurately. This involves prepositions and other parts of speech that help determine the relationships between entities in a sentence.
5. Lemmatization and Stemming: These techniques reduce words to their base forms (e.g., “running” to “run”). POS tags can help identify the correct base form depending on the word’s function in the sentence, distinguishing between different uses such as nouns, verbs, or interjections.
6. Grammar Checking: POS information can be used to flag potential grammatical errors, like using a verb in the wrong tense. This is particularly useful in applications such as grammar checking software, where understanding the pos tagger output helps identify mistakes.

## Challanges

Some common challenges in part-of-speech (POS) tagging include:

1. Ambiguity: Some words can have multiple POS tags depending on the context in which they appear, making it difficult to determine their correct tag. For example, the word “bass” can be a noun (a type of fish) or an adjective (having a low frequency or pitch).

2. Out-of-vocabulary (OOV) words: Words that are not present in the training data of a POS tagger can be difficult to tag accurately, especially if they are rare or specific to a particular domain.
3. Complex grammatical structures: Languages with complex grammatical structures, such as languages with many inflections or free word order, can be more challenging to tag accurately.
4. Lack of annotated training data: Some languages or domains may have limited annotated training data, making it difficult to train a high-performing POS tagger.
5. Inconsistencies in annotated data: Annotated data can sometimes contain errors or inconsistencies, which can negatively impact the performance of a POS tagger.


## Sources

* Universal POS Tags: [universaldependencies.org](https://universaldependencies.org/u/pos/)
* GeeksForGeeks: [POS(Parts-Of-Speech) Tagging in NLP](https://www.geeksforgeeks.org/nlp-part-of-speech-default-tagging/)
* Shiksha: [Understanding Part-of-Speech Tagging in NLP: Techniques and Applications](https://www.shiksha.com/online-courses/articles/pos-tagging-in-nlp/)
* Analytics Vidhya: [How Part-of-Speech Tag, Dependency and Constituency Parsing Aid In Understanding Text Data?](https://www.analyticsvidhya.com/blog/2020/07/part-of-speechpos-tagging-dependency-parsing-and-constituency-parsing-in-nlp/)