<a href="https://colab.research.google.com/github/babupallam/Msc_AI_Module2_Natural_Language_Processing/blob/main/Note_02_Lexical_Categories_and_Part_of_Speech_(POS)_Tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Note:

- This section focuses on how words are categorized into different parts of speech (POS) and how these categories are automatically assigned to words in texts.
- This section lays the foundation for understanding language structure and provides essential knowledge for performing higher-level tasks in natural language processing (NLP).



### **What are Lexical Categories?**

- Lexical categories, also known as **parts of speech**, are the groups or classes into which words are classified based on their grammatical and syntactic roles within sentences.

- These categories determine how a word functions in a sentence, and knowing a word's part of speech allows a computer to interpret and process its meaning more accurately. Some of the most common lexical categories are:

  - **Nouns (N)**: Represent people, places, things, or ideas (e.g., *dog*, *city*, *intelligence*).
  - **Verbs (V)**: Indicate actions or states (e.g., *run*, *is*, *sing*).
  - **Adjectives (ADJ)**: Modify nouns, describing qualities or characteristics (e.g., *blue*, *fast*, *large*).
  - **Adverbs (ADV)**: Modify verbs, adjectives, or other adverbs (e.g., *quickly*, *very*, *smoothly*).
  - **Pronouns (PRON)**: Substitute for nouns or noun phrases (e.g., *he*, *she*, *it*).
  - **Determiners (DET)**: Introduce nouns (e.g., *the*, *a*, *some*).
  - **Prepositions (ADP)**: Relate nouns to other words in a sentence (e.g., *on*, *in*, *under*).
  - **Conjunctions (CONJ)**: Connect words, phrases, or clauses (e.g., *and*, *but*, *if*).

- These categories allow NLP models to structure and parse language effectively, making it easier to interpret meaning and syntactic relations.



### **Open vs. Closed Classes**

Words are generally divided into two primary groups: **open classes** and **closed classes**.

- **Open Classes**:
  - Include nouns, verbs, adjectives, and adverbs. These categories are called "open" because they are constantly expanding as new words are coined and added. For example, recent additions to the lexicon like "selfie" and "cryptocurrency" belong to these open categories.
  
- **Closed Classes**:
  - Include pronouns, prepositions, determiners, and conjunctions. These categories are "closed" because they have a limited and relatively fixed set of members. New additions to these classes are rare, and they change very gradually over time.

This distinction is crucial in understanding language dynamics. While open-class words provide most of the content in language (i.e., **lexical words**), closed-class words act as function words, providing grammatical structure.



### **Part-of-Speech (POS) Tagging**

- **POS tagging** is the process of assigning the appropriate part-of-speech tag to each word in a sentence.
- This process allows NLP models to determine not just what a word is, but how it functions in context.
- For example, the word "run" can be a noun ("a run in the park") or a verb ("I run every day"), and the POS tagger helps resolve such ambiguities by using contextual information.

    **Example**:  
    Sentence: *The quick brown fox jumps over the lazy dog.*  
    POS Tags:  
    - The (DET)
    - quick (ADJ)
    - brown (ADJ)
    - fox (NOUN)
    - jumps (VERB)
    - over (ADP)
    - the (DET)
    - lazy (ADJ)
    - dog (NOUN)



### **Importance of POS Tagging in NLP**

POS tagging is a crucial preprocessing step in NLP pipelines. It plays a role in multiple downstream tasks, including:

1. **Syntactic Parsing**: Understanding the syntactic structure of a sentence requires identifying the roles of words based on their POS.
2. **Information Retrieval**: Helps in extracting relevant information from text by focusing on particular types of words (e.g., nouns, verbs).
3. **Machine Translation**: POS tagging ensures that words are translated appropriately based on their function in a sentence.
4. **Named Entity Recognition (NER)**: NER often relies on POS tagging to identify proper nouns and specific entities (e.g., people, organizations, locations).
5. **Sentiment Analysis**: The role of adjectives and adverbs in text is crucial for determining the sentiment (positive or negative) expressed in a sentence.



### **Automating POS Tagging: Challenges and Approaches**

- Automating POS tagging is a non-trivial task due to linguistic complexities, such as **ambiguity**.
- Many words in English can take multiple parts of speech, depending on the context. For example:

- **Word**: "fly"
  - **Noun**: The *fly* on the wall.
  - **Verb**: Birds *fly* in the sky.

- Automated POS tagging uses statistical models or machine learning algorithms to resolve such ambiguities and assign the correct tags.
- There are several methods for implementing POS taggers:

  - **Rule-Based Tagging**:
    - Involves manually defined rules for assigning tags based on the words and their contexts (e.g., a word ending in "ing" might be tagged as a present participle verb).
  - **Stochastic/Probabilistic Tagging**:
    - Uses statistical models to assign POS tags based on the likelihood of a tag given a word and its surrounding context.
      - Popular techniques include:
        - **Unigram Tagging**:
          - Assigns the most frequent tag for each word, ignoring context.
        - **Bigram and Trigram Tagging (N-Gram Tagging)**:
          - Considers the probability of a tag given the previous one or two tags, thereby incorporating context.
        
  - **Supervised Learning Methods**:
    - Machine learning models trained on large corpora of tagged text.
    - Some of the common supervised methods include:
      - **Hidden Markov Models (HMM)**:
        - A sequence-based model where the hidden states correspond to POS tags, and the visible states are the words in the sentence.
      - **Maximum Entropy Models**:
        - A more flexible probabilistic framework that can incorporate multiple features to assign tags.
      - **Conditional Random Fields (CRF)**:
        - A sequence labeling model that is commonly used for POS tagging, especially in more complex languages.



### **Example: Using NLTK for POS Tagging**

Here is an example of how POS tagging works using the **Natural Language Toolkit (NLTK)** in Python:


In [5]:
# Importing necessary modules from NLTK (Natural Language Toolkit)
import nltk
from nltk import pos_tag  # pos_tag is a function to perform Part-of-Speech tagging
from nltk.tokenize import word_tokenize  # word_tokenize is used to split the sentence into words

# Download the 'punkt' resource for tokenization
nltk.download('punkt') # Download the 'punkt' resource for sentence tokenization
# Download the 'averaged_perceptron_tagger' resource for POS tagging
nltk.download('averaged_perceptron_tagger') # This line downloads the necessary resource

# Example sentence that we will process
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenizing the sentence
# This step splits the sentence into individual words (tokens).
# For example, "The quick brown fox jumps over the lazy dog."
# becomes ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
tokens = word_tokenize(sentence)

# Applying POS tagging
# The pos_tag() function takes the list of words (tokens) and assigns a part of speech (POS) tag to each one.
# For example:
# ('The', 'DT') --> 'DT' means Determiner
# ('quick', 'JJ') --> 'JJ' means Adjective
# ('fox', 'NN') --> 'NN' means Noun, singular or mass
tagged = pos_tag(tokens)

# Printing the list of words along with their POS tags
# The result will look something like:
# [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'),
#  ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
print(tagged)

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!



This simple example shows how NLTK assigns POS tags to each word in a sentence.
Here,
  - **DT** stands for determiner,
  - **JJ** for adjective,
  - **NN** for noun, and
  - **VBZ** for a verb in third-person singular form.


In [9]:

# Print the meanings of the POS tags used in the sentence
from nltk.data import load # Load data from NLTK's data directory
nltk.download('tagsets') # This line is added to download tagset information

print("\nPOS Tag Meanings:")
tagdict = load('help/tagsets/upenn_tagset.pickle')
for word, tag in tagged:
    if tag in tagdict:
        print(f"{word}  -->  {tag}: {tagdict[tag]}")


POS Tag Meanings:
The  -->  DT: ('determiner', 'all an another any both del each either every half la many much nary neither no some such that the them these this those ')
quick  -->  JJ: ('adjective or numeral, ordinal', 'third ill-mannered pre-war regrettable oiled calamitous first separable ectoplasmic battery-powered participatory fourth still-to-be-named multilingual multi-disciplinary ... ')
brown  -->  NN: ('noun, common, singular or mass', 'common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ... ')
fox  -->  NN: ('noun, common, singular or mass', 'common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ... ')
jumps  -->  VBZ: ('verb, present tense, 3rd person singular', 'bases reconstructs marks mixes displeases seals carps weaves snatches slumps stretches authorizes smolders pictures emerges

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


### Observations:

#### Which  POS tagging standards has been used in the above code:


The **POS tagging standard** used in the code is the **Penn Treebank Tagset**. This is evident because **the default tagger in NLTK (which the code utilizes) applies the Penn Treebank tagset for English.**

##### Examples from the Output of the Code:
- **'DT'**: Determiner (e.g., *The*)
- **'JJ'**: Adjective (e.g., *quick*)
- **'NN'**: Noun, singular or mass (e.g., *fox*)
- **'VBZ'**: Verb, 3rd person singular present (e.g., *jumps*)

The Penn Treebank tagset is one of the most commonly used tagsets for English POS tagging and includes around 36 tags. It is widely used in many NLP tasks and corpora, including the **Wall Street Journal (WSJ) Corpus** and other large datasets.

For more information on the specific tags used in the Penn Treebank, you can refer to [Penn Treebank POS Tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

#### What is the purpose of using `nltk.download('punkt')` in the code?


- The `nltk.download('punkt')` command downloads the **Punkt** tokenizer models, which are necessary for splitting a text into words or sentences (tokenization).
- In this case, we are using **word tokenization**, which splits the input sentence into individual words.
- Without downloading the 'punkt' resource, NLTK wouldn't have the necessary model to perform tokenization.




#### Why do we use `nltk.download('averaged_perceptron_tagger')` in the code?

- The `nltk.download('averaged_perceptron_tagger')` command downloads the **Averaged Perceptron Tagger**. This is a machine learning model used to assign part-of-speech (POS) tags to words.
- The model tags each word in the input sentence with its most probable POS category (e.g., noun, verb, adjective).
-  This tagger uses both the word itself and the context provided by neighboring words to decide the correct POS tag.




#### What does the `pos_tag()` function do in this code?

- The `pos_tag()` function takes a list of tokens (words) and assigns a part-of-speech (POS) tag to each word.
- For example, the word "fox" is tagged as **NN** (Noun), and the word "jumps" is tagged as **VBZ** (Verb, 3rd person singular present).
- The `pos_tag()` function is key to identifying the syntactic role each word plays in a sentence.




#### What is tokenization, and why do we need it in NLP?

- **Tokenization** is the process of breaking a text into smaller components, usually words or sentences.
- In NLP, tokenization is a critical preprocessing step because it enables us to handle text at the word or sentence level.
- In this code, the **`word_tokenize()`** function splits the sentence *"The quick brown fox jumps over the lazy dog.
- This is required for further processing, such as POS tagging or any other text-based operations.




#### What POS tags are generated for the example sentence, and what do they mean?


The sentence "The quick brown fox jumps over the lazy dog." is tokenized, and each word is tagged as follows:

- **The (DT)**: Determiner (DT) – Articles like "the" and "a."
- **quick (JJ)**: Adjective (JJ) – Describes a noun.
- **brown (JJ)**: Adjective (JJ).
- **fox (NN)**: Noun (NN) – A singular or mass noun.
- **jumps (VBZ)**: Verb (VBZ) – Verb, 3rd person singular present.
- **over (IN)**: Preposition or subordinating conjunction (IN).
- **the (DT)**: Determiner (DT).
- **lazy (JJ)**: Adjective (JJ).
- **dog (NN)**: Noun (NN).

Each tag follows the **Penn Treebank Tagset** standard, where **DT** represents determiners, **JJ** represents adjectives, **NN** is a noun, and **VBZ** represents a verb in third-person singular present form.




#### What is the significance of using the Penn Treebank Tagset in this context?

- The **Penn Treebank Tagset** is one of the most widely used tagsets for POS tagging in English.
- It provides 36 tags that cover most syntactic categories, including nouns, verbs, adjectives, adverbs, determiners, and more.
- This tagset allows for detailed syntactic analysis of text, making it a popular choice in many NLP tasks, such as parsing and machine learning-based text analysis.




#### How can we use POS tagging for downstream NLP tasks?

POS tagging is a critical step in many **downstream NLP tasks**, including:
- **Named Entity Recognition (NER)**: Identifying proper nouns (people, places, organizations) relies on tagging nouns and other POS categories.
- **Syntactic Parsing**: Understanding sentence structure (subjects, predicates, objects) requires knowledge of each word's part of speech.
- **Machine Translation**: Translating a sentence accurately requires understanding how each word functions in a sentence, which is made possible by POS tagging.
- **Sentiment Analysis**: POS tags can help identify adjectives and adverbs, which are often key to understanding sentiment.
  
By assigning POS tags, NLP systems can analyze the grammatical structure and meaning of text, improving the performance of these downstream tasks.




#### What are some challenges associated with POS tagging?

Some of the main challenges in POS tagging include:
- **Ambiguity**: Many words can have different parts of speech depending on context. For example, the word "run" can be both a noun ("a run") and a verb ("to run"). The tagger must use context to disambiguate.
- **Out-of-Vocabulary Words**: Unknown or rare words that do not appear in the training data can be difficult to tag correctly.
- **Complex Sentence Structures**: POS tagging can become challenging with more complex sentences, especially when dealing with multiple clauses or nested structures.
- **Errors in Tokenization**: If the sentence is not tokenized properly (e.g., by missing punctuation or splitting multi-word phrases), POS tagging accuracy can suffer.




#### How does the `pos_tag()` function handle words with multiple meanings?

- The `pos_tag()` function in NLTK uses context to handle words with multiple meanings.
- It utilizes the **Averaged Perceptron Tagger**, a statistical model that considers not only the word itself but also the words around it (context) to make decisions.
- For example:
  - **Word**: "run"
    - In "I go for a run," it is tagged as a noun (NN).
    - In "I run every day," it is tagged as a verb (VB).
  
- The tagger uses surrounding words to determine the most likely tag for ambiguous words.
