<img src='https://www.di.uniroma1.it/sites/all/themes/sapienza_bootstrap/logo.png' width="200"/>  

# Part_1_8_Part_of_Speech_Tagging  

In Natural Language Processing (`NLP`), tagging is a crucial process for annotating text with meaningful labels that aid in linguistic and semantic analysis. Among these, **Part-of-Speech (`POS`) tagging** plays a foundational role in identifying the grammatical roles of words in a sentence, such as noun, verb, adjective, or adverb. This understanding is critical for tasks like syntactic parsing, named entity recognition, machine translation, and text-to-speech systems.  

`POS` tagging methods have evolved from rule-based systems to sophisticated algorithms like **Hidden Markov Models (`HMMs`)** and **Conditional Random Fields (CRFs)**, which leverage statistical properties for better contextual analysis. More recently, **neural network-based models** have introduced significant advancements, enabling state-of-the-art performance by leveraging word embeddings and deep learning architectures.  

### **Objectives:**  
In this notebook, Parham provides an overview of Part-of-Speech tagging, its significance in `NLP`, and the algorithms behind it, including Hidden Markov Models (`HMMs`) and neural networks. Through practical exercises, Parham will train a neural network for `POS` tagging and use `NLTK` to implement the Stanford `POS` Tagger.  

### **References:**  
- [https://www.nltk.org/book/ch05.html](https://www.nltk.org/book/ch05.html)  
- [https://web.stanford.edu/~jurafsky/slp3/old_oct19/8.pdf](https://web.stanford.edu/~jurafsky/slp3/old_oct19/8.pdf)  
- [https://www.linguisticsweb.org/doku.php?id=linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger_python](https://www.linguisticsweb.org/doku.php?id=linguisticsweb:tutorials:linguistics_tutorials:automaticannotation:stanford_pos_tagger_python)  
- [https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
- [https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/core-mathematics/pure-maths/matrices/eigenvalues-and-eigenvectors.html](https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/core-mathematics/pure-maths/matrices/eigenvalues-and-eigenvectors.html)

### **Contributors:**  
- Parham Membari  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Gmail_icon_%282020%29.svg" alt="Logo" width="20" height="20"> **Email**: p.membari96@gmail.com  
    - <img src="https://www.iconsdb.com/icons/preview/red/linkedin-6-xxl.png" alt="Logo" width="20" height="20"> **LinkedIn**: [LinkedIn](https://www.linkedin.com/in/p-mem/)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/a/ae/Github-desktop-logo-symbol.svg" alt="Logo" width="20" height="20"> **GitHub**: [GitHub](https://github.com/parham075)  
    - <img src="https://upload.wikimedia.org/wikipedia/commons/e/ec/Medium_logo_Monogram.svg" alt="Logo" width="20" height="20"> **Medium**: [Medium](https://medium.com/@p.membari96)  

**Table of Contents:**  
1. Import Libraries
2. Introduction to Tagging in NLP  
3. Classical algorithms Behind `POS` Tagging (Rule-Based, HMM)  
4. Fine tunning of a Neural Network for `POS` Tagging  
5. Using NLTK to Handle Stanford POS Tagger  
6. Closing Thoughts  

## 1. Import Libraries

In [1]:
import os
import nltk
import numpy as np
import spacy
import torch

## 2. Introduction to Tagging in NLP  



n Natural Language Processing (NLP), **tagging** involves assigning meaningful labels to elements of text, such as words, phrases, or sentences. These labels capture linguistic or semantic information that is essential for various NLP applications. For example:  
- **Part-of-Speech (POS) Tagging:** Assigns grammatical roles (e.g., noun, verb, adjective).  

- **Named Entity Recognition (NER):** Identifies proper nouns like names, locations, or organizations.  

- **Semantic Role Labeling (SRL):** Describes the roles words play in the semantic structure of a sentence.  

Each tagging approach serves a unique purpose, contributing to tasks like text parsing, translation, summarization, and information extraction. Techniques for tagging range from traditional rule-based systems to modern neural network-based methods:  
- **Rule-Based Tagging:** Relies on linguistic rules and patterns. It works well for predictable structures but struggles with ambiguity and language variability.  
- **Statistical Tagging:** Algorithms like Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) use probabilistic methods to predict tags based on contextual patterns in labeled data.  
- **Neural Network-Based Tagging:** Leverages word embeddings and deep learning architectures like BiLSTMs and Transformers to achieve state-of-the-art performance by capturing complex patterns in language.  

### 2.1. Part-of-Speech Tagging: A Closer Look  

Among these approaches, **Part-of-Speech (POS) tagging** is a foundational task in NLP. It identifies the grammatical role of each word in a sentence, helping to structure raw text for downstream tasks. Consider the sentence:  

_"Computer Science department of Sapienza University of Rome is intellectually lively and reputed for its research outcome."_  

POS tagging identifies:  
- Computer      → Proper Noun (NNP)  
- Science       → Proper Noun (NNP)  
- department    → Noun (NN)  
- of            → Preposition (IN)  
- Sapienza      → Proper Noun (NNP)  
- University    → Proper Noun (NNP)  
- of            → Preposition (IN)  
- Rome          → Proper Noun (NNP)  
- is            → Verb (VBZ)  
- intellectually → Adverb (RB)  
- lively        → Adjective (JJ)  
- and           → Coordinating Conjunction (CC)  
- reputed       → Verb, Past Participle (VBN)  
- for           → Preposition (IN)  
- its           → Possessive Pronoun (PRP$)  
- research      → Noun (NN)  
- outcome       → Noun (NN)  

> Note: for more identifiers please check this [documentation](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

By providing information about grammatical structure, this tagging helps machines understand not just individual words, but also the connections between them within a sentence.

### 2.2. Two classes of words: **Open** vs. **Closed**:
- Closed class words
    - Relatively fixed membership
    - Usually function words: short, frequent words with grammatical function
    - determiners: a, an, the
    - pronouns: she, he, I
    - prepositions: on, under, over, near, by, …
- Open class words
    - Usually content words: Nouns, Verbs, Adjectives, Adverbs
    - Plus interjections: oh, ouch, uh-huh, yes, hello
    - New nouns and verbs like iPhone or to fax



### 2.3. Why Part-of-Speech Tagging?  

Here’s why POS tagging is so valuable:  

- **Supports Other NLP Tasks**: POS tagging provides crucial insights for tasks like syntactic parsing, sentiment analysis, and text-to-speech systems.  
- **Parsing**: Knowing POS tags can improve syntactic parsing accuracy, which is vital for machine translation and language understanding.  
- **Machine Translation (MT)**: POS tags help reordering structures, such as adjectives and nouns, when translating between languages like Spanish and English.  
- **Sentiment Analysis**: Distinguishing adjectives or verbs can reveal sentiment or emotional tone in text.  
- **Text-to-Speech**: Pronunciation ambiguity, as seen with words like *lead* or *object*, can be resolved using POS tags.  
- **Linguistic Analysis**: POS tagging aids in studying linguistic evolution, identifying meaning shifts, and creating new words.  

In short, POS tagging acts as a bridge, enabling both practical NLP tasks and linguistic research to benefit from accurate syntactic understanding.  


### 2.4. How Difficult is POS Tagging in English?  

Although English `POS` tagging has achieved high accuracy, it is not without challenges. Ambiguity is a major issue:  

- About **15% of word types** in English are ambiguous (e.g., *back* can be a noun, verb, adjective, or adverb).  
- However, **85% of word types are unambiguous** (e.g., *Sapienza* is always a proper noun, and *intellectually* is always an adverb).  
- The ambiguous 15% are highly frequent in text, meaning **~60% of word tokens** in actual usage are ambiguous.  

Here are examples of how the word *back* varies based on context:  

- **Adjective (ADJ)**: _Earnings growth took a **back** seat._  
- **Noun (NOUN)**: _A small building in the **back**._  
- **Verb (VERB)**: _A clear majority of senators **back** the bill._  
- **Particle (PART)**: _Enable the country to buy **back** debt._  
- **Adverb (ADV)**: _I was twenty-one **back** then._  


### 2.5. POS Tagging Performance  

How accurate is POS tagging? Modern methods have achieved impressive results:  

- **Tagging Accuracy**: About **97%**, which hasn't changed much in the last decade. Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), and neural network-based approaches like BERT perform similarly.  
- **Baseline Accuracy**: Even a "stupid" baseline, such as tagging every word with its most frequent tag or unknown words as nouns, achieves **92%** accuracy.  

The high accuracy is partly because many words are unambiguous. However, improving the remaining 3% can be difficult due to rare and ambiguous cases.  

### **3. Classical Algorithms Behind `POS` Tagging (Rule-Based, HMM)**  

Several algorithms are used to perform Part-of-Speech (POS) tagging, each with its strengths and ideal use cases. Let's explore three key approaches: **Rule-Based Tagging**, and **Statistical Methods (e.g., Hidden Markov Models)**.

#### **3.1. Rule-Based POS Tagging**  

This approach relies on a set of predefined linguistic rules and dictionaries:  

- **How it Works**:  
  - Uses lexicons (dictionaries) where words are tagged with their possible parts of speech.  
  - Applies hand-crafted rules to disambiguate between possible tags based on word context.  
  - For example:  
    - If a word is preceded by a determiner like "the," it is likely a noun.  
    - If a word ends with "ly," it is likely an adverb.  

- **Advantages**:  
  - Effective for languages with well-defined grammar rules.  
  - Requires no training data.  

- **Limitations**:  
  - Hand-crafting rules is labor-intensive and language-specific.  
  - Struggles with ambiguous words or phrases outside the rule set.  

#### **3.2. Statistical Methods: Hidden Markov Models (HMM)**  

HMMs are a probabilistic approach that leverages the likelihood of word sequences:  

- **How it Works**:  
  - Treats POS tagging as a sequence labeling problem.  
  - Uses the probabilities of tags given a word (**emission probabilities**) and probabilities of transitioning from one tag to another (**transition probabilities**).  
  - Finds the most likely sequence of tags using the **Viterbi algorithm**.  

- **Advantages**:  
  - Handles ambiguous words well by considering context.  
  - Requires annotated training data but is more adaptable than rule-based methods.  

- **Limitations**:  
  - Assumes independence between words, which limits performance.  
  - Simpler models compared to modern neural networks. 
  

#### **3.3. Illustrating HMM with an Example: Markov Chain in a Restaurant**  

Suppose there is a restaurant that serves only three dishes each day (Pizza 🍕, Hamburger 🍔, and Hotdog 🌭). We want to calculate the probability of the next dish served tomorrow. The following diagram represents the Markov chain for this restaurant, showing the transition probabilities between dishes:  

<p align="center"><img src="../imgs/restaurant_Markov.png" alt="Markov Chain" width="40%" height="40%" style="display: block; margin: 20px auto;"/></p>  

The diagram can be summarized as the following transition matrix `A`:  

  ```python
  A = [   
      # 🍔   🍕   🌭
      [0.2, 0.6, 0.2],  # 🍔
      [0.3, 0.0, 0.7],  # 🍕
      [0.5, 0.0, 0.5]   # 🌭
  ]
  ```

- **Key Property of Markov Chains**:  
  The future state depends only on the current state. Mathematically:  
  $$
  P(X_{n+1} = x | X_1 = x_1, ..., X_n = x_n) \approx P(X_{n+1} = x | X_n = x_n)
  $$  

  Given the restaurant has served the following sequence of dishes:  
  🍕 → 🍔 → 🍕 → ?  

  The probabilities of serving each dish tomorrow are:  
  $$ P(X_4 = 🍕 | X_3 = 🍕) = 0.0 $$  
  $$ P(X_4 = 🍔 | X_3 = 🍕) = 0.3 $$  
  $$ P(X_4 = 🌭 | X_3 = 🍕) = 0.7 $$  

  Thus, the highest probability is for a **Hotdog 🌭** to be served tomorrow.  


#### **3.4. Long-Term Behavior of the Markov Chain**  

Now consider a random walk of dishes over 10 steps:  
🍕 → 🌭 → 🌭 → 🍔 → 🍕 → 🍔 → 🍔 → 🍔 → 🌭 → 🍕 → ?  

After 10 steps, the probabilities of each dish can be calculated as:  
$$ P(🍕) = \frac{\text{Occurrences of 🍕}}{\text{Total steps}} = \frac{3}{10} = 0.3 $$  
$$ P(🍔) = \frac{\text{Occurrences of 🍔}}{\text{Total steps}} = \frac{4}{10} = 0.4 $$  
$$ P(🌭) = \frac{\text{Occurrences of 🌭}}{\text{Total steps}} = \frac{3}{10} = 0.3 $$  

Do these probabilities converge to specific values, or will they continue to fluctuate? Let's find out using a Python script.  


In [41]:
import numpy as np

# Transition matrix
A = np.array([
    [0.2, 0.6, 0.2],  # 🍔
    [0.3, 0.0, 0.7],  # 🍕
    [0.5, 0.0, 0.5]   # 🌭
])

dishes = ["🍔", "🍕", "🌭"]
num_steps = 1000
state = 1  # Start with 🍕
states = []

for _ in range(num_steps):
    states.append(state)
    state = np.random.choice([0, 1, 2], p=A[state])

# Calculate empirical probabilities
counts = np.bincount(states, minlength=len(dishes))
probabilities = counts / num_steps

print("Long-term probabilities:")
for dish, prob in zip(dishes, probabilities):
    print(f"{dish}: {prob:.3f}")


Long-term probabilities:
🍔: 0.342
🍕: 0.212
🌭: 0.446


it seems they will converge to a certain point which calls stationary distribution or the equilibrium state. 
>To know more about why we have this behaviour, please take a look at this [documentation](https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/core-mathematics/pure-maths/matrices/eigenvalues-and-eigenvectors.html)

In [42]:
import numpy as np

# Transition matrix
A = np.array([
    [0.2, 0.6, 0.2],  # 🍔
    [0.3, 0.0, 0.7],  # 🍕
    [0.5, 0.0, 0.5]   # 🌭
])

dishes = ["🍔", "🍕", "🌭"]
current_state = 1  # Start with 🍕
num_steps = 1000
predictions = []

for step in range(num_steps):
    # Get probabilities for the next day
    next_day_probs = A[current_state]
    
    # Predict the most likely state for tomorrow
    predicted_state = np.argmax(next_day_probs)
    predictions.append(dishes[predicted_state])
    
    # Update current state for the next iteration
    current_state = predicted_state

# Calculate frequencies of each dish
from collections import Counter
freq = Counter(predictions)
for dish, count in freq.items():
    print(f"{dish}: {count / num_steps:.3f}")


🌭: 0.334
🍔: 0.333
🍕: 0.333


**Excercise 1:**: Predicting the Sequence of Dishes Using Markov Chains**

**Objective**  
In this assignment, you will implement a Markov Chain model to simulate the prediction of dishes served in a restaurant over 50 days. You will calculate the sequence of probable dishes and their long-term probabilities. 
The restaurant serves three dishes: 🍔 (Hamburger), 🍕 (Pizza), and 🌭 (Hotdog). The transitions between these dishes are governed by the following Markov Chain transition matrix \( A \):
$$
A = \begin{bmatrix}
0.2 & 0.6 & 0.2 \\
0.3 & 0.0 & 0.7 \\
0.5 & 0.0 & 0.5
\end{bmatrix}
$$


**Your Task**  

1. **Simulate the Sequence**  
   - Start with a given dish (e.g., 🍕 for Day 1).  
   - Use the transition matrix to predict the dish for the next day, iterating for 50 days.  
   - Print the sequence of dishes over the 50 days.

2. **Calculate Probabilities**  
   - At the end of the simulation, calculate the long-term probabilities of each dish:
     $$
     P(\text{Dish}) = \frac{\text{Occurrences of Dish}}{\text{Total Steps}}
     $$
   - Print the probabilities for 🍔, 🍕, and 🌭.

**Additional hints and Guidelines**  

1. **Transition Matrix**  
   Use the matrix `A` to calculate the probabilities of transitioning from the current dish to each possible next dish.  

2. **Choose the Next Dish**  
   Use the probabilities from the transition matrix to determine the next dish. Obviously, it make sense if you choose the next dish based on the maximum transition probability.  

3. **Implementation Steps**  
   - Start with a given dish.  
   - Use the row corresponding to the current dish in the transition matrix `A` to calculate probabilities for the next day.  
   - Update the current dish and repeat the process for 50 days.  

In [43]:
# @title 🧑🏿‍💻 Your code here

In [44]:
# @title 👀 Solution

import numpy as np
import random

# Transition matrix A
transition_matrix = np.array([
    [0.2, 0.6, 0.2],  # Probabilities for Hamburger
    [0.3, 0.0, 0.7],  # Probabilities for Pizza
    [0.5, 0.0, 0.5]   # Probabilities for Hotdog
])

dishes = ["Hamburger 🍔", "Pizza 🍕", "Hotdog 🌭"]

def simulate_markov_chain(initial_dish, days):
    current_dish = dishes.index(initial_dish)
    sequence = [initial_dish]
    for _ in range(days - 1):
        # Get the probabilities for the next dish based on the current dish
        probabilities = transition_matrix[current_dish]
        # Choose the next dish based on the maximum transition probability
        next_dish = np.argmax(probabilities)
        # Append the name of the next dish to the sequence
        sequence.append(dishes[next_dish])
        # Update the current dish to the index of the next dish
        current_dish = next_dish

    return sequence

def calculate_probabilities(sequence):
    total_steps = len(sequence)
    probabilities = {dish: sequence.count(dish) / total_steps for dish in dishes}
    return probabilities

# Simulate for 50 days
initial_dish = "Pizza 🍕"
days = 50
sequence = simulate_markov_chain(initial_dish, days)
probabilities = calculate_probabilities(sequence)

# Output results
print(f"We have: {sequence[-1]} for the day 50th\n")
print("Sequence of Dishes Over 50 Days:\n")
steps_to_show = 5
for day in range(0, days, steps_to_show):
    print(" -> ".join(sequence[day:day+steps_to_show]), "->")

print("\nProbabilities After 50 Days:")
for dish, prob in probabilities.items():
    print(f"{dish}: {prob:.2f}")


We have: Hotdog 🌭 for the day 50th

Sequence of Dishes Over 50 Days:

Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 ->
Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 ->
Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 ->
Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 ->
Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 ->
Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 ->
Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 ->
Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 ->
Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 ->
Pizza 🍕 -> Hotdog 🌭 -> Hamburger 🍔 -> Pizza 🍕 -> Hotdog 🌭 ->

Probabilities After 50 Days:
Hamburger 🍔: 0.32
Pizza 🍕: 0.34
Hotdog 🌭: 0.34


## 4. Fine tunning of a Neural Network for `POS` Tagging  

## 5. Using NLTK to Handle Stanford POS Tagger  

## 6. Closing Thoughts  