<a href="https://colab.research.google.com/github/gaoyuchen1515-art/rulinwaishi/blob/main/CBS5502_Tutorial_pos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">

# üß† **CBS5502 ‚Äî Computational Linguistics and NLP Technologies**

### üêç **2nd Python Tutorial**
### üìÖ *February 4, 2026*

---

## üá®üá≥ **PoS Tagging and Disambiguation**
---
### üë®‚Äçüè´ **Instructor**
**Dr. WAN Mingyu**

### üë®‚Äçüè´ **Teaching Assistant**
**Mr. BAO Xiaoyi**

</div>

---

## üåü Welcome!

Welcome to the tutorial series of **CBS5502**!  
In this tutorial, we will explore how **Part-of-Speech (PoS) tagging** works and how ambiguity can be resolved using **three different approaches**, all demonstrated with the classic ambiguous sentence:

> **‚ÄúWe can can the can.‚Äù** üåü

---

## üéØ Learning Objectives

By the end of this tutorial, you will be able to:

- Understand what **PoS tagging** is and why it is important in NLP  
- Identify **lexical and structural ambiguity** in natural language  
- Apply **three approaches to PoS tagging**:
  - Rule-based tagging
  - Statistical / probabilistic tagging
  - Dictionary‚Äë or library‚Äëbased tagging using Python  
- Analyze and interpret tagging results for ambiguous sentences  

---

üöÄ Let‚Äôs Get Started!

In [9]:
# Import required libraries
import nltk
nltk.download('punkt_tab')
from nltk.probability import FreqDist
from nltk.util import ngrams
from nltk.tag import hmm
from collections import defaultdict
from nltk.tag import brill, brill_trainer
from nltk.tag import UnigramTagger, BigramTagger, DefaultTagger
from nltk.corpus import treebank

# Ensure you have the required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Input sentence
sentence = "We can can the can."

# Tokenizing the sentence into words
tokens = nltk.word_tokenize(sentence)
print("Tokenized Sentence:", tokens)

AttributeError: partially initialized module 'nltk' has no attribute 'data' (most likely due to a circular import)

### 1Ô∏è‚É£ Rule‚ÄëBased Approach

#### üîç Overview
The **rule‚Äëbased approach** assigns Part‚Äëof‚ÄëSpeech (PoS) tags using **handcrafted linguistic rules**, typically based on word forms, surrounding context, or fixed patterns.  
This method does **not rely on training data**, making it easy to understand and implement.

#### üß† How It Works
- Each rule matches a word (or pattern) in the sentence
- The **first matching rule** determines the PoS tag
- Rules are applied **sequentially**, from top to bottom

#### üß© Example Rules
For our ambiguous sentence, we define a few **simple and intuitive rules**:
- Tag **‚ÄúWe‚Äù** as a personal pronoun
- Tag **‚Äúthe‚Äù** as a determiner
- Assign **‚Äúcan‚Äù** a default modal‚Äëverb tag
- Use a fallback rule for unknown cases

These rules illustrate both the **strength** (clarity) and **limitation** (lack of context awareness) of the rule‚Äëbased approach.

In [None]:
# --------------------------------------------------
# STEP 1: Define default (most likely) POS tags
# --------------------------------------------------
# This dictionary provides a fallback tag for each word.
# If no contextual rule applies, we use these tags.
most_likely_tags = {
    "We": "PRP",   # Personal pronoun
    "can": "MD",   # Modal verb (default assumption)
    "the": "DT"    # Determiner
}

In [None]:
# --------------------------------------------------
# STEP 2: Define the rule-based POS tagging function
# --------------------------------------------------
def rule_based_pos_tagger(tokens):
    """
    Assign POS tags to a list of tokens using
    handcrafted contextual rules.

    Parameters:
        tokens (list): A list of word tokens

    Returns:
        list: A list of (word, POS tag) tuples
    """

    tagged_sentence = []  # Store the final tagged output

    # Iterate through each word with its position
    for i, word in enumerate(tokens):

        # --------------------------------------------------
        # STEP 3: Apply context-sensitive rules
        # --------------------------------------------------

        # Rule 1:
        # If "can" appears immediately after "We",
        # it functions as a modal verb (e.g., "We can ...")
        if word == "can" and i > 0 and tokens[i - 1] == "We":
            tag = "MD"

        # Rule 2:
        # If "can" follows "the", it is treated as a noun
        # (e.g., "the can")
        elif word == "can" and i > 0 and tokens[i - 1] == "the":
            tag = "NN"

        # Rule 3:
        # If "can" follows another "can",
        # it is treated as a main verb
        # (e.g., "can can the...")
        elif word == "can" and i > 0 and tokens[i - 1] == "can":
            tag = "VB"

        # --------------------------------------------------
        # STEP 4: Apply default rule
        # --------------------------------------------------
        # If no specific contextual rule matches,
        # fall back to the most likely tag
        else:
            tag = most_likely_tags.get(word, "NN")
            # Unknown words default to NN (noun)

        # Add the (word, tag) pair to the result
        tagged_sentence.append((word, tag))

    return tagged_sentence

## üîé Step‚Äëby‚ÄëStep Rule Application

Sentence: **We can can the can .**

| Position | Token | Left Context | Applied Rule | Assigned Tag |
|---------:|-------|--------------|--------------|--------------|
| 0 | We | ‚Äî | Default dictionary rule | PRP |
| 1 | can | We | Rule 1: *can* after *We* | MD |
| 2 | can | can | Rule 3: *can* after *can* | VB |
| 3 | the | can | Default dictionary rule | DT |
| 4 | can | the | Rule 2: *can* after *the* | NN |
| 5 | . | can | Default fallback | NN |

In [10]:
# --------------------------------------------------
# STEP 5: Apply the rule-based tagger
# --------------------------------------------------
rule_based_tags = rule_based_pos_tagger(tokens)

# Display the result
print("Rule-Based POS Tags:")
for word, tag in rule_based_tags:
    print(f"{word:>5}  ‚Üí  {tag}")

NameError: name 'rule_based_pos_tagger' is not defined

## ‚ö†Ô∏è Error Cases & Discussion

### Example 1
Sentence: **They can fish.**

Expected:
- can ‚Üí MD
- fish ‚Üí VB

Rule-Based Output:
- can ‚Üí MD ‚úÖ
- fish ‚Üí NN ‚ùå

üìå *Why?*  
The system lacks a rule recognizing **verb usage without ‚Äúthe‚Äù**.

### Example 2
Sentence: **The can can rust.**

Correct interpretation:
- can ‚Üí NN
- can ‚Üí VB

Rule-Based Output:
- can ‚Üí NN ‚úÖ
- can ‚Üí VB ‚úÖ (by coincidence)

üìå *Discussion point:*  
Correct tagging here is **accidental**, not robust.

### üß† Teaching Notes

- This approach relies entirely on **manually written rules**
- Each rule encodes **explicit linguistic intuition**

#### ‚úÖ Strengths
- Easy to understand and interpret
- Transparent decision‚Äëmaking process

#### ‚ùå Limitations
- Difficult to scale to large vocabularies
- Brittle when encountering unseen or unexpected patterns

### üß† Learning Takeaways

- Rule-based tagging makes **linguistic assumptions explicit**
- Context helps, but only when **manually encoded**
- Error cases reveal why **learning from data is necessary**
- HMM and Brill taggers automate what rules attempt to approximate

## 2Ô∏è‚É£ Hidden Markov Model (HMM) Approach

### üîç Overview
The **Hidden Markov Model (HMM)** approach is a **statistical sequence‚Äëlabeling method** that assigns PoS tags by modeling language as a **probabilistic process**.  
It predicts the **most likely sequence of tags** for a sentence using probabilities learned from a **tagged corpus**.

### üß† Core Assumptions
HMM PoS tagging relies on two key assumptions:

- **Markov Assumption:**  
  The current tag depends only on a limited number of previous tags (typically one or two).
- **Output Independence Assumption:**  
  Each word is generated independently given its tag.

### üîÅ Decoding Strategy
To determine the optimal tag sequence, HMMs use the **Viterbi algorithm**, which efficiently finds:

> ‚úÖ The most probable tag sequence for the entire sentence,  
> rather than tagging each word independently.



### üß† Step 1: What Does an HMM Model?

An HMM models language with two probability components:

1. **Transition Probability**
   - $$P(t_i \mid t_{i-1})$$  
   - How likely one tag follows another

2. **Emission Probability**
   - $$P(w_i \mid t_i)$$  
   - How likely a word is generated by a tag

The goal is to find the **most probable tag sequence** for the entire sentence.

### üîÅ Step 2: Why We Need Sequence‚ÄëLevel Decisions

Ambiguous words like **‚Äúcan‚Äù** cannot be tagged reliably in isolation.

‚úÖ HMMs solve this by:
- Considering **previous tags**
- Evaluating the **entire sentence**
- Using **global optimization** via the Viterbi algorithm

### üîß Step 3: Environment Setup

In [None]:
import nltk
from nltk.corpus import brown
from nltk.tag import hmm

In [11]:
# Download resources (run once)
nltk.download('punkt')
nltk.download('brown')

NameError: name 'nltk' is not defined

### üìö Step 4: Prepare Training Data

HMMs require a **tagged corpus** to learn probabilities.
Here, we use the Brown Corpus (news category).

## üè∑Ô∏è Official Tag List for `hmm_tagger` (Brown Tagset)

### ‚úÖ Primary Reference (Recommended for Teaching)

You can find the **complete Brown Corpus tagset**, along with detailed explanations, in the **NLTK Book, Chapter 5**:

üëâ **NLTK Book ‚Äî Categorizing and Tagging Words (Brown Tagset)**  
*(nltk.org)*

### üìö What This Section Covers

This reference documents:

- ‚úÖ All Brown tags (e.g. `PPSS`, `AT`, `NP`, `VB`, `MD`)
- ‚úÖ Examples of words annotated with each tag
- ‚úÖ Key differences between the **Brown tagset** and the **Penn Treebank tagset**

üìå **Teaching note:**  
The `HiddenMarkovModelTagger` in NLTK inherits its tagset directly from the corpus it is trained on. When trained with the Brown Corpus, it therefore produces **Brown-style PoS tags**.

In [12]:
# Load tagged sentences for training
train_sentences = brown.tagged_sents(categories='news')

# Inspect one example
train_sentences[0]

NameError: name 'brown' is not defined

### üèóÔ∏è Step 5: Train the HMM Tagger

The training process automatically learns:
- Tag transition probabilities
- Word emission probabilities

In [13]:
# Train an HMM tagger
hmm_tagger = hmm.HiddenMarkovModelTagger.train(train_sentences)

NameError: name 'hmm' is not defined

### üîç Step 6: Apply the HMM to an Ambiguous Sentence

In [14]:
sentence = "We can can the can ."
tokens = nltk.word_tokenize(sentence)

hmm_tags = hmm_tagger.tag(tokens)
hmm_tags

NameError: name 'nltk' is not defined

‚úî The HMM correctly captures grammatical structure  
‚úî Ambiguity is resolved using contextual probabilities  
‚úî The same word can have different tags in the same sentence

## 3Ô∏è‚É£ Transformation‚ÄëBased (Brill) Tagging

### üîç Overview
The **Transformation‚ÄëBased Approach**, also known as **Brill Tagging**, is a **hybrid method** that combines:

- ‚úÖ A **simple statistical baseline tagger**
- ‚úÖ A set of **learned transformation rules**

Instead of assigning tags in one step, Brill tagging **iteratively corrects errors** made by an initial tagger using rules learned from a **tagged corpus**.


### üß† Core Idea

Brill tagging follows this learning cycle:

1. Start with a **baseline tagger** (e.g., Unigram Tagger)
2. Compare its output with **gold‚Äëstandard tags**
3. Learn **transformation rules** that reduce errors
4. Apply the rules sequentially to improve tagging accuracy

üìå The learned rules are **human‚Äëreadable**, making this approach both **accurate and interpretable**.

### üîß Step 1: Environment Setup

In [15]:
!pip install -q sklearn-crfsuite

In [16]:
import nltk
from nltk.corpus import brown

nltk.download('brown')
nltk.download('punkt')

sentences = brown.tagged_sents(categories='news')

AttributeError: partially initialized module 'nltk' has no attribute 'data' (most likely due to a circular import)

In [None]:
from nltk.tag import DefaultTagger, UnigramTagger

default_tagger = DefaultTagger('NN')
baseline_tagger = UnigramTagger(sentences, backoff=default_tagger)

In [None]:
sentence = "We can can the can .".split()
baseline_tagger.tag(sentence)

In [None]:
def word_features(sent, i):
    word = sent[i]
    features = {
        'word.lower()': word.lower(),
        'is_upper': word.isupper(),
        'is_title': word.istitle(),
        'is_digit': word.isdigit(),
    }
    if i > 0:
        features['prev_word'] = sent[i-1]
    else:
        features['BOS'] = True

    if i < len(sent) - 1:
        features['next_word'] = sent[i+1]
    else:
        features['EOS'] = True

    return features

In [None]:
def sent2features(sent):
    return [word_features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [tag for _, tag in sent]

X = [sent2features([w for w, t in s]) for s in sentences]
y = [sent2labels(s) for s in sentences]

In [None]:
sent_subset = sentences[:500]

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=10,
    all_possible_transitions=False
)

crf.fit(X_small[:500], y_small[:500])

In [None]:
test_sentence = "We can can the can .".split()
features = sent2features(test_sentence)

list(zip(test_sentence, crf.predict_single(features)))

### ‚ö†Ô∏è Practical Note on CRF Training

CRFs are powerful but computationally expensive models.
Training on large corpora with all possible tag transitions
can take a very long time.

‚úÖ For teaching and experimentation:
- Use a small subset of data
- Limit the number of iterations
- Disable unnecessary transitions

This preserves the learning behavior while keeping runtime manageable.

#### üß† Interpretation

- **Rule‚ÄëBased Tagger**
  - Correctly handles this sentence due to carefully designed rules.
  - Performance is **fragile** and depends entirely on manual rule coverage.

- **Hidden Markov Model (HMM)**
  - Resolves ambiguity using **learned transition and emission probabilities**.
  - Makes **global sequence‚Äëlevel decisions**, leading to robust results.

- **Brill‚ÄëStyle (Transformation‚ÄëBased) Tagger**
  - Starts from a weak baseline and **learns contextual corrections**.
  - Combines the **interpretability of rules** with **data‚Äëdriven learning**.
  - Often outperforms unigram or bigram taggers when **training data is limited**.

---

#### ‚úÖ Key Takeaway

Although all three methods succeed on this example, they do so for different reasons:

- Rule‚Äëbased tagging relies on **explicit linguistic intuition**
- HMM tagging relies on **probabilistic sequence modeling**
- Brill‚Äëstyle tagging bridges both worlds by **learning rules from data**

This comparison highlights why transformation‚Äëbased methods remain an important conceptual bridge between symbolic and statistical NLP approaches.

## üìù Playground ‚Äî Ending Exercises

The following exercises encourage you to **apply, compare, and reflect** on the three PoS tagging approaches covered in this tutorial. Focus on **ambiguity**, **context**, and **model behavior** rather than just correctness.

---

### üß™ Exercise 1: English Ambiguity Challenge  
**Sentence:**  
> *‚ÄúTime flies like an arrow.‚Äù*

This sentence is famously ambiguous and can be interpreted in multiple ways.

#### ‚úÖ Tasks
1. **Tokenize** the sentence.
2. Apply:
   - Rule‚Äëbased tagging  
   - HMM tagging  
   - Brill‚Äëstyle (transformation‚Äëbased) tagging
3. Record the PoS tags produced by each method.

#### üß† Guiding Questions
- Which word(s) show different PoS tags across methods?
- Does *flies* behave as a **noun** or a **verb**?
- Is *like* treated as a **verb**, **preposition**, or **conjunction**?
- Which approach best captures the intended reading?

#### üí° Reflection
- Why is this sentence difficult to tag correctly without full syntactic analysis?
- How does sequence‚Äëlevel modeling help resolve ambiguity?

---

### üß™ Exercise 2: Chinese Structural Ambiguity  
**Sentence:**  
> **ÊàëÂñúÊ¨¢ÂêÉËãπÊûúÁöÑ‰∫∫„ÄÇ**  
> *(Pinyin: W«í x«êhuƒÅn chƒ´ p√≠nggu«í de r√©n.)*

This sentence is a classic example used in **Chinese NLP** to test ambiguity resolution.

#### ‚úÖ Tasks
1. Segment the sentence into words (use a Chinese tokenizer if available).
2. Assign PoS tags to each word.
3. Identify at least **two possible interpretations** of the sentence.

#### üß† Key Points to Consider
- The grammatical role of **‚ÄúÁöÑ‚Äù**
- Whether **‚ÄúÂêÉËãπÊûú‚Äù** modifies:
  - *Êàë* (I like to eat apples), or
  - *‰∫∫* (people who eat apples)
- How relative clauses are formed in Chinese

#### üí° Reflection
- Why is **‚ÄúÁöÑ‚Äù** challenging for PoS tagging and parsing?
- What additional information (syntax, semantics, or context) would help disambiguate the sentence?
- Why do purely rule‚Äëbased approaches struggle with this example?

---

### üåü Take‚ÄëHome Insight

These exercises illustrate that:

- **PoS tagging alone is often insufficient** for full disambiguation
- Ambiguity exists at both **lexical** and **structural** levels
- Real‚Äëworld NLP systems must integrate **context, syntax, and semantics**

‚úÖ Congratulations on completing the tutorial!

In [None]:
!pip uninstall nltk -y
!pip uninstall sklearn-crfsuite -y
!pip uninstall jieba -y
!pip uninstall pkuseg -y

# Ê∏ÖÁêÜ pip ÁºìÂ≠ò
!pip cache purge

# ÂÆâË£ÖÊâÄÊúâÈúÄË¶ÅÁöÑÂ∫ì
!pip install nltk==3.8.1
!pip install numpy
!pip install jieba

print("ÂÆâË£ÖÂÆåÊàêÔºÅÁé∞Âú®ÈáçÊñ∞ÂêØÂä®ËøêË°åÊó∂...")

In [6]:
# ============================
# Exercise 1: English Ambiguity Analysis
# ============================

# Clean installation
!pip uninstall nltk -y -q
!pip install nltk==3.9.0 -q  # Use compatible version

import nltk
import numpy as np
from collections import defaultdict

# Download NLTK data
try:
    nltk.download('punkt', quiet=True)
    nltk.download('averaged_perceptron_tagger', quiet=True)
    nltk.download('brown', quiet=True)
except:
    # Alternative download method
    import ssl
    try:
        _create_unverified_https_context = ssl._create_unverified_context
    except AttributeError:
        pass
    else:
        ssl._create_default_https_context = _create_unverified_https_context
    nltk.download('punkt', quiet=True)
    nltk.download('averaged_perceptron_tagger', quiet=True)

print("=" * 60)
print("EXERCISE 1: English Ambiguity - 'Time flies like an arrow'")
print("=" * 60)

# Define the sentence
sentence = "Time flies like an arrow"
tokens = nltk.word_tokenize(sentence)
print(f"Original sentence: {sentence}")
print(f"Tokenized: {tokens}\n")

# ========== Method 1: Rule-based Tagging ==========
print("Method 1: Rule-based POS Tagging")
print("-" * 40)

# Define tagging rules
rule_based_tags = []

for i, word in enumerate(tokens):
    word_lower = word.lower()

    # Rule 1: First word "Time" is likely a noun
    if i == 0 and word_lower == "time":
        rule_based_tags.append((word, "NN"))  # Noun

    # Rule 2: "flies" could be verb or noun
    elif word_lower == "flies":
        if i > 0 and tokens[i-1].lower() == "time":
            rule_based_tags.append((word, "VBZ"))  # Verb 3rd person singular
        else:
            rule_based_tags.append((word, "NNS"))  # Noun plural

    # Rule 3: "like" could be preposition or verb
    elif word_lower == "like":
        if i > 0 and tokens[i-1].lower() in ["flies", "fly"]:
            rule_based_tags.append((word, "IN"))  # Preposition
        else:
            rule_based_tags.append((word, "VB"))  # Verb

    # Rule 4: "an" is a determiner
    elif word_lower == "an":
        rule_based_tags.append((word, "DT"))

    # Rule 5: "arrow" is a noun
    elif word_lower == "arrow":
        rule_based_tags.append((word, "NN"))

    # Default rule
    else:
        rule_based_tags.append((word, "UNK"))

print("Rule-based tagging results:")
for word, tag in rule_based_tags:
    print(f"  {word}: {tag}")

# ========== Method 2: NLTK Default Tagger ==========
print("\nMethod 2: NLTK Default Tagger (Transformation-based)")
print("-" * 40)

# Use NLTK's pre-trained tagger
default_tags = nltk.pos_tag(tokens)
print("NLTK tagging results:")
for word, tag in default_tags:
    print(f"  {word}: {tag}")

# ========== Method 3: Create a Custom Tagger ==========
print("\nMethod 3: Custom Pattern-based Tagger")
print("-" * 40)

# Define patterns for Brill-style tagging
patterns = [
    (r'.*ing$', 'VBG'),               # gerunds
    (r'.*ed$', 'VBD'),                # simple past
    (r'.*es$', 'VBZ'),                # 3rd singular present
    (r'.*ould$', 'MD'),               # modals
    (r'.*\'s$', 'POS'),               # possessive
    (r'.*s$', 'NNS'),                 # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
    (r'.*', 'NN')                     # default: noun
]

# Create a regex tagger
from nltk import RegexpTagger
regex_tagger = RegexpTagger(patterns)
regex_tags = regex_tagger.tag(tokens)

print("Pattern-based tagging results:")
for word, tag in regex_tags:
    print(f"  {word}: {tag}")

# ========== Analysis and Comparison ==========
print("\n" + "=" * 60)
print("ANALYSIS AND COMPARISON")
print("=" * 60)

print("\nTagging results comparison:")
print(f"{'Word':<8} {'Rule-based':<12} {'NLTK':<12} {'Pattern-based':<12}")
print("-" * 45)
for i, word in enumerate(tokens):
    rule_tag = rule_based_tags[i][1]
    nltk_tag = default_tags[i][1]
    pattern_tag = regex_tags[i][1]
    print(f"{word:<8} {rule_tag:<12} {nltk_tag:<12} {pattern_tag:<12}")

print("\n1. Key ambiguity points:")
print("   - 'flies': Can be verb (VBZ) or noun (NNS)")
print("   - 'like': Can be preposition (IN) or verb (VB)")
print("   - 'Time': Always noun (NN) in this context")

print("\n2. Two interpretations:")
print("   Interpretation A: 'Time passes quickly like an arrow'")
print("   - Structure: Time(NN) flies(VBZ) like(IN) an(DT) arrow(NN)")
print("   - Meaning: Time moves as fast as an arrow")
print("\n   Interpretation B: 'Time-flies (insects) enjoy an arrow'")
print("   - Structure: Time(NN) flies(NNS) like(VB) an(DT) arrow(NN)")
print("   - Meaning: Insects called 'time flies' like arrows")

print("\n3. Method evaluation:")
print("   - Rule-based: Good for specific cases, needs manual tuning")
print("   - NLTK: Most reliable, uses statistical and rule-based hybrid")
print("   - Pattern-based: Good fallback, uses word endings and patterns")

# ============================
# Exercise 2: Chinese Sentence Analysis
# ============================
print("\n\n" + "=" * 60)
print("EXERCISE 2: Chinese Sentence Analysis")
print("=" * 60)

print("NOTE: Chinese analysis requires additional libraries.")
print("We'll analyze the sentence conceptually instead.")

# Chinese sentence information
chinese_sentence = "ÊàëÂñúÊ¨¢ÂêÉËãπÊûúÁöÑ‰∫∫"
print(f"\nChinese sentence: {chinese_sentence}")
print(f"Pinyin: W«í x«êhuƒÅn chƒ´ p√≠nggu«í de r√©n")
print(f"Literal: I like eat apple DE person")
print(f"Common translation: 'I like people who eat apples' or 'I like to eat apples'")

# ========== Manual Analysis ==========
print("\n" + "-" * 40)
print("MANUAL LINGUISTIC ANALYSIS")
print("-" * 40)

print("\nWord-by-word breakdown:")
print("  Êàë (w«í) - I/me (pronoun)")
print("  ÂñúÊ¨¢ (x«êhuƒÅn) - to like (verb)")
print("  ÂêÉ (chƒ´) - to eat (verb)")
print("  ËãπÊûú (p√≠nggu«í) - apple (noun)")
print("  ÁöÑ (de) - possessive/relative particle")
print("  ‰∫∫ (r√©n) - person/people (noun)")

print("\nTwo possible structures:")
print("\n1. Structure A: Êàë + ÂñúÊ¨¢ + [ÂêÉËãπÊûú + ÁöÑ + ‰∫∫]")
print("   Parse: I + like + [people who eat apples]")
print("   Meaning: I like people who eat apples")
print("   POS: PRON + VERB + [VERB + NOUN + PART + NOUN]")

print("\n2. Structure B: Êàë + ÂñúÊ¨¢ + ÂêÉ + ËãπÊûú")
print("   Parse: I + like + eat + apples")
print("   Meaning: I like to eat apples ('ÁöÑ‰∫∫' is redundant)")
print("   POS: PRON + VERB + VERB + NOUN")

print("\n3. Structure C: [ÊàëÂñúÊ¨¢ÂêÉËãπÊûú] + ÁöÑ + ‰∫∫")
print("   Parse: [I like to eat apples] + 's + person")
print("   Meaning: The person who likes to eat apples")
print("   POS: [PRON + VERB + VERB + NOUN] + PART + NOUN")

print("\nAmbiguity source:")
print("  - The particle 'ÁöÑ' can create relative clauses")
print("  - Chinese allows noun phrases without explicit relative pronouns")
print("  - Context determines the correct parse")

# ========== Comparison with English ==========
print("\n" + "=" * 60)
print("COMPARISON: ENGLISH vs CHINESE AMBIGUITY")
print("=" * 60)

print("\nEnglish 'Time flies like an arrow':")
print("  - Type: Lexical and structural ambiguity")
print("  - Resolution: POS tagging + syntax parsing")
print("  - Key words: 'flies'(V/N), 'like'(P/V)")

print("\nChinese 'ÊàëÂñúÊ¨¢ÂêÉËãπÊûúÁöÑ‰∫∫':")
print("  - Type: Structural/syntactic ambiguity")
print("  - Resolution: Phrase structure parsing")
print("  - Key element: Particle 'ÁöÑ' (creates relative clauses)")

print("\nSimilarities:")
print("  - Both require context for full disambiguation")
print("  - Both have multiple valid interpretations")
print("  - Syntax plays crucial role in meaning")

print("\nDifferences:")
print("  - English: Word-level ambiguity (same word, different POS)")
print("  - Chinese: Phrase-level ambiguity (same sequence, different structures)")
print("  - English uses word order and function words")
print("  - Chinese uses particles and context")

# ========== Advanced Analysis ==========
print("\n\n" + "=" * 60)
print("ADVANCED ANALYSIS: TUTORIAL EXAMPLES")
print("=" * 60)

# Test the tutorial examples
tutorial_sentences = [
    "We can can the can",
    "Fruit flies like a banana",
    "The old man the boat"
]

print("\nTesting additional ambiguous sentences:")

for i, sent in enumerate(tutorial_sentences, 1):
    print(f"\n{i}. Sentence: '{sent}'")
    tokens = nltk.word_tokenize(sent)
    tags = nltk.pos_tag(tokens)

    print(f"   Tokens: {tokens}")
    print(f"   POS tags: {tags}")

    # Specific analysis for each sentence
    if i == 1:
        print("   Analysis: 'can' appears 3 times with different POS:")
        print("     - Position 1: Modal verb (ability)")
        print("     - Position 2: Verb (to preserve)")
        print("     - Position 4: Noun (container)")
    elif i == 2:
        print("   Analysis: Same ambiguity as 'Time flies like an arrow'")
        print("     - 'flies': Can be noun (insects) or verb")
        print("     - 'like': Can be preposition or verb")
    elif i == 3:
        print("   Analysis: Garden path sentence")
        print("     - Initial parse: 'The old man' (noun phrase)")
        print("     - Correct parse: 'The old' (noun) + 'man' (verb)")
        print("     - Meaning: Elderly people operate the boat")

# ========== Implementation Demonstration ==========
print("\n\n" + "=" * 60)
print("IMPLEMENTATION: SIMPLE DISAMBIGUATION SYSTEM")
print("=" * 60)

def analyze_sentence(sentence):
    """Simple function to analyze sentence ambiguity"""
    tokens = nltk.word_tokenize(sentence)
    tags = nltk.pos_tag(tokens)

    print(f"\nAnalyzing: '{sentence}'")
    print(f"Tokens: {tokens}")
    print(f"POS tags: {tags}")

    # Check for specific patterns
    if "flies" in sentence.lower() and "like" in sentence.lower():
        print("\nPattern detected: 'X flies like Y'")
        print("Possible interpretations:")
        print("  1. X (flies like Y) - X moves similarly to Y")
        print("  2. X-flies (like Y) - Insects called X-flies enjoy Y")

    elif "can" in tokens and tokens.count("can") > 1:
        print("\nPattern detected: Multiple 'can' usage")
        print("Common pattern: Modal verb + Verb + Determiner + Noun")
        print("Example: We [can-MD] [can-VB] the [can-NN]")

    return tokens, tags

# Test the function
test_sentences = [
    "Time flies like an arrow",
    "We can can the can",
    "I saw the man with the telescope"
]

for sent in test_sentences:
    analyze_sentence(sent)

# ========== Conclusion ==========
print("\n" + "=" * 60)
print("CONCLUSION: POS TAGGING AND DISAMBIGUATION")
print("=" * 60)

print("\nKey takeaways from this tutorial:")
print("\n1. POS Tagging Methods:")
print("   - Rule-based: Transparent but limited")
print("   - Statistical: Learns from data, more robust")
print("   - Hybrid (Brill/HMM): Combines strengths")

print("\n2. Ambiguity Types:")
print("   - Lexical: Same word, different meanings/POS")
print("   - Structural: Same words, different parse trees")
print("   - Scope: Modifier attachment problems")

print("\n3. Language Differences:")
print("   - English: Rich morphology helps POS tagging")
print("   - Chinese: Relies more on word order and particles")
print("   - Both: Require context for full disambiguation")

print("\n4. Practical Applications:")
print("   - Machine translation")
print("   - Information extraction")
print("   - Question answering")
print("   - Grammar checking")

print("\n5. Limitations:")
print("   - POS tagging alone is insufficient")
print("   - Need syntax and semantics")
print("   - Context is crucial")

print("\n" + "=" * 60)
print("END OF TUTORIAL ANALYSIS")
print("=" * 60)

# Quick test of the tutorial's main example
print("\n\nQUICK TEST: Tutorial's main example")
main_example = "We can can the can"
tokens_ex = nltk.word_tokenize(main_example)
tags_ex = nltk.pos_tag(tokens_ex)

print(f"\nSentence: {main_example}")
print("Word-by-word analysis:")
for i, (word, tag) in enumerate(tags_ex):
    pos_explanation = {
        'PRP': 'Personal pronoun',
        'MD': 'Modal verb (indicates ability/permission)',
        'VB': 'Verb (base form)',
        'DT': 'Determiner',
        'NN': 'Noun (singular)'
    }
    explanation = pos_explanation.get(tag, 'Unknown')
    print(f"  '{word}': {tag} - {explanation}")

print("\nComplete analysis:")
print("  'We can [can] the [can]'")
print("  Where:")
print("    can‚ÇÅ (position 2): Modal verb - 'are able to'")
print("    can‚ÇÇ (position 3): Verb - 'to put into containers'")
print("    can‚ÇÉ (position 5): Noun - 'metal container'")

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/1.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.5/1.5 MB[0m [31m15.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.5/1.5 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
EXERCISE 1: English Ambiguity - 'Time flies like an arrow'
Original sentence: Time flies like an arrow
Tokenized: ['Time', 'flies', 'like', 'an', 'arrow']

Method 1: Rule-based POS Tagging
----------------------------------------
Rule-based tagging results:
  Time: NN
  flies: VBZ
  like: IN
  an: DT
  arrow: NN

Method 2: NLTK Default Tagger (Transformation-based)
-----------------------------