<a href="https://colab.research.google.com/github/gaoyuchen1515-art/rulinwaishi/blob/main/CBS5502_Tutorial_pos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">

# üß† **CBS5502 ‚Äî Computational Linguistics and NLP Technologies**

### üêç **2nd Python Tutorial**
### üìÖ *February 4, 2026*

---

## üá®üá≥ **PoS Tagging and Disambiguation**
---
### üë®‚Äçüè´ **Instructor**
**Dr. WAN Mingyu**

### üë®‚Äçüè´ **Teaching Assistant**
**Mr. BAO Xiaoyi**

</div>

---

## üåü Welcome!

Welcome to the tutorial series of **CBS5502**!  
In this tutorial, we will explore how **Part-of-Speech (PoS) tagging** works and how ambiguity can be resolved using **three different approaches**, all demonstrated with the classic ambiguous sentence:

> **‚ÄúWe can can the can.‚Äù** üåü

---

## üéØ Learning Objectives

By the end of this tutorial, you will be able to:

- Understand what **PoS tagging** is and why it is important in NLP  
- Identify **lexical and structural ambiguity** in natural language  
- Apply **three approaches to PoS tagging**:
  - Rule-based tagging
  - Statistical / probabilistic tagging
  - Dictionary‚Äë or library‚Äëbased tagging using Python  
- Analyze and interpret tagging results for ambiguous sentences  

---

üöÄ Let‚Äôs Get Started!

In [9]:
# Import required libraries
import nltk
nltk.download('punkt_tab')
from nltk.probability import FreqDist
from nltk.util import ngrams
from nltk.tag import hmm
from collections import defaultdict
from nltk.tag import brill, brill_trainer
from nltk.tag import UnigramTagger, BigramTagger, DefaultTagger
from nltk.corpus import treebank

# Ensure you have the required NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Input sentence
sentence = "We can can the can."

# Tokenizing the sentence into words
tokens = nltk.word_tokenize(sentence)
print("Tokenized Sentence:", tokens)

AttributeError: partially initialized module 'nltk' has no attribute 'data' (most likely due to a circular import)

### 1Ô∏è‚É£ Rule‚ÄëBased Approach

#### üîç Overview
The **rule‚Äëbased approach** assigns Part‚Äëof‚ÄëSpeech (PoS) tags using **handcrafted linguistic rules**, typically based on word forms, surrounding context, or fixed patterns.  
This method does **not rely on training data**, making it easy to understand and implement.

#### üß† How It Works
- Each rule matches a word (or pattern) in the sentence
- The **first matching rule** determines the PoS tag
- Rules are applied **sequentially**, from top to bottom

#### üß© Example Rules
For our ambiguous sentence, we define a few **simple and intuitive rules**:
- Tag **‚ÄúWe‚Äù** as a personal pronoun
- Tag **‚Äúthe‚Äù** as a determiner
- Assign **‚Äúcan‚Äù** a default modal‚Äëverb tag
- Use a fallback rule for unknown cases

These rules illustrate both the **strength** (clarity) and **limitation** (lack of context awareness) of the rule‚Äëbased approach.

In [None]:
# --------------------------------------------------
# STEP 1: Define default (most likely) POS tags
# --------------------------------------------------
# This dictionary provides a fallback tag for each word.
# If no contextual rule applies, we use these tags.
most_likely_tags = {
    "We": "PRP",   # Personal pronoun
    "can": "MD",   # Modal verb (default assumption)
    "the": "DT"    # Determiner
}

In [None]:
# --------------------------------------------------
# STEP 2: Define the rule-based POS tagging function
# --------------------------------------------------
def rule_based_pos_tagger(tokens):
    """
    Assign POS tags to a list of tokens using
    handcrafted contextual rules.

    Parameters:
        tokens (list): A list of word tokens

    Returns:
        list: A list of (word, POS tag) tuples
    """

    tagged_sentence = []  # Store the final tagged output

    # Iterate through each word with its position
    for i, word in enumerate(tokens):

        # --------------------------------------------------
        # STEP 3: Apply context-sensitive rules
        # --------------------------------------------------

        # Rule 1:
        # If "can" appears immediately after "We",
        # it functions as a modal verb (e.g., "We can ...")
        if word == "can" and i > 0 and tokens[i - 1] == "We":
            tag = "MD"

        # Rule 2:
        # If "can" follows "the", it is treated as a noun
        # (e.g., "the can")
        elif word == "can" and i > 0 and tokens[i - 1] == "the":
            tag = "NN"

        # Rule 3:
        # If "can" follows another "can",
        # it is treated as a main verb
        # (e.g., "can can the...")
        elif word == "can" and i > 0 and tokens[i - 1] == "can":
            tag = "VB"

        # --------------------------------------------------
        # STEP 4: Apply default rule
        # --------------------------------------------------
        # If no specific contextual rule matches,
        # fall back to the most likely tag
        else:
            tag = most_likely_tags.get(word, "NN")
            # Unknown words default to NN (noun)

        # Add the (word, tag) pair to the result
        tagged_sentence.append((word, tag))

    return tagged_sentence

## üîé Step‚Äëby‚ÄëStep Rule Application

Sentence: **We can can the can .**

| Position | Token | Left Context | Applied Rule | Assigned Tag |
|---------:|-------|--------------|--------------|--------------|
| 0 | We | ‚Äî | Default dictionary rule | PRP |
| 1 | can | We | Rule 1: *can* after *We* | MD |
| 2 | can | can | Rule 3: *can* after *can* | VB |
| 3 | the | can | Default dictionary rule | DT |
| 4 | can | the | Rule 2: *can* after *the* | NN |
| 5 | . | can | Default fallback | NN |

In [10]:
# --------------------------------------------------
# STEP 5: Apply the rule-based tagger
# --------------------------------------------------
rule_based_tags = rule_based_pos_tagger(tokens)

# Display the result
print("Rule-Based POS Tags:")
for word, tag in rule_based_tags:
    print(f"{word:>5}  ‚Üí  {tag}")

NameError: name 'rule_based_pos_tagger' is not defined

## ‚ö†Ô∏è Error Cases & Discussion

### Example 1
Sentence: **They can fish.**

Expected:
- can ‚Üí MD
- fish ‚Üí VB

Rule-Based Output:
- can ‚Üí MD ‚úÖ
- fish ‚Üí NN ‚ùå

üìå *Why?*  
The system lacks a rule recognizing **verb usage without ‚Äúthe‚Äù**.

### Example 2
Sentence: **The can can rust.**

Correct interpretation:
- can ‚Üí NN
- can ‚Üí VB

Rule-Based Output:
- can ‚Üí NN ‚úÖ
- can ‚Üí VB ‚úÖ (by coincidence)

üìå *Discussion point:*  
Correct tagging here is **accidental**, not robust.

### üß† Teaching Notes

- This approach relies entirely on **manually written rules**
- Each rule encodes **explicit linguistic intuition**

#### ‚úÖ Strengths
- Easy to understand and interpret
- Transparent decision‚Äëmaking process

#### ‚ùå Limitations
- Difficult to scale to large vocabularies
- Brittle when encountering unseen or unexpected patterns

### üß† Learning Takeaways

- Rule-based tagging makes **linguistic assumptions explicit**
- Context helps, but only when **manually encoded**
- Error cases reveal why **learning from data is necessary**
- HMM and Brill taggers automate what rules attempt to approximate

## 2Ô∏è‚É£ Hidden Markov Model (HMM) Approach

### üîç Overview
The **Hidden Markov Model (HMM)** approach is a **statistical sequence‚Äëlabeling method** that assigns PoS tags by modeling language as a **probabilistic process**.  
It predicts the **most likely sequence of tags** for a sentence using probabilities learned from a **tagged corpus**.

### üß† Core Assumptions
HMM PoS tagging relies on two key assumptions:

- **Markov Assumption:**  
  The current tag depends only on a limited number of previous tags (typically one or two).
- **Output Independence Assumption:**  
  Each word is generated independently given its tag.

### üîÅ Decoding Strategy
To determine the optimal tag sequence, HMMs use the **Viterbi algorithm**, which efficiently finds:

> ‚úÖ The most probable tag sequence for the entire sentence,  
> rather than tagging each word independently.



### üß† Step 1: What Does an HMM Model?

An HMM models language with two probability components:

1. **Transition Probability**
   - $$P(t_i \mid t_{i-1})$$  
   - How likely one tag follows another

2. **Emission Probability**
   - $$P(w_i \mid t_i)$$  
   - How likely a word is generated by a tag

The goal is to find the **most probable tag sequence** for the entire sentence.

### üîÅ Step 2: Why We Need Sequence‚ÄëLevel Decisions

Ambiguous words like **‚Äúcan‚Äù** cannot be tagged reliably in isolation.

‚úÖ HMMs solve this by:
- Considering **previous tags**
- Evaluating the **entire sentence**
- Using **global optimization** via the Viterbi algorithm

### üîß Step 3: Environment Setup

In [None]:
import nltk
from nltk.corpus import brown
from nltk.tag import hmm

In [11]:
# Download resources (run once)
nltk.download('punkt')
nltk.download('brown')

NameError: name 'nltk' is not defined

### üìö Step 4: Prepare Training Data

HMMs require a **tagged corpus** to learn probabilities.
Here, we use the Brown Corpus (news category).

## üè∑Ô∏è Official Tag List for `hmm_tagger` (Brown Tagset)

### ‚úÖ Primary Reference (Recommended for Teaching)

You can find the **complete Brown Corpus tagset**, along with detailed explanations, in the **NLTK Book, Chapter 5**:

üëâ **NLTK Book ‚Äî Categorizing and Tagging Words (Brown Tagset)**  
*(nltk.org)*

### üìö What This Section Covers

This reference documents:

- ‚úÖ All Brown tags (e.g. `PPSS`, `AT`, `NP`, `VB`, `MD`)
- ‚úÖ Examples of words annotated with each tag
- ‚úÖ Key differences between the **Brown tagset** and the **Penn Treebank tagset**

üìå **Teaching note:**  
The `HiddenMarkovModelTagger` in NLTK inherits its tagset directly from the corpus it is trained on. When trained with the Brown Corpus, it therefore produces **Brown-style PoS tags**.

In [12]:
# Load tagged sentences for training
train_sentences = brown.tagged_sents(categories='news')

# Inspect one example
train_sentences[0]

NameError: name 'brown' is not defined

### üèóÔ∏è Step 5: Train the HMM Tagger

The training process automatically learns:
- Tag transition probabilities
- Word emission probabilities

In [13]:
# Train an HMM tagger
hmm_tagger = hmm.HiddenMarkovModelTagger.train(train_sentences)

NameError: name 'hmm' is not defined

### üîç Step 6: Apply the HMM to an Ambiguous Sentence

In [14]:
sentence = "We can can the can ."
tokens = nltk.word_tokenize(sentence)

hmm_tags = hmm_tagger.tag(tokens)
hmm_tags

NameError: name 'nltk' is not defined

‚úî The HMM correctly captures grammatical structure  
‚úî Ambiguity is resolved using contextual probabilities  
‚úî The same word can have different tags in the same sentence

## 3Ô∏è‚É£ Transformation‚ÄëBased (Brill) Tagging

### üîç Overview
The **Transformation‚ÄëBased Approach**, also known as **Brill Tagging**, is a **hybrid method** that combines:

- ‚úÖ A **simple statistical baseline tagger**
- ‚úÖ A set of **learned transformation rules**

Instead of assigning tags in one step, Brill tagging **iteratively corrects errors** made by an initial tagger using rules learned from a **tagged corpus**.


### üß† Core Idea

Brill tagging follows this learning cycle:

1. Start with a **baseline tagger** (e.g., Unigram Tagger)
2. Compare its output with **gold‚Äëstandard tags**
3. Learn **transformation rules** that reduce errors
4. Apply the rules sequentially to improve tagging accuracy

üìå The learned rules are **human‚Äëreadable**, making this approach both **accurate and interpretable**.

### üîß Step 1: Environment Setup

In [15]:
!pip install -q sklearn-crfsuite

In [16]:
import nltk
from nltk.corpus import brown

nltk.download('brown')
nltk.download('punkt')

sentences = brown.tagged_sents(categories='news')

AttributeError: partially initialized module 'nltk' has no attribute 'data' (most likely due to a circular import)

In [None]:
from nltk.tag import DefaultTagger, UnigramTagger

default_tagger = DefaultTagger('NN')
baseline_tagger = UnigramTagger(sentences, backoff=default_tagger)

In [None]:
sentence = "We can can the can .".split()
baseline_tagger.tag(sentence)

In [None]:
def word_features(sent, i):
    word = sent[i]
    features = {
        'word.lower()': word.lower(),
        'is_upper': word.isupper(),
        'is_title': word.istitle(),
        'is_digit': word.isdigit(),
    }
    if i > 0:
        features['prev_word'] = sent[i-1]
    else:
        features['BOS'] = True

    if i < len(sent) - 1:
        features['next_word'] = sent[i+1]
    else:
        features['EOS'] = True

    return features

In [None]:
def sent2features(sent):
    return [word_features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [tag for _, tag in sent]

X = [sent2features([w for w, t in s]) for s in sentences]
y = [sent2labels(s) for s in sentences]

In [None]:
sent_subset = sentences[:500]

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=10,
    all_possible_transitions=False
)

crf.fit(X_small[:500], y_small[:500])

In [None]:
test_sentence = "We can can the can .".split()
features = sent2features(test_sentence)

list(zip(test_sentence, crf.predict_single(features)))

### ‚ö†Ô∏è Practical Note on CRF Training

CRFs are powerful but computationally expensive models.
Training on large corpora with all possible tag transitions
can take a very long time.

‚úÖ For teaching and experimentation:
- Use a small subset of data
- Limit the number of iterations
- Disable unnecessary transitions

This preserves the learning behavior while keeping runtime manageable.

#### üß† Interpretation

- **Rule‚ÄëBased Tagger**
  - Correctly handles this sentence due to carefully designed rules.
  - Performance is **fragile** and depends entirely on manual rule coverage.

- **Hidden Markov Model (HMM)**
  - Resolves ambiguity using **learned transition and emission probabilities**.
  - Makes **global sequence‚Äëlevel decisions**, leading to robust results.

- **Brill‚ÄëStyle (Transformation‚ÄëBased) Tagger**
  - Starts from a weak baseline and **learns contextual corrections**.
  - Combines the **interpretability of rules** with **data‚Äëdriven learning**.
  - Often outperforms unigram or bigram taggers when **training data is limited**.

---

#### ‚úÖ Key Takeaway

Although all three methods succeed on this example, they do so for different reasons:

- Rule‚Äëbased tagging relies on **explicit linguistic intuition**
- HMM tagging relies on **probabilistic sequence modeling**
- Brill‚Äëstyle tagging bridges both worlds by **learning rules from data**

This comparison highlights why transformation‚Äëbased methods remain an important conceptual bridge between symbolic and statistical NLP approaches.

## üìù Playground ‚Äî Ending Exercises

The following exercises encourage you to **apply, compare, and reflect** on the three PoS tagging approaches covered in this tutorial. Focus on **ambiguity**, **context**, and **model behavior** rather than just correctness.

---

### üß™ Exercise 1: English Ambiguity Challenge  
**Sentence:**  
> *‚ÄúTime flies like an arrow.‚Äù*

This sentence is famously ambiguous and can be interpreted in multiple ways.

#### ‚úÖ Tasks
1. **Tokenize** the sentence.
2. Apply:
   - Rule‚Äëbased tagging  
   - HMM tagging  
   - Brill‚Äëstyle (transformation‚Äëbased) tagging
3. Record the PoS tags produced by each method.

#### üß† Guiding Questions
- Which word(s) show different PoS tags across methods?
- Does *flies* behave as a **noun** or a **verb**?
- Is *like* treated as a **verb**, **preposition**, or **conjunction**?
- Which approach best captures the intended reading?

#### üí° Reflection
- Why is this sentence difficult to tag correctly without full syntactic analysis?
- How does sequence‚Äëlevel modeling help resolve ambiguity?

---

### üß™ Exercise 2: Chinese Structural Ambiguity  
**Sentence:**  
> **ÊàëÂñúÊ¨¢ÂêÉËãπÊûúÁöÑ‰∫∫„ÄÇ**  
> *(Pinyin: W«í x«êhuƒÅn chƒ´ p√≠nggu«í de r√©n.)*

This sentence is a classic example used in **Chinese NLP** to test ambiguity resolution.

#### ‚úÖ Tasks
1. Segment the sentence into words (use a Chinese tokenizer if available).
2. Assign PoS tags to each word.
3. Identify at least **two possible interpretations** of the sentence.

#### üß† Key Points to Consider
- The grammatical role of **‚ÄúÁöÑ‚Äù**
- Whether **‚ÄúÂêÉËãπÊûú‚Äù** modifies:
  - *Êàë* (I like to eat apples), or
  - *‰∫∫* (people who eat apples)
- How relative clauses are formed in Chinese

#### üí° Reflection
- Why is **‚ÄúÁöÑ‚Äù** challenging for PoS tagging and parsing?
- What additional information (syntax, semantics, or context) would help disambiguate the sentence?
- Why do purely rule‚Äëbased approaches struggle with this example?

---

### üåü Take‚ÄëHome Insight

These exercises illustrate that:

- **PoS tagging alone is often insufficient** for full disambiguation
- Ambiguity exists at both **lexical** and **structural** levels
- Real‚Äëworld NLP systems must integrate **context, syntax, and semantics**

‚úÖ Congratulations on completing the tutorial!

In [None]:
!pip uninstall nltk -y
!pip uninstall sklearn-crfsuite -y
!pip uninstall jieba -y
!pip uninstall pkuseg -y

# Ê∏ÖÁêÜ pip ÁºìÂ≠ò
!pip cache purge

# ÂÆâË£ÖÊâÄÊúâÈúÄË¶ÅÁöÑÂ∫ì
!pip install nltk==3.8.1
!pip install numpy
!pip install jieba

print("ÂÆâË£ÖÂÆåÊàêÔºÅÁé∞Âú®ÈáçÊñ∞ÂêØÂä®ËøêË°åÊó∂...")

In [2]:
# First, reinstall nltk to fix any issues
!pip uninstall nltk -y -q
!pip install nltk==3.8.1 -q

import nltk
import numpy as np
from collections import defaultdict

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('brown', quiet=True)

print("=" * 60)
print("EXERCISE 1: English Ambiguity - 'Time flies like an arrow'")
print("=" * 60)

# Define the sentence
sentence = "Time flies like an arrow"
tokens = nltk.word_tokenize(sentence)
print(f"Original sentence: {sentence}")
print(f"Tokenized: {tokens}\n")

# ========== Method 1: Rule-based Tagging ==========
print("Method 1: Rule-based POS Tagging")
print("-" * 40)

# Define tagging rules
rule_based_tags = []

for i, word in enumerate(tokens):
    word_lower = word.lower()

    # Rule 1: First word "Time" is likely a noun
    if i == 0 and word_lower == "time":
        rule_based_tags.append((word, "NN"))  # Noun

    # Rule 2: "flies" could be verb or noun
    elif word_lower == "flies":
        # Check previous word
        if i > 0 and tokens[i-1].lower() == "time":
            # "Time flies" - flies is likely a verb
            rule_based_tags.append((word, "VBZ"))  # Verb 3rd person singular
        else:
            rule_based_tags.append((word, "NNS"))  # Noun plural

    # Rule 3: "like" could be preposition or verb
    elif word_lower == "like":
        # Check context
        if i > 0 and tokens[i-1].lower() in ["flies", "fly"]:
            rule_based_tags.append((word, "IN"))  # Preposition
        else:
            rule_based_tags.append((word, "VB"))  # Verb

    # Rule 4: "an" is a determiner
    elif word_lower == "an":
        rule_based_tags.append((word, "DT"))

    # Rule 5: "arrow" is a noun
    elif word_lower == "arrow":
        rule_based_tags.append((word, "NN"))

    # Default rule
    else:
        rule_based_tags.append((word, "UNK"))  # Unknown

print("Rule-based tagging results:")
for word, tag in rule_based_tags:
    print(f"  {word}: {tag}")

# ========== Method 2: NLTK Default Tagger ==========
print("\nMethod 2: NLTK Default Tagger (Transformation-based)")
print("-" * 40)

# Use NLTK's pre-trained tagger
default_tags = nltk.pos_tag(tokens)
print("NLTK tagging results:")
for word, tag in default_tags:
    print(f"  {word}: {tag}")

# ========== Method 3: Statistical Tagging ==========
print("\nMethod 3: Statistical Tagging (Based on Brown Corpus)")
print("-" * 40)

from nltk.corpus import brown

# Create frequency distributions from Brown Corpus
tag_freq = nltk.FreqDist(tag for (word, tag) in brown.tagged_words())
word_tag_freq = nltk.ConditionalFreqDist(brown.tagged_words())

# Create statistical tagger
statistical_tags = []
for word in tokens:
    word_lower = word.lower()

    # If word exists in corpus, use most frequent tag
    if word_lower in word_tag_freq:
        most_common_tag = word_tag_freq[word_lower].max()
        statistical_tags.append((word, most_common_tag))
    else:
        # Fallback strategy
        if word_lower in ["time", "arrow"]:
            statistical_tags.append((word, "NN"))
        elif word_lower == "flies":
            statistical_tags.append((word, "NNS"))  # Default as plural noun
        elif word_lower == "like":
            statistical_tags.append((word, "IN"))  # Default as preposition
        elif word_lower == "an":
            statistical_tags.append((word, "AT"))  # Article tag in Brown Corpus
        else:
            statistical_tags.append((word, "UNK"))

print("Statistical tagging results:")
for word, tag in statistical_tags:
    print(f"  {word}: {tag}")

# ========== Analysis and Comparison ==========
print("\n" + "=" * 60)
print("ANALYSIS AND COMPARISON")
print("=" * 60)

# Convert to dictionaries for easy access
rule_dict = dict(rule_based_tags)
default_dict = dict(default_tags)
stat_dict = dict(statistical_tags)

print("\n1. Tagging differences for 'flies':")
print(f"   Rule-based: {rule_dict.get('flies', 'N/A')} - could be noun(NNS) or verb(VBZ)")
print(f"   NLTK: {default_dict.get('flies', 'N/A')}")
print(f"   Statistical: {stat_dict.get('flies', 'N/A')}")

print("\n2. Tagging differences for 'like':")
print(f"   Rule-based: {rule_dict.get('like', 'N/A')} - could be preposition(IN) or verb(VB)")
print(f"   NLTK: {default_dict.get('like', 'N/A')}")
print(f"   Statistical: {stat_dict.get('like', 'N/A')}")

print("\n3. Two possible interpretations of the sentence:")
print("   a) Time passes quickly like an arrow (Time [flies like an arrow])")
print("      Structure: Noun + Verb phrase")
print("      Tagging: Time(NN) flies(VBZ) like(IN) an(DT) arrow(NN)")
print("      Meaning: Time passes as quickly as an arrow flies")
print("\n   b) Time-flies (insects) enjoy an arrow (Time flies [like an arrow])")
print("      Structure: Noun phrase + Verb + Noun phrase")
print("      Tagging: Time(NN) flies(NNS) like(VB) an(DT) arrow(NN)")
print("      Meaning: Insects called 'time flies' enjoy arrows")

print("\n4. Which method is best?")
print("   - Rule-based: Flexible but subjective")
print("   - NLTK: Based on extensive training data, usually most accurate")
print("   - Statistical: Based on corpus frequencies, needs sufficient data")



print("\n\n" + "=" * 60)
print("EXERCISE 2: Chinese Ambiguity - 'ÊàëÂñúÊ¨¢ÂêÉËãπÊûúÁöÑ‰∫∫'")
print("=" * 60)

# Install Chinese processing library
try:
    import jieba
    print("jieba is already installed")
except ImportError:
    print("Installing jieba...")
    !pip install jieba -q
    import jieba

# Chinese sentence
chinese_sentence = "ÊàëÂñúÊ¨¢ÂêÉËãπÊûúÁöÑ‰∫∫"

print(f"Chinese sentence: {chinese_sentence}")
print(f"Pinyin: W«í x«êhuƒÅn chƒ´ p√≠nggu«í de r√©n")
print(f"Literal translation: I like eat apple DE person")

# ========== Method 1: jieba Segmentation ==========
print("\nMethod 1: jieba Word Segmentation")
print("-" * 40)

# Use jieba for word segmentation
words = jieba.lcut(chinese_sentence)
print(f"Segmentation results: {words}")

# Manual POS tagging for jieba results
jieba_pos = []
for word in words:
    if word == "Êàë":
        jieba_pos.append((word, "r"))  # Pronoun
    elif word == "ÂñúÊ¨¢":
        jieba_pos.append((word, "v"))  # Verb
    elif word == "ÂêÉ":
        jieba_pos.append((word, "v"))  # Verb
    elif word == "ËãπÊûú":
        jieba_pos.append((word, "n"))  # Noun
    elif word == "ÁöÑ":
        jieba_pos.append((word, "uj"))  # Auxiliary
    elif word == "‰∫∫":
        jieba_pos.append((word, "n"))  # Noun
    else:
        jieba_pos.append((word, "x"))  # Unknown

print("POS tagging results:")
for word, tag in jieba_pos:
    print(f"  {word}: {tag}")

print("\nTag meanings:")
print("  r - pronoun")
print("  v - verb")
print("  n - noun")
print("  uj - auxiliary particle")

# ========== Method 2: Rule-based Analysis ==========
print("\nMethod 2: Rule-based Segmentation and POS Tagging")
print("-" * 40)

# Three possible segmentation methods
segmentations = [
    # Segmentation 1: Êàë / ÂñúÊ¨¢ / ÂêÉËãπÊûú / ÁöÑ / ‰∫∫
    ["Êàë", "ÂñúÊ¨¢", "ÂêÉËãπÊûú", "ÁöÑ", "‰∫∫"],
    # Segmentation 2: ÊàëÂñúÊ¨¢ / ÂêÉ / ËãπÊûú / ÁöÑ / ‰∫∫
    ["ÊàëÂñúÊ¨¢", "ÂêÉ", "ËãπÊûú", "ÁöÑ", "‰∫∫"],
    # Segmentation 3: Êàë / ÂñúÊ¨¢ / ÂêÉËãπÊûúÁöÑ‰∫∫
    ["Êàë", "ÂñúÊ¨¢", "ÂêÉËãπÊûúÁöÑ‰∫∫"]
]

print("Three possible segmentation methods:")
for i, seg in enumerate(segmentations, 1):
    print(f"  Method {i}: {' | '.join(seg)}")

# Analyze first segmentation method
selected_seg = segmentations[0]
rule_pos_chinese = []

# Assign POS tags
for word in selected_seg:
    if word == "Êàë":
        rule_pos_chinese.append((word, "PRON"))
    elif word == "ÂñúÊ¨¢":
        rule_pos_chinese.append((word, "VERB"))
    elif word == "ÂêÉËãπÊûú":
        rule_pos_chinese.append((word, "VP"))  # Verb phrase
    elif word == "ÁöÑ":
        rule_pos_chinese.append((word, "PART"))  # Particle
    elif word == "‰∫∫":
        rule_pos_chinese.append((word, "NOUN"))
    else:
        rule_pos_chinese.append((word, "UNK"))

print("\nRule-based POS tagging (using segmentation method 1):")
for word, tag in rule_pos_chinese:
    print(f"  {word}: {tag}")

# ========== Ambiguity Analysis ==========
print("\n" + "=" * 60)
print("AMBIGUITY ANALYSIS")
print("=" * 60)

print("\nTwo main interpretations:")
print("\n1. Interpretation A: I like [people who eat apples]")
print("   Structure: I + like + [people who eat apples]")
print("   Meaning: I like people who eat apples")
print("   Grammatical analysis:")
print("     - 'ÂêÉËãπÊûúÁöÑ' modifies '‰∫∫' (forms noun phrase)")
print("     - 'ÁöÑ' functions as a structural particle")
print("     - 'ÂêÉËãπÊûúÁöÑ‰∫∫' is the object of 'ÂñúÊ¨¢'")

print("\n2. Interpretation B: I like to eat apples")
print("   Structure: I + like + eat + apples")
print("   Meaning: I like eating apples ('ÁöÑ‰∫∫' might be redundant)")
print("   Grammatical analysis:")
print("     - 'ÂêÉËãπÊûú' is the object of 'ÂñúÊ¨¢'")
print("     - 'ÁöÑ' might indicate emphasis or colloquial expression")

print("\n3. Interpretation C: [I like eating apples]'s person (less common)")
print("   Structure: [I like eating apples] + ÁöÑ + person")
print("   Meaning: The person who likes eating apples")
print("   Grammatical analysis:")
print("     - 'ÊàëÂñúÊ¨¢ÂêÉËãπÊûú' modifies '‰∫∫' as a relative clause")

print("\nKey ambiguity points:")
print("   1. Grammatical role of 'ÁöÑ':")
print("      - Structural particle: connects modifier and head noun")
print("      - Might indicate possession or modification")
print("      - Could be a modal particle in some dialects")
print("\n   2. Syntactic function of 'ÂêÉËãπÊûú':")
print("      - As modifier: modifies '‰∫∫'")
print("      - As object: direct object of 'ÂñúÊ¨¢'")
print("\n   3. Structural hierarchy:")
print("      - (I like (people who eat apples))")
print("      - ((I like eating apples)'s person)")

print("\nDisambiguation strategies:")
print("   1. Context: surrounding sentences")
print("   2. Intonation/stress: in spoken language")
print("   3. Syntactic parsing: full parse tree analysis")
print("   4. Semantic analysis: agent and patient roles of 'ÂñúÊ¨¢'")
print("   5. Pragmatic factors: discourse context and speaker intent")

# ========== Summary and Comparison ==========
print("\n" + "=" * 60)
print("SUMMARY AND COMPARISON")
print("=" * 60)

print("\nEnglish sentence 'Time flies like an arrow':")
print("  - Ambiguity type: Structural + Lexical")
print("  - Key points: 'flies'(verb/noun), 'like'(preposition/verb)")
print("  - Method performance:")
print("     Rule-based: Flexible but requires manual rules")
print("     NLTK: Usually most accurate, based on large training data")
print("     Statistical: Needs sufficient corpus data")

print("\nChinese sentence 'ÊàëÂñúÊ¨¢ÂêÉËãπÊûúÁöÑ‰∫∫':")
print("  - Ambiguity type: Structural + Elliptical")
print("  - Key points: Function of 'ÁöÑ', structural hierarchy")
print("  - Method performance:")
print("     jieba: Automatic segmentation but limited POS tagging")
print("     Rule-based: Requires linguistic knowledge")

print("\nGeneral conclusions:")
print("  1. POS tagging is fundamental but insufficient for full disambiguation")
print("  2. Structural ambiguity requires syntactic analysis")
print("  3. Statistical methods generally outperform pure rule-based methods")
print("  4. Chinese processing is more complex than English")
print("  5. Context is crucial for disambiguation")

# ========== Additional Tests ==========
print("\n" + "=" * 60)
print("ADDITIONAL TEST SENTENCES")
print("=" * 60)

test_sentences = [
    "We can can the can",  # Example from tutorial
    "Fruit flies like a banana",  # Classic ambiguity
    "The old man the boat"  # Garden path sentence
]

for i, sent in enumerate(test_sentences, 1):
    print(f"\n{i}. Test sentence: {sent}")
    tokens = nltk.word_tokenize(sent)
    tags = nltk.pos_tag(tokens)
    print(f"   Tokens: {tokens}")
    print(f"   POS tags: {tags}")

    # Simple explanation
    if i == 1:
        print("   Ambiguity: 'can' can be modal verb(MD), verb(VB), or noun(NN)")
    elif i == 2:
        print("   Ambiguity: 'flies' can be noun(fruit flies) or verb(flies)")
    elif i == 3:
        print("   Ambiguity: 'man' can be noun(person) or verb(to operate)")

print("\n" + "=" * 60)
print("CODE EXECUTION COMPLETE!")
print("=" * 60)

# Final demonstration with tutorial example
print("\n\n" + "=" * 60)
print("TUTORIAL EXAMPLE: 'We can can the can'")
print("=" * 60)

tutorial_sentence = "We can can the can"
tutorial_tokens = nltk.word_tokenize(tutorial_sentence)
tutorial_tags = nltk.pos_tag(tutorial_tokens)

print(f"Sentence: {tutorial_sentence}")
print(f"Tokens: {tutorial_tokens}")
print(f"POS tags: {tutorial_tags}")
print("\nExplanation of 'can' occurrences:")
print("  1. 'can' (position 1): Modal verb (MD) - indicates ability")
print("  2. 'can' (position 2): Verb (VB) - means to preserve/containerize")
print("  3. 'can' (position 4): Noun (NN) - refers to a container")

print("\nThis demonstrates lexical ambiguity:")
print("  - Same word form 'can' has three different POS tags")
print("  - Context determines the correct interpretation")
print("  - Statistical models learn these patterns from training data")

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
textblob 0.19.0 requires nltk>=3.9, but you have nltk 3.8.1 which is incompatible.[0m[31m
EXERCISE 1: English Ambiguity - 'Time flies like an arrow'
Original sentence: Time flies like an arrow
Tokenized: ['Time', 'flies', 'like', 'an', 'arrow']

Method 1: Rule-based POS Tagging
----------------------------------------
Rule-based tagging results:
  Time: NN
  flies: VBZ
  like: IN
  an: DT
  arrow: NN

Method 2: NLTK Default Tagger (Transformation-based)
----------------------------------------
NLTK tagging results:
  Time: NNP
  flies: NNS
  like: IN
  an: DT
  arrow: NN

Method 3: Statistical Tagging (Based on Brown Corpus)
----------------------------------------
Statistical tagging results:
  Time: NN
  flies: NNS
  like: CS
  an: AT
  arrow: NN

ANALYSIS AND COMPARISON

1. Tagging differences for 'flies

Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
DEBUG:jieba:Loading model from cache /tmp/jieba.cache


jieba is already installed
Chinese sentence: ÊàëÂñúÊ¨¢ÂêÉËãπÊûúÁöÑ‰∫∫
Pinyin: W«í x«êhuƒÅn chƒ´ p√≠nggu«í de r√©n
Literal translation: I like eat apple DE person

Method 1: jieba Word Segmentation
----------------------------------------


Loading model cost 1.472 seconds.
DEBUG:jieba:Loading model cost 1.472 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.


Segmentation results: ['Êàë', 'ÂñúÊ¨¢', 'ÂêÉ', 'ËãπÊûú', 'ÁöÑ', '‰∫∫']
POS tagging results:
  Êàë: r
  ÂñúÊ¨¢: v
  ÂêÉ: v
  ËãπÊûú: n
  ÁöÑ: uj
  ‰∫∫: n

Tag meanings:
  r - pronoun
  v - verb
  n - noun
  uj - auxiliary particle

Method 2: Rule-based Segmentation and POS Tagging
----------------------------------------
Three possible segmentation methods:
  Method 1: Êàë | ÂñúÊ¨¢ | ÂêÉËãπÊûú | ÁöÑ | ‰∫∫
  Method 2: ÊàëÂñúÊ¨¢ | ÂêÉ | ËãπÊûú | ÁöÑ | ‰∫∫
  Method 3: Êàë | ÂñúÊ¨¢ | ÂêÉËãπÊûúÁöÑ‰∫∫

Rule-based POS tagging (using segmentation method 1):
  Êàë: PRON
  ÂñúÊ¨¢: VERB
  ÂêÉËãπÊûú: VP
  ÁöÑ: PART
  ‰∫∫: NOUN

AMBIGUITY ANALYSIS

Two main interpretations:

1. Interpretation A: I like [people who eat apples]
   Structure: I + like + [people who eat apples]
   Meaning: I like people who eat apples
   Grammatical analysis:
     - 'ÂêÉËãπÊûúÁöÑ' modifies '‰∫∫' (forms noun phrase)
     - 'ÁöÑ' functions as a structural particle
     - 'ÂêÉËãπÊûúÁöÑ‰∫∫' is the object of 'ÂñúÊ¨¢'

2. Int