# I. Rule-based and Statistical Approaches for Part-of-Speech Tagging

Part-of-Speech tagging, also known as POS tagging, is the process of assigning grammatical tags or labels to words in a sentence. The tags represent the syntactic category or part of speech of each word, such as noun, verb, adjective, adverb, etc. POS tagging is an essential step in many Natural Language Processing (NLP) tasks, including parsing, machine translation, and information retrieval.

POS tagging can be approached using different techniques, including rule-based approaches, statistical approaches, and hybrid approaches that combine both. In statistical approaches, Hidden Markov Models (HMMs) and Maximum Entropy Markov Models (MEMMs) are commonly used.

Implement a rule-based part-of-speech (POS) tagger:
* a. Write a set of rules to assign POS tags to words based on their context
* b. Apply the rules to a sample text and evaluate the accuracy of the tagger.



Implement a statistical POS tagger using a pre-trained model:


* a. Train a statistical POS tagger on a labeled corpus using a machine learning algorithm such as Naive Bayes or Maximum Entropy.
* b. Apply the trained model to tag a sample text and evaluate its accuracy.





In [None]:
import nltk
import re
nltk.download('punkt')
nltk.download('treebank')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker') #The maxent_ne_chunker contains two pre-trained English named entity chunkers trained on an ACE corpus (perhaps ACE ACE 2004 Multilingual Training Corpus?)

In [None]:
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
from nltk.corpus import treebank
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger #important for POS tagging


# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging

# Rule-based POS Tagger
def rule_based_pos_tagger(sentence):
    # Define your rules here
    rules = [
          (re.compile(r'\bThe\b'), 'DT'),
          (re.compile(r'\bcat\b'), 'NN'),
          (re.compile(r'\bis\b'), 'VB'),
          (re.compile(r'\bsitting\b'), 'VB'),
          (re.compile(r'\bon\b'), 'IN'),
          (re.compile(r'\bthe\b'), 'DT'),
          (re.compile(r'\bmat\b'), 'NN'),
      ]
    tagged_sentence = []
    words = word_tokenize(sentence)
    for word in words:
        for pattern, tag in rules:
            if pattern.match(word):
                tagged_sentence.append((word, tag))
                break
        else:
            tagged_sentence.append((word, 'UNKNOWN'))
    return tagged_sentence

# Statistical POS Tagger
def statistical_pos_tagger(sentence):
    # Train your model on a labeled corpus (e.g., treebank)
    train_data = treebank.tagged_sents()[:3000]
    # Train your statistical model here

    # Split data into training and testing sets
    train_size = int(len(train_data) * 0.8)
    train_set = train_data[:train_size]
    test_set = train_data[train_size:]

    # Create taggers
    default_tagger = DefaultTagger('NN')  # Default tagger assigns 'NN' to all words
    unigram_tagger = UnigramTagger(train_set, backoff=default_tagger)  # Unigram tagger using training set
    bigram_tagger = BigramTagger(train_set, backoff=unigram_tagger)  # Bigram tagger using training set and fallback to unigram tagger

    # Evaluate on test set
    accuracy = bigram_tagger.accuracy(test_set)
    print("Accuracy:", accuracy)


    # Apply the trained model to tag the sentence
    tagged_sentence  = bigram_tagger.tag(word_tokenize(sentence))
    #tagged_sentence = nltk.pos_tag(words)
    #tagged_sentence.append(tagged_sentence)
    return tagged_sentence

In [None]:
# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging
sample_sentence = "The cat is sitting on the mat."

# Rule-based POS Tagging
rule_based_tags = rule_based_pos_tagger(sample_sentence)
print("Rule-based POS Tags:")
print(rule_based_tags)

# Statistical POS Tagging
statistical_tags = statistical_pos_tagger(sample_sentence)
print("Statistical POS Tags:")
print(statistical_tags)


Additionally, NLTK has a built in function call ```pos_tags``
See example below

In [None]:
sample_sentence = "The cat is sitting on the mat."

tagged_sentence = nltk.pos_tag(word_tokenize(sample_sentence))
print(tagged_sentence)

 ### Exercise

Update the Rule-based tagger with patterns using regex. An example could be:



      ```  (r'\b\w+s\b|\b\w+es\b', 'NN'),     # Nouns ending ```

  From here proivde an updated rule-based tagger and statistical based tagger that can apply a part of speech tag for the following complex sentence:

  ```
  sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."

  ```

In [1]:
import re
from nltk.tokenize import word_tokenize

def rule_based_pos_tagger(sentence):
    # More comprehensive and nuanced rules
    rules = [
        (re.compile(r'\b(and|or|but|nor|yet|so)\b', re.I), 'CC'),  # Coordinating conjunction
        (re.compile(r'\b(in|on|at|by|for|with|about|against)\b', re.I), 'IN'),  # Preposition
        (re.compile(r'\b(is|am|are|was|were|be|being|been)\b', re.I), 'VB'),  # Be verbs
        (re.compile(r'\b(has|have|had)\b', re.I), 'HV'),  # Have verbs
        (re.compile(r'\b(do|does|did)\b', re.I), 'DO'),  # Do verbs
        (re.compile(r'\b\w+ing\b', re.I), 'VBG'),  # Gerund or present participle
        (re.compile(r'\b\w+ed\b', re.I), 'VBD'),  # Past tense verb
        (re.compile(r'\b\w+es\b', re.I), 'VBZ'),  # 3rd person singular present verb
        (re.compile(r'\b\w+s\b', re.I), 'NNS'),  # Plural noun
        (re.compile(r'\b\w+ly\b', re.I), 'RB'),  # Adverb
        (re.compile(r'\b\w+er\b|\b\w+est\b', re.I), 'JJR'),  # Comparative or superlative adjective
        (re.compile(r'\bI\b|\bme\b|\byou\b|\bhe\b|\bshe\b|\bit\b|\bwe\b|\bthey\b', re.I), 'PRP'),  # Personal pronoun
        (re.compile(r'\bmy\b|\byour\b|\bhis\b|\bher\b|\bits\b|\bour\b|\btheir\b', re.I), 'PRP$'),  # Possessive pronoun
        (re.compile(r'\ba\b|\ban\b|\bthe\b', re.I), 'DT'),  # Article
        (re.compile(r'\bthis\b|\bthat\b|\bthese\b|\bthose\b', re.I), 'DT'),  # Demonstrative determiner
        # Custom rule for adjectives, avoiding adverb confusion
        (re.compile(r'\b\w+y\b|\b\w+ful\b|\b\w+ous\b|\b\w+ive\b|\b\w+ic\b', re.I), 'JJ'),
    ]
    
    # Tokenize the sentence
    words = word_tokenize(sentence)
    tagged_sentence = []
    
    for word in words:
        tagged = False
        for pattern, tag in rules:
            if pattern.match(word):
                tagged_sentence.append((word, tag))
                tagged = True
                break
        if not tagged:
            tagged_sentence.append((word, 'UNKNOWN'))
            
    return tagged_sentence


In [2]:
import typing
import nltk
from nltk.corpus import treebank
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Ensure necessary NLTK datasets are downloaded
nltk.download('treebank')
nltk.download('universal_tagset')


class StatisticalPOSTagger:

    def __init__(self) -> None:

        self.vec = DictVectorizer()
        self.encoder = LabelEncoder()
        self.nb = MultinomialNB()

        # Get the Treebank dataset
        tagged_sentences = treebank.tagged_sents(tagset='universal')
        X, y = [], []
        for tagged in tagged_sentences:
            untagged = [w for w, _ in tagged]
            for index in range(len(tagged)):
                X.append(self.__features(untagged, index))
                y.append(tagged[index][1])

        # Split the dataset
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
        # Train the model
        self._fit(X_train, y_train)
        # Report the Test accuracy
        print(f'Test Accuracy: {self._eval(X_test, y_test)}')

    def __call__(self, sentence: str):
        X = []
        words = sentence.split()
        for index in range(len(words)):
            X.append(self.__features(words, index))
        X = self.vec.transform(X)
        y = self.nb.predict(X)
        y = self.encoder.inverse_transform(y)
        return [(word, label) for word, label in zip(words, y)]

    def _fit(self, X, y):
        X = self.vec.fit_transform(X)
        y = self.encoder.fit_transform(y)
        self.nb.fit(X, y)

    def _eval(
        self, 
        X: typing.List[typing.Dict], 
        y: np.ndarray, 
    ):
        X = self.vec.transform(X)
        y_true = self.encoder.transform(y)
        y_pred = self.nb.predict(X)
        return accuracy_score(y_true, y_pred)

    @staticmethod
    def __features(sentence, index):
        return {
            'word': sentence[index],
            'is_first': index == 0,
            'is_last': index == len(sentence) - 1,
            'is_capitalized': sentence[index][0].upper() == sentence[index][0],
            'is_all_caps': sentence[index].upper() == sentence[index],
            'is_all_lower': sentence[index].lower() == sentence[index],
            'prefix-1': sentence[index][0],
            'prefix-2': sentence[index][:2],
            'prefix-3': sentence[index][:3],
            'suffix-1': sentence[index][-1],
            'suffix-2': sentence[index][-2:],
            'suffix-3': sentence[index][-3:],
            'prev_word': '' if index == 0 else sentence[index - 1],
            'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
            'has_hyphen': '-' in sentence[index],
            'is_numeric': sentence[index].isdigit(),
        }
 

[nltk_data] Downloading package treebank to
[nltk_data]     /Users/hiepdang/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/hiepdang/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


In [3]:
#show printed output below
sentence = "The quick brown fox jumps over the lazy dog while it's raining heavily."

In [4]:
rule_based_pos_tagger(sentence)

[('The', 'DT'),
 ('quick', 'UNKNOWN'),
 ('brown', 'UNKNOWN'),
 ('fox', 'UNKNOWN'),
 ('jumps', 'NNS'),
 ('over', 'JJR'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'UNKNOWN'),
 ('while', 'UNKNOWN'),
 ('it', 'PRP'),
 ("'s", 'UNKNOWN'),
 ('raining', 'VBG'),
 ('heavily', 'RB'),
 ('.', 'UNKNOWN')]

In [5]:
statistical_pos_tagger = StatisticalPOSTagger()
statistical_pos_tagger(sentence)

Test Accuracy: 0.9502880413190306


[('The', 'DET'),
 ('quick', 'NOUN'),
 ('brown', 'ADJ'),
 ('fox', 'NOUN'),
 ('jumps', 'NOUN'),
 ('over', 'ADP'),
 ('the', 'DET'),
 ('lazy', 'ADJ'),
 ('dog', 'VERB'),
 ('while', 'ADP'),
 ("it's", 'PRON'),
 ('raining', 'VERB'),
 ('heavily.', 'NOUN')]

---