<a href="https://colab.research.google.com/github/Venkatalakshmikottapalli/NLP/blob/main/V_Kottapalli_NLP_POS_assn2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Rule-based and Statistical Approaches for Part-of-Speech Tagging

### Introduction:

Part-of-Speech (POS) tagging assigns grammatical tags to words in a sentence, identifying their syntactic categories such as nouns, verbs, adjectives, etc. It's essential for various NLP tasks like parsing and machine translation. There are two main approaches to POS tagging: rule-based methods, which use predefined rules to tag words, and statistical methods, which use machine learning models like Hidden Markov Models (HMMs) or Maximum Entropy Markov Models (MEMMs).

In this assignment, we will:

Implement a rule-based POS tagger by writing rules to assign tags based on word context and evaluate its accuracy.
Develop a statistical POS tagger by training a machine learning model on a labeled corpus and assess its performance.


In [1]:
import nltk
import re
nltk.download('punkt')
nltk.download('treebank')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker') #The maxent_ne_chunker contains two pre-trained English named entity chunkers trained on an ACE corpus (perhaps ACE ACE 2004 Multilingual Training Corpus?)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

#### Comment:
 We have downloaded the libraries and set up the NLTK environment needed for various natural language processing tasks:
- Tokenization: Punkt tokenizer (punkt).
- Part-of-Speech Tagging: Averaged perceptron tagger -
(averaged_perceptron_tagger).
- Training and evaluating models:(treebank)
- Named Entity Recognition: Maximum entropy named entity chunker (maxent_ne_chunker).

In [2]:
# Import the libraries
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
from nltk.corpus import treebank
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger #important for POS tagging

#### Comment:
- We have imported the libraries from the nltk
- POS Tagging (pos_tag): Used to assign POS tags to words in sentences.
- Named Entity Recognition (ne_chunk): Used to recognize and categorize named entities in text.
- Tokenization (word_tokenize): Used to split text into individual words or tokens.
- Penn Treebank Corpus (treebank): Provides access to a widely-used corpus of English text with detailed annotations.
- Taggers (DefaultTagger, UnigramTagger, BigramTagger): Provide different methods for automatically assigning POS tags to words based on statistical models.

In [3]:
# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging

# Rule-based POS Tagger
def rule_based_pos_tagger(sentence):
    # Define your rules here
    rules = [
          (re.compile(r'\bThe\b'), 'DT'),
          (re.compile(r'\bcat\b'), 'NN'),
          (re.compile(r'\bis\b'), 'VB'),
          (re.compile(r'\bsitting\b'), 'VB'),
          (re.compile(r'\bon\b'), 'IN'),
          (re.compile(r'\bthe\b'), 'DT'),
          (re.compile(r'\bmat\b'), 'NN'),
      ]
    tagged_sentence = []
    words = word_tokenize(sentence)
    for word in words:
        for pattern, tag in rules:
            if pattern.match(word):
                tagged_sentence.append((word, tag))
                break
        else:
            tagged_sentence.append((word, 'UNKNOWN'))
    return tagged_sentence


##### Comment
- First we have defined a function rule_based_pos_tagger
- the rules have
- (re.compile(r'\bThe\b'), 'DT') - Matches the word The and gives the tag 'DT' (determiner)
- (re.compile(r'\bcat\b'), 'NN')- Matches the word cat and gives the tag 'NN' (Noun)
- (re.compile(r'\bis\b'), 'VB') - Matches the word is and gives the tag 'VB' (Verb)
- (re.compile(r'\bsitting\b'), 'VB') - Matches the word sitting and gives the tag 'VB' (Verb)
- (re.compile(r'\bon\b'), 'IN') - Matches the word on and gives the tag 'IN' (prepositions)
- (re.compile(r'\bthe\b'), 'DT') - Matches the word the and gives the tag 'DT' (determiner)
- (re.compile(r'\bmat\b'), 'NN') - Matches the word mat and gives the tag 'NN' (Noun)
- after that we have created an empty list to store the tokens with tags
- tokenized the words with word_tokenize
- the loop checks each token to match with the pattern. if it matches it assigns the tag or it assigns unknown
- finally returned the list

In [4]:
# Statistical POS Tagger
def statistical_pos_tagger(sentence):
    # Train your model on a labeled corpus (e.g., treebank)
    train_data = treebank.tagged_sents()[:3000]
    # Train your statistical model here

    # Split data into training and testing sets
    train_size = int(len(train_data) * 0.8)
    train_set = train_data[:train_size]
    test_set = train_data[train_size:]

    # Create taggers
    default_tagger = DefaultTagger('NN')  # Default tagger assigns 'NN' to all words
    unigram_tagger = UnigramTagger(train_set, backoff=default_tagger)  # Unigram tagger using training set
    bigram_tagger = BigramTagger(train_set, backoff=unigram_tagger)  # Bigram tagger using training set and fallback to unigram tagger

    # Evaluate on test set
    accuracy = bigram_tagger.accuracy(test_set)
    print("Accuracy:", accuracy)


    # Apply the trained model to tag the sentence
    tagged_sentence  = bigram_tagger.tag(word_tokenize(sentence))
    #tagged_sentence = nltk.pos_tag(words)
    #tagged_sentence.append(tagged_sentence)
    return tagged_sentence

#### Comment:
- The statistical_pos_tagger function demonstrates a statistical approach to POS tagging using NLTK's taggers (DefaultTagger, UnigramTagger, BigramTagger).
- Frist we have retrieved the 3000 tagged sentences from the treebank corpus
- Then, we splitted the data into training and testing data
- first all words assigned with DefaultTagger 'NN' and UnigramTagger assigns unigram tagger using training set
- later, BigramTagger assigns bigram tagger using training set
- Calculated the accuracy
- Applied the statisticsl approach to the sentence and returned it.

In [5]:
# Part 1: Rule-based and Statistical Approaches for Part-of-Speech Tagging
sample_sentence = "The cat is sitting on the mat."

# Rule-based POS Tagging
rule_based_tags = rule_based_pos_tagger(sample_sentence)
print("Rule-based POS Tags:")
print(rule_based_tags)

# Statistical POS Tagging
statistical_tags = statistical_pos_tagger(sample_sentence)
print("Statistical POS Tags:")
print(statistical_tags)

Rule-based POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VB'), ('sitting', 'VB'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', 'UNKNOWN')]
Accuracy: 0.8748033560566335
Statistical POS Tags:
[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


#### Comment:
- Here we applied both rule based and statistical tagging to the sample sentence and printed the results.

Additionally, NLTK has a built in function call ```pos_tags`` See example below

In [6]:
sample_sentence = "The cat is sitting on the mat."

tagged_sentence = nltk.pos_tag(word_tokenize(sample_sentence))
print(tagged_sentence)

[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('sitting', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]


In [7]:
# sent = "The quick brown fox jumps over the lazy dog while it's raining heavily."
# Function with updated rule based POS tagger
def updated_pos_tag_regex(sentence):
    patterns = [
        (re.compile(r'\b(The|the|an|An|a|A)\b'), 'DT'),  # Determiners
        (re.compile(r'\b(quick|brown|lazy)\b'), 'JJ'),  # Adjectives
        (re.compile(r'\b(fox|dog)\b'), 'NN'),  # Nouns
        (re.compile(r"\bjumps\b|\'s\b"), 'VBZ'),  # Verbs
        (re.compile(r'\b(while|over|and)\b'), 'IN'),  # Conjunctions
        (re.compile(r'\b(it|he|she)\b'), 'PRP'),  # Pronouns
        (re.compile(r'\b\w+ing\b'), 'VBG'),  # Gerunds ending in 'ing'
        (re.compile(r'\b\w+ly\b'), 'RB'),  # Adverbs ending in 'ly'
        (re.compile(r'\b\w+ed\b'), 'VBD'),  # Verbs ending in 'ed'
        (re.compile(r'\b\w+s\b|\b\w+es\b'), 'NNS'),  # Plural nouns ending in 's' or 'es'
        (re.compile(r'\.'), '.'),  # Full stop
        (re.compile(r'\,'), ','),  # Comma
    ]
    tagged_sentence = []
    words = word_tokenize(sentence)
    for word in words:
        tagged = False
        for pattern, tag in patterns:
            if pattern.match(word):
                tagged_sentence.append((word, tag))
                tagged = True
                break
        if not tagged:
            tagged_sentence.append((word, 'UNKNOWN'))
    return tagged_sentence



##### Comment:
- Here the rule_based_pos_tagger updated as updated_pos_tagger.
- (re.compile(r'\b\w+ed\b'), 'VBD') -  Matches the word with ed and gives the tag 'VBD' (Verb in Past).
- (re.compile(r'\b\w+ly\b'), 'RB') - Matches the word with ly and gives the tag 'RB' (Adverb).
- (re.compile(r'\b\w+ing\b'), 'VBG') - Matches the word with ing and gives the tag 'VBG' (Gerund).
- (re.compile(r'\b(The|the|an|An|a|A)\b'), 'DT') - Matches the word artucle and gives the tag 'DET' (article).
- (re.compile(r'\b(quick|brown|lazy)\b'), 'JJ') - Matches the word quick or brown or lazy and gives the tag 'JJ' (Adjective).
- (re.compile(r'\b(fox|dog)\b'), 'NN') - Matches the word fox or dog and gives the tag 'NN' (Noun).
- (re.compile(r"\bjumps\b|\'s\b"), 'VBZ') - Matches the jumps or 's and gives the tag 'VBZ' (verb).
- (re.compile(r'\b(while|over|and)\b'), 'IN')- Matches the word while or over or and and gives the tag 'In' (Conjunctions or Prepositions).
- (re.compile(r'\b(it|he|she)\b'), 'PRP') - Matches the word with it or he or she and gives the tag 'PRP' (Pronoun).
- (re.compile(r'\b\w+s\b|\b\w+es\b'), 'NNS') Matches the word with es or s and gives the tag 'NNS' (Noun with plurals).
- (re.compile(r'\.'), '.') Matches the '.' and gives the tag '.' (full stop ).
- (re.compile(r'\,'), ',') Matches the ',' and gives the tag ',' (comma).

- after that we have created an empty list to store the tokens with tags
- tokenized the words with word_tokenize
- the loop checks each token to match with the pattern. if it matches it assigns the tag or it assigns unknown
- finally returned the list

### Rule based POS Tagger and statistical POS Tagger

In [8]:
# Test the improved Rule-based POS Tagger
sent = "The quick brown fox jumps over the lazy dog while it's raining heavily."

# Rule-based POS Tagging
rule_based_tags_updated = updated_pos_tag_regex(sent)
print("Rule-based POS Tags (Updated):")
print(rule_based_tags_updated)

# Statistical POS Tagging
statistical_tags = statistical_pos_tagger(sent)
print("Statistical POS Tags:")
print(statistical_tags)

# Statistical POS Tagging
nltk_tagged_sent = nltk.pos_tag(word_tokenize(sent))
print("NLTK Statistical POS Tags:")
print(nltk_tagged_sent)

Rule-based POS Tags (Updated):
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('while', 'IN'), ('it', 'PRP'), ("'s", 'VBZ'), ('raining', 'VBG'), ('heavily', 'RB'), ('.', '.')]
Accuracy: 0.8748033560566335
Statistical POS Tags:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NN'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN'), ('while', 'IN'), ('it', 'PRP'), ("'s", 'VBZ'), ('raining', 'NN'), ('heavily', 'RB'), ('.', '.')]
NLTK Statistical POS Tags:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('while', 'IN'), ('it', 'PRP'), ("'s", 'VBZ'), ('raining', 'VBG'), ('heavily', 'RB'), ('.', '.')]


#### Comment:
- Here we applied  rule based, nltk pos tagging and statistical tagging to the sample sentence and printed the results