Q3 
Load Data

In [1]:
# for q3 we're using data_2 
with open("../Data/Data_2.txt", "r", encoding="utf-8") as file:
    inputData = file.read()

Testing if it was opened properly

In [2]:
print(inputData)

The big black dog barked at the white cat and chased away.


1) 3 pos tagging methods demonstration 

In [3]:
#imports 
import nltk 
from textblob import TextBlob
from nltk import pos_tag, word_tokenize, RegexpTagger

#launch
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/afanasevartur/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/afanasevartur/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [4]:
#tokenization

inputData_tokenized = word_tokenize(inputData)
print(inputData_tokenized)

['The', 'big', 'black', 'dog', 'barked', 'at', 'the', 'white', 'cat', 'and', 'chased', 'away', '.']


In [5]:
# 1. NLTK POS Tagger
print("1. NLTK POS Tagger Output:")
nltk_tags = pos_tag(inputData_tokenized)
print(nltk_tags)


1. NLTK POS Tagger Output:
[('The', 'DT'), ('big', 'JJ'), ('black', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('white', 'JJ'), ('cat', 'NN'), ('and', 'CC'), ('chased', 'VBD'), ('away', 'RB'), ('.', '.')]


In [6]:
# 2. TextBlob POS Tagger
print("2. TextBlob POS Tagger Output:")
blob = TextBlob(inputData)
textblob_tags = blob.tags
print(textblob_tags)

2. TextBlob POS Tagger Output:
[('The', 'DT'), ('big', 'JJ'), ('black', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('white', 'JJ'), ('cat', 'NN'), ('and', 'CC'), ('chased', 'VBD'), ('away', 'RB')]


In [7]:
 # 3. Regular Expression Tagger
patterns = [
    (r'(?i)^(the|a|an)$', 'DT'), # Determiners/articles (The, the)
    (r'.*ed$', 'VBD'), # Past tense verbs (barked, chased)
    (r'^(at|in|on|by)$', 'IN'), # Prepositions (at)
    (r'^(and|but|or)$', 'CC'), # Conjunctions (and)
    (r'^(away|here|there)$', 'RB'), # Adverbs (away)
    (r'^(big|black|white)$', 'JJ'), # Adjectives (big, black, white)
    (r'^\.$', '.'), # Period at the end of a sentence
    (r'.*', 'NN') # By default, everything else is considered a noun (dog, cat)
    ]
print("3. Regular Expression Tagger Output:")
regexp_tagger = RegexpTagger(patterns)
regexp_tags = regexp_tagger.tag(inputData_tokenized)
print(regexp_tags)

3. Regular Expression Tagger Output:
[('The', 'DT'), ('big', 'JJ'), ('black', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('white', 'JJ'), ('cat', 'NN'), ('and', 'CC'), ('chased', 'VBD'), ('away', 'RB'), ('.', '.')]


2) Differences Explanation

The most important obvious difference between these 3 methods is that NLTK and TextBlob rely on statistical machine learning models trained on huge corpora of text, while Regular Expression (Regex) Tagger is a strictly deterministic algorithm based on human-written rules.

breakdown by criterias:

Accuracy - 
NLTK and TextBlob: Highly accurate. They analyze the context of a sentence. For the phrase "The big black dog barked...," they will accurately identify "barked" as a past tense verb (VBD) and "dog" as a noun (NN), based on the probability distribution of words around them.
Regex Tagger: Has extremely low accuracy on general text. Its accuracy is 100% only for words that strictly match the specified regular expressions, and 0% for all others. It doesn't understand context (for example, it won't distinguish the word "book" as a book from "book" as a reservation, unless a strict rule is specified).

Tag Sets - 
NLTK and TextBlob: Out of the box, they use the standardized Penn Treebank tag set (e.g., NN for nouns, JJ for adjectives, VBD for past tense verbs). They understand and apply dozens of different tags for fine-grained classification.
Regex Tagger: The tag set is entirely up to the programmer. The developer must manually specify the pattern conformance to the Penn Treebank standard (as we did in the code: .*ed$ -> VBD). Any omission will result in incorrect tagging (for example, irregular verbs like "slept" will not match the .*ed$ rule).

Performance - 
Regex Tagger: Wins in pure computation speed at the micro-level. String matching using regular expressions is lightning fast and doesn't require loading heavy models into RAM. However, creating a complex system of rules for the entire English language would take years of manual effort.
NLTK and TextBlob: They require time for initialization (loading pre-trained models, such as averaged_perceptron_tagger ) and consume more RAM. However, on the scale of real-world problems, their efficiency is incomparably higher, as they process any text without the need to write new rules.

Use Cases - 
NLTK: The industry standard for academic and complex NLP tasks. Used when full control over the natural language processing pipeline and high accuracy are required.
TextBlob: Ideal for rapid prototyping. It's a wrapper around NLTK, providing a simpler and more intuitive API. Suitable for basic analytics tasks where you need to write code quickly and without unnecessary configuration.
Regex Tagger: Absolutely unsuitable as a standalone tagger for general-purpose text. However, it is indispensable in two cases:
As a backoff tagger when the statistical model fails to handle an unknown word.
For highly specialized subject areas (e.g., extracting and tagging specific part numbers, phone numbers, or currency codes with a rigid structure).

3) Parse Tree