# Q3 - Afanasev Artur

## Dataset: Data_2.txt 

1. Demonstrate 3 POS tagging methods
2. Explain differences 
3. Draw parse tree

In [95]:
# for q3 we're using data_2 
with open("../Data/Data_2.txt", "r", encoding="utf-8") as file:
    inputData = file.read()

Testing if it was opened properly

In [96]:
print(inputData)

The big black dog barked at the white cat and chased away.


1) 3 pos tagging methods demonstration 

In [97]:
#imports 
import nltk 
from textblob import TextBlob
from nltk import pos_tag, word_tokenize, RegexpTagger

#launch
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/afanasevartur/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/afanasevartur/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [98]:
#tokenization

inputData_tokenized = word_tokenize(inputData)
print(inputData_tokenized)

['The', 'big', 'black', 'dog', 'barked', 'at', 'the', 'white', 'cat', 'and', 'chased', 'away', '.']


In [99]:
# 1. NLTK POS Tagger
print("1. NLTK POS Tagger Output:")
nltk_tags = pos_tag(inputData_tokenized)
print(nltk_tags)


1. NLTK POS Tagger Output:
[('The', 'DT'), ('big', 'JJ'), ('black', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('white', 'JJ'), ('cat', 'NN'), ('and', 'CC'), ('chased', 'VBD'), ('away', 'RB'), ('.', '.')]


In [100]:
# 2. TextBlob POS Tagger
print("2. TextBlob POS Tagger Output:")
blob = TextBlob(inputData)
textblob_tags = blob.tags
print(textblob_tags)

2. TextBlob POS Tagger Output:
[('The', 'DT'), ('big', 'JJ'), ('black', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('white', 'JJ'), ('cat', 'NN'), ('and', 'CC'), ('chased', 'VBD'), ('away', 'RB')]


In [101]:
 # 3. Regular Expression Tagger
patterns = [
    (r'(?i)^(the|a|an)$', 'DT'), # Determiners/articles (The, the)
    (r'.*ed$', 'VBD'), # Past tense verbs (barked, chased)
    (r'^(at|in|on|by)$', 'IN'), # Prepositions (at)
    (r'^(and|but|or)$', 'CC'), # Conjunctions (and)
    (r'^(away|here|there)$', 'RB'), # Adverbs (away)
    (r'^(big|black|white)$', 'JJ'), # Adjectives (big, black, white)
    (r'^\.$', '.'), # Period at the end of a sentence
    (r'.*', 'NN') # By default, everything else is considered a noun (dog, cat)
    ]
print("3. Regular Expression Tagger Output:")
regexp_tagger = RegexpTagger(patterns)
regexp_tags = regexp_tagger.tag(inputData_tokenized)
print(regexp_tags)

3. Regular Expression Tagger Output:
[('The', 'DT'), ('big', 'JJ'), ('black', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('white', 'JJ'), ('cat', 'NN'), ('and', 'CC'), ('chased', 'VBD'), ('away', 'RB'), ('.', '.')]


2) Differences Explanation

The most important obvious difference between these 3 methods is that NLTK and TextBlob rely on statistical machine learning models trained on huge corpora of text, while Regular Expression (Regex) Tagger is a strictly deterministic algorithm based on human-written rules.

breakdown by criteria:

Accuracy - 
NLTK and TextBlob: Highly accurate. They analyze the context of a sentence. For the phrase "The big black dog barked...," they will accurately identify "barked" as a past tense verb (VBD) and "dog" as a noun (NN), based on the probability distribution of words around them.
Regex Tagger: Has extremely low accuracy on general text. Its accuracy is 100% only for words that strictly match the specified regular expressions, and 0% for all others. It doesn't understand context (for example, it won't distinguish the word "book" as a book from "book" as a reservation, unless a strict rule is specified).

Tag Sets - 
NLTK and TextBlob: Out of the box, they use the standardized Penn Treebank tag set (e.g., NN for nouns, JJ for adjectives, VBD for past tense verbs). They understand and apply dozens of different tags for fine-grained classification.
Regex Tagger: The tag set is entirely up to the programmer. The developer must manually specify the pattern conformance to the Penn Treebank standard (as we did in the code: .*ed$ -> VBD). Any omission will result in incorrect tagging (for example, irregular verbs like "slept" will not match the .*ed$ rule).

Performance - 
Regex Tagger: Wins in pure computation speed at the micro-level. String matching using regular expressions is lightning fast and doesn't require loading heavy models into RAM. However, creating a complex system of rules for the entire English language would take years of manual effort.
NLTK and TextBlob: They require time for initialization (loading pre-trained models, such as averaged_perceptron_tagger ) and consume more RAM. However, on the scale of real-world problems, their efficiency is incomparably higher, as they process any text without the need to write new rules.

Use Cases - 
NLTK: The industry standard for academic and complex NLP tasks. Used when full control over the natural language processing pipeline and high accuracy are required.
TextBlob: Ideal for rapid prototyping. It's a wrapper around NLTK, providing a simpler and more intuitive API. Suitable for basic analytics tasks where you need to write code quickly and without unnecessary configuration.
Regex Tagger: Absolutely unsuitable as a standalone tagger for general-purpose text. However, it is indispensable in two cases:
As a backoff tagger when the statistical model fails to handle an unknown word.
For highly specialized subject areas (e.g., extracting and tagging specific part numbers, phone numbers, or currency codes with a rigid structure).

3) Parse Tree

In [102]:
from nltk import CFG, ChartParser

# define CFG
grammar = CFG.fromstring("""
  S -> NP VP Punc
  NP -> Det Adj Adj N | Det Adj N
  VP -> VP1 CC VP2
  VP1 -> V PP
  VP2 -> V Adv
  PP -> P NP
  
  Det -> 'The' | 'the'
  Adj -> 'big' | 'black' | 'white'
  N -> 'dog' | 'cat'
  V -> 'barked' | 'chased'
  P -> 'at'
  CC -> 'and'
  Adv -> 'away'
  Punc -> '.'
""")

parser = ChartParser(grammar)

#showcase
print("Parse Trees:")
for tree in parser.parse(inputData_tokenized):
    tree.pretty_print()

Parse Trees:
                              S                                        
      ________________________|_____________________________________    
     |                                 VP                           |  
     |                         ________|___________________         |   
     |                       VP1                |          |        |  
     |               _________|___              |          |        |   
     |              |             PP            |          |        |  
     |              |      _______|____         |          |        |   
     NP             |     |            NP       |         VP2       |  
  ___|_________     |     |    ________|____    |     _____|___     |   
Det Adj  Adj   N    V     P  Det      Adj   N   CC   V        Adv  Punc
 |   |    |    |    |     |   |        |    |   |    |         |    |   
The big black dog barked  at the     white cat and chased     away  .  



# Q5 - Alternative approach: Spacy - Afanasev Artur

## Dataset - Data_1.txt

## code implementation:

In [103]:
#install spacy and download the english model
#pip3 install spacy

import spacy
import time

#load the text file (assuming Data_1.txt is in the Data folder)
with open("../Data/Data_1.txt", "r", encoding="utf-8") as file:
    text = file.read()

nlp = spacy.load("en_core_web_sm")

#text process
start_time = time.time()
doc = nlp(text)
spacy_time = time.time() - start_time

#extract text from spaCy Token objects
spacy_tokens = [token.text for token in doc]

print("Tokens using spaCy:\n")
print(spacy_tokens)
print(f"\nProcessing time: {spacy_time:.5f} seconds")

Tokens using spaCy:

['Classification', 'is', 'the', 'task', 'of', 'choosing', 'the', 'correct', 'class', 'label', 'for', 'a', 'given', 'input', '.', 'In', 'basic', '\n', 'classification', 'tasks', ',', 'each', 'input', 'is', 'considered', 'in', 'isolation', 'from', 'all', 'other', 'inputs', ',', 'and', 'the', 'set', 'of', 'labels', 'is', 'defined', 'in', 'advance', '.', 'The', 'basic', 'classification', 'task', 'has', 'a', 'number', 'of', 'interesting', 'variants', '.', 'For', 'example', ',', 'in', 'multiclass', 'classification', ',', 'each', 'instance', 'may', 'be', 'assigned', 'multiple', 'labels', ';', 'in', 'open', '-', 'class', 'classification', ',', 'the', 'set', 'of', 'labels', 'is', 'not', 'defined', 'in', 'advance', ';', 'and', 'in', 'sequence', 'classification', ',', 'a', 'list', 'of', 'inputs', 'are', 'jointly', 'classified', '.']

Processing time: 0.01418 seconds


## comparison:

In [104]:
import string
from nltk.tokenize import word_tokenize

def analyze_token_quality(method_title, token_list, raw_string):
    # 1. count all and unique tokens 
    token_count = len(token_list)
    vocab_size = len(set(token_list))
    
    # 2. checks punctuations mistakes
    punct_chars = set(string.punctuation)
    attached_punct = 0
    for token in token_list:
        if punct_chars.intersection(token):
            attached_punct += 1
            
    # 3.check hyphen
    raw_words = raw_string.split()
    words_with_hyphens = [w for w in raw_words if '-' in w]
    
    broken_hyphen_count = 0
    for hyphen_word in words_with_hyphens:
        if hyphen_word not in token_list:
            broken_hyphen_count += 1
            
    print(f"--- {method_title} ---")
    print("Total tokens:", token_count)
    print("Tokens with punctuation attached:", attached_punct)
    print("Hyphenated words split incorrectly:", broken_hyphen_count)
    print("Unique tokens:", vocab_size)
    print()

nltk_result = word_tokenize(text)

analyze_token_quality("Group's Approach (NLTK)", nltk_result, text)
analyze_token_quality("Alternative Approach (spaCy)", spacy_tokens, text)

--- Group's Approach (NLTK) ---
Total tokens: 94
Tokens with punctuation attached: 13
Hyphenated words split incorrectly: 0
Unique tokens: 53

--- Alternative Approach (spaCy) ---
Total tokens: 97
Tokens with punctuation attached: 13
Hyphenated words split incorrectly: 1
Unique tokens: 55



## 5.2 Compare and contrast the alternative approach with the groupâ€™s approach.

The alternative approach implemented is the spaCy tokenizer which is contrasts significantly with the group's approaches in its underlying architecture and data representation.

1. Object-Oriented vs. String-Based: The group's tokenizers like nltk or regEx primarily return an array of plain strings. In contrast spaCy returns a `Doc` object containing `Token` objects. Each token is not just a string, but a rich object that already holds linguistic annotations such as like part-of-speech tags, lemmas, and whether it is a stop word.
2. Non-Destructive Tokenization: regEx and etc often lose original whitespace characters. NLTK separates punctuation but reconstructing the exact original text can be tricky. spaCy is completely non-destructive; it preserves all original whitespaces internally, allowing us to perfectly reconstruct the original text at any time.
3. Contextual Awareness: While NLTK uses a generalized statistical model (`punkt`) spaCy uses a modern neural-network-backed pipeline (`en_core_web_sm`) which is highly optimized for complex, modern text.

## 5.3 Explain why the alternative approach is better, worse, or just different.

The spaCy approach serves a fundamentally different purpose compared to traditional NLP tokenizers.

Reasons why spaCy is BETTER:
Production-Ready and Fast: While it takes slightly longer to load the model initially, spaCy is written in Cython and is heavily optimized for speed on massive datasets compared to pure Python implementations like NLTK.
Rich Linguistic Features: Tokenizing with spaCy automatically computes POS tags, syntactic dependencies, and named entities in one go, making it immensely powerful for complex text analytics tasks.
Intelligent Edge-Case Handling: It handles complex punctuation, currency symbols, and abbreviations exceptionally well without requiring custom RegEx rules.

Reasons why spaCy is WORSE:
Resource Intensive: It requires downloading and loading heavy language models (e.g., `en_core_web_sm`), consuming significantly more RAM than simple NLTK or RegEx operations.
Less Customizable: NLTK allows developers to easily build custom tokenizers from scratch. spaCy is a "black box" pipeline; changing its core tokenization behavior for highly niche, non-standard text like obscure programming code is much more difficult.

Final conclusion:
The spaCy approach is better for real-world application development where speed on large corpora and rich linguistic metadata are required. However, the group's approach (especially NLTK) remains better for academic environments, rapid prototyping, and scenarios where memory resources are strictly limited.