<a href="https://colab.research.google.com/github/acastellanos-ie/NLP-MBDS-EN/blob/main/03_linguistic_structure_and_interpretability/parsing_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classical Syntax and its Limitations (Lab)

**Learning Objective:**
Understand how classical NLP uses rule-based Dependency Parsing to extract structured information (Subject-Verb-Object) from text.

More importantly, we will empirically expose the fragility of rule-based systems when faced with real-world linguistic variability, setting the stage for the deep learning revolution (Semantics & Transformers).

In [None]:
# 1. Environment Setup
# We use spaCy, the industry standard library for classical NLP pipelines.
!pip install -q spacy
!python -m spacy download en_core_web_sm -q

### Phase 1: Visualizing Linguistic Structure
Before deep learning, we relied on parsing sentences into directed graphs (Dependency Trees) to understand grammatical relationships.

In [None]:
import spacy
from spacy import displacy
from IPython.core.display import display, HTML

# Load the small English pipeline
nlp = spacy.load("en_core_web_sm")

text = "The quick brown fox jumps over the lazy dog."
doc = nlp(text)

# Render the dependency tree using displaCy
html = displacy.render(doc, style="dep", jupyter=False, options={'distance': 100})
display(HTML(html))

### Phase 2: The Illusion of Rule-Based Information Extraction
Let's build a classic Subject-Verb-Object (SVO) extractor. This was the foundation of early Knowledge Graphs and Question Answering (QA) systems.

In [None]:
def extract_svo_triples(text, nlp_model):
    doc = nlp_model(text)
    svo_triples = []

    for token in doc:
        # Find the Subject
        if "subj" in token.dep_:
            subject = token
            verb = token.head
            # Find the Object attached to the Verb
            for obj in verb.children:
                if "obj" in obj.dep_:
                    svo_triples.append((subject, verb, obj))
                elif obj.dep_ == "prep":
                    for pobj in obj.children:
                        if pobj.dep_ == "pobj":
                            svo_triples.append((subject, verb, pobj))
    return svo_triples

# Let's test it on a sterile, simple dataset
sterile_text = "John bought a new car. Mary gave John a book. Alice traveled to Paris."
sterile_triples = extract_svo_triples(sterile_text, nlp)

print("--- Extracted Triples ---")
for triple in sterile_triples:
    print(f"Subject: {triple[0].text: <5} | Verb: {triple[1].text: <8} | Object: {triple[2].text}")

### Phase 3: Building a Simple QA System
We can use these triples to answer basic questions by matching the verb.

In [None]:
def simple_qa(question, svo_triples, nlp_model):
    question_doc = nlp_model(question)
    question_verb = None

    # Identify the main verb in the question
    for token in question_doc:
        if "VERB" in token.pos_:
            question_verb = token
            break

    if question_verb is not None:
        for triple in svo_triples:
            subject, verb, obj = triple
            # Match by lemma (base form of the verb: bought -> buy)
            if verb.lemma_ == question_verb.lemma_:
                return f"{subject.text} {verb.text} {obj.text}"

    return "System Failure: Cannot determine answer."

print("Question: Who bought a car?")
print("Answer:  ", simple_qa("Who bought a car?", sterile_triples, nlp))

### Phase 4: The Stress Test
The system above looks like artificial intelligence. **It is not.** It is a rigid topological rule.

Let's see what happens when we introduce standard linguistic variance (Synonyms, Passive Voice, Complex Clauses).

In [None]:
# The real world does not speak in perfect SVO structures.
real_world_text = "A brand new car was purchased by John. The book, which was heavy, was given to John by Mary."
real_world_triples = extract_svo_triples(real_world_text, nlp)

print("--- QA Stress Test ---")
questions = [
    "Who bought a car?",           # Fails: Synonym (purchased vs bought)
    "What did John purchase?",     # Fails: Syntactic inversion (Passive voice makes 'car' the subject)
    "Who gave the book to John?"   # Fails: Passive voice makes 'book' the subject
]

for q in questions:
    print(f"\nQ: {q}")
    print(f"A: {simple_qa(q, real_world_triples, nlp)}")

print("\n--- Why did it fail? Look at the Triples ---")
for triple in real_world_triples:
    print(f"Parsed -> Subject: {triple[0].text: <4} | Verb: {triple[1].text: <9} | Object: {triple[2].text}")



# TAKEAWAY FOR THE STUDENT:

In the passive sentence: `A car was purchased by John` the syntax parser labels "car" as the grammatical subject `(nsubjpass)`

Our hardcoded logic thinks the car is doing the purchasing.

To fix this in classical NLP, you would need hundreds of "if/else" rules to handle every grammatical edge case.
This is unsustainable.

**The solution?**

Dense Vector Semantics (Next Session).