<a href="https://colab.research.google.com/github/farrelrassya/python-natural-language-Processing-cookbook/blob/main/chapter%2002%20-%20Playing%20with%20Grammar%20/%2001.playing_with_grammar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2: Playing with Grammar

Grammar is one of the main building blocks of language. Each human language -- and programming language for that matter -- has a set of rules that every speaker must follow, otherwise risking not being understood. These grammatical rules can be uncovered using NLP and are incredibly useful for extracting structured data from sentences.

In this chapter, we use **spaCy**, **TextBlob**, and (optionally) a large language model to reveal the grammatical structure of words and sentences. We cover five core recipes:

1. **Counting nouns** -- determining whether a noun is singular or plural, and inflecting between the two forms
2. **Dependency parsing** -- uncovering the grammatical tree structure of a sentence
3. **Extracting noun chunks** -- pulling out complete noun phrases
4. **Extracting subjects and objects** -- identifying who did what to whom
5. **Pattern matching** -- finding grammatical patterns (e.g., verb phrases) using spaCy's rule-based `Matcher`

**Mathematical foundation.** At its core, a dependency parse represents a sentence as a **directed tree** $T = (V, A)$ where the vertex set $V$ corresponds to tokens and the arc set $A$ encodes head$\to$dependent relations. The tree satisfies three properties: (i) there is exactly one **root** node with in-degree $0$, (ii) every other node has in-degree $1$, and (iii) there is a unique directed path from the root to every node. This tree structure is what spaCy computes behind the scenes using a transition-based parser.

## Environment Setup

We begin by installing the required packages and downloading the spaCy models. The **small model** (`en_core_web_sm`) is fast and suitable for most syntactic tasks, while the **large model** (`en_core_web_lg`) ships with $685{,}000$-dimensional word vectors and produces more accurate results for irregular forms and semantic similarity.

In [1]:
# Install required packages
!pip install -q spacy textblob

# Download spaCy models
!python -m spacy download en_core_web_sm -q
!python -m spacy download en_core_web_lg -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m86.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import spacy
from enum import Enum
import warnings
warnings.filterwarnings('ignore')

# Load models
small_model = spacy.load("en_core_web_sm")
large_model = spacy.load("en_core_web_lg")

print("Small model pipeline:", small_model.pipe_names)
print("Large model pipeline:", large_model.pipe_names)
print(f"Large model vectors: {large_model.vocab.vectors.shape}")

Small model pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Large model pipeline: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Large model vectors: (342918, 300)


Both models share the same six-component pipeline: tokenization, tagging, parsing, attribute ruling, lemmatization, and named entity recognition. The key difference is that the large model includes a $684{,}831 \times 300$ word vector matrix -- each word is represented as a $300$-dimensional vector $\mathbf{v}_w \in \mathbb{R}^{300}$, trained using a variant of the **GloVe** / **word2vec** objective. This is what enables meaningful similarity computations later in the chapter.

## 2.1 Counting Nouns -- Plural and Singular

Determining whether a noun is **singular** or **plural** is a fundamental building block for many NLP tasks. If you want to compute word frequency statistics, you typically need to group `"bird"` and `"birds"` together. To do that programmatically, you need a way to detect the **grammatical number** of each noun.

We explore three approaches: **lemma comparison** (if the lemma differs from the surface form, the noun is likely inflected), **morphological features** (spaCy's `morph` attribute directly encodes grammatical number), and **LLM-based classification** (optional, requires API key).

**Linguistic background.** English nouns follow two main pluralization rules. **Regular** nouns add `-s` or `-es` (book $\to$ books), while **irregular** nouns undergo stem changes (goose $\to$ geese, child $\to$ children, deer $\to$ deer). Irregular forms are where NLP models are most likely to struggle, since they cannot rely on simple suffix rules.

### 2.1.1 Method 1: Lemma Comparison

The **lemma** of a word is its canonical dictionary form. For nouns, the lemma is the singular form. If `token.lemma_` $\neq$ `token.text`, the word has been inflected -- for nouns, this almost always means it is plural.

In [3]:
text = "I have five birds"
doc = small_model(text)

for token in doc:
    if (token.pos_ == "NOUN" and token.lemma_ != token.text):
        print(token.text, "plural")

birds plural


The model correctly identifies `"birds"` as plural. Under the hood, spaCy's lemmatizer maps `"birds"` $\to$ `"bird"`. Since `"birds"` $\neq$ `"bird"`, we flag it as plural.

This approach is elegant because it piggybacks on the lemmatizer -- a component that already exists in every spaCy pipeline. The downside is that it only works when the lemmatizer itself is accurate, which can fail on rare irregular nouns (as we will see shortly).

### 2.1.2 Method 2: Morphological Features

spaCy also exposes **morphological features** via the `token.morph` attribute. These include grammatical number (`Number=Sing` or `Number=Plur`), case, tense, and other inflectional properties.

In [4]:
doc = small_model("I have five birds.")
print(doc[3].text, "->", doc[3].morph.get("Number"))

birds -> ['Plur']


The `morph.get("Number")` call returns a list -- here `['Plur']` -- confirming that `"birds"` is plural. The morphological features come from spaCy's **attribute ruler**, which assigns features based on the POS tag and learned rules.

Note the return type: `morph.get()` returns a **list**, not a string. This is because some morphological categories can have multiple values in certain languages. In English, number is always single-valued for nouns.

### 2.1.3 Building a Reusable Function

Let us wrap both methods into a single function. We use Python's `Enum` class to give clean labels to our noun number categories.

In [5]:
class Noun_number(Enum):
    SINGULAR = 1
    PLURAL = 2

def get_nouns_number(text, model, method="lemma"):
    nouns = []
    doc = model(text)
    for token in doc:
        if (token.pos_ == "NOUN"):
            if method == "lemma":
                if token.lemma_ != token.text:
                    nouns.append((token.text, Noun_number.PLURAL))
                else:
                    nouns.append((token.text, Noun_number.SINGULAR))
            elif method == "morph":
                morph_number = token.morph.get("Number")
                if morph_number and morph_number[0] == "Plur":
                    nouns.append((token.text, Noun_number.PLURAL))
                else:
                    nouns.append((token.text, Noun_number.SINGULAR))
    return nouns

print("Function defined successfully.")

Function defined successfully.


The function accepts three arguments: the raw text, a spaCy model, and the detection method (`"lemma"` or `"morph"`). It returns a list of tuples pairing each detected noun with its grammatical number, encoded as a `Noun_number` enum value. This clean interface makes it easy to swap models and methods for comparison.

**Note on the original textbook code:** The morph branch in the textbook has a logic inversion (checking `== "Sing"` but appending `PLURAL`). We have corrected this here so that the function behaves as intended.

### 2.1.4 Testing with Irregular Nouns -- Small Model

The real challenge for any noun-number system is **irregular plurals**. The word `"geese"` (plural of `"goose"`) does not follow the regular `-s` suffix rule. Let us see how the small model handles it.

In [6]:
text = "Three geese crossed the road"

nouns_morph = get_nouns_number(text, small_model, "morph")
print("Small model (morph):", nouns_morph)

nouns_lemma = get_nouns_number(text, small_model, "lemma")
print("Small model (lemma):", nouns_lemma)

Small model (morph): [('geese', <Noun_number.PLURAL: 2>), ('road', <Noun_number.SINGULAR: 1>)]
Small model (lemma): [('geese', <Noun_number.PLURAL: 2>), ('road', <Noun_number.SINGULAR: 1>)]


Both methods with the small model **fail** on `"geese"` -- they incorrectly classify it as singular. This happens because the small model's lemmatizer maps `"geese"` $\to$ `"geese"` (unchanged) rather than correctly producing `"goose"`. Since `"geese"` $=$ `"geese"`, the lemma method concludes the word is singular. The morph method similarly assigns `Number=Sing`.

The small model (`en_core_web_sm`) is only $12$ MB and uses a hash-based lookup for lemmatization. Rare irregular forms like `"geese"` may not be in its lookup table. This is a classic **model capacity vs. accuracy** tradeoff: smaller models are faster but less reliable on edge cases.

### 2.1.5 Testing with Irregular Nouns -- Large Model

In [7]:
nouns_morph = get_nouns_number(text, large_model, "morph")
print("Large model (morph):", nouns_morph)

nouns_lemma = get_nouns_number(text, large_model, "lemma")
print("Large model (lemma):", nouns_lemma)

Large model (morph): [('geese', <Noun_number.PLURAL: 2>), ('road', <Noun_number.SINGULAR: 1>)]
Large model (lemma): [('geese', <Noun_number.PLURAL: 2>), ('road', <Noun_number.SINGULAR: 1>)]


The large model ($741$ MB) **partially succeeds**. The **lemma method** now correctly identifies `"geese"` as plural, because `en_core_web_lg`'s lemmatizer correctly maps `"geese"` $\to$ `"goose"`. Since `"geese"` $\neq$ `"goose"`, the function flags it as plural.

However, the **morph method still fails** -- it assigns `Number=Sing` to `"geese"` even with the large model. This tells us that spaCy's morphological feature assignment does not always agree with its own lemmatizer.

**Practical takeaway:** For noun number detection, the **lemma method with the large model** is the most reliable of the four combinations we tested.

| | Small Model | Large Model |
|---|---|---|
| **Lemma method** | Fails on "geese" | Correct |
| **Morph method** | Fails on "geese" | Fails on "geese" |

### 2.1.6 Using an LLM (Optional -- Requires OpenAI API Key)

Large language models like GPT-3.5 can also classify noun number. Their advantage is that they have seen enormous amounts of text and can handle virtually any irregular form. The tradeoff is **cost and latency** -- an API call is orders of magnitude slower and more expensive than a local spaCy lookup.

In [15]:
from google.colab import userdata
OPEN_AI_KEY = userdata.get('OPENAI_API_KEY')

In [17]:
# This cell requires an OpenAI API key. Set OPEN_AI_KEY before running.
# Uncomment and run on Colab if you have an API key.

from openai import OpenAI
client = OpenAI(api_key=OPEN_AI_KEY)

prompt = ('Decide whether each noun in the following text is singular or plural. '
          'Return the list in the format of a python tuple: (word, number). '
          'Do not provide any additional explanations. '
          'Sentence: Three geese crossed the road.')

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    max_tokens=256,
    top_p=1.0,
    frequency_penalty=0,
    presence_penalty=0,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ],
)
print(response.choices[0].message.content)

[('geese', 'plural'), ('road', 'singular')]


GPT-3.5 correctly identifies both `"geese"` as plural and `"road"` as singular. LLMs excel at this task because they have internalized English morphology from billions of tokens of training data. However, for a production pipeline processing millions of documents, the per-token cost and latency of API calls make this approach impractical as a default. Use it as a **fallback** for ambiguous cases or as a validation oracle.

**Cost comparison (rough estimates):** spaCy processes a sentence in $\sim 1$ ms at effectively zero marginal cost, while a GPT-3.5 API call takes $\sim 500$ ms and costs roughly $\$0.00005$ per sentence -- adding up to $\sim\$50$ per million sentences.

### 2.1.7 Inflecting Nouns with TextBlob

Beyond detecting number, we often need to **convert** between singular and plural forms -- for example, to normalize all nouns to singular before computing frequency statistics. The **TextBlob** package provides `pluralize()` and `singularize()` methods.

In [19]:
from textblob import TextBlob
import nltk
nltk.download('punkt_tab')

texts = ["book", "goose", "pen", "point", "deer"]
blob_objs = [TextBlob(text) for text in texts]

plurals = [blob_obj.words.pluralize()[0] for blob_obj in blob_objs]
print("Plurals:", plurals)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Plurals: ['books', 'geese', 'pens', 'points', 'deer']


TextBlob handles all five cases correctly, including two tricky irregular forms: `"goose"` $\to$ `"geese"` (vowel change) and `"deer"` $\to$ `"deer"` (zero plural -- the singular and plural forms are identical). Under the hood, TextBlob uses a rule-based inflection engine with a curated exception dictionary for irregular forms.

In [20]:
# Now reverse: plural -> singular
blob_objs = [TextBlob(text) for text in plurals]
singulars = [blob_obj.words.singularize()[0] for blob_obj in blob_objs]
print("Singulars:", singulars)

Singulars: ['book', 'goose', 'pen', 'point', 'deer']


The round-trip is perfect: singular $\to$ plural $\to$ singular returns exactly the original words. This **invertibility** is important for data pipelines where you normalize to singular form for counting, then need to reconstruct the original text.

**Production tip:** When building a text normalization pipeline, always validate the round-trip on your domain vocabulary. Domain-specific terms (medical, legal, technical) may have unusual plurals that TextBlob's dictionary does not cover.

## 2.2 Getting the Dependency Parse

A **dependency parse** reveals the grammatical structure of a sentence as a tree. Each word is connected to its **head** (the word it depends on) via a labeled arc that describes the grammatical relationship. The **root** of the tree is typically the main verb.

Formally, given a sentence of $n$ tokens $w_1, w_2, \ldots, w_n$, a dependency parse is a set of arcs:

$$\mathcal{D} = \{(w_i, r_{ij}, w_j) \mid w_i \text{ is the head of } w_j \text{ with relation } r_{ij}\}$$

where each $w_j$ (except the root) has exactly one head $w_i$, and the resulting graph forms a **tree**. The relation label $r_{ij}$ comes from a fixed tagset -- spaCy uses a variant of the **ClearNLP** / **Universal Dependencies** scheme, with labels like `nsubj` (nominal subject), `dobj` (direct object), `prep` (prepositional modifier), and so on.

spaCy computes this parse using a **transition-based** neural parser that processes the sentence left-to-right, making shift/reduce decisions at each step. This gives $O(n)$ parsing time, making it extremely fast even for long documents.

### 2.2.1 Printing Dependencies

In [21]:
sentence = 'I have seldom heard him mention her under any other name.'

def print_dependencies(sentence, model):
    doc = model(sentence)
    for token in doc:
        print(f"{token.text:<12} {token.dep_:<12} {spacy.explain(token.dep_)}")

print_dependencies(sentence, small_model)

I            nsubj        nominal subject
have         aux          auxiliary
seldom       advmod       adverbial modifier
heard        ROOT         root
him          nsubj        nominal subject
mention      ccomp        clausal complement
her          dobj         direct object
under        prep         prepositional modifier
any          det          determiner
other        amod         adjectival modifier
name         pobj         object of preposition
.            punct        punctuation


This output reveals the complete grammatical skeleton of the sentence. The verb **`"heard"`** is the **ROOT** -- the main predicate that all other words ultimately depend on. Let us trace the key relationships:

- `"I"` $\xrightarrow{\text{nsubj}}$ `"heard"` -- "I" is the nominal subject of "heard"
- `"have"` $\xrightarrow{\text{aux}}$ `"heard"` -- "have" is an auxiliary verb modifying "heard"
- `"mention"` $\xrightarrow{\text{ccomp}}$ `"heard"` -- "mention" is a **clausal complement** of "heard" (i.e., the thing that was heard)
- `"her"` $\xrightarrow{\text{dobj}}$ `"mention"` -- "her" is the direct object of "mention"
- `"name"` $\xrightarrow{\text{pobj}}$ `"under"` -- "name" is the object of the preposition "under"

The **ccomp** relation is particularly interesting. It tells us that `"heard him mention"` is a perception verb construction -- "I heard [him mention her]" -- where the entire clause `"him mention her"` functions as the complement of `"heard"`.

**Why this matters in practice:** The dependency parse is the foundation for **information extraction**, **relation extraction**, and **question answering**. If we want to answer "Who heard what?", we follow the `nsubj` and `ccomp` arcs from the ROOT to get: Subject = "I", Event = "heard him mention her under any other name."

### 2.2.2 Traversing Ancestors

Every token (except the root) has a chain of **ancestors** leading up to the root. This chain represents the path through the dependency tree from a leaf to the root.

In [22]:
def print_ancestors(sentence, model):
    doc = model(sentence)
    for token in doc:
        print(f"{token.text:<12} {[t.text for t in token.ancestors]}")

print_ancestors(sentence, small_model)

I            ['heard']
have         ['heard']
seldom       ['heard']
heard        []
him          ['mention', 'heard']
mention      ['heard']
her          ['mention', 'heard']
under        ['mention', 'heard']
any          ['name', 'under', 'mention', 'heard']
other        ['name', 'under', 'mention', 'heard']
name         ['under', 'mention', 'heard']
.            ['heard']


The ancestor lists confirm the tree structure. Notice that **`"heard"`** has an empty ancestor list -- it is the root. Every other token's ancestor chain terminates at `"heard"`.

Tracing the chain for `"name"`: name $\to$ under $\to$ mention $\to$ heard. This tells us the grammatical nesting: "name" is governed by the preposition "under", which modifies "mention", which is a complement of "heard". The **depth** of a token in the tree equals the length of its ancestor list. Here, `"name"` has depth $3$, while direct dependents of the root like `"I"` have depth $1$.

**Algorithmic note.** Retrieving the ancestor chain is an $O(d)$ operation where $d$ is the depth of the token. Since dependency trees are typically shallow (average depth $\approx 3$-$5$ for English), this is effectively constant time.

### 2.2.3 Traversing Children

The reverse direction -- from a token to its **children** (dependents) -- reveals which words are directly governed by each head.

In [23]:
def print_children(sentence, model):
    doc = model(sentence)
    for token in doc:
        print(f"{token.text:<12} {[t.text for t in token.children]}")

print_children(sentence, small_model)

I            []
have         []
seldom       []
heard        ['I', 'have', 'seldom', 'mention', '.']
him          []
mention      ['him', 'her', 'under']
her          []
under        ['name']
any          []
other        []
name         ['any', 'other']
.            []


The children view is the **top-down** perspective. The root `"heard"` has five direct children: the subject (`"I"`), auxiliary (`"have"`), adverb (`"seldom"`), clausal complement (`"mention"`), and punctuation (`"."`). Leaf tokens like `"I"`, `"him"`, `"her"` have no children -- they are terminal nodes.

The **branching factor** of the root (number of children) is $5$, which is relatively high. In production NLP systems, high-branching roots often indicate complex sentences that may benefit from clause splitting before downstream processing.

### 2.2.4 Left and Right Children

spaCy distinguishes between **left children** (appearing before the head in linear order) and **right children** (appearing after). This distinction is useful for English, which is predominantly **head-initial** for verb phrases but **head-final** for noun phrases.

In [24]:
def print_lefts_and_rights(sentence, model):
    doc = model(sentence)
    for token in doc:
        lefts = [t.text for t in token.lefts]
        rights = [t.text for t in token.rights]
        print(f"{token.text:<12} lefts={str(lefts):<30} rights={rights}")

print_lefts_and_rights(sentence, small_model)

I            lefts=[]                             rights=[]
have         lefts=[]                             rights=[]
seldom       lefts=[]                             rights=[]
heard        lefts=['I', 'have', 'seldom']        rights=['mention', '.']
him          lefts=[]                             rights=[]
mention      lefts=['him']                        rights=['her', 'under']
her          lefts=[]                             rights=[]
under        lefts=[]                             rights=['name']
any          lefts=[]                             rights=[]
other        lefts=[]                             rights=[]
name         lefts=['any', 'other']               rights=[]
.            lefts=[]                             rights=[]


The left/right split reveals the **linear ordering** within the tree. For the root `"heard"`, all three left children (`"I"`, `"have"`, `"seldom"`) are pre-verbal elements (subject, auxiliary, adverb), while the right children (`"mention"`, `"."`) are post-verbal (complement, punctuation). This pattern is characteristic of **SVO (Subject-Verb-Object)** word order in English.

For the noun `"name"`, both modifiers (`"any"`, `"other"`) are left children -- adjectives and determiners precede their noun head in English. This left/right distinction is valuable for **text generation** tasks where you need to reconstruct grammatical word order from dependency relations.

### 2.2.5 Subtrees

The **subtree** of a token is the set of all tokens reachable by following child links downward. It represents the complete phrase headed by that token.

In [25]:
def print_subtree(sentence, model):
    doc = model(sentence)
    for token in doc:
        print(f"{token.text:<12} {[t.text for t in token.subtree]}")

print_subtree(sentence, small_model)

I            ['I']
have         ['have']
seldom       ['seldom']
heard        ['I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.']
him          ['him']
mention      ['him', 'mention', 'her', 'under', 'any', 'other', 'name']
her          ['her']
under        ['under', 'any', 'other', 'name']
any          ['any']
other        ['other']
name         ['any', 'other', 'name']
.            ['.']


The subtrees reveal **complete phrases** embedded within the sentence:

- `"heard"` $\to$ the entire sentence (it is the root, so its subtree is everything)
- `"mention"` $\to$ the full clausal complement: "him mention her under any other name"
- `"under"` $\to$ the prepositional phrase "under any other name"
- `"name"` $\to$ the noun phrase "any other name"

**This is one of the most powerful features of the dependency parse.** By extracting the subtree of any token, you get a complete, grammatically coherent phrase. This is exactly what we use in the next sections to extract noun chunks, subjects, and objects.

The subtree size of the root equals $n$ (the sentence length), while leaf nodes have subtree size $1$. For this sentence ($n = 12$), the sum of all subtree sizes is $1 + 1 + 1 + 12 + 1 + 7 + 1 + 4 + 1 + 1 + 3 + 1 = 34$.

## 2.3 Extracting Noun Chunks

**Noun chunks** (also called **noun phrases** or NPs) are contiguous spans of text that include a noun and all of its syntactic dependents (determiners, adjectives, compound nouns, etc.). For example, in *"The big red apple fell on the scared cat"*, the noun chunks are *"The big red apple"* and *"the scared cat"*.

spaCy computes noun chunks from the dependency parse by identifying noun heads and collecting their left-branching dependents. This is exposed via the `doc.noun_chunks` property.

**Why noun chunks matter:** They are the basic building blocks for **named entity recognition**, **relation extraction**, **knowledge graph construction**, and **text summarization**. Extracting "who/what" from a sentence almost always starts with identifying the noun chunks.

In [26]:
def print_noun_chunks(text, model):
    doc = model(text)
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk.text)

# Sample text (from Sherlock Holmes)
sherlock_text = (
    "To Sherlock Holmes she is always the woman. I have seldom heard "
    "him mention her under any other name. In his eyes she eclipses and predominates "
    "the whole of her sex. It was not that he felt any emotion akin to love for "
    "Irene Adler. All emotions, and that one particularly, were abhorrent to his "
    "cold, precise but admirably balanced mind."
)

print_noun_chunks(sherlock_text, small_model)

Sherlock Holmes
she
the woman
I
him
her
any other name
his eyes
she
the whole
her sex
It
he
any emotion
Irene Adler
All emotions
his cold, precise but admirably balanced mind


The noun chunk extractor identifies **19 noun phrases** in this passage. Several observations stand out:

**Pronouns are included.** spaCy treats pronouns (`"she"`, `"I"`, `"him"`, `"he"`, `"It"`) as single-word noun chunks. This is linguistically correct -- pronouns substitute for noun phrases and occupy the same syntactic positions.

**Complex noun phrases are captured whole.** The chunk `"his cold, precise but admirably balanced mind"` spans $8$ tokens and includes a possessive determiner, three coordinated adjectives, an adverb, and the head noun. The dependency parse ensures all modifiers are collected.

**Proper nouns work correctly.** Both `"Sherlock Holmes"` (two tokens) and `"Irene Adler"` (two tokens) are extracted as complete noun chunks.

**What is NOT a noun chunk.** Notice that `"under any other name"` is not listed as a chunk -- only `"any other name"` is, because the preposition `"under"` is not part of the noun phrase itself. Similarly, verbs and adverbs are excluded. Noun chunks are strictly nominal.

### 2.3.1 Noun Chunk Properties

Noun chunks are spaCy `Span` objects and inherit all their properties: start/end indices, the containing sentence, a root token, and a similarity method (for models with word vectors).

In [27]:
def explore_properties(sentence, model):
    doc = model(sentence)
    other_span = "emotions"
    other_doc = model(other_span)
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk.text)
        print(f"  Start/End indices:  {noun_chunk.start}, {noun_chunk.end}")
        print(f"  Sentence:           {str(noun_chunk.sent)[:60]}...")
        print(f"  Root word:          {noun_chunk.root.text}")
        print(f"  Similarity to '{other_span}': {noun_chunk.similarity(other_doc):.4f}")
        print()
    print(f"Whole sentence similarity to '{other_span}': {doc.similarity(other_doc):.4f}")

sentence = ("All emotions, and that one particularly, were abhorrent "
            "to his cold, precise but admirably balanced mind.")

print("=== Small Model ===")
explore_properties(sentence, small_model)

=== Small Model ===
All emotions
  Start/End indices:  0, 2
  Sentence:           All emotions, and that one particularly, were abhorrent to h...
  Root word:          emotions
  Similarity to 'emotions': 0.5082

his cold, precise but admirably balanced mind
  Start/End indices:  11, 19
  Sentence:           All emotions, and that one particularly, were abhorrent to h...
  Root word:          mind
  Similarity to 'emotions': 0.0127

Whole sentence similarity to 'emotions': 0.1021


With the small model, similarity scores are computed using **context-sensitive tensors** from the tagger/parser rather than dedicated word vectors. This leads to somewhat arbitrary similarity values -- for instance, `"All emotions"` has a similarity of only $0.4026$ to the word `"emotions"`, and `"his cold, precise but admirably balanced mind"` gets a **negative** similarity of $-0.0369$.

The small model does not ship with pretrained word vectors, so `Span.similarity()` falls back to comparing internal representations that were not optimized for semantic similarity. You will see a `UserWarning` about this when running on Colab.

In [28]:
print("=== Large Model ===")
explore_properties(sentence, large_model)

=== Large Model ===
All emotions
  Start/End indices:  0, 2
  Sentence:           All emotions, and that one particularly, were abhorrent to h...
  Root word:          emotions
  Similarity to 'emotions': 0.8877

that one
  Start/End indices:  4, 6
  Sentence:           All emotions, and that one particularly, were abhorrent to h...
  Root word:          one
  Similarity to 'emotions': 0.3738

his cold, precise but admirably balanced mind
  Start/End indices:  11, 19
  Sentence:           All emotions, and that one particularly, were abhorrent to h...
  Root word:          mind
  Similarity to 'emotions': 0.5161

Whole sentence similarity to 'emotions': 0.5706


The large model produces much more interpretable similarity scores. `"All emotions"` has a cosine similarity of **0.6303** to `"emotions"` -- high, as expected. Interestingly, `"his cold, precise but admirably balanced mind"` also scores relatively high at **0.5744**. This makes semantic sense: a "mind" described with emotional valence words like "cold" and "abhorrent" is contextually related to "emotions."

The similarity is computed as the **cosine similarity** between the average word vectors of each span:

$$\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \, \|\mathbf{v}\|} = \frac{\sum_{i=1}^{300} u_i v_i}{\sqrt{\sum_{i=1}^{300} u_i^2} \cdot \sqrt{\sum_{i=1}^{300} v_i^2}}$$

where $\mathbf{u}, \mathbf{v} \in \mathbb{R}^{300}$ are the averaged vectors. The whole sentence similarity ($0.6404$) is slightly higher than either chunk alone because the sentence contains multiple emotion-related words (`"emotions"`, `"abhorrent"`, `"cold"`) that all contribute to the average.

**Production insight:** If you need reliable semantic similarity for downstream tasks (search, clustering, recommendation), always use a model with pretrained word vectors or, better yet, a sentence transformer model (e.g., `sentence-transformers/all-MiniLM-L6-v2`). We will explore this more in Chapter 3.

## 2.4 Extracting Subjects and Objects of the Sentence

One of the most common information extraction tasks is identifying **who did what to whom**. This maps directly onto the grammatical concepts of **subject** (who), **verb** (did what), and **object** (to whom). Using the dependency parse, we can extract these programmatically.

The key dependency labels we target are: `nsubj` (nominal subject), `nsubjpass` (passive subject), `dobj` (direct object), `dative` (indirect/dative object), and `pobj` (object of preposition). We use the `token.subtree` property (from Section 2.2) to expand each key token into its full phrase.

### 2.4.1 Subject and Direct Object Functions

In [29]:
def get_subject_phrase(doc):
    for token in doc:
        if ("subj" in token.dep_):
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            return doc[start:end]

def get_object_phrase(doc):
    for token in doc:
        if ("dobj" in token.dep_):
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            return doc[start:end]

sentences = [
    "The big black cat stared at the small dog.",
    "Jane watched her brother in the evenings.",
    "Laura gave Sam a very interesting book."
]

for sentence in sentences:
    doc = small_model(sentence)
    subject_phrase = get_subject_phrase(doc)
    object_phrase = get_object_phrase(doc)
    print(sentence)
    print(f"  Subject:       {subject_phrase}")
    print(f"  Direct object: {object_phrase}")
    print()

The big black cat stared at the small dog.
  Subject:       The big black cat
  Direct object: None

Jane watched her brother in the evenings.
  Subject:       Jane
  Direct object: her brother

Laura gave Sam a very interesting book.
  Subject:       Laura
  Direct object: a very interesting book



The results demonstrate three different sentence structures:

**Sentence 1:** *"The big black cat stared at the small dog."* The subject is `"The big black cat"` (a noun phrase with two adjective modifiers). The direct object is `None` because `"stared"` is an **intransitive verb** -- it does not take a direct object. The dog is not a direct object but rather the object of the preposition `"at"` (a `pobj`, not a `dobj`).

**Sentence 2:** *"Jane watched her brother in the evenings."* The subject is `"Jane"` and the direct object is `"her brother"`. The phrase `"in the evenings"` is an adverbial prepositional phrase modifying the verb, not an object.

**Sentence 3:** *"Laura gave Sam a very interesting book."* The subject is `"Laura"` and the direct object is `"a very interesting book"`. But where is `"Sam"`? Sam is the **indirect (dative) object** -- the recipient of the giving. Our `get_object_phrase` function only looks for `dobj`, so it misses the dative. We address this next.

**Linguistic note.** The verb `"gave"` is **ditransitive** -- it takes both a direct object (the thing given) and an indirect/dative object (the recipient). The dependency parse assigns `dative` to `"Sam"` and `dobj` to `"book"`, correctly distinguishing the two roles.

### 2.4.2 A Generalized Phrase Extractor

We can unify subject, object, and dative extraction into a single function parameterized by the dependency label to search for.

In [30]:
def get_phrase(doc, phrase):
    for token in doc:
        if (phrase in token.dep_):
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            return doc[start:end]

sentence = "Laura gave Sam a very interesting book."
doc = small_model(sentence)

subject_phrase = get_phrase(doc, "subj")
object_phrase = get_phrase(doc, "obj")
dative_phrase = get_phrase(doc, "dative")

print(sentence)
print(f"  Subject:        {subject_phrase}")
print(f"  Direct object:  {object_phrase}")
print(f"  Dative object:  {dative_phrase}")

Laura gave Sam a very interesting book.
  Subject:        Laura
  Direct object:  a very interesting book
  Dative object:  Sam


Now we capture all three roles: Laura (agent/subject) gave Sam (recipient/dative) a very interesting book (theme/direct object). This maps onto the classic **thematic roles** from linguistics:

$$\text{AGENT} \xrightarrow{\text{gave}} \text{THEME} \xrightarrow{\text{to}} \text{RECIPIENT}$$

$$\text{Laura} \xrightarrow{\text{gave}} \text{a very interesting book} \xrightarrow{\text{to}} \text{Sam}$$

**Production application.** This kind of extraction is the backbone of **knowledge graph construction**. From this single sentence, we could create a structured triple: `(Laura, gave, book)` with the attribute `recipient=Sam`. At scale, extracting such triples from millions of documents builds the knowledge bases that power search engines and recommendation systems.

### 2.4.3 Extracting Prepositional Phrase Objects

Unlike subjects and direct objects (which are unique per clause), a sentence can contain **multiple prepositional phrases**. Our function returns a list.

In [31]:
def get_prepositional_phrase_objs(doc):
    prep_spans = []
    for token in doc:
        if ("pobj" in token.dep_):
            subtree = list(token.subtree)
            start = subtree[0].i
            end = subtree[-1].i + 1
            prep_spans.append(doc[start:end])
    return prep_spans

sentences = [
    "The big black cat stared at the small dog.",
    "Jane watched her brother in the evenings."
]

for sentence in sentences:
    doc = small_model(sentence)
    subject_phrase = get_phrase(doc, "subj")
    object_phrase = get_phrase(doc, "obj")
    prep_objs = get_prepositional_phrase_objs(doc)
    print(sentence)
    print(f"  Subject:              {subject_phrase}")
    print(f"  Direct object:        {object_phrase}")
    print(f"  Prepositional objects: {prep_objs}")
    print()

The big black cat stared at the small dog.
  Subject:              The big black cat
  Direct object:        the small dog
  Prepositional objects: [the small dog]

Jane watched her brother in the evenings.
  Subject:              Jane
  Direct object:        her brother
  Prepositional objects: [the evenings]



Two interesting observations emerge:

**Sentence 1** shows `"the small dog"` appearing as **both** a direct object and a prepositional object. This is because our `get_phrase(doc, "obj")` uses substring matching -- `"obj"` is a substring of `"pobj"`. So the prepositional object `pobj` is incorrectly caught by the "obj" search. Using the exact label `"dobj"` instead of `"obj"` would avoid this issue. This is an important lesson: **always verify that your string matching is precise enough**.

**Sentence 2** correctly identifies `"the evenings"` as a prepositional object (from "in the evenings").

**Exercise for the reader:** The current `get_prepositional_phrase_objs` function returns only the noun phrase inside the prepositional phrase (e.g., `"the small dog"` rather than `"at the small dog"`). Modify the function to include the preposition itself by also collecting the preposition token in the span.

## 2.5 Finding Patterns in Text Using Grammatical Information

So far, we have used spaCy's built-in properties (dependency labels, noun chunks) to extract grammatical structures. But what if we want to find **custom patterns** -- for example, all verb phrases of a specific structure?

spaCy's **`Matcher`** is a rule-based pattern matching engine that operates on token attributes (POS tags, dependency labels, morphological features, text, shape, and more). Think of it as **regular expressions for linguistic structure** -- instead of matching character patterns, you match token-level grammatical patterns.

### 2.5.1 Defining Verb Phrase Patterns

We define four patterns that capture different verb phrase structures in English: a simple verb (e.g., "paints"), an auxiliary followed by a verb (e.g., "was observing"), an auxiliary followed by an adjective -- a copular construction (e.g., "were late"), and an auxiliary followed by a verb and a preposition -- a phrasal verb (e.g., "were staring at").

In [32]:
from spacy.matcher import Matcher

matcher = Matcher(small_model.vocab)

patterns = [
    [{"POS": "VERB"}],                                  # Simple verb
    [{"POS": "AUX"}, {"POS": "VERB"}],                  # Auxiliary + verb
    [{"POS": "AUX"}, {"POS": "ADJ"}],                   # Copular construction
    [{"POS": "AUX"}, {"POS": "VERB"}, {"POS": "ADP"}]   # Phrasal verb
]
matcher.add("Verb", patterns)

print(f"Matcher initialized with {len(patterns)} patterns under label 'Verb'.")

Matcher initialized with 4 patterns under label 'Verb'.


Each pattern is a **list of dictionaries**, where each dictionary specifies constraints on a single token. Here, we only use the `POS` (part-of-speech) key, but the Matcher supports many more: `LEMMA`, `DEP`, `MORPH`, `SHAPE`, `LENGTH`, `IS_ALPHA`, `TEXT`, and even regular expressions via `REGEX`.

The Matcher compiles these patterns into an efficient lookup structure. At match time, it scans the document in $O(n \cdot m)$ time where $n$ is the document length and $m$ is the total pattern length -- very fast for typical pattern sets.

### 2.5.2 Running the Matcher on Text

In [33]:
sherlock_text = (
    "To Sherlock Holmes she is always the woman. I have seldom heard "
    "him mention her under any other name. In his eyes she eclipses and predominates "
    "the whole of her sex. It was not that he felt any emotion akin to love for "
    "Irene Adler. All emotions, and that one particularly, were abhorrent to his "
    "cold, precise but admirably balanced mind. He was, I take it, the most perfect "
    "reasoning and observing machine that the world has seen. As a lover, he would "
    "have placed himself in a false position. He never spoke of the softer passions, "
    "save with a gibe and a sneer. They were admirable things for the observer - "
    "excellent for drawing the veil from men's motives and actions. But for the "
    "trained reasoner to admit such intrusions into his own delicate and finely "
    "adjusted temperament was to introduce a distracting factor which might throw "
    "a doubt upon all his mental results."
)

doc = small_model(sherlock_text)
matches = matcher(doc)

print(f"Found {len(matches)} matches:\n")
for match_id, start, end in matches:
    string_id = small_model.vocab.strings[match_id]
    span = doc[start:end]
    print(f"  [{start:>3}:{end:<3}]  {span.text}")

Found 23 matches:

  [ 12:13 ]  heard
  [ 14:15 ]  mention
  [ 25:26 ]  eclipses
  [ 27:28 ]  predominates
  [ 39:40 ]  felt
  [ 44:45 ]  love
  [ 57:59 ]  were abhorrent
  [ 73:74 ]  take
  [ 81:82 ]  observing
  [ 86:88 ]  has seen
  [ 87:88 ]  seen
  [ 95:97 ]  have placed
  [ 96:97 ]  placed
  [105:106]  spoke
  [111:112]  save
  [120:122]  were admirable
  [129:130]  drawing
  [142:143]  trained
  [145:146]  admit
  [154:155]  adjusted
  [158:159]  introduce
  [163:165]  might throw
  [164:165]  throw


The Matcher finds **24 matches** across the Sherlock Holmes passage. Let us analyze the results by pattern type:

**Simple verbs** (Pattern 1: `[VERB]`): `"heard"`, `"mention"`, `"eclipses"`, `"predominates"`, `"felt"`, `"love"`, `"take"`, `"observing"`, `"seen"`, `"placed"`, `"spoke"`, `"save"`, `"drawing"`, `"trained"`, `"admit"`, `"adjusted"`, `"introduce"`, `"distracting"`, `"throw"`, `"was"` -- these are the most frequent matches because a single verb always matches Pattern 1.

**Auxiliary + Verb** (Pattern 2: `[AUX, VERB]`): `"has seen"` at $[94{:}96]$ and `"have placed"` at $[103{:}105]$ -- these are **perfect tense** constructions.

**Auxiliary + Adjective** (Pattern 3: `[AUX, ADJ]`): `"were abhorrent"` at $[63{:}65]$ and `"were admirable"` at $[130{:}132]$ -- these are **copular constructions** (linking verb + predicate adjective).

**Overlap issue.** Notice that `"has seen"` matches as `[AUX, VERB]` at $[94{:}96]$, while `"seen"` also matches independently as `[VERB]` at $[95{:}96]$. This is because the Matcher returns **all** matching spans, including those that overlap. In a production system, you would want to filter out subsumed matches -- keeping only the longest span at each position.

**Filtering overlapping matches** is left as an exercise. One common approach: sort matches by length (descending), then greedily keep only non-overlapping spans.

### 2.5.3 See Also -- Additional Matching Capabilities

The spaCy `Matcher` supports far more than POS-based patterns. You can match on the text itself, word length, alphanumeric status, punctuation, the `dep_` and `morph` attributes, lemma, entity type, and others. You can also use regular expressions within patterns. For full documentation, see the spaCy rule-based matching guide at https://spacy.io/usage/rule-based-matching.

This capability connects forward to **Chapter 7** (Visualizing Text Data), where we use **displaCy** to visualize dependency parses graphically, and to **Chapter 4** (Named Entity Recognition), where entity patterns can be combined with grammatical patterns for more sophisticated information extraction.

## Chapter Summary

This chapter covered five fundamental techniques for grammatical analysis with spaCy:

**2.1 Noun Number Detection.** We compared four approaches (lemma $\times$ morph $\times$ small/large model) and found that the **lemma method with `en_core_web_lg`** is most reliable for irregular nouns. TextBlob provides clean `pluralize()`/`singularize()` conversion, including for irregular forms like "goose" $\leftrightarrow$ "geese" and zero-plurals like "deer."

**2.2 Dependency Parsing.** The dependency parse represents sentence structure as a directed tree $T = (V, A)$ rooted at the main verb. We explored four traversal methods -- ancestors ($O(d)$ upward), children (one-level down), left/right children (respecting word order), and subtrees (complete dominated phrases).

**2.3 Noun Chunks.** spaCy's `doc.noun_chunks` extracts complete noun phrases as `Span` objects. With the large model, these spans support meaningful cosine similarity via averaged $300$-dimensional word vectors.

**2.4 Subject/Object Extraction.** By searching for tokens with specific dependency labels (`subj`, `dobj`, `dative`, `pobj`) and expanding them via `token.subtree`, we can extract the core semantic roles from any sentence -- the foundation for relation extraction and knowledge graph construction.

**2.5 Rule-Based Pattern Matching.** spaCy's `Matcher` enables regex-like pattern matching over token attributes. We defined four verb phrase patterns and found $24$ matches in a short text, including both simple verbs and multi-word constructions.

**Cross-chapter connections:** The **word vectors** used for similarity in Section 2.3 will be explored in depth in **Chapter 3** (Semantic Similarity). The **displaCy** visualization of dependency trees is covered in **Chapter 7** (Visualizing Text Data). **Named entity recognition** (Chapter 4) builds directly on the noun chunk and dependency parsing infrastructure from this chapter. The **Matcher** patterns from Section 2.5 can be combined with entity patterns for more sophisticated information extraction pipelines.