# ðŸ“˜ Learning NLP Basics

This notebook covers the foundational text-preprocessing steps every NLP pipeline needs.  
By the end you will be able to:

| Skill | Tool(s) |
|---|---|
| Divide text into **sentences** | NLTK Â· spaCy |
| Divide sentences into **words** (tokenization) | NLTK Â· spaCy |
| **Part-of-speech** tagging | NLTK Â· spaCy |
| Combine similar words â€” **lemmatization** | NLTK Â· spaCy |
| Remove **stopwords** | NLTK Â· spaCy |

> **Packages used:** `nltk`, `spacy`

---

## 0 â€” Environment Setup

Install and download all the resources we need for this notebook.

In [1]:
# Install packages (uncomment if needed)
# !pip install nltk spacy

# Download NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('omw-1.4', quiet=True)
print("âœ… NLTK resources downloaded")

âœ… NLTK resources downloaded


In [2]:
# Download spaCy English model
import subprocess, sys
subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm", "-q"])

import spacy
nlp = spacy.load("en_core_web_sm")
print("âœ… spaCy model loaded")

âœ… spaCy model loaded


## Helper â€” File Utility

A simple function to read a plain-text file (mirrors the `file_utils` notebook from the book's repo).

In [3]:
def read_text_file(filename: str) -> str:
    """Read and return the entire contents of a text file."""
    with open(filename, "r", encoding="utf-8") as f:
        return f.read()

## Sample Text â€” *The Adventures of Sherlock Holmes*

We will use a short excerpt from Arthur Conan Doyle's story as our running example.  
(The book's GitHub repo stores this in `data/sherlock_holmes_1.txt`.)

In [4]:
# For this notebook we embed the sample text directly.
# Replace with read_text_file("../data/sherlock_holmes_1.txt") if you have the file.

sherlock_holmes_part_of_text = """To Sherlock Holmes she is always the woman. I have seldom heard \
him mention her under any other name. In his eyes she eclipses and \
predominates the whole of her sex. It was not that he felt any emotion \
akin to love for Irene Adler. All emotions, and that one particularly, \
were abhorrent to his cold, precise but admirably balanced mind. He was, \
I take it, the most perfect reasoning and observing machine that the \
world has seen, but as a lover he would have placed himself in a false \
position. He never spoke of the softer passions, save with a gibe and a \
sneer. They were admirable things for the observerâ€”excellent for drawing \
the veil from men's motives and actions. But for the trained reasoner to \
admit such intrusions into his own delicate and finely adjusted \
temperament was to introduce a distracting factor which might throw a \
doubt upon all his mental results."""

print(sherlock_holmes_part_of_text[:200], "...")
print(f"\nTotal characters: {len(sherlock_holmes_part_of_text)}")

To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotio ...

Total characters: 872


---
## 1 â€” Dividing Text into Sentences

Sentences are the main processing unit in many NLP tasks (e.g., providing context to LLMs).

Simply splitting on periods (`.`) is **not** reliable because:
- Periods appear in abbreviations ("Dr. Smith will see you now.")
- Capital letters appear in proper nouns, not just sentence starts

Both NLTK and spaCy provide robust sentence tokenizers that handle these edge cases.

---
### 1.1 Using NLTK

In [5]:
import nltk

# Load the Punkt sentence tokenizer for English
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

# Tokenize
sentences_nltk = tokenizer.tokenize(sherlock_holmes_part_of_text)

# Display results
for i, sent in enumerate(sentences_nltk, 1):
    print(f"  [{i}] {sent}")

print(f"\nâ†’ Total sentences (NLTK): {len(sentences_nltk)}")

  [1] To Sherlock Holmes she is always the woman.
  [2] I have seldom heard him mention her under any other name.
  [3] In his eyes she eclipses and predominates the whole of her sex.
  [4] It was not that he felt any emotion akin to love for Irene Adler.
  [5] All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
  [6] He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position.
  [7] He never spoke of the softer passions, save with a gibe and a sneer.
  [8] They were admirable things for the observerâ€”excellent for drawing the veil from men's motives and actions.
  [9] But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results.

â†’ Total sentences (NLTK): 9


### 1.2 Using spaCy

In [6]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(sherlock_holmes_part_of_text)

sentences_spacy = [sent.text for sent in doc.sents]

for i, sent in enumerate(sentences_spacy, 1):
    print(f"  [{i}] {sent}")

print(f"\nâ†’ Total sentences (spaCy): {len(sentences_spacy)}")

  [1] To Sherlock Holmes she is always the woman.
  [2] I have seldom heard him mention her under any other name.
  [3] In his eyes she eclipses and predominates the whole of her sex.
  [4] It was not that he felt any emotion akin to love for Irene Adler.
  [5] All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
  [6] He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position.
  [7] He never spoke of the softer passions, save with a gibe and a sneer.
  [8] They were admirable things for the observerâ€”excellent for drawing the veil from men's motives and actions.
  [9] But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results.

â†’ Total sentences (spaCy): 9


### 1.3 Speed Comparison

spaCy loads a full language model and runs multiple pipeline components, so sentence splitting alone is slower than NLTK's dedicated tokenizer.

In [7]:
import time

def split_into_sentences_nltk(text):
    return tokenizer.tokenize(text)

def split_into_sentences_spacy(text):
    doc = nlp(text)
    return [sent.text for sent in doc.sents]

# --- Benchmark ---
start = time.time()
split_into_sentences_nltk(sherlock_holmes_part_of_text)
nltk_time = time.time() - start

start = time.time()
split_into_sentences_spacy(sherlock_holmes_part_of_text)
spacy_time = time.time() - start

print(f"NLTK  : {nltk_time:.6f} s")
print(f"spaCy : {spacy_time:.6f} s")
print(f"\nspaCy is ~{spacy_time/nltk_time:.0f}Ã— slower for sentence splitting alone.")

NLTK  : 0.000661 s
spaCy : 0.074355 s

spaCy is ~112Ã— slower for sentence splitting alone.


> **When to use which?**  
> If you only need sentence splitting â†’ **NLTK** is faster.  
> If you are already using spaCy for other tasks (POS tagging, NER, etc.) â†’ use **spaCy** for the whole pipeline.

### 1.4 Other Languages

| Library | Supported Languages |
|---|---|
| **NLTK** | Czech, Danish, Dutch, Estonian, Finnish, French, German, Greek, Italian, Norwegian, Polish, Portuguese, Slovene, Spanish, Swedish, Turkish |
| **spaCy** | Chinese, Dutch, French, German, Greek, Italian, Japanese, Portuguese, Romanian, Spanish, and others |

```python
# NLTK â€” Spanish
tokenizer_es = nltk.data.load("tokenizers/punkt/spanish.pickle")

# spaCy â€” Spanish (download first: python -m spacy download es_core_news_sm)
nlp_es = spacy.load("es_core_news_sm")
```

---

## 2 â€” Dividing Sentences into Words (Tokenization)

Many NLP tasks operate at the **word level** â€” building semantic models, searching for specific parts of speech, etc.

---
### 2.1 Using NLTK

In [8]:
import nltk

words_nltk = nltk.tokenize.word_tokenize(sherlock_holmes_part_of_text)

print(words_nltk[:30])
print(f"\nâ†’ Total tokens (NLTK): {len(words_nltk)}")

['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'the', 'woman', '.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole']

â†’ Total tokens (NLTK): 171


**Notes on NLTK word tokenization:**
- Punctuation and quotes are treated as separate tokens.
- Contractions are split but **not** expanded: `don't` â†’ `do`, `n't`; `men's` â†’ `men`, `'s`.

---
### 2.2 Multi-Word Expression (MWE) Tokenizer

Sometimes we want to keep certain phrases as a single token (e.g., *"dim sum dinner"*).

In [9]:
from nltk.tokenize import MWETokenizer

# Initialize with multi-word expressions
mwe_tokenizer = MWETokenizer([('dim', 'sum', 'dinner')])
mwe_tokenizer.add_mwe(('best', 'dim', 'sum'))

# Example 1 â€” no MWE match
tokens_1 = mwe_tokenizer.tokenize(
    'Last night I went for dinner in an Italian restaurant. The pasta was delicious.'.split()
)
print("Example 1:", tokens_1)

# Example 2 â€” MWE matches present
tokens_2 = mwe_tokenizer.tokenize(
    'I went out to a dim sum dinner last night. This restaurant has the best dim sum in town.'.split()
)
print("Example 2:", tokens_2)

Example 1: ['Last', 'night', 'I', 'went', 'for', 'dinner', 'in', 'an', 'Italian', 'restaurant.', 'The', 'pasta', 'was', 'delicious.']
Example 2: ['I', 'went', 'out', 'to', 'a', 'dim_sum_dinner', 'last', 'night.', 'This', 'restaurant', 'has', 'the', 'best_dim_sum', 'in', 'town.']


Notice how `dim_sum_dinner` and `best_dim_sum` are kept as single tokens in Example 2.

---
### 2.3 Using spaCy

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(sherlock_holmes_part_of_text)

words_spacy = [token.text for token in doc]

print(words_spacy[:30])
print(f"\nâ†’ Total tokens (spaCy): {len(words_spacy)}")

['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'the', 'woman', '.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.', 'In', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole']

â†’ Total tokens (spaCy): 173


### 2.4 Comparing NLTK vs spaCy Tokens

In [11]:
# Tokens unique to each tokenizer
only_in_spacy = set(words_spacy) - set(words_nltk)
only_in_nltk  = set(words_nltk) - set(words_spacy)

print(f"Tokens only in spaCy ({len(only_in_spacy)}): {only_in_spacy}")
print(f"Tokens only in NLTK  ({len(only_in_nltk)}):  {only_in_nltk}")
print(f"\nNLTK token count : {len(words_nltk)}")
print(f"spaCy token count: {len(words_spacy)}")

Tokens only in spaCy (3): {'â€”', 'observer', 'excellent'}
Tokens only in NLTK  (1):  {'observerâ€”excellent'}

NLTK token count : 171
spaCy token count: 173


> **Key differences:**  
> - spaCy keeps newline characters (`\n`) as separate tokens.  
> - spaCy splits hyphenated words (e.g., `high-power` â†’ `high`, `-`, `power`).  
> - If you are doing further processing with spaCy, use its tokenizer. Otherwise NLTK word tokenization is sufficient.

---

## 3 â€” Part-of-Speech (POS) Tagging

POS tagging assigns a grammatical category (noun, verb, adjective, â€¦) to each token. This is useful for:
- Filtering tokens by type (e.g., extract only nouns)
- Disambiguation (e.g., *"bank"* as noun vs. verb)
- Downstream tasks like named-entity recognition

---
### 3.1 Using NLTK

In [12]:
import nltk

# First tokenize, then tag
words = nltk.tokenize.word_tokenize(sherlock_holmes_part_of_text)
pos_tags_nltk = nltk.pos_tag(words)

# Show first 20 tagged tokens
for word, tag in pos_tags_nltk[:20]:
    print(f"  {word:20s} â†’ {tag}")

print(f"\nâ†’ Total tagged tokens: {len(pos_tags_nltk)}")

  To                   â†’ TO
  Sherlock             â†’ NNP
  Holmes               â†’ NNP
  she                  â†’ PRP
  is                   â†’ VBZ
  always               â†’ RB
  the                  â†’ DT
  woman                â†’ NN
  .                    â†’ .
  I                    â†’ PRP
  have                 â†’ VBP
  seldom               â†’ VBN
  heard                â†’ RB
  him                  â†’ PRP
  mention              â†’ VB
  her                  â†’ PRP
  under                â†’ IN
  any                  â†’ DT
  other                â†’ JJ
  name                 â†’ NN

â†’ Total tagged tokens: 171


NLTK uses the **Penn Treebank** tagset. Common tags include:

| Tag | Meaning | Examples |
|---|---|---|
| `NN` | Noun, singular | woman, name |
| `NNP` | Proper noun | Sherlock, Holmes |
| `VB` | Verb, base form | speak, observe |
| `VBD` | Verb, past tense | felt, placed |
| `JJ` | Adjective | cold, precise |
| `RB` | Adverb | seldom, admirably |
| `PRP` | Personal pronoun | she, he, I |
| `DT` | Determiner | the, a, his |

You can look up any tag with:
```python
nltk.help.upenn_tagset('VBD')
```

---
### 3.2 Using spaCy

In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(sherlock_holmes_part_of_text)

# spaCy provides both fine-grained (.tag_) and coarse (.pos_) tags
print(f"{'Token':20s} {'POS':8s} {'Fine Tag':10s} {'Description'}")
print("-" * 65)
for token in list(doc)[:20]:
    print(f"  {token.text:20s} {token.pos_:8s} {token.tag_:10s} {spacy.explain(token.pos_)}")

Token                POS      Fine Tag   Description
-----------------------------------------------------------------
  To                   ADP      IN         adposition
  Sherlock             PROPN    NNP        proper noun
  Holmes               PROPN    NNP        proper noun
  she                  PRON     PRP        pronoun
  is                   AUX      VBZ        auxiliary
  always               ADV      RB         adverb
  the                  DET      DT         determiner
  woman                NOUN     NN         noun
  .                    PUNCT    .          punctuation
  I                    PRON     PRP        pronoun
  have                 AUX      VBP        auxiliary
  seldom               ADV      RB         adverb
  heard                VERB     VBN        verb
  him                  PRON     PRP        pronoun
  mention              VERB     VB         verb
  her                  PRON     PRP        pronoun
  under                ADP      IN         adposition


> **Tip:** spaCy's `.pos_` gives the Universal POS tag (NOUN, VERB, ADJ, â€¦) while `.tag_` gives a more detailed tag similar to Penn Treebank.

### 3.3 Extracting Specific Parts of Speech

A common practical task â€” extracting only nouns or only verbs:

In [14]:
# Extract nouns using spaCy
nouns = [token.text for token in doc if token.pos_ == "NOUN"]
print(f"Nouns ({len(nouns)}): {nouns}")

# Extract verbs using spaCy
verbs = [token.text for token in doc if token.pos_ == "VERB"]
print(f"\nVerbs ({len(verbs)}): {verbs}")

# Extract adjectives using spaCy
adjs = [token.text for token in doc if token.pos_ == "ADJ"]
print(f"\nAdjectives ({len(adjs)}): {adjs}")

Nouns (29): ['woman', 'name', 'eyes', 'whole', 'sex', 'emotion', 'emotions', 'mind', 'reasoning', 'machine', 'world', 'lover', 'position', 'passions', 'gibe', 'sneer', 'things', 'observer', 'veil', 'men', 'motives', 'actions', 'reasoner', 'intrusions', 'temperament', 'distracting', 'factor', 'doubt', 'results']

Verbs (18): ['heard', 'mention', 'eclipses', 'predominates', 'felt', 'love', 'take', 'observing', 'seen', 'placed', 'spoke', 'save', 'drawing', 'trained', 'admit', 'adjusted', 'introduce', 'throw']

Adjectives (15): ['other', 'akin', 'abhorrent', 'cold', 'precise', 'balanced', 'perfect', 'false', 'softer', 'admirable', 'excellent', 'such', 'own', 'delicate', 'mental']


---

## 4 â€” Combining Similar Words (Lemmatization)

**Lemmatization** reduces words to their base (dictionary) form:
- *running*, *ran*, *runs* â†’ **run**
- *better* â†’ **good**
- *women* â†’ **woman**

This differs from **stemming**, which just chops off suffixes (often producing non-words like *"happi"*).

---
### 4.1 Lemmatization with NLTK (WordNet)

In [15]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Examples
examples = ["running", "ran", "better", "women", "geese", "rocks", "corpora"]
for word in examples:
    print(f"  {word:15s} â†’ {lemmatizer.lemmatize(word)}")

  running         â†’ running
  ran             â†’ ran
  better          â†’ better
  women           â†’ woman
  geese           â†’ goose
  rocks           â†’ rock
  corpora         â†’ corpus


> **Note:** NLTK's WordNet lemmatizer works best when you also supply the POS tag. Without it, it defaults to noun.

```python
lemmatizer.lemmatize("better", pos="a")   # â†’ "good"
lemmatizer.lemmatize("running", pos="v")  # â†’ "run"
```

---
### 4.2 Lemmatization with spaCy

In [16]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(sherlock_holmes_part_of_text)

# Display token â†’ lemma
print(f"{'Token':20s} â†’ {'Lemma':20s} {'POS'}")
print("-" * 50)
for token in doc[:25]:
    if token.text != token.lemma_:  # only show where lemma differs
        print(f"  {token.text:20s} â†’ {token.lemma_:20s} {token.pos_}")

Token                â†’ Lemma                POS
--------------------------------------------------
  To                   â†’ to                   ADP
  is                   â†’ be                   AUX
  heard                â†’ hear                 VERB
  him                  â†’ he                   PRON
  her                  â†’ she                  PRON
  In                   â†’ in                   ADP
  eyes                 â†’ eye                  NOUN


spaCy automatically uses the POS tag to pick the correct lemma, making it more accurate out of the box.

### 4.3 Stemming (for comparison)

Stemming is a faster but cruder approach â€” it applies rules to strip suffixes.

In [17]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

examples = ["running", "ran", "better", "women", "happiness", "university", "corpora"]
print(f"{'Word':15s} {'Stem':15s} {'Lemma (NLTK)':15s}")
print("-" * 45)
for word in examples:
    print(f"  {word:15s} {stemmer.stem(word):15s} {lemmatizer.lemmatize(word):15s}")

Word            Stem            Lemma (NLTK)   
---------------------------------------------
  running         run             running        
  ran             ran             ran            
  better          better          better         
  women           women           woman          
  happiness       happi           happiness      
  university      univers         university     
  corpora         corpora         corpus         


> Stemming is faster but can produce non-words (e.g., *"happi"*, *"univers"*).  
> Lemmatization is slower but always returns a valid dictionary form.

---

## 5 â€” Removing Stopwords

**Stopwords** are very frequent words (the, is, at, which, on, â€¦) that carry little semantic meaning. Removing them can improve results for tasks like text classification or topic modeling.

---
### 5.1 NLTK Stopwords

In [18]:
from nltk.corpus import stopwords

stop_words_nltk = set(stopwords.words('english'))
print(f"NLTK has {len(stop_words_nltk)} English stopwords.")
print(f"Sample: {sorted(list(stop_words_nltk))[:20]}")

NLTK has 198 English stopwords.
Sample: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']


In [19]:
# Remove stopwords from our tokenized text
import nltk

words = nltk.tokenize.word_tokenize(sherlock_holmes_part_of_text)
filtered_words_nltk = [w for w in words if w.lower() not in stop_words_nltk]

print(f"Before: {len(words)} tokens")
print(f"After : {len(filtered_words_nltk)} tokens")
print(f"Removed: {len(words) - len(filtered_words_nltk)} stopwords\n")
print("Filtered tokens:", filtered_words_nltk[:20])

Before: 171 tokens
After : 89 tokens
Removed: 82 stopwords

Filtered tokens: ['Sherlock', 'Holmes', 'always', 'woman', '.', 'seldom', 'heard', 'mention', 'name', '.', 'eyes', 'eclipses', 'predominates', 'whole', 'sex', '.', 'felt', 'emotion', 'akin', 'love']


### 5.2 spaCy Stopwords

In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")

# spaCy stores stopwords as a set on the language vocab
stop_words_spacy = nlp.Defaults.stop_words
print(f"spaCy has {len(stop_words_spacy)} English stopwords.")

# Filter using spaCy's built-in .is_stop attribute
doc = nlp(sherlock_holmes_part_of_text)
filtered_tokens_spacy = [token.text for token in doc if not token.is_stop and not token.is_punct]

print(f"\nBefore: {len(doc)} tokens")
print(f"After : {len(filtered_tokens_spacy)} tokens (stopwords + punctuation removed)")
print("\nFiltered tokens:", filtered_tokens_spacy[:20])

spaCy has 326 English stopwords.

Before: 173 tokens
After : 64 tokens (stopwords + punctuation removed)

Filtered tokens: ['Sherlock', 'Holmes', 'woman', 'seldom', 'heard', 'mention', 'eyes', 'eclipses', 'predominates', 'sex', 'felt', 'emotion', 'akin', 'love', 'Irene', 'Adler', 'emotions', 'particularly', 'abhorrent', 'cold']


### 5.3 Comparing Stopword Lists

In [21]:
# See which words differ between the two stopword lists
only_nltk  = stop_words_nltk - stop_words_spacy
only_spacy = stop_words_spacy - stop_words_nltk

print(f"Only in NLTK  ({len(only_nltk)}): {sorted(only_nltk)[:10]} ...")
print(f"Only in spaCy ({len(only_spacy)}): {sorted(only_spacy)[:10]} ...")
print(f"Shared: {len(stop_words_nltk & stop_words_spacy)}")

Only in NLTK  (75): ['ain', 'aren', "aren't", 'couldn', "couldn't", 'd', 'didn', "didn't", 'doesn', "doesn't"] ...
Only in spaCy (203): ["'d", "'ll", "'m", "'re", "'s", "'ve", 'across', 'afterwards', 'almost', 'alone'] ...
Shared: 123


---

## 6 â€” Putting It All Together: A Complete Preprocessing Pipeline

Let's combine everything into a single reusable function using spaCy:

In [22]:
import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess(text, remove_stopwords=True, lemmatize=True):
    """
    Full NLP preprocessing pipeline.
    Returns a list of dicts with token info.
    """
    doc = nlp(text)
    results = []
    for token in doc:
        # Skip punctuation, whitespace, and optionally stopwords
        if token.is_punct or token.is_space:
            continue
        if remove_stopwords and token.is_stop:
            continue

        results.append({
            "text": token.text,
            "lemma": token.lemma_ if lemmatize else token.text,
            "pos": token.pos_,
            "tag": token.tag_,
            "is_stop": token.is_stop,
        })
    return results

# Run the pipeline
processed = preprocess(sherlock_holmes_part_of_text)

# Display nicely
print(f"{'Text':18s} {'Lemma':18s} {'POS':8s} {'Tag'}")
print("=" * 55)
for tok in processed[:25]:
    print(f"  {tok['text']:18s} {tok['lemma']:18s} {tok['pos']:8s} {tok['tag']}")

print(f"\nâ†’ {len(processed)} meaningful tokens after preprocessing")

Text               Lemma              POS      Tag
  Sherlock           Sherlock           PROPN    NNP
  Holmes             Holmes             PROPN    NNP
  woman              woman              NOUN     NN
  seldom             seldom             ADV      RB
  heard              hear               VERB     VBN
  mention            mention            VERB     VB
  eyes               eye                NOUN     NNS
  eclipses           eclipse            VERB     VBZ
  predominates       predominate        VERB     VBZ
  sex                sex                NOUN     NN
  felt               feel               VERB     VBD
  emotion            emotion            NOUN     NN
  akin               akin               ADJ      JJ
  love               love               VERB     VB
  Irene              Irene              PROPN    NNP
  Adler              Adler              PROPN    NNP
  emotions           emotion            NOUN     NNS
  particularly       particularly       ADV      RB
  a

---

## Summary

| Step | NLTK | spaCy |
|---|---|---|
| **Sentence splitting** | `punkt` tokenizer â€” fast, single-purpose | `doc.sents` â€” part of full pipeline |
| **Word tokenization** | `word_tokenize()` â€” rule-based | Automatic during `nlp()` call |
| **POS tagging** | `pos_tag()` â€” Penn Treebank tags | `.pos_` (universal) / `.tag_` (detailed) |
| **Lemmatization** | `WordNetLemmatizer` â€” needs POS hint | `.lemma_` â€” automatic with POS context |
| **Stopwords** | `stopwords.words('english')` â€” 179 words | `nlp.Defaults.stop_words` â€” 326 words |

**Rule of thumb:** If you need a full processing pipeline, use **spaCy**. If you need a quick, lightweight operation, **NLTK** may be faster.

### Further Reading
- NLTK Documentation: https://www.nltk.org/
- spaCy Documentation: https://spacy.io/
- Punkt algorithm paper: https://aclanthology.org/J06-4003.pdf
- spaCy processing pipelines: https://spacy.io/usage/processing-pipelines