<p style="text-align: center; font-size: 28px;"><b>Text Preprocessing</b></p>

**Goals of this Unit**

The purpose of this unit is to learn how to prepare text data effectively for NLP tasks. After completing it, you will be able to:

**Objectives**
* Recognize several preprocessing techniques used in NLP.
* Use **Python** and **regex** to clean and remove unnecessary formatting.
* **Tokenize** text with **NLTK**.
* **Normalize** text using Python, regex, and NLTK:

  * remove affixes
  * change text casing
  * remove common words

**Reminder**

Learning is collaborative — engage with the Codecademy community through forums, share your progress, request code reviews, and review others' work to reinforce your understanding.


# Natural Language Parsing with Regular Expressions

Exploring hidden code words in declassified CIA files might sound like an intelligence officer’s job, and identifying gender bias in Harry Potter could seem like work for a literature expert. However, by applying natural language parsing with regular expressions, these kinds of analyses are accessible to anyone.

Even if we don’t consciously think about sentence structure as we write, the way we build sentences is essential for conveying meaning. Studying how words are arranged — and which ones are chosen — can reveal deeper layers of a text. It can uncover connotations, expose an author’s biases, and provide insights that even close reading might miss.

With Python’s re module for regular expressions and the Natural Language Toolkit (NLTK), you can search for key terms, analyze their frequency and context, and detect patterns in parts of speech to uncover subtle meanings within a text. Let’s dive in!

## Compiling and Matching

Regular expressions (regex) are used for **pattern matching in text**. In Python, the `re` module provides tools to work with regex.

### Using `re.compile()`

`re.compile()` creates a **regular expression object** based on a pattern:

```python
regular_expression_object = re.compile("[A-Za-z]{4}")
```

This matches **exactly 4 alphabetic characters** (uppercase or lowercase).

### Using `.match()`

`.match()` checks if the pattern **starts at the beginning** of a string:

```python
result = regular_expression_object.match("Toto")
```

* If there is a match → a **match object** is returned.
* If not → `None` is returned.

To retrieve the matched text, use:

```python
result.group(0)
```

### Shortcut — One Line

You can skip compiling and match directly:

```python
result = re.match("[A-Za-z]{4}", "Toto")
```

### Regex Meaning Table

| Pattern    | Matches                             |
| ---------- | ----------------------------------- |
| `[A-Za-z]` | Any letter (uppercase or lowercase) |
| `[A-Z]`    | Only uppercase letters              |
| `[a-z]`    | Only lowercase letters              |
| `\d`       | Any digit (0–9)                     |
| `.`        | Any character except newline        |

These basics are essential before moving to more complex parsing tasks in NLP preprocessing.


In [6]:
import re

# characters are defined
character_1 = "Dorothy"
character_2 = "Henry"

# compile your regular expression here
# [A-Za-z] means only letters (uppercase or lowercase), {7} means exactly 7 letters.
regular_expression = re.compile("[A-Za-z]{7}")

# check for a match to character_1 here
# .match() checks if the text starts with exactly 7 letters.
result_1 = regular_expression.match(character_1)
print(result_1)  # prints a match object if found, otherwise None

# store and print the matched text here
# .group(0) extracts the actual matched substring from the match object.
match_1 = result_1.group(0)
print(match_1)  # should print "Dorothy"

# compile a regular expression to match a 7 character string of word characters and check for a match to character_2 here
# NOTE: The "!" at the start makes the pattern invalid for matching — this prevents any match.
result_2 = re.match("![A-Za-z]{7}", character_2)
print(result_2)  # will print None because of the leading "!" in the pattern

<re.Match object; span=(0, 7), match='Dorothy'>
Dorothy
None


## Searching and Finding

The `.search()` method scans **an entire string** from left to right and returns the **first match** it finds. This differs from `.match()`, which only checks the **beginning** of a string.

### Example

```python
result = re.search("\w{8}", "Are you a Munchkin?")
```

* `.search()` ➜ finds `"Munchkin"`
* `.match()` ➜ returns `None` (because the match is not at the start)

---

### Finding All Matches — `.findall()`

To retrieve **all non-overlapping matches**, use `.findall()`. It returns a **list** of every match found.

Example text:

```python
text = "Everything is green here, while in the country of the Munchkins blue was the favorite color..."
```

To find all 8-character word sequences:

```python
list_of_matches = re.findall("\w{8}", text)
```

**Result:**

```
['Everythi', 'Munchkin', 'favorite', 'friendly', 'Munchkin']
```

`.findall()` is useful for **frequency analysis**, keyword extraction, and pattern detection in NLP preprocessing.


In [7]:
oz_text = open("utils/the_wizard_of_oz_text.txt",encoding='utf-8').read().lower()

In [9]:
# Step 1: Search oz_text for the first occurrence of 'wizard'
# .search() scans the entire text and returns a MATCH OBJECT if found, or None if not found.
# It does NOT return True/False directly, but the match object can be used as a boolean.
found_wizard = re.search("wizard", oz_text)
print(found_wizard)  # shows position of the first match (or None)

# Step 2: Find all occurrences of 'lion' in the text
# .findall() scans the entire text and returns a LIST of every match found.
# Useful for frequency analysis and keyword extraction.
all_lions = re.findall("lion", oz_text)
print(all_lions)  # prints every match of 'lion'

# Step 3: Count the number of times 'lion' appears
# len() gives the total frequency of matches in the text.
number_lions = len(all_lions)
print(number_lions)  # higher frequency may indicate higher importance in the text

<re.Match object; span=(14, 20), match='wizard'>
['lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion', 'lion'

## Part-of-Speech Tagging

To analyze text more meaningfully, we can examine it **word by word** and identify the **part of speech** (POS) of each word — a process known as **part-of-speech tagging**.

### Why It Matters

POS tagging helps us understand **how words function** within a sentence, which supports tasks like:

* syntax analysis
* sentiment analysis
* information extraction
* text generation

---

### Example Sentence

**Wow! Ramona and her class are happily studying the new textbook she has on NLP.**

| Part of Speech   | Examples                     |
| ---------------- | ---------------------------- |
| **Noun**         | Ramona, class, textbook, NLP |
| **Pronoun**      | her, she                     |
| **Determiner**   | the                          |
| **Verb**         | studying, are, has           |
| **Adjective**    | new                          |
| **Adverb**       | happily                      |
| **Preposition**  | on                           |
| **Conjunction**  | and                          |
| **Interjection** | Wow                          |

---

### POS Tagging with `nltk`

Use `pos_tag()` to automatically tag words:

```python
word_sentence = ['do', 'you', 'suppose', 'oz', 'could', 'give', 'me', 'a', 'heart', '?']
part_of_speech_tagged_sentence = pos_tag(word_sentence)
```

**Output:**

```
[('do', 'VB'), ('you', 'PRP'), ('suppose', 'VB'), ('oz', 'NNS'),
 ('could', 'MD'), ('give', 'VB'), ('me', 'PRP'), ('a', 'DT'),
 ('heart', 'NN'), ('?', '.')]
```

---

### Common POS Abbreviations

| Tag | Meaning    |
| --- | ---------- |
| NN  | Noun       |
| VB  | Verb       |
| RB  | Adverb     |
| JJ  | Adjective  |
| DT  | Determiner |
| PRP | Pronoun    |
| MD  | Modal verb |

You can find the full list of POS tags in the NLTK documentation.


In [14]:
# Step 1: Import the NLTK tokenizers for sentences and words
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Step 2: Make sure the Punkt tokenizer models are available (run once)
# This is needed for both sentence and word tokenization.
nltk.download('punkt')

# Step 3: Split the full text (oz_text) into individual sentences
# sent_tokenize() takes a large string and returns a list of sentence strings.
oz_sentences = sent_tokenize(oz_text)

# Step 4: Tokenize each sentence into a list of word tokens
# This will produce a list of lists, where each inner list contains tokens of one sentence.
word_tokenized_oz = [word_tokenize(sentence) for sentence in oz_sentences]

[nltk_data] Downloading package punkt to /home/alamanna1/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
import nltk
from nltk import pos_tag

In [18]:
# Step 1: Access and print the sentence at index 100
witches_fate = word_tokenized_oz[100]
print("Original tokenized sentence at index 100:", witches_fate)

# Step 2: Create an empty list to store POS-tagged sentences
pos_tagged_oz = []

# Step 3: Loop through all tokenized sentences and apply POS tagging
for sentence in word_tokenized_oz:
    tagged_sentence = pos_tag(sentence)  # assign part of speech to each token
    pos_tagged_oz.append(tagged_sentence)

# Step 4: Access and print the POS-tagged version of the sentence at index 100
witches_fate_pos = pos_tagged_oz[100]
print("POS-tagged version of the same sentence:", witches_fate_pos)

Original tokenized sentence at index 100: ['``', 'the', 'house', 'must', 'have', 'fallen', 'on', 'her', '.']
POS-tagged version of the same sentence: [('``', '``'), ('the', 'DT'), ('house', 'NN'), ('must', 'MD'), ('have', 'VB'), ('fallen', 'VBN'), ('on', 'IN'), ('her', 'PRP'), ('.', '.')]


## Introduction to Chunking

With part-of-speech tagging complete, you can now discover **patterns in sentence structure** to extract meaning. This process is called **chunking** — it groups words based on their POS tags using **regular expressions**.

### What is Chunking?

Chunking allows you to:

* search for **structural patterns** in tagged text
* define **custom rules** based on POS tags
* extract meaningful **word groups** (e.g., adjective + noun)

---

### Chunk Grammar Format

```python
chunk_grammar = "AN: {<JJ><NN>}"
```

**Explanation:**

* `AN` → user-defined chunk name (here: *Adjective + Noun*)
* `{}` → chunk pattern definition
* `<JJ>` → matches **adjective**
* `<NN>` → matches **noun**

This grammar finds **any adjective followed by a noun**.

---

### Running the Chunk Parser

```python
chunk_parser = RegexpParser(chunk_grammar)
```

Once defined, you can apply it using `.parse()` on POS-tagged sentences.

**Example Input:**

```python
pos_tagged_sentence = [
    ('where', 'WRB'), ('is', 'VBZ'), 
    ('the', 'DT'), ('emerald', 'JJ'), ('city', 'NN'), ('?', '.')
]
```

**Example Chunking:**

```python
chunked = chunk_parser.parse(pos_tagged_sentence)
```

This identifies **“emerald city”** as a chunk — matching the `<JJ><NN>` pattern.

Chunking is a powerful tool to reveal structure in text and extract meaningful information for NLP tasks.


In [20]:
from nltk import RegexpParser, Tree

# Step 1: Define grammar for chunking — AN = Adjective-Noun
# <JJ> matches adjectives and <NN> matches nouns. Together, they form small noun phrases.
chunk_grammar = "AN: {<JJ><NN>}"

# Step 2: Create a parser that will search for the grammar pattern in POS-tagged sentences
chunk_parser = RegexpParser(chunk_grammar)

# Step 3: Chunk the POS-tagged sentence at index 282 in the novel
# This applies the grammar to detect adjective+noun pairs in that specific sentence.
scaredy_cat = chunk_parser.parse(pos_tagged_oz[282])
print("Chunked sentence showing adjective-noun pairs labeled as 'AN':")
print(scaredy_cat)

# Step 4: Visualize the result in a tree format for easier understanding of the structure
Tree.fromstring(str(scaredy_cat)).pretty_print()  # Displays chunks as a syntactic tree

Chunked sentence showing adjective-noun pairs labeled as 'AN':
(S ``/`` where/WRB is/VBZ the/DT (AN emerald/JJ city/NN) ?/. ''/'')
                         S                                    
   ______________________|__________________________           
  |       |       |      |     |    |               AN        
  |       |       |      |     |    |        _______|_____     
``/`` where/WRB is/VBZ the/DT ?/. ''/'' emerald/JJ     city/NN



## Chunking Noun Phrases

Chunking lets you group words into meaningful units based on their **part-of-speech (POS)** tags. A very useful and common type is **NP-chunking** (noun phrase chunking), where we look for phrases that behave as a single noun in a sentence.

A typical **noun phrase (NP)** structure in English often looks like:

* a **determiner** `DT` (e.g., *the, a, this*)
* followed by **zero or more adjectives** `JJ` (e.g., *wicked, new*)
* ending with a **noun** `NN` (e.g., *witch, book*)

Example POS-tagged sentence:

> `[('we', 'PRP'), ('are', 'VBP'), ('so', 'RB'), ('grateful', 'JJ'), ('to', 'TO'), ('you', 'PRP'), ('for', 'IN'), ('having', 'VBG'), ('killed', 'VBN'), ('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'), ('of', 'IN'), ('the', 'DT'), ('east', 'NN'), (',', ','), ('and', 'CC'), ('for', 'IN'), ('setting', 'VBG'), ('our', 'PRP$'), ('people', 'NNS'), ('free', 'VBP'), ('from', 'IN'), ('bondage', 'NN'), ('.', '.')]`

From this sentence, three noun phrases of the form **(DT) (JJ)* NN** are:

* `(('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'))`
* `(('the', 'DT'), ('east', 'NN'))`
* `(('bondage', 'NN'))` (no determiner, no adjective, just a noun)

To automatically detect such patterns, you define **chunk grammar** using a regex-like syntax over POS tags:

```python
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"
```

Explanation of this grammar:

* `NP:` — name of the chunk (noun phrase).
* `{ ... }` — defines the pattern to chunk.
* `<DT>` — matches any **determiner**.
* `?` — **optional** quantifier: 0 or 1 determiner.
* `<JJ>` — matches any **adjective**.
* `*` — **Kleene star**: 0 or more adjectives.
* `<NN>` — matches any **noun** (singular or plural).

Once noun phrases are chunked, you can:

* Perform **frequency analysis** to find important, recurring noun phrases.
* Use NP-chunks as **pseudo-topics** to tag and categorize documents.
* Analyze an author’s **adjective choices** (e.g., sentiment or bias) for specific nouns.

The interpretation of these NP-chunks and their frequencies depends on your goals and your understanding of the text.


In [21]:
from utils.np_chunk_counter import np_chunk_counter

# Step 1: Define noun-phrase (NP) chunk grammar
# NP: {<DT>?<JJ>*<NN>}
# <DT>?  → optional determiner (0 or 1)
# <JJ>*  → zero or more adjectives
# <NN>   → a noun (the head of the phrase)
chunk_grammar = "NP: {<DT>?<JJ>*<NN>}"

# Step 2: Create a RegexpParser object using the NP grammar
chunk_parser = RegexpParser(chunk_grammar)

# Step 3: Create an empty list to store NP-chunked sentences
np_chunked_oz = []

# Step 4: Loop through each POS-tagged sentence and apply chunking
for sentence in pos_tagged_oz:
    chunked_sentence = chunk_parser.parse(sentence)
    np_chunked_oz.append(chunked_sentence)

# Step 5: Count the 30 most frequent NP chunks using np_chunk_counter
most_common_np_chunks = np_chunk_counter(np_chunked_oz)

# Step 6: Display the result
print(most_common_np_chunks)

[((('i', 'NN'),), 325), ((('dorothy', 'NN'),), 222), ((('the', 'DT'), ('scarecrow', 'NN')), 213), ((('the', 'DT'), ('lion', 'NN')), 148), ((('the', 'DT'), ('tin', 'NN')), 123), ((('woodman', 'NN'),), 112), ((('oz', 'NN'),), 86), ((('toto', 'NN'),), 73), ((('head', 'NN'),), 59), ((('the', 'DT'), ('woodman', 'NN')), 59), ((('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN')), 58), ((('the', 'DT'), ('emerald', 'JJ'), ('city', 'NN')), 51), ((('the', 'DT'), ('witch', 'NN')), 49), ((('the', 'DT'), ('girl', 'NN')), 46), ((('the', 'DT'), ('road', 'NN')), 41), ((('room', 'NN'),), 29), ((('nothing', 'NN'),), 29), ((('the', 'DT'), ('air', 'NN')), 29), ((('the', 'DT'), ('country', 'NN')), 26), ((('the', 'DT'), ('land', 'NN')), 24), ((('a', 'DT'), ('heart', 'NN')), 24), ((('the', 'DT'), ('west', 'NN')), 23), ((('axe', 'NN'),), 23), ((('the', 'DT'), ('sun', 'NN')), 22), ((('the', 'DT'), ('little', 'JJ'), ('girl', 'NN')), 22), ((('course', 'NN'),), 22), ((('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN'

## Chunking Verb Phrases

A verb phrase (VP) contains a **verb** along with its **objects**, **modifiers**, or **complements**. VP-chunking helps us identify patterns in actions and understand how characters behave or how actions are described in a text.

### Two Common Verb Phrase Structures

| Structure  | Pattern                   | Description                                   |
| ---------- | ------------------------- | --------------------------------------------- |
| **Form 1** | `VB → NP → (optional RB)` | Verb first, noun phrase next, optional adverb |
| **Form 2** | `NP → VB → (optional RB)` | Noun phrase first, verb next, optional adverb |

Both structures carry the **same meaning**. For example:

* `('said', 'VBD'), ('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN')`
* `('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN'), ('said', 'VBD')`

---

### **Chunk Grammar — Form 1**

```python
chunk_grammar = "VP: {<VB.*><DT>?<JJ>*<NN><RB.?>?}"
```

### Explanation of Each Part

| Pattern   | Meaning                                                                    |
| --------- | -------------------------------------------------------------------------- |
| `<VB.*>`  | Matches **any verb** (VB, VBD, VBG, VBN, VBZ…) — the `.*` allows any tense |
| `<DT>?`   | Optional determiner (0 or 1)                                               |
| `<JJ>*`   | Zero or more adjectives                                                    |
| `<NN>`    | A noun (head of the phrase)                                                |
| `<RB.?>?` | **Entire adverb is optional** — see breakdown below                        |

---

### **Understanding `<RB.?>?`**

| Position     | Meaning                                                                  |
| ------------ | ------------------------------------------------------------------------ |
| `.` (inside) | Wildcard → any single character                                          |
| `?` (inner)  | Makes that character optional → allows `RB`, `RBR`, or `RBS`             |
| `?` (outer)  | Makes the **entire adverb optional** — it may appear once, or not at all |

✔ Valid adverb forms matched:
`RB` → regular adverb (*quickly*)
`RBR` → comparative (*faster*)
`RBS` → superlative (*fastest*)
`None` → no adverb at all

---

### Chunk Grammar — Form 2

```python
chunk_grammar = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"
```

This reverses NP and verb — but the meaning is the same.

---

### Why It Matters

Once VP chunks are extracted, you can:

* Analyze character **actions**
* Investigate **tone/bias** through adverbs
* Track **frequent verb phrases** for narrative insight

This is where your interpretation and domain knowledge become valuable for understanding the text beyond raw syntax.


In [23]:
from utils.vp_chunk_counter import vp_chunk_counter

# Step 1: Define two grammars for verb phrases (VP)

# Grammar 1: Verb phrase structure → VERB + NOUN PHRASE + optional ADVERB
# <VB.*>    → matches any verb form (VB, VBD, VBG, VBN, VBZ, etc.)
# <DT>?     → optional determiner (0 or 1)
# <JJ>*     → zero or more adjectives
# <NN>      → noun (head of the noun phrase)
# <RB.?>?   → optional adverb:
#               inner "?" → RB / RBR / RBS
#               outer "?" → whole adverb element is optional
chunk_grammar1 = "VP: {<VB.*><DT>?<JJ>*<NN><RB.?>?}"

# Step 2: Use grammar 2 for chunking (can switch between the two for comparison)
chunk_parser = RegexpParser(chunk_grammar1)

# Step 3: Empty list to store chunked sentences
vp_chunked_oz = []

# Step 4: Loop through each POS-tagged sentence and apply the chunk parser
for sentence in pos_tagged_oz:
    chunked_sentence = chunk_parser.parse(sentence)
    vp_chunked_oz.append(chunked_sentence)

# Step 5: Use provided function to count the 30 most common verb phrases
most_common_vp_chunks = vp_chunk_counter(vp_chunked_oz)

# Step 6: Print results (action-based insights)
print(most_common_vp_chunks)

[((('said', 'VBD'), ('the', 'DT'), ('scarecrow', 'NN')), 33), ((('said', 'VBD'), ('dorothy', 'NN')), 31), ((('asked', 'VBN'), ('dorothy', 'NN')), 20), ((('said', 'VBD'), ('the', 'DT'), ('tin', 'NN')), 19), ((('said', 'VBD'), ('the', 'DT'), ('lion', 'NN')), 15), ((('said', 'VBD'), ('the', 'DT'), ('girl', 'NN')), 10), ((('asked', 'VBN'), ('the', 'DT'), ('scarecrow', 'NN')), 10), ((('answered', 'VBD'), ('the', 'DT'), ('scarecrow', 'NN')), 8), ((('said', 'VBD'), ('the', 'DT'), ('cowardly', 'JJ'), ('lion', 'NN')), 8), ((('said', 'VBD'), ('oz', 'NN')), 8), ((('said', 'VBD'), ('the', 'DT'), ('woodman', 'NN')), 7), ((('pass', 'VB'), ('the', 'DT'), ('night', 'NN')), 6), ((('asked', 'VBN'), ('the', 'DT'), ('girl', 'NN')), 6), ((('see', 'VB'), ('the', 'DT'), ('great', 'JJ'), ('oz', 'NN')), 6), ((('answered', 'VBD'), ('oz', 'NN')), 6), ((('replied', 'VBD'), ('oz', 'NN')), 6), ((('cried', 'VBN'), ('dorothy', 'NN')), 5), ((('asked', 'VBN'), ('the', 'DT'), ('tin', 'NN')), 5), ((('asked', 'VBN'), ('th

In [24]:
# Step 1: Define two grammars for verb phrases (VP)

# Grammar 1: Verb phrase structure → VERB + NOUN PHRASE + optional ADVERB
# <VB.*>    → matches any verb form (VB, VBD, VBG, VBN, VBZ, etc.)
# <DT>?     → optional determiner (0 or 1)
# <JJ>*     → zero or more adjectives
# <NN>      → noun (head of the noun phrase)
# <RB.?>?   → optional adverb:
#               inner "?" → RB / RBR / RBS
#               outer "?" → whole adverb element is optional
# Grammar 2: Same elements, but noun phrase comes before the verb
chunk_grammar2 = "VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

# Step 2: Use grammar 2 for chunking (can switch between the two for comparison)
chunk_parser = RegexpParser(chunk_grammar2)

# Step 3: Empty list to store chunked sentences
vp_chunked_oz = []

# Step 4: Loop through each POS-tagged sentence and apply the chunk parser
for sentence in pos_tagged_oz:
    chunked_sentence = chunk_parser.parse(sentence)
    vp_chunked_oz.append(chunked_sentence)

# Step 5: Use provided function to count the 30 most common verb phrases
most_common_vp_chunks = vp_chunk_counter(vp_chunked_oz)

# Step 6: Print results (action-based insights)
print(most_common_vp_chunks)

[((('i', 'NN'), ('am', 'VBP')), 24), ((('i', 'NN'), ('was', 'VBD')), 17), ((('dorothy', 'NN'), ('was', 'VBD')), 13), ((('i', 'NN'), ('have', 'VBP')), 9), ((('i', 'NN'), ('know', 'VBP')), 8), ((('i', 'NN'), ('had', 'VBD')), 8), ((('dorothy', 'NN'), ('had', 'VBD')), 7), ((('oz', 'NN'), ('had', 'VBD')), 7), ((('the', 'DT'), ('scarecrow', 'NN'), ('said', 'VBD')), 6), ((('i', 'NN'), ('am', 'VBP'), ('oz', 'RB')), 6), ((('oz', 'NN'), ('was', 'VBD')), 6), ((('toto', 'NN'), ('did', 'VBD'), ('not', 'RB')), 5), ((('dorothy', 'NN'), ('looked', 'VBD')), 5), ((('i', 'NN'), ('have', 'VBP'), ('never', 'RB')), 5), ((('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN'), ('had', 'VBD')), 5), ((('i', 'NN'), ("'m", 'VBP')), 5), ((('i', 'NN'), ('do', 'VBP'), ("n't", 'RB')), 5), ((('i', 'NN'), ('want', 'VBP')), 5), ((('the', 'DT'), ('scarecrow', 'NN'), ('had', 'VBD')), 5), ((('the', 'DT'), ('balloon', 'NN'), ('was', 'VBD')), 4), ((('the', 'DT'), ('cyclone', 'NN'), ('had', 'VBD')), 4), ((('i', 'NN'), ('came', 'VB

## Chunk Filtering

Chunk filtering allows you to **remove unwanted parts of speech** from chunks instead of directly defining what to include. Rather than specifying the exact structure (like NP or VP), you can chunk the **entire sentence**, then filter out certain patterns that break the phrase.

### How It Works

This method chunks the full sentence first, and then **filters out specific POS tags** such as verbs or prepositions. If a filtered tag appears **in the middle of a chunk**, it will **split the chunk** into two separate pieces — helping isolate meaningful phrases.

### Chunk Filtering Grammar

```python
chunk_grammar = """NP: {<.*>+}
                       }<VB.?|IN>+{"""
```

### Explanation of the Grammar

| Component | Purpose                                               |                                                                                              |
| --------- | ----------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| `NP:`     | Defines the chunk name (noun phrase).                 |                                                                                              |
| `{<.*>+}` | Chunks **everything** in the sentence (all POS tags). |                                                                                              |
| `}<VB.?   | IN>+{`                                                | Filters out **verbs (`VB.*`) and prepositions (`IN`)**. These split chunks when encountered. |

### Why Use Chunk Filtering?

* Helps **extract meaningful segments** without strict grammatical rules.
* Reveals phrases where nouns naturally group together.
* Can be used to **remove noise** while keeping the functional phrase.

Chunk filtering gives you a *more flexible* way to explore patterns in POS-tagged text — especially useful when sentence structures vary.


In [25]:
# Step 1: Chunk entire sentence into one big chunk
grammar = "Chunk: {<.*>+}"
parser = RegexpParser(grammar)
chunked_dancers = parser.parse(pos_tagged_oz[230])
print("One big chunk:", chunked_dancers)

# Step 2: Define chunk filtering grammar (NP = noun phrase)
chunk_grammar = """NP: {<.*>+}
                       }<VB.?|IN>+{"""

# Step 3: Create parser using NP filtering grammar
chunk_parser = RegexpParser(chunk_grammar)

# Step 4: Apply filtering to sentence at index 230
filtered_dancers = chunk_parser.parse(pos_tagged_oz[230])
print("Filtered NP chunks:", filtered_dancers)

# Step 5: Visualize the filtered chunks
Tree.fromstring(str(filtered_dancers)).pretty_print()

One big chunk: (S
  (Chunk
    then/RB
    she/PRP
    sat/VBD
    upon/IN
    a/DT
    settee/NN
    and/CC
    watched/VBD
    the/DT
    people/NNS
    dance/NN
    ./.))
Filtered NP chunks: (S
  (NP then/RB she/PRP)
  sat/VBD
  upon/IN
  (NP a/DT settee/NN and/CC)
  watched/VBD
  (NP the/DT people/NNS dance/NN ./.))
                                                 S                                                  
    _____________________________________________|_______________________________                    
   |       |         |               NP                  NP                      NP                 
   |       |         |          _____|_____       _______|_______        ________|________________   
sat/VBD upon/IN watched/VBD then/RB     she/PRP a/DT settee/NN and/CC the/DT people/NNS dance/NN ./.



## Review

You now have the tools to perform natural language parsing using **regular expressions** and **nltk**. With these techniques, you can uncover structure, meaning, and even bias in any text dataset.

### Key Takeaways

* **Regex Methods**

  * `.compile()` and `.match()` → look for a **single match at the start** of a string.
  * `.search()` → finds the **first match anywhere** in the string.
  * `.findall()` → returns **all non-overlapping matches**.

* **Part-of-Speech Tagging**

  * Use `pos_tag()` in nltk to label each word by its grammatical role.

* **Chunking**

  * Group words based on POS tags using a regex-style **chunk grammar**.
  * Use `RegexpParser(...).parse()` to apply chunking to tokenized sentences.

* **NP-Chunking (Noun Phrases)**

  * Pattern: optional `DT` + any number of `JJ` + `NN`
  * Useful for finding recurring **subjects and topics** in a text.

* **VP-Chunking (Verb Phrases)**

  * Pattern: verb `VB` + noun phrase + optional adverb `RB`
  * Reveals how actions are described and how subjects behave — helpful for detecting **bias or attitude**.

* **Chunk Filtering**

  * Instead of specifying what to include, specify what to **remove** to isolate meaningful chunks.

These tools form the foundation of many NLP preprocessing pipelines — and now you can build your own.


In [27]:
from utils.chunk_counter import chunk_counter

# custom chunk grammar: focus on descriptive noun clusters and remove noise
chunk_grammar = r'''
Chunk: {<DT>?<JJ>*<NN.*>+}
            }<VB.*|IN|CC|PRP|PRP\$>+{
'''

# create parser
chunk_parser = RegexpParser(chunk_grammar)

# list to hold chunked sentences
chunked_oz = []

# loop through each POS-tagged sentence and apply chunking
for pos_tagged_sentence in pos_tagged_oz:
    chunked_oz.append(chunk_parser.parse(pos_tagged_sentence))

# count and print most common chunks
most_common_chunks = chunk_counter(chunked_oz)
print(most_common_chunks)

[((('i', 'NN'),), 293), ((('the', 'DT'), ('scarecrow', 'NN')), 212), ((('dorothy', 'NN'),), 211), ((('the', 'DT'), ('lion', 'NN')), 147), ((('the', 'DT'), ('tin', 'NN'), ('woodman', 'NN')), 111), ((('oz', 'NN'),), 84), ((('toto', 'NN'),), 73), ((('the', 'DT'), ('woodman', 'NN')), 59), ((('the', 'DT'), ('wicked', 'JJ'), ('witch', 'NN')), 58), ((('head', 'NN'),), 56), ((('the', 'DT'), ('emerald', 'JJ'), ('city', 'NN')), 49), ((('the', 'DT'), ('witch', 'NN')), 47), ((('i', 'NNS'),), 46), ((('the', 'DT'), ('girl', 'NN')), 44), ((('the', 'DT'), ('road', 'NN')), 41), ((('brains', 'NNS'),), 34), ((('eyes', 'NNS'),), 31), ((('arms', 'NNS'),), 29), ((('the', 'DT'), ('air', 'NN')), 29), ((('nothing', 'NN'),), 28), ((('friends', 'NNS'),), 28), ((('the', 'DT'), ('country', 'NN')), 26), ((('the', 'DT'), ('people', 'NNS')), 25), ((('the', 'DT'), ('winkies', 'NNS')), 25), ((('the', 'DT'), ('land', 'NN')), 24), ((('a', 'DT'), ('heart', 'NN')), 24), ((('the', 'DT'), ('west', 'NN')), 23), ((('the', 'DT'

# Introduction

Text preprocessing is the process of cleaning and preparing text data for use in a specific context. It is used in almost all **NLP pipelines**, including:

* voice recognition software
* search engine lookups
* machine learning model training

Text data can vary widely—from its format (websites, messages, voice input) to the people generating it (language, dialect). These differences introduce **noise**, which must be reduced.

## Goal

The main objective is to keep **only the words needed** for your NLP task.

## What You Will Learn

This lesson introduces key strategies for preparing text data. While not exhaustive, it includes several widely used techniques:

* **Using Regex & NLTK libraries**
* **Noise Removal** — removing unwanted characters and formatting
* **Tokenization** — splitting text into smaller units called *tokens*
* **Normalization** — broader processing tasks including **stemming** and **lemmatization**

These methods form the foundation of most NLP workflows.


**In anutshell: how does it work**

Original Sentence

"Who was partying?"

Step 1 – Noise Removal: Remove punctuation and special characters:

"Who was partying"

Step 2 – Lowercasing: Convert all text to lowercase for consistency:

"who was partying"

Step 3 – Tokenization: Split the sentence into individual units (tokens):

["who", "was", "partying"]

Step 4 – Lemmatization: Convert each token to its base/dictionary form:

["who", "be", "party"]

## Text Preprocessing — Noise Removal

Text cleaning is a crucial step in preparing data for NLP tasks. Raw text often contains **noise** that does not contribute to your analysis and must be removed beforehand.

### Common Types of Noise

Depending on your data source (Twitter API, web scraping, voice input, etc.), you may need to remove:

* Punctuation and accents
* Special characters
* Numeric digits
* Leading/trailing/vertical whitespace
* HTML formatting

---

### Using `re.sub()` for Noise Removal

The `.sub()` method from Python’s `re` module is commonly used for text cleaning.

**It takes three required arguments:**

| Argument           | Description                                 |
| ------------------ | ------------------------------------------- |
| `pattern`          | Regex to search for (use `r""` raw strings) |
| `replacement_text` | Text to replace matches with                |
| `input`            | Original string being cleaned               |

**Returns:** a new string with replacements applied.

---

### Example 1 — Removing HTML Tags

```python
import re

text = "<p>    This is a paragraph</p>"
result = re.sub(r'<.?p>', '', text)
print(result)
#    This is a paragraph
```

Here, we replace the tags with an **empty string** `''`, effectively removing them.

---

### Example 2 — Removing Whitespace

```python
import re

text = "    This is a paragraph"
result = re.sub(r'\s{4}', '', text)
print(result)
# This is a paragraph
```

This removes exactly four spaces at the beginning of the string.

---

These tools allow you to clean raw text efficiently and prepare it for tokenization, normalization, or deeper NLP processing.


In [29]:
# Step 1: Import regular expression library
import re

# Step 2: Remove opening and closing <h1> tags from the headline
headline_one = '<h1>Nation\'s Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini</h1>'
headline_no_tag = re.sub(r"<.?h1>", "", headline_one)  # removes both <h1> and </h1> tags

# Step 3: Remove all '@' characters from the tweet
tweet = '@fat_meats, veggies are better than you think.'
tweet_no_at = re.sub(r"@", "", tweet)  # removes @ symbols

# Test prints
try:
    print(headline_no_tag)
except:
    print('No variable called `headline_no_tag`')

try:
    print(tweet_no_at)
except:
    print('No variable called `tweet_no_at`')

Nation's Top Pseudoscientists Harness High-Energy Quartz Crystal Capable Of Reversing Effects Of Being Gemini
fat_meats, veggies are better than you think.


## Tokenization

To perform most NLP tasks, we first need to **break text into smaller components**. This process is called **tokenization**, and the resulting pieces are called **tokens**.

### Why Tokenize?

Tokenization helps with tasks such as:

* Counting words or sentences
* Tracking word frequency
* Finding co-occurring terms
* Preparing text for models and analysis

Tokens are usually **individual words**, but they can also be **sentences** or other structured units.

---

### Word Tokenization

Use `word_tokenize()` from **nltk** to split text into individual words:

```python
from nltk.tokenize import word_tokenize

text = "Tokenize this text"
tokenized = word_tokenize(text)

print(tokenized)
# ["Tokenize", "this", "text"]
```

---

### Sentence Tokenization

Use `sent_tokenize()` when you want to separate entire sentences:

```python
from nltk.tokenize import sent_tokenize

text = "Tokenize this sentence. Also, tokenize this sentence."
tokenized = sent_tokenize(text)

print(tokenized)
# ['Tokenize this sentence.', 'Also, tokenize this sentence.']
```

Tokenization is often the **first step** in NLP preprocessing — it provides access to each part of the text for further analysis or transformation.


In [31]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

# Sample ECG text
ecg_text = 'An electrocardiogram is used to record the electrical conduction through a person\'s heart. The readings can be used to diagnose cardiac arrhythmias.'

# Step 1: Tokenize by word
tokenized_by_word = word_tokenize(ecg_text)

# Step 2: Tokenize by sentence
tokenized_by_sentence = sent_tokenize(ecg_text)

# Print results
try:
    print('Word Tokenization:')
    print(tokenized_by_word)
except:
    print('Expected a variable called `tokenized_by_word`')

try:
    print('Sentence Tokenization:')
    print(tokenized_by_sentence)
except:
    print('Expected a variable called `tokenized_by_sentence`')

Word Tokenization:
['An', 'electrocardiogram', 'is', 'used', 'to', 'record', 'the', 'electrical', 'conduction', 'through', 'a', 'person', "'s", 'heart', '.', 'The', 'readings', 'can', 'be', 'used', 'to', 'diagnose', 'cardiac', 'arrhythmias', '.']
Sentence Tokenization:
["An electrocardiogram is used to record the electrical conduction through a person's heart.", 'The readings can be used to diagnose cardiac arrhythmias.']


## Normalization

After tokenization and noise removal, many NLP workflows require **text normalization**, which prepares tokens for deeper analysis. Normalization includes several types of preprocessing:

### Common Normalization Techniques

* **Upper or lowercasing**
* **Stopword removal**
* **Stemming** — removing prefixes/suffixes
* **Lemmatization** — converting words to their root form

The simplest normalization step is **changing text to lowercase or uppercase**, ensuring consistency across tokens.

### Example — Changing Case

```python
my_string = 'tHiS HaS a MiX oF cAsEs'

print(my_string.upper())
# 'THIS HAS A MIX OF CASES'

print(my_string.lower())
# 'this has a mix of cases'
```

Normalization helps reduce variation in text (e.g., “Apple”, “apple”, “APPLE”) and improves token consistency for downstream NLP tasks.

In [33]:
brands = 'Salvation Army, YMCA, Boys & Girls Club of America'

# Step 1: Convert all characters to lowercase
brands_lower = brands.lower()

# Step 2: Convert all characters to uppercase
brands_upper = brands.upper()

# Print to verify results
try:
    print(f'Lowercased brands: {brands_lower}')
except:
    print('Expected a variable called `brands_lower`')

try:
    print(f'Uppercased brands: {brands_upper}')
except:
    print('Expected a variable called `brands_upper`')

Lowercased brands: salvation army, ymca, boys & girls club of america
Uppercased brands: SALVATION ARMY, YMCA, BOYS & GIRLS CLUB OF AMERICA


## Stopword Removal

Stopwords are high-frequency words that often provide **little to no meaning** in a sentence. They include words like **“a”**, **“an”**, **“the”**, and other common terms that do not affect sentiment or intent. Removing them helps focus on meaningful words.

### Using NLTK’s Stopword List

```python
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
```

The words are stored as a **set** for fast lookup.

---

### Removing Stopwords from Text

```python
nbc_statement = "NBC was founded in 1926 making it the oldest major broadcast network in the USA"

word_tokens = word_tokenize(nbc_statement)  # tokenize sentence

statement_no_stop = [word for word in word_tokens if word not in stop_words]

print(statement_no_stop)
# ['NBC', 'founded', '1926', 'making', 'oldest', 'major', 'broadcast', 'network', 'USA']
```

---

**Process Summary:**

1. Tokenize the text.
2. Filter out words that are in `stop_words`.
3. Keep only the meaningful words for analysis.

Stopword removal is essential for focusing on **semantically relevant content** during text processing and NLP.


In [34]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Sample survey text
survey_text = 'A YouGov study found that American\'s like Italian food more than any other country\'s cuisine.'

# Step 1: Load English stopwords into a set
stop_words = set(stopwords.words("english"))

# Step 2: Tokenize the survey text
tokenized_survey = word_tokenize(survey_text)

# Step 3: Remove stopwords using list comprehension
text_no_stops = [word for word in tokenized_survey if word not in stop_words]

# Print results for verification
try:
    print(f'Stopwords type: {type(stop_words)}')
except:
    print('Expected a variable called `stop_words`')

try:
    print(f'Words Tokenized: {tokenized_survey}')
except:
    print('Expected a variable called `tokenized_survey`')

try:
    print(f'Text without Stops: {text_no_stops}')
except:
    print('Expected a variable called `text_no_stops`')

Stopwords type: <class 'set'>
Words Tokenized: ['A', 'YouGov', 'study', 'found', 'that', 'American', "'s", 'like', 'Italian', 'food', 'more', 'than', 'any', 'other', 'country', "'s", 'cuisine', '.']
Text without Stops: ['A', 'YouGov', 'study', 'found', 'American', "'s", 'like', 'Italian', 'food', 'country', "'s", 'cuisine', '.']


## Stemming

Stemming is a normalization technique used in NLP to **remove prefixes and suffixes** from words, reducing them to a simpler base form. This helps unify similar words such as:

* *going* → **go**
* *connected* → **connect**
* *happiness* → **happi**

Search engines often use stemming to improve keyword matching between queries and documents.

---

### Using `PorterStemmer` in NLTK

To use stemming, first import and initialize the stemmer:

```python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
```

---

### Applying Stemming to Tokenized Text

```python
tokenized = ['NBC', 'was', 'founded', 'in', '1926', '.', 'This', 'makes', 'NBC', 'the', 'oldest', 'major', 'broadcast', 'network', '.']

stemmed = [stemmer.stem(token) for token in tokenized]

print(stemmed)
# ['nbc', 'wa', 'found', 'in', '1926', '.', 'thi', 'make', 'nbc', 'the', 'oldest', 'major', 'broadcast', 'network', '.']
```

---

### Important Note

Stemming is powerful but **not always precise** — words may become shortened to forms that no longer resemble their original meaning. Use it carefully depending on your NLP task.

Stemming is often followed by **lemmatization** for more accurate normalization.


In [35]:
# Step 1: Import PorterStemmer and initialize it
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

# Step 2: Tokenize the text
populated_island = 'Java is an Indonesian island in the Pacific Ocean. It is the most populated island in the world, with over 140 million people.'
island_tokenized = word_tokenize(populated_island)

# Step 3: Apply stemming using a list comprehension
stemmed = [stemmer.stem(token) for token in island_tokenized]

# Print results for verification
try:
    print('A stemmer exists:')
    print(stemmer)
except:
    print('Expected a variable called `stemmer`')

try:
    print('Words Tokenized:')
    print(island_tokenized)
except:
    print('Expected a variable called `island_tokenized`')

try:
    print('Stemmed Words:')
    print(stemmed)
except:
    print('Expected a variable called `stemmed`')

A stemmer exists:
<PorterStemmer>
Words Tokenized:
['Java', 'is', 'an', 'Indonesian', 'island', 'in', 'the', 'Pacific', 'Ocean', '.', 'It', 'is', 'the', 'most', 'populated', 'island', 'in', 'the', 'world', ',', 'with', 'over', '140', 'million', 'people', '.']
Stemmed Words:
['java', 'is', 'an', 'indonesian', 'island', 'in', 'the', 'pacif', 'ocean', '.', 'it', 'is', 'the', 'most', 'popul', 'island', 'in', 'the', 'world', ',', 'with', 'over', '140', 'million', 'peopl', '.']


## Lemmatization

**Lemmatization** is a normalization technique that converts words to their **root (dictionary) form**, known as the *lemma*. Unlike stemming, lemmatization considers the **part of speech (POS)** of each word, making it **more accurate but less efficient**.

---

### Using `WordNetLemmatizer` in NLTK

```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
```

---

### Applying Lemmatization to Tokens

```python
tokenized = ["NBC", "was", "founded", "in", "1926"]

lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]

print(lemmatized)
# ["NBC", "wa", "founded", "in", "1926"]
```

---

### Why “was” → “wa”?

Because `lemmatize()` **assumes every word is a noun by default**.
To get accurate results (e.g., “was” → “be”), we must supply **part-of-speech tags** — which we'll explore in the next exercise.

---

**Key Difference vs. Stemming:**

| Feature        | Stemming           | Lemmatization         |
| -------------- | ------------------ | --------------------- |
| Considers POS? | ❌ No               | ✔ Yes                 |
| Accuracy       | Low–Medium         | High                  |
| Output         | Can be nonsensical | Valid dictionary word |

Lemmatization is most useful when **POS tagging** is available — then it becomes a powerful tool for NLP preprocessing.


In [36]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Step 1: Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Step 2: Tokenize the input string
populated_island = 'Indonesia was founded in 1945. It contains the most populated island in the world, Java, with over 140 million people.'
tokenized_string = word_tokenize(populated_island)

# Step 3: Lemmatize each token using list comprehension
lemmatized_words = [lemmatizer.lemmatize(token) for token in tokenized_string]

# Print results for verification
try:
    print(f'A lemmatizer exists: {lemmatizer}')
except:
    print('Expected a variable called `lemmatizer`')
try:
    print(f'Words Tokenized: {tokenized_string}')
except:
    print('Expected a variable called `tokenized_string`')
try:
    print(f'Lemmatized Words: {lemmatized_words}')
except:
    print('Expected a variable called `lemmatized_words`')

A lemmatizer exists: <WordNetLemmatizer>
Words Tokenized: ['Indonesia', 'was', 'founded', 'in', '1945', '.', 'It', 'contains', 'the', 'most', 'populated', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']
Lemmatized Words: ['Indonesia', 'wa', 'founded', 'in', '1945', '.', 'It', 'contains', 'the', 'most', 'populated', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']


## Part-of-Speech Tagging

To improve lemmatization accuracy, we first need to determine the **part of speech (POS)** for each word in a string. This allows us to provide contextual information to the lemmatizer, resulting in more accurate word roots.

### How the POS Tagging Function Works

#### 1. Import Required Tools

```python
from nltk.corpus import wordnet
from collections import Counter
```

* **wordnet** provides word meanings and POS categories
* **Counter** counts occurrences of each POS type

---

#### 2. Get Synonyms (Word Senses)

```python
def get_part_of_speech(word):
    probable_part_of_speech = wordnet.synsets(word)
```

`synsets()` returns possible meanings of a word — each tagged with a POS.

---

#### 3. Count POS Types

We count how many times each part of speech appears:

```python
pos_counts["n"] = len([item for item in probable_part_of_speech if item.pos() == "n"])  # nouns
# ... same for verbs, adjectives, adverbs
```

---

#### 4. Determine the Most Likely POS

```python
most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
```

This returns the most probable part of speech for the input word.

---

### Example: Improved Lemmatization

```python
tokenized = ["How", "old", "is", "the", "country", "Indonesia"]

lemmatized = [
    lemmatizer.lemmatize(token, get_part_of_speech(token)) 
    for token in tokenized
]

print(lemmatized)
# ['How', 'old', 'be', 'the', 'country', 'Indonesia']
```

Because the correct POS was provided, **“is” became “be”** — something basic lemmatization could not achieve.

---

This approach provides **context-aware normalization**, essential for accurate NLP preprocessing.


In [38]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from utils.part_of_speech2 import get_part_of_speech

# Step 1: Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Step 2: Tokenize the input string
populated_island = 'Indonesia was founded in 1945. It contains the most populated island in the world, Java, with over 140 million people.'
tokenized_string = word_tokenize(populated_island)

# Step 3: Lemmatize using POS tagging for better accuracy
lemmatized_pos = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized_string]

# Print results
try:
    print(f'The lemmatized words are: {lemmatized_pos}')
except:
    print('Expected a variable called `lemmatized_pos`')

The lemmatized words are: ['Indonesia', 'be', 'found', 'in', '1945', '.', 'It', 'contain', 'the', 'most', 'populate', 'island', 'in', 'the', 'world', ',', 'Java', ',', 'with', 'over', '140', 'million', 'people', '.']


## Review

This lesson introduced essential techniques for **text preprocessing**, a crucial step before applying NLP models or analysis. The goal is to prepare raw text into a clean and structured form suitable for downstream tasks.

### Key Concepts Covered

* **Text Preprocessing**
  The process of cleaning and formatting text so it is usable for NLP applications.

* **Noise Removal**
  Removes irrelevant elements such as punctuation, HTML tags, symbols, or excess whitespace.

* **Tokenization**
  Splits text into smaller units — typically **words** or **sentences** — using tools like `word_tokenize()` and `sent_tokenize()`.

* **Normalization**
  A set of transformations that makes tokens more consistent:

  * Uppercasing / Lowercasing
  * Stopword removal
  * **Stemming** — bluntly removes prefixes/suffixes
  * **Lemmatization** — finds the dictionary root of a word

* **Stemming**
  Reduces words to shorter forms (may lose meaning).
  Example: “running” → “run”, “was” → “wa”

* **Lemmatization**
  More accurate than stemming and often uses **part-of-speech tagging** to return meaningful word roots.
  Example: “was” → “be”

---

### Final Takeaway

Before building your preprocessing pipeline, **decide how you want the text formatted** and **why**. Once your goal is clear, these preprocessing tools allow you to shape the data exactly as needed for NLP models, analysis, or feature extraction.


In [40]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from utils.part_of_speech import get_part_of_speech
from nltk.corpus import stopwords
import re

# Step 1: Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Step 2: Original HTML text
oprah_wiki = "<p>Working in local media, she was both the youngest news anchor and the first black female news anchor at Nashville's WLAC-TV. </p>"

# Step 3: Remove <p> tags
cleaned_wiki = re.sub(r"<.?p>", "", oprah_wiki)

# Step 4: Remove periods and commas using regex
cleaned_wiki = re.sub(r"[.,]", "", cleaned_wiki)

# Step 5: Convert to lowercase
cleaned_wiki = cleaned_wiki.lower()

# Step 6: Tokenize
tokenized_sentence = word_tokenize(cleaned_wiki)

# Step 7: Remove stopwords (including punctuation tokens)
stop_words = set(stopwords.words("english"))
print(stop_words)

{'further', "they'd", 'same', 'did', "that'll", 'd', "i've", "needn't", 'only', 'very', 'that', 'not', 'and', 'some', 'a', "we'll", 'yours', 'can', "she's", "we're", "shan't", 'for', "she'd", 'needn', 'in', 'mightn', 'so', 'll', 'me', 'is', 'more', 'there', 'doing', "you'll", 'then', "it's", "he's", 'm', 'but', 'between', 'hasn', 'itself', "they've", "i'm", 'no', 'both', 'you', "hasn't", 'through', 'again', 'as', 'during', 'its', 'themselves', "aren't", 'him', 'an', 'shouldn', 'where', 'being', 'while', 'weren', 'against', 'how', 'should', 'who', 'few', 'out', 'are', 'whom', "you're", 'these', 've', 'of', 'will', 'below', 'over', 'now', 'this', 'all', "shouldn't", 'what', 'haven', "it'll", 'down', "i'd", 'under', 'with', "you've", 'didn', 'most', "it'd", 't', 'your', 'about', 'do', 'they', 'too', 'her', 'does', 'here', 'their', 'doesn', 'won', 'own', 're', 'theirs', "couldn't", 's', 'each', 'mustn', 'because', 'other', 'were', 'don', 'hers', "don't", 'hadn', 'the', 'wouldn', 'than', 'a

In [41]:
stop_words.update({"''", "``", "'s"})  # extra cleanup
text_no_stops = [word for word in tokenized_sentence if word not in stop_words]

# Step 8: Lemmatize using POS tagging
lemmatized_words = [lemmatizer.lemmatize(word, get_part_of_speech(word)) for word in text_no_stops]

# Final output
print(lemmatized_words)

['work', 'local', 'medium', 'young', 'news', 'anchor', 'first', 'black', 'female', 'news', 'anchor', 'nashville', 'wlac-tv']
