# Natural Language Processing: Words, Tokens, and Regular Expressions
## Exercises Notebook - Session 2

This notebook contains exercises covering:
- Word and token counting
- Regular expressions in Python
- Unicode handling
- Unix text processing tools
- Tokenization concepts

---
## Section 1: Word and Token Counting
---

### Exercise 1.1: Counting Words

Given the sentence from the slides: *"They picnicked by the pool, then lay back on the grass and looked at the stars."*

Write Python code to:
1. Count words excluding punctuation
2. Count words including punctuation as separate tokens
3. Count unique word types (case-insensitive)

In [None]:
# YOUR CODE HERE
sentence = "They picnicked by the pool, then lay back on the grass and looked at the stars."
import re

# 1. Count words excluding punctuation
words_no_punct = re.findall(r"\b\w+\b", sentence)
print(f"Words excluding punctuation: {words_no_punct}")
# We need to use len() to count the number of words excluding punctuation
print(f"Number of words excluding punctuation: {len(words_no_punct)}")

# 2. Count words including punctuation
words_with_punct = re.findall(r"\S+", sentence)
print(f"Words including punctuation: {words_with_punct}")
# We need to use len() to count the number of words including punctuation
print(f"Number of words including punctuation: {len(words_with_punct)}")

# 3. Count unique types (case-insensitive)
unique_types = set(word.lower() for word in words_no_punct)
print(f"Unique types (case-insensitive): {unique_types}")
# We need to use len() to count the number of unique types
print(f"Number of unique types (case-insensitive): {len(unique_types)}")

Words excluding punctuation: ['They', 'picnicked', 'by', 'the', 'pool', 'then', 'lay', 'back', 'on', 'the', 'grass', 'and', 'looked', 'at', 'the', 'stars']
Number of words excluding punctuation: 16
Words including punctuation: ['They', 'picnicked', 'by', 'the', 'pool,', 'then', 'lay', 'back', 'on', 'the', 'grass', 'and', 'looked', 'at', 'the', 'stars.']
Number of words including punctuation: 16
Unique types (case-insensitive): {'they', 'lay', 'grass', 'the', 'picnicked', 'at', 'back', 'by', 'on', 'pool', 'and', 'stars', 'looked', 'then'}
Number of unique types (case-insensitive): 14


### Exercise 1.2: Handling Disfluencies

The slides show the utterance: *"I do uh main- mainly business data processing"*

Write code to:
1. Count all tokens including disfluencies
2. Remove filled pauses (uh, um) and fragments (words ending with -)
3. Count "clean" words

In [7]:
# YOUR CODE HERE
utterance = "I do uh main- mainly business data processing"

# 1. Count all tokens including disfluencies
# We can use the split() method to tokenize the utterance by whitespace
tokens_with_disfluencies = utterance.split()
print(f"Tokens including disfluencies: {tokens_with_disfluencies}")
# We need to use len() to count the number of tokens including disfluencies
print(f"Number of tokens including disfluencies: {len(tokens_with_disfluencies)}")

# 2. Remove filled pauses (uh, um) and fragments (words ending with -)
# We can use a list comprehension eliminate filled pauses and fragments
tokens_cleaned = [token for token in tokens_with_disfluencies if token not in {"uh", "um"} and not token.endswith("-")]
print(f"Tokens after removing disfluencies: {tokens_cleaned}")

# 3. Count "clean" words
# We can use len() to count the number of clean words
print(f"Number of clean words: {len(tokens_cleaned)}")

Tokens including disfluencies: ['I', 'do', 'uh', 'main-', 'mainly', 'business', 'data', 'processing']
Number of tokens including disfluencies: 8
Tokens after removing disfluencies: ['I', 'do', 'mainly', 'business', 'data', 'processing']
Number of clean words: 6


---
## Section 2: Regular Expressions
---

### Exercise 2.1: Basic Pattern Matching

Using the regex patterns from the slides, write code to:
1. Find all words starting with capital letters
2. Find all digits in a text
3. Find words that are NOT capitalized

In [None]:
# YOUR CODE HERE
text = "Chapter 1: Down the Rabbit Hole. Alice was 7 years old in 1865."

# 1. Find all words starting with capital letters
capitalized_words = re.findall(r"\b[A-Z][a-zA-Z]*\b", text)
print(f"Capitalized words: {capitalized_words}")

# 2. Find all digits in a text
digits = re.findall(r"\d+", text)
print(f"Digits in the text: {digits}")

# 3. Find words that are NOT capitalized
non_capitalized_words = re.findall(r"\b[a-z][a-zA-Z]*\b", text)
print(f"Non-capitalized words: {non_capitalized_words}")

### Exercise 2.2: Kleene Star and Plus

The slides explain Kleene operators (* and +).

Write patterns to match:
1. "baa" followed by zero or more 'a's (baa, baaa, baaaa...)
2. One or more digits
3. Any sequence of characters (wildcard)

In [10]:
# YOUR CODE HERE
test_strings = ["ba", "baa", "baaa", "baaaa", "baaaaa"]

# 1. "baa" followed by zero or more 'a's (baa, baaa, baaaa...)
pattern_1 = r"^baa+$"
matches_1 = [s for s in test_strings if re.match(pattern_1, s)]
print(f"Strings matching 'baa' followed by zero or more 'a's: {matches_1}")

# 2. One or more digits
pattern_2 = r"^\d+$"
matches_2 = [s for s in test_strings if re.match(pattern_2, s)]
print(f"Strings matching one or more digits: {matches_2}")

# 3. Any sequence of characters (wildcard)
pattern_3 = r"^.*$"
matches_3 = [s for s in test_strings if re.match(pattern_3, s)]
print(f"Strings matching any sequence of characters: {matches_3}")

Strings matching 'baa' followed by zero or more 'a's: ['baa', 'baaa', 'baaaa', 'baaaaa']
Strings matching one or more digits: []
Strings matching any sequence of characters: ['ba', 'baa', 'baaa', 'baaaa', 'baaaaa']


### Exercise 2.3: Anchors and Word Boundaries

Write patterns using anchors (^, $, \b) to:
1. Match lines starting with a capital letter
2. Match lines ending with a period
3. Find the word "the" as a complete word (not in "other" or "there")

In [None]:
# YOUR CODE HERE
lines = [
    "The quick brown fox",
    "jumped over the lazy dog.",
    "other animals watched.",
    "There was nothing else."
]

# 1. Match lines starting with a capital letter
# We use the caret (^) to indicate the start of the line and [A-Z] to match any capital letter
pattern_start_capital = r"^[A-Z].*"
lines_starting_with_capital = [line for line in lines if re.match(pattern_start_capital, line)]
print(f"Lines starting with a capital letter: {lines_starting_with_capital}")

# 2. Match lines ending with a period
# We use the dollar sign ($) to indicate the end of the line and \. to match a literal period
pattern_end_period = r".*\.$"
lines_ending_with_period = [line for line in lines if re.match(pattern_end_period, line)]
print(f"Lines ending with a period: {lines_ending_with_period}")

# 3. Find the word "the" as a complete word (not in "other" or "there")
# We use \b to indicate word boundaries around "the"
pattern_whole_word_the = r"\bthe\b"
lines_with_whole_word_the = [line for line in lines if re.search(pattern_whole_word_the, line, re.IGNORECASE)]
print(f"Lines containing the whole word 'the': {lines_with_whole_word_the}")

### Exercise 2.4: Substitutions and Capture Groups

The slides show date format conversion. Write code to:
1. Convert US dates (mm/dd/yyyy) to EU format (dd-mm-yyyy)
2. Anonymize email addresses by replacing with [EMAIL]
3. Convert "I'm" contractions to "I am"

In [None]:
# YOUR CODE HERE
text_dates = "Meeting on 10/15/2011 and 03/22/2024"
text_emails = "Contact john@example.com or jane@test.org"
text_contractions = "I'm happy and I'm excited"

### Solution 2.4

In [None]:
import re

# 1. Date format conversion (exact example from slides)
text_dates = "The date is 10/15/2011"
converted = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\2-\1-\3', text_dates)
print(f"1. US to EU date: '{text_dates}' -> '{converted}'")

# 2. Email anonymization
text_emails = "Contact john@example.com or jane@test.org"
anon = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text_emails)
print(f"\n2. Anonymized: '{anon}'")

# 3. Contraction expansion
text_contractions = "I'm happy and I'm excited"
expanded = re.sub(r"I'm", "I am", text_contractions)
print(f"\n3. Expanded: '{expanded}'")

# EXPLANATION (Slides reference):
# Capture groups use parentheses () to store matched values.
# In the replacement string:
# - \1 refers to first captured group (month)
# - \2 refers to second captured group (day)
# - \3 refers to third captured group (year)
# The slides show: re.sub(r"(\d{2})/(\d{2})/(\d{4})", r"\2-\1-\3", string)

### Exercise 2.5: ELIZA-style Pattern Matching

The slides show ELIZA, an early chatbot using regex.
Implement simple ELIZA rules for:
1. "I'm [emotion]" -> "WHY DO YOU THINK YOU ARE [emotion]"
2. "I need [something]" -> "WHAT WOULD IT MEAN IF YOU GOT [something]"
3. Any input with "always" -> "CAN YOU THINK OF A SPECIFIC EXAMPLE?"

In [None]:
# This are the rules for a simple ELIZA-like

# 1. "I'm [emotion]" -> "WHY DO YOU THINK YOU ARE [emotion]"
# 2. "I need [something]" -> "WHAT WOULD IT MEAN IF YOU GOT [something]"
# 3. Any input with "always" -> "CAN YOU THINK OF A SPECIFIC EXAMPLE?

def simple_eliza(user_input):
    # Implement ELIZA rules
    # Rule 1
    if match := re.match(r"I'm (.+)", user_input, re.IGNORECASE):
        emotion = match.group(1)
        return f"WHY DO YOU THINK YOU ARE {emotion.upper()}"
    # Rule 2
    elif match := re.match(r"I need (.+)", user_input, re.IGNORECASE):
        something = match.group(1)
        return f"WHAT WOULD IT MEAN IF YOU GOT {something.upper()}"
    # Rule 3
    elif re.search(r"always", user_input, re.IGNORECASE):
        return "CAN YOU THINK OF A SPECIFIC EXAMPLE?"
    else:
        return "PLEASE TELL ME MORE."

# Test
test_inputs = [
    "I'm depressed",
    "I need a vacation", 
    "They always ignore me"
]
# Now we can test the function with the test inputs
for input_str in test_inputs:
    response = simple_eliza(input_str)
    print(f"User: {input_str}\nELIZA: {response}\n") 

---
## Section 3: Unicode and Encoding
---

### Exercise 3.1: Unicode Code Points

The slides explain that Unicode assigns code points to characters.
Write code to:
1. Get the code point of 'a' (should be U+0061)
2. Get the character for code point U+00F1 (Ã±)
3. Show the UTF-8 byte encoding of "hello"

In [12]:
# 1. Get the code point of 'a' (should be U+0061)
# We can use the ord() function to get the Unicode code point of a character
code_point_a = ord('a')
print(f"Code point of 'a': U+{code_point_a:04X}")

# 2. Get the character for code point U+00F1 (Ã±)
# We can use the chr() function to get the character from a Unicode code point
char_Ã± = chr(0x00F1)
print(f"Character for code point U+00F1: {char_Ã±}")

# 3. Show the UTF-8 byte encoding of "hello"
# We can use the encode() method to get the UTF-8 byte encoding of a string
utf8_hello = "hello".encode('utf-8')
print(f"UTF-8 byte encoding of 'hello': {utf8_hello}")

Code point of 'a': U+0061
Character for code point U+00F1: Ã±
UTF-8 byte encoding of 'hello': b'hello'


### Exercise 3.2: Multi-byte UTF-8 Characters

The slides show that different scripts need different byte lengths.
Examine the byte lengths for:
1. ASCII character: 'A'
2. Spanish: 'Ã±' (U+00F1)
3. Chinese: 'å§š' (a character from the slides)
4. Emoji: 'ðŸ˜€'

In [16]:
# Examine the byte lengths for:
# 1. ASCII character: 'A'
ascii_char = "A"
# We can use the encode() method to get the UTF-8 byte encoding of the ASCII character
ascii_bytes = ascii_char.encode("utf-8")
# We need to use len() to get the byte length
print(f"Byte length of ASCII character 'A': {len(ascii_bytes)} bytes")

# 2. Spanish: 'Ã±' (U+00F1)
spanish_char = "Ã±"
# We can use the encode() method to get the UTF-8 byte encoding of the Spanish character
spanish_bytes = spanish_char.encode("utf-8")
# We need to use len() to get the byte length
print(f"Byte length of Spanish character 'Ã±': {len(spanish_bytes)} bytes")

# 3. Chinese: 'å§š' (a character from the slides)
chinese_char = "å§š"
# We can use the encode() method to get the UTF-8 byte encoding of the Chinese character
chinese_bytes = chinese_char.encode("utf-8")
# We need to use len() to get the byte length
print(f"Byte length of Chinese character 'å§š': {len(chinese_bytes)} bytes")

# 4. Emoji: 'ðŸ˜€'
emoji_char = "ðŸ˜€"
# We can use the encode() method to get the UTF-8 byte encoding of the emoji character
emoji_bytes = emoji_char.encode("utf-8")
# We need to use len() to get the byte length
print(f"Byte length of emoji character 'ðŸ˜€': {len(emoji_bytes)} bytes")

Byte length of ASCII character 'A': 1 bytes
Byte length of Spanish character 'Ã±': 2 bytes
Byte length of Chinese character 'å§š': 3 bytes
Byte length of emoji character 'ðŸ˜€': 4 bytes


### Exercise 3.3: String Length vs Byte Length

The slides note that len() returns code points, not bytes.
Compare string length vs byte length for mixed text.

In [None]:
# YOUR CODE HERE
mixed_text = "SeÃ±or ä½ å¥½ ðŸ˜€"

# Compare string length vs byte length for mixed text.
# For string length, we can use len() directly on the string
string_length = len(mixed_text)
# For byte length, we encode the string to UTF-8 and then use len()
byte_length = len(mixed_text.encode("utf-8"))
print(f"String length: {string_length} characters")
print(f"Byte length (UTF-8): {byte_length} bytes")

String length: 10 characters
Byte length (UTF-8): 18 bytes


---
## Section 4: Shell Commands for Text Processing
---

### Exercise 4.1: Basic tr Command

The slides show Unix tools for tokenization.
Using shell commands, tokenize text by:
1. Converting all characters to lowercase
2. Replacing non-alphabetic chars with newlines

In [None]:
# Create a sample file first
sample_text = """THE SONNETS
by William Shakespeare
From fairest creatures
We desire increase"""

with open('sample.txt', 'w') as f:
    f.write(sample_text)

# YOUR SHELL COMMANDS HERE (use ! prefix in Jupyter)
# 1. Converting all characters to lowercase
# We use the wsl prefix to run the command in WSL environment
! wsl cat sample.txt | wsl tr 'A-Z' 'a-z'

# 2. Replacing non-alphabetic chars with newlines
! wsl sed 's/[^a-zA-Z]/\n/g' sample.txt

"cat" no se reconoce como un comando interno o externo,
programa o archivo por lotes ejecutable.


### Exercise 4.2: Word Frequency Count

Implement the complete pipeline from the slides:
```
tr -sc 'A-Za-z' '\n' < file | sort | uniq -c | sort -n -r
```
This should output word frequencies.

In [None]:
# YOUR CODE HERE - implement in shell or Python

### Exercise 4.3: The Mystery 'd'

The slides show that in Shakespeare's text, 'd' appears 8954 times.
The slide asks "What happened here?"

Explain why and write code to investigate contractions like "'d".

In [None]:
# YOUR CODE HERE