# Text Tokenization Exercise

This exercise explores the challenges of splitting text into sentences and words when dealing with complex real-world text containing dates, amounts, URLs, emails, acronyms, and multi-word expressions.

## The Challenge

Given a text variable, split it into:
1. **Sentences** - logical units of meaning ending with terminal punctuation
2. **Words (tokens)** - individual meaningful units

In [1]:
# Sample text with challenging elements
text = """Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M."""

print("Original text:")
print(text)


Original text:
Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.


In [None]:
import re

def split_sentences(text):
    sentences = re.split(r'(?<=\.)\s+', text)
    return [s.strip() for s in sentences if s.strip()]

def split_words(text):
    words = re.findall(r'\b\w+\b', text)
    return words


sentences = split_sentences(text)
words = split_words(text)

print("Sentences:")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent}")

print("\nWords:")
print(words)

Sentences:
1. Dr.
2. John Smith, Ph.D., earned $1,250.50 on Jan.
3. 15, 2024, for his work at A.I.
4. Corp.
5. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.
6. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.

Words:
['Dr', 'John', 'Smith', 'Ph', 'D', 'earned', '1', '250', '50', 'on', 'Jan', '15', '2024', 'for', 'his', 'work', 'at', 'A', 'I', 'Corp', 'You', 'can', 'reach', 'him', 'at', 'j', 'smith', 'ai', 'corp', 'co', 'uk', 'or', 'visit', 'https', 'www', 'ai', 'corp', 'co', 'uk', 'team', 'dr', 'smith', 'for', 'more', 'info', 'The', 'U', 'S', 'A', 'based', 'company', 'reported', 'a', '23', '5', 'increase', 'in', 'Q3', 'revenue', 'totaling', '2', '5M']


# Corpus Tokenization Exercise

This exercise explores the challenges of splitting words in large corpuses and find the most common words. 

## The Challenge

Given a file `shakes.txt` in the book folder. Find the words that are more common in Shakespeare's book. 

In [2]:
import re
from collections import Counter

def tokenize_words(text):
    words = re.findall(r'\b\w+\b', text.lower())  
    return words


with open('shakes.txt', 'r', encoding='utf-8') as f:
    corpus_text = ''.join(line for i, line in enumerate(f) if i < 1000)

all_words = tokenize_words(corpus_text)

word_counts = Counter(all_words)

most_common = word_counts.most_common(10)

print("Top 10 most common words in Shakespeare's book (first 1000 lines):")
for word, count in most_common:
    print(f"{word}: {count}")

Top 10 most common words in Shakespeare's book (first 1000 lines):
and: 198
the: 181
to: 148
of: 145
my: 115
in: 113
thou: 104
thy: 102
s: 100
that: 97
