# Text Tokenization Exercise

This exercise explores the challenges of splitting text into sentences and words when dealing with complex real-world text containing dates, amounts, URLs, emails, acronyms, and multi-word expressions.

## The Challenge

Given a text variable, split it into:
1. **Sentences** - logical units of meaning ending with terminal punctuation
2. **Words (tokens)** - individual meaningful units

In [49]:
# We create a function that Given a text variable, split it into:
# Sentences- logical units of meaning ending with terminal punctuation
# Words (tokens) - individual meaningful units

import re
def split_text(text):
    # First we split the text to find all the words
    words = re.findall(r"[^\s,]+(?:,[^\s,]+)*", text)
    return words


In [50]:
# Sample text with challenging elements
text = """Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M."""

print("Original text:")
print(text)

# Split the text into words
words = split_text(text)
print("\nWords:")
print(words)




Original text:
Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.

Words:
['Dr.', 'John', 'Smith', 'Ph.D.', 'earned', '$1,250.50', 'on', 'Jan.', '15', '2024', 'for', 'his', 'work', 'at', 'A.I.', 'Corp.', 'You', 'can', 'reach', 'him', 'at', 'j.smith@ai-corp.co.uk', 'or', 'visit', 'https://www.ai-corp.co.uk/team/dr-smith', 'for', 'more', 'info.', 'The', 'U.S.A.-based', 'company', 'reported', 'a', '23.5%', 'increase', 'in', 'Q3', 'revenue', 'totaling', '€2.5M.']


In [51]:
# Unix tools is not useful for this exercise
# Unix tools like grep, sed, and awk are not suitable for this tokenization exercise because they lack linguistic intelligence
# and work only with simple pattern matching. While we can extract words using regex patterns, these tools cannot reliably split
# sentences because they treat all periods identically, incorrectly breaking abbreviations like "Dr. Smith" or "Ph.D." into separate segments.
#Unlike NLTK's Punkt algorithm or spaCy's neural models, which are trained on real text corpora to distinguish abbreviation periods from 
# sentence-ending periods, Unix tools have no contextual awareness or learning capability. We therefore exclude Unix tools because proper
# sentence segmentation requires specialized linguistic knowledge that only dedicated NLP libraries provide.

In [52]:
# Now we try with NLTK 
import nltk
# This is needed to download the punkt tokenizer models
nltk.download("punkt_tab")
#  We import the necessary functions
from nltk.tokenize import sent_tokenize, RegexpTokenizer
def nltk_split_text(text):
    # We use NLTK's built-in functions to split the text, in this case in sentences
    sentences = sent_tokenize(text)
    # We create a custom tokenizer using our specific regex pattern
    # This pattern keeps dots in abbreviations and commas inside numbers
    word_tokenizer = RegexpTokenizer(r"[^\s,]+(?:,[^\s,]+)*")
    # And in words
    words = word_tokenizer.tokenize(text)
    return sentences, words

# Now we try with NLTK
sentences, words = nltk_split_text(text)
print("\nSentences (NLTK):")
print(sentences)
print("\nWords (NLTK):")
print(words)


Sentences (NLTK):
['Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I.', 'Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.', 'The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.']

Words (NLTK):
['Dr.', 'John', 'Smith', 'Ph.D.', 'earned', '$1,250.50', 'on', 'Jan.', '15', '2024', 'for', 'his', 'work', 'at', 'A.I.', 'Corp.', 'You', 'can', 'reach', 'him', 'at', 'j.smith@ai-corp.co.uk', 'or', 'visit', 'https://www.ai-corp.co.uk/team/dr-smith', 'for', 'more', 'info.', 'The', 'U.S.A.-based', 'company', 'reported', 'a', '23.5%', 'increase', 'in', 'Q3', 'revenue', 'totaling', '€2.5M.']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\carid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [53]:
# Now we try with SpaCy
import spacy
nlp = spacy.load("en_core_web_sm")
def spacy_split_text(text):
    # We process the text with spaCy's neural pipeline
    doc = nlp(text)
    # We extract sentences using dependency-based segmentation
    sentences = [sent.text for sent in doc.sents]
    
    # We filter out standalone comma tokens
    # SpaCy separates everything, so we need to remove commas manually
    words = [token.text for token in doc if token.text != ',']
    # We have also to take into account that $ needs to be attached to the number
    # So we merge $ with the next token if applicable
    # We create a new list to hold the filtered words
    filtered_words = []
    # We use a variable to skip the next token if we have merged it
    skip_next = False
    for i, token in enumerate(words):
        # If we have to skip the next token, we do it
        if skip_next:
            skip_next = False
            continue
        # If the token is $ or € , we merge it with the next token
        if token == "$" or token == "€" and i + 1 < len(words):
            filtered_words.append(token + words[i + 1])
            skip_next = True
        # Otherwise, we just add the token
        else:
            filtered_words.append(token)
    words = filtered_words

    
    return sentences, words

# Now we try with SpaCy
sentences, words = spacy_split_text(text)
print("\nSentences (SpaCy):")
print(sentences)
print("\nWords (SpaCy):")
print(words)


Sentences (SpaCy):
['Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp.', 'You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.', 'The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.']

Words (SpaCy):
['Dr.', 'John', 'Smith', 'Ph.D.', 'earned', '$1,250.50', 'on', 'Jan.', '15', '2024', 'for', 'his', 'work', 'at', 'A.I.', 'Corp.', 'You', 'can', 'reach', 'him', 'at', 'j.smith@ai-corp.co.uk', 'or', 'visit', 'https://www.ai-corp.co.uk/team/dr-smith', 'for', 'more', 'info', '.', 'The', 'U.S.A.-based', 'company', 'reported', 'a', '23.5', '%', 'increase', 'in', 'Q3', 'revenue', 'totaling', '€2.5M.']


# Corpus Tokenization Exercise

This exercise explores the challenges of splitting words in large corpuses and find the most common words. 

## The Challenge

Given a file `shakes.txt` in the book folder. Find the words that are more common in Shakespeare's book. 