# Text Tokenization Exercise

This exercise explores the challenges of splitting text into sentences and words when dealing with complex real-world text containing dates, amounts, URLs, emails, acronyms, and multi-word expressions.

## The Challenge

Given a text variable, split it into:
1. **Sentences** - logical units of meaning ending with terminal punctuation
2. **Words (tokens)** - individual meaningful units

In [54]:
# We create a function that Given a text variable, split it into:
# Sentences- logical units of meaning ending with terminal punctuation
# Words (tokens) - individual meaningful units

import re
def split_text(text):
    # First we split the text to find all the words
    words = re.findall(r"[^\s,]+(?:,[^\s,]+)*", text)
    return words


In [55]:
# Sample text with challenging elements
text = """Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M."""

print("Original text:")
print(text)

# Split the text into words
words = split_text(text)
print("\nWords:")
print(words)




Original text:
Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.

Words:
['Dr.', 'John', 'Smith', 'Ph.D.', 'earned', '$1,250.50', 'on', 'Jan.', '15', '2024', 'for', 'his', 'work', 'at', 'A.I.', 'Corp.', 'You', 'can', 'reach', 'him', 'at', 'j.smith@ai-corp.co.uk', 'or', 'visit', 'https://www.ai-corp.co.uk/team/dr-smith', 'for', 'more', 'info.', 'The', 'U.S.A.-based', 'company', 'reported', 'a', '23.5%', 'increase', 'in', 'Q3', 'revenue', 'totaling', '€2.5M.']


In [56]:
# Unix tools is not useful for this exercise
# Unix tools like grep, sed, and awk are not suitable for this tokenization exercise because they lack linguistic intelligence
# and work only with simple pattern matching. While we can extract words using regex patterns, these tools cannot reliably split
# sentences because they treat all periods identically, incorrectly breaking abbreviations like "Dr. Smith" or "Ph.D." into separate segments.
#Unlike NLTK's Punkt algorithm or spaCy's neural models, which are trained on real text corpora to distinguish abbreviation periods from 
# sentence-ending periods, Unix tools have no contextual awareness or learning capability. We therefore exclude Unix tools because proper
# sentence segmentation requires specialized linguistic knowledge that only dedicated NLP libraries provide.

In [57]:
# Now we try with NLTK 
import nltk
# This is needed to download the punkt tokenizer models
nltk.download("punkt_tab")
#  We import the necessary functions
from nltk.tokenize import sent_tokenize, RegexpTokenizer
def nltk_split_text(text):
    # We use NLTK's built-in functions to split the text, in this case in sentences
    sentences = sent_tokenize(text)
    # We create a custom tokenizer using our specific regex pattern
    # This pattern keeps dots in abbreviations and commas inside numbers
    word_tokenizer = RegexpTokenizer(r"[^\s,]+(?:,[^\s,]+)*")
    # And in words
    words = word_tokenizer.tokenize(text)
    return sentences, words

# Now we try with NLTK
sentences, words = nltk_split_text(text)
print("\nSentences (NLTK):")
print(sentences)
print("\nWords (NLTK):")
print(words)


Sentences (NLTK):
['Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I.', 'Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.', 'The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.']

Words (NLTK):
['Dr.', 'John', 'Smith', 'Ph.D.', 'earned', '$1,250.50', 'on', 'Jan.', '15', '2024', 'for', 'his', 'work', 'at', 'A.I.', 'Corp.', 'You', 'can', 'reach', 'him', 'at', 'j.smith@ai-corp.co.uk', 'or', 'visit', 'https://www.ai-corp.co.uk/team/dr-smith', 'for', 'more', 'info.', 'The', 'U.S.A.-based', 'company', 'reported', 'a', '23.5%', 'increase', 'in', 'Q3', 'revenue', 'totaling', '€2.5M.']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\carid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [58]:
# Now we try with SpaCy
import spacy
nlp = spacy.load("en_core_web_sm")
def spacy_split_text(text):
    # We process the text with spaCy's neural pipeline
    doc = nlp(text)
    # We extract sentences using dependency-based segmentation
    sentences = [sent.text for sent in doc.sents]
    
    # We filter out commas and standalone periods that are sentence-final punctuation
    # We keep periods that are part of abbreviations (like Dr. or Ph.D.)
    words = []
    for token in doc:
        # We skip commas completely
        if token.text == ',':
            continue
        # We skip periods that are standalone punctuation (sentence endings)
        # SpaCy marks them with is_punct=True and they appear after spaces
        elif token.text == '.' and token.is_punct:
            continue
        # Otherwise we keep the token
        else:
            words.append(token.text)

    # We have also to take into account that $ needs to be attached to the number
    # So we merge $ with the next token if applicable
    # We create a new list to hold the filtered words
    filtered_words = []
    # We use a variable to skip the next token if we have merged it
    skip_next = False
    for i, token in enumerate(words):
        # If we have to skip the next token, we do it
        if skip_next:
            skip_next = False
            continue
        # If the token is $ or € , we merge it with the next token
        if (token == "$" or token == "€") and i + 1 < len(words):
            filtered_words.append(token + words[i + 1])
            skip_next = True
        
        # If the next token is %, we merge it with the current number
        elif i + 1 < len(words) and words[i + 1] == "%":
            filtered_words.append(token + "%")
            skip_next = True
            
        # Otherwise, we just add the token
        else:
            filtered_words.append(token)
    words = filtered_words

    
    return sentences, words

# Now we try with SpaCy
sentences, words = spacy_split_text(text)
print("\nSentences (SpaCy):")
print(sentences)
print("\nWords (SpaCy):")
print(words)


Sentences (SpaCy):
['Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp.', 'You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.', 'The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.']

Words (SpaCy):
['Dr.', 'John', 'Smith', 'Ph.D.', 'earned', '$1,250.50', 'on', 'Jan.', '15', '2024', 'for', 'his', 'work', 'at', 'A.I.', 'Corp.', 'You', 'can', 'reach', 'him', 'at', 'j.smith@ai-corp.co.uk', 'or', 'visit', 'https://www.ai-corp.co.uk/team/dr-smith', 'for', 'more', 'info', 'The', 'U.S.A.-based', 'company', 'reported', 'a', '23.5%', 'increase', 'in', 'Q3', 'revenue', 'totaling', '€2.5M.']


# Corpus Tokenization Exercise

This exercise explores the challenges of splitting words in large corpuses and find the most common words. 

## The Challenge

Given a file `shakes.txt` in the book folder. Find the words that are more common in Shakespeare's book. 

In [62]:
# Now we need to find the most common words in the text "shakes.txt", first with NLTK
from collections import Counter

# We read the text file
with open("TXT_FILES/shakes.txt", "r", encoding="utf-8") as f:
    shakes_text = f.read()

# Now we create a function to get the most common words
def most_common_words_nltk(text, n=10):
    # We use the nltk_split_text function to get the words
    _, words = nltk_split_text(text)
    # We convert all words to lowercase for uniformity
    words = [word.lower() for word in words]
    # We use Counter to count the occurrences of each word
    word_counts = Counter(words)
    # We return the n most common words
    return word_counts.most_common(n)

# We get the 10 most common words in the shakes_text
common_words_nltk = most_common_words_nltk(shakes_text, n=10)
print("\n10 Most common words in 'shakes.txt' using NLTK:")
print(common_words_nltk)



10 Most common words in 'shakes.txt' using NLTK:
[('the', 27729), ('and', 26746), ('i', 19856), ('to', 18843), ('of', 18163), ('a', 14438), ('my', 12457), ('you', 12175), ('that', 10840), ('in', 10830)]


In [63]:
# Now we do it with SpaCy
def most_common_words_spacy(text, n=10):
    # We use the spacy_split_text function to get the words
    _, words = spacy_split_text(text)
    # We convert all words to lowercase for uniformity
    words = [word.lower() for word in words]
    # We use Counter to count the occurrences of each word
    word_counts = Counter(words)
    # We return the n most common words
    return word_counts.most_common(n)

# We get the 10 most common words in the shakes_text
common_words_spacy = most_common_words_spacy(shakes_text, n=10)
print("\n10 Most common words in 'shakes.txt' using SpaCy:")
print(common_words_spacy)

ValueError: [E088] Text of length 5465395 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

In [None]:
# If we do it again with an increased max_length it will work better, beacuse Spacy for default has a max_length of 1 million characters

# We increase spaCy's max_length to handle large texts
nlp = spacy.load("en_core_web_sm")
nlp.max_length = 6000000  # We set it to 6 million characters (more than our text)

def most_common_words_spacy(text, n=10):
    # We use the spacy_split_text function to get the words
    _, words = spacy_split_text(text)
    # We convert all words to lowercase for uniformity
    words = [word.lower() for word in words]
    # We use Counter to count the occurrences of each word
    word_counts = Counter(words)
    # We return the n most common words
    return word_counts.most_common(n)


# We get the 10 most common words in the shakes_text
common_words_spacy = most_common_words_spacy(shakes_text, n=10)
print("\n10 Most common words in 'shakes.txt' using SpaCy:")
print(common_words_spacy)
