### Introduction

This notebook outlines several methods for tokenizing text into words (and sentences), including:

* whitespace
* nltk (Penn Treebank tokenizer)
* nltk (Twitter-aware)
* spaCy
* custom regular expressions

highlighting differences between them.

***

### Setup: Importing Libraries

First, we import the necessary Python libraries.
* `nltk`: The Natural Language Toolkit, a popular library for NLP tasks.
* `re`: Python's built-in module for regular expressions.
* `json`: For parsing the JSON file containing the tweet data.
* `spacy`: A modern and powerful NLP library.
* `Counter`: A dictionary subclass from the `collections` module for counting hashable objects, which is perfect for tallying token frequencies.

In [1]:
# Import necessary libraries
import nltk, re, json
import spacy
from collections import Counter

***

### Downloading NLTK Models

To perform sentence and word tokenization, NLTK relies on pre-trained models. Here, we download the `punkt` tokenizer models, which are used by NLTK's functions for splitting text into sentences (`sent_tokenize`) and words (`word_tokenize`).

In [2]:
# If you haven't downloaded the sentence segmentation model before, do so here
# This command downloads the 'punkt' resource, which includes pre-trained models
# for sentence tokenization for multiple languages.
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mehmetcanyavuz/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

***

### Downloading spaCy Models

Similarly, spaCy uses statistical models to process text. We download `en_core_web_sm`, which is a small, efficient English language model that includes components for tokenization, part-of-speech tagging, named entity recognition, and more. The `!` allows us to run this command directly in the shell from the notebook.

In [5]:
# This command executes a shell command to download the small English model for spaCy.
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


***

### Loading and Configuring the spaCy Model

After downloading the model, we load it into our script using `spacy.load()`. For this notebook's purpose, we only need spaCy's tokenizer. To make the process more efficient, we disable other components of the NLP pipeline like the part-of-speech `tagger`, `ner` (Named Entity Recognizer), and `parser`.

In [6]:
# Load the small English spaCy model, disabling unnecessary components for efficiency.
nlp = spacy.load('en_core_web_sm', disable=['tagger,ner,parser'])

# Explicitly remove the pipeline components to ensure they are not used.
nlp.remove_pipe('tagger')
nlp.remove_pipe('ner')
nlp.remove_pipe('parser');

***

### Data Loading: Reading Tweets from JSON

This section defines a helper function to read our data. The function `read_tweets_from_json` opens a specified JSON file, parses its content, and extracts the text from each tweet object, returning a list of tweet strings.

In [7]:
# Define a function to read tweets from a JSON file.
def read_tweets_from_json(filename):
    # Initialize an empty list to store the tweet texts.
    tweets=[]
    # Open the specified file with UTF-8 encoding.
    with open(filename, encoding="utf-8") as file:
        # Load the entire JSON content from the file.
        data=json.load(file)
        # Iterate through each tweet object in the loaded data.
        for tweet in data:
            # Append the value of the "text" key to our list.
            tweets.append(tweet["text"])
    # Return the list of all tweet texts.
    return tweets

***

Now, let's specify the path to our data file.

In [8]:
# Store the path to the tweet data file in a variable.
filename="../data/trump_tweets.json"

***

Using the function we just defined, we load the tweet texts from the JSON file into the `tweets` list.

In [9]:
# Call the function to read the tweets and store them in the 'tweets' variable.
tweets=read_tweets_from_json(filename)

***

### Method 1: Whitespace Tokenization

This is the simplest tokenization method. We iterate through each tweet and use Python's built-in `split()` method, which splits a string by any whitespace (spaces, tabs, newlines) by default. The resulting lists of tokens are stored in `whitespace_tokens`.

In [10]:
# Initialize an empty list to hold the tokenized tweets.
whitespace_tokens=[]
# Loop through each tweet in the 'tweets' list.
for tweet in tweets:
    # Split the tweet string by whitespace and append the resulting list of tokens.
    whitespace_tokens.append(tweet.split())

***

### Downloading Additional NLTK Data

The `punkt_tab` resource is a version of the Punkt tokenizer data that is specifically trained to handle tab characters within text, although for standard text `punkt` is usually sufficient.

In [14]:
# Download an alternative version of the 'punkt' tokenizer data.
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/mehmetcanyavuz/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

***

### Method 2: NLTK's Penn Treebank Tokenizer

Here, we use `nltk.word_tokenize()`. This tokenizer is more sophisticated than simple whitespace splitting. It's based on the Penn Treebank conventions and is better at handling punctuation, separating it from words (e.g., `"don't"` becomes `['do', 'n't']`).

In [15]:
# Initialize an empty list to hold the tokenized tweets.
nltk_tokens=[]
# Loop through each tweet in the 'tweets' list.
for tweet in tweets:
    # Use NLTK's standard word tokenizer and append the result.
    nltk_tokens.append(nltk.word_tokenize(tweet, language="english"))

***

### Method 3: NLTK's Casual (Twitter-aware) Tokenizer

The `nltk.casual_tokenize()` function is specifically designed for informal text like tweets. It's better at handling social media conventions like hashtags (`#`), mentions (`@`), and emoticons, often keeping them as single, intact tokens.

In [16]:
# Initialize an empty list to hold the tokenized tweets.
nltk_casual_tokens=[]
# Loop through each tweet in the 'tweets' list.
for tweet in tweets:
    # Use NLTK's casual tokenizer designed for social media text.
    nltk_casual_tokens.append(nltk.casual_tokenize(tweet))

***

### Method 4: spaCy's Tokenizer

spaCy's tokenizer is highly advanced and part of a larger processing pipeline. It's language-specific and considers complex grammatical rules. We process each tweet with our loaded `nlp` object and extract the text of each token.

In [17]:
# Initialize an empty list to hold the tokenized tweets.
spacy_tokens=[]
# Loop through each tweet in the 'tweets' list.
for tweet in tweets:
    # Process the tweet with the spaCy nlp object and create a list of the token texts.
    spacy_tokens.append([token.text for token in nlp(tweet)])

***

### Method 5: Custom Extensible Regex Tokenizer

For maximum control, we can define our own tokenizer using regular expressions. This code defines a sequence of regex patterns to capture different types of tokens in a specific order of priority (e.g., mentions first, then hashtags, then words with apostrophes, etc.). The `re.compile()` function creates a reusable regex object for efficiency.

In [18]:
# This regular expression pattern is adapted from Christopher Potts' sentiment tokenizing script.
# It defines a tuple of regex patterns. The order is crucial as they are matched sequentially.
regexes=(
    # Pattern 1: Keep usernames/mentions together (e.g., @user_name).
    r"(?:@[\w_]+)",

    # Pattern 2: Keep hashtags together (e.g., #topic).
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

    # Pattern 3: Keep words with internal apostrophes, hyphens, or underscores together (e.g., "word-word").
    r"(?:[a-z][a-z’'\-_]+[a-z])",

    # Pattern 4: Keep all other sequences of word characters (letters, numbers, underscore) together.
    r"(?:[\w_]+)",

    # Pattern 5: Match any other non-whitespace character as a token (e.g., punctuation).
    r"(?:\S)"
)

# Join all the individual regex patterns with the '|' (OR) operator to create one large regex.
big_regex="|".join(regexes)

# Compile the combined regex for faster execution.
# re.VERBOSE: Allows for comments and whitespace in the pattern.
# re.I: Makes the matching case-insensitive.
# re.UNICODE: Makes character classes like \w work with all Unicode characters.
my_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

# Define a function that takes text and returns all non-overlapping matches found by our regex.
def my_extensible_tokenize(text):
    return my_extensible_tokenizer.findall(text)

***

Now we apply our custom tokenizer to the tweets, just as we did with the other methods.

In [19]:
# Initialize an empty list to hold the tokenized tweets.
extensible_tokens=[]
# Loop through each tweet in the 'tweets' list.
for tweet in tweets:
    # Use our custom regex-based tokenizer function and append the result.
    extensible_tokens.append(my_extensible_tokenize(tweet))

***

## Q1: Write a function to print out the first 5 tokenized tweets in each of the five tokenizers above. Examine those tweets; how would you characterize the differences?

***

To answer this, we'll loop through the first five tweets. The `zip()` function is used to iterate over the results from all five tokenization methods simultaneously for each tweet. For each tweet, we print the output from each tokenizer, making it easy to compare them side-by-side.

**Observations on Differences:**

* **Whitespace:** The most basic. It fails to separate punctuation from words (e.g., `United States.` is one token, `Wall,` is one token). It's generally not ideal for NLP tasks.
* **NLTK (Standard):** Better than whitespace. It correctly separates most punctuation (e.g., `.` `!` `,`). However, it splits contractions like `can't` into `ca` and `n't` and handles `’` as a separate token. It also splits mentions like `@newtgingrich` into `@` and `newtgingrich`.
* **NLTK (Casual):** Designed for tweets. It correctly keeps mentions (`@newtgingrich`) and hashtags together. It handles HTML entities differently (`&amp;` becomes `&`).
* **spaCy:** Very sophisticated. It handles contractions well (`can't` becomes `ca` `n’t`). It keeps mentions together but sometimes attaches preceding punctuation to them (e.g., `.@newtgingrich`).
* **Extensible (Regex):** Very effective for this specific text. It correctly keeps mentions and other important structures intact as single tokens because we explicitly defined rules for them. It separates punctuation cleanly.

In [20]:
# Use zip to iterate through the first 5 tokenized tweets from all five lists at once.
# 'enumerate' provides an index 'idx' for each set of tweets.
for idx, (one, two, three, four, five) in enumerate(zip(nltk_tokens, nltk_casual_tokens, spacy_tokens, whitespace_tokens, extensible_tokens)):
    # Stop the loop after processing the first 5 tweets (indices 0 through 4).
    if idx >= 5:
        break
    # Print the output from each tokenizer, joining the token lists back into strings for readability.
    print("NLTK      :\t%s" % ' '.join(one))
    print("CASUAL    :\t%s" % ' '.join(two))
    print("SPACY     :\t%s" % ' '.join(three))
    print("WHITESPACE:\t%s" % ' '.join(four))
    print("EXTENSIBLE:\t%s" % ' '.join(five))

    # Print a newline for better separation between tweets.
    print()


NLTK      :	Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States . We stopped the last two - many are still in Mexico but can ’ t get through our Wall , but it takes a lot of Border Agents if there is no Wall . Not easy !
CASUAL    :	Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States . We stopped the last two - many are still in Mexico but can ’ t get through our Wall , but it takes a lot of Border Agents if there is no Wall . Not easy !
SPACY     :	Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States . We stopped the last two - many are still in Mexico but ca n’t get through our Wall , but it takes a lot of Border Agents if there is no Wall . Not easy !
WHITESPACE:	Mexico is doing NOTHING to stop the Caravan which is now fully formed and heading to the United States. We stopped the last two - many are still in Mexico but can’t get thro

***

## Q2: Write a function `compare(tokenization_one, tokenization_two)` that compares two tokenizations of the same text and finds the 20 most frequent tokens that don't appear in the other.

***

This function is designed to highlight the differences between two tokenization methods. It works by:
1.  Counting the frequency of every token in both tokenization results using `collections.Counter`.
2.  Iterating through the tokens of the first method and checking if they exist in the second. If not, they are added to a "missing" list.
3.  Doing the same for the second method against the first.
4.  Finally, printing the 20 most common tokens that were unique to each method.

In [21]:
# Define a function that takes two lists of tokenized sentences.
def compare(one_tokens, two_tokens):
    
    # Create a Counter object to store token frequencies for the first tokenization.
    one_counts=Counter()
    # Create a Counter object to store token frequencies for the second tokenization.
    two_counts=Counter()

    # Iterate through each tokenized sentence in the first list.
    for sentence in one_tokens:
        # Iterate through each token in the sentence.
        for token in sentence:
            # Increment the count for that token.
            one_counts[token]+=1
        
    # Iterate through each tokenized sentence in the second list.
    for sentence in two_tokens:
        # Iterate through each token in the sentence.
        for token in sentence:
            # Increment the count for that token.
            two_counts[token]+=1
        
    # Create a Counter for tokens present in the second list but missing from the first.
    missing_from_one=Counter()
    # Create a Counter for tokens present in the first list but missing from the second.
    missing_from_two=Counter()
    
    # Iterate through all unique word types found in the first tokenization.
    for word_type in one_counts:
        # If a word is not found in the vocabulary of the second tokenization...
        if word_type not in two_counts:
            # ...add it to the 'missing_from_two' counter with its frequency.
            missing_from_two[word_type]=one_counts[word_type]
        
    # Iterate through all unique word types found in the second tokenization.
    for word_type in two_counts:
        # If a word is not found in the vocabulary of the first tokenization...
        if word_type not in one_counts:
            # ...add it to the 'missing_from_one' counter with its frequency.
            missing_from_one[word_type]=two_counts[word_type]

    # Print a summary of the total number of sentences in each list.
    print ("Token counts -- one: %s, two: %s" % (len(one_tokens), len(two_tokens)))
    # Print the 20 most common tokens that are in the second list but not the first.
    print ("\nNot in one:")
    print ('\n'.join("%s\t%d" % (k,v) for (k,v) in missing_from_one.most_common(20)))
    # Print the 20 most common tokens that are in the first list but not the second.
    print ("\nNot in two:")
    print ('\n'.join("%s\t%d" % (k,v) for (k,v) in missing_from_two.most_common(20)))

***

Now, let's use the `compare` function to see the differences between NLTK's `casual_tokenize` (more modern, tweet-aware) and `word_tokenize` (standard Penn Treebank). The output clearly shows how the casual tokenizer keeps full mentions (`@realDonaldTrump`) and hashtags (`#Trump2016`) together, while the standard one splits them and handles punctuation and contractions differently (e.g., `don't` vs. `n't`).

In [22]:
# Call the compare function to analyze the differences between the casual and standard NLTK tokenizers.
compare(nltk_casual_tokens, nltk_tokens)

Token counts -- one: 36583, two: 36583

Not in one:
``	13299
''	11514
's	3541
amp	3364
n't	2503
--	2077
Trump2016	846
U.S.	665
....	542
'm	538
're	528
CelebApprentice	416
Mr.	333
MittRomney	312
've	307
'll	307
IvankaTrump	236
w/	209
'd	175
.....	171

Not in two:
"	24807
@realDonaldTrump	8661
#Trump2016	840
@BarackObama	732
don't	626
#MakeAmericaGreatAgain	560
@FoxNews	547
I'm	524
@foxandfriends	504
can't	423
@ApprenticeNBC	393
@MittRomney	314
It's	304
it's	303
#CelebApprentice	289
@CNN	285
you're	276
doesn't	266
#MAGA	239
@IvankaTrump	237


***

## Q3: Use one of the NLTK tokenizers; write code to determine how many sentences are in this dataset, and what the average number of words per sentence is.

***

To solve this, we iterate through each full tweet text. For each tweet, we first use `nltk.sent_tokenize()` to split it into sentences. Then, for each of those sentences, we use `nltk.word_tokenize()` to count the words. We keep a running total of the number of sentences and the total number of words to calculate the average at the end.

In [23]:
# Initialize a float for the total token/word count.
count=0.
# Initialize an integer for the total sentence count.
num_sents=0
# Loop through each raw tweet string.
for tweet in tweets:
    # For each tweet, loop through the sentences detected by NLTK's sentence tokenizer.
    for sent in nltk.sent_tokenize(tweet):
        # Add the number of words in the current sentence to the total word count.
        count+=len(nltk.word_tokenize(sent))
        # Increment the sentence counter.
        num_sents+=1
# Print the final counts and the calculated average number of tokens per sentence.
print("Sents: %s, Tokens/sent: %.1f" % (num_sents, (count/num_sents)))

Sents: 70618, Tokens/sent: 12.5


***

## Q4 (check-plus): modify the extensible tokenizer above to keep urls together (e.g., www.google.com or http://www.google.com)

***

To handle URLs, we add new patterns to our tuple of regular expressions. It's important to place them before the more general patterns. We add two new rules:
1.  `r"(?:https?:\S+)"`: This captures URLs starting with `http:` or `https:`, followed by any sequence of non-whitespace characters.
2.  `r"(?:www\.\S+)"`: This captures URLs starting with `www.`, also followed by non-whitespace characters.

By placing these near the top of the `regexes` tuple, we ensure they are matched before the text can be broken up by more general rules.

In [24]:
# Keep usernames together (any token starting with @, followed by A-Z, a-z, 0-9)
regexes=(r"(?:@[\w_]+)",

# Keep hashtags together (any token starting with #, followed by A-Z, a-z, 0-9, _, or -)
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)",

# Keep urls together
r"(?:https?:\S+)",
r"(?:www\.\S+)",
  
# Keep words with apostrophes, hyphens and underscores together
r"(?:[a-z][a-z’'\-_]+[a-z])",

# Keep all other sequences of A-Z, a-z, 0-9, _ together
r"(?:[\w_]+)",

# Everything else that's not whitespace
r"(?:\S)"
)

big_regex="|".join(regexes)

my_url_extensible_tokenizer = re.compile(big_regex, re.VERBOSE | re.I | re.UNICODE)

def my_extensible_tokenize_with_urls(text):
    return my_url_extensible_tokenizer.findall(text)

***

Let's test our new URL-aware tokenizer on a sample sentence containing a URL. The output shows that the URL is correctly identified and kept as a single token, demonstrating that our modification was successful.

In [25]:
# Test the new tokenizer on a sample sentence and print each token on a new line.
print ('\n'.join(my_extensible_tokenize_with_urls("The course website is http://people.ischool.berkeley.edu/~dbamman/info256.html")))

The
course
website
is
http://people.ischool.berkeley.edu/~dbamman/info256.html


***

In [None]:
# This is an empty cell, often left for future code or notes.