[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/1.words/EvaluateTokenizationForSentiment.ipynb)

# The impact of tokenization on downstream tasks

Tokenization can have a big impact on downstream model performance. Here, we look at different methods for tokenization and stemming/lemmatization and evaluate how they affect the performance on a simple binary sentiment classification task.

We use a train/dev deataset of 1000 reviews from the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

Each tokenization method is evaluated on the same learning algorithm ($l_2$-regularized logistic regression); the only difference is the tokenization process.

For more, see: http://sentiment.christopherpotts.net/tokenizing.html.

In [5]:
# download code and data
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/1.words/happyfuntokenizing.py

!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/sentiment.1000.train.txt
!wget https://raw.githubusercontent.com/dbamman/anlp25/main/data/sentiment.1000.dev.txt

--2025-09-03 06:18:08--  https://raw.githubusercontent.com/dbamman/anlp25/main/1.words/happyfuntokenizing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7510 (7.3K) [text/plain]
Saving to: ‘happyfuntokenizing.py’


2025-09-03 06:18:08 (68.4 MB/s) - ‘happyfuntokenizing.py’ saved [7510/7510]

--2025-09-03 06:18:08--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/sentiment.1000.train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1320314 (1.3M) [text/plain]
Saving to: ‘sentiment.1000.train.txt’


2025-09-03 06:1

In [6]:
# make sure dependencies are installed
!pip install nltk
!pip install spacy
!pip install scikit-learn



In [7]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

import spacy
from nltk.stem.porter import *
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn import linear_model


from happyfuntokenizing import Tokenizer as potts

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Setting up evaluation
We'll set up a class that we can use to test different tokenization methods.

In [8]:
class TokenizationTest():

    def __init__(self, train_file, dev_file):
        self.train_file = train_file
        self.dev_file = dev_file
        self.count_vectorizer = CountVectorizer(
            max_features=10_000,
            analyzer=lambda x: x,
            lowercase=False,
            strip_accents=None,
            binary=True
        )
        self.label_encoder = LabelEncoder()

    def read_data(self, filename, tokenizer):
        tokenized_text = []
        labels = []

        with open(filename, encoding="utf-8") as file:
            for idx, line in enumerate(file):
                cols = line.rstrip().split("\t")
                label = cols[0]
                text = cols[1]
                tokens = list(tokenizer(text))
                tokenized_text.append(tokens)
                labels.append(label)
        return tokenized_text, labels

    def evaluate(self, tokenizer):
        train_tokens, train_labels = self.read_data(self.train_file, tokenizer)
        dev_tokens, dev_labels = self.read_data(self.dev_file, tokenizer)

        X_train = self.count_vectorizer.fit_transform(train_tokens)
        X_dev = self.count_vectorizer.transform(dev_tokens)

        self.label_encoder.fit(train_labels)
        Y_train = self.label_encoder.transform(train_labels)
        Y_dev = self.label_encoder.transform(dev_labels)

        model = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2')
        model.fit(X_train, Y_train)
        print("Function '%s' Accuracy: %.3f" % (tokenizer.__name__, model.score(X_dev, Y_dev)))

## Setting up tokenizers

Now let's set up our tokenizers. Each tokenizer should take as input a string and output a list of strings. We'll try six different tokenization methods.

1. Splitting on whitespace with `str.split()`
2. Splitting on whitespace, then stemming with the [Porter stemmer](https://tartarus.org/martin/PorterStemmer/)
3. Using [`nltk.word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html)
4. Using the [`spacy` tokenizer](https://spacy.io/usage/linguistic-features#how-tokenizer-works)
5. Using the [`spacy` tokenizer](https://spacy.io/usage/linguistic-features#how-tokenizer-works) with [lemmatization](https://spacy.io/api/lemmatizer)
6. Using the [Potts tokenizer](http://sentiment.christopherpotts.net/tokenizing.html) (implemented for you in `happyfuntokenization.py`)

Note: evaluating the spacy tokenizers might take ~1 minute.

In [9]:
# load NLTK porter stemmer
stemmer = PorterStemmer()
def tokenize_with_porter(data):
    return [
        stemmer.stem(word) for word in str.split(data)
    ]

In [10]:
# spaCy lemmatization needs tagger but disable the rest
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

def tokenize_with_spacy(data):
    spacy_tokens = nlp(data)
    return [token.text for token in spacy_tokens]

def tokenize_with_spacy_lemma(data):
    spacy_tokens = nlp(data)
    return [token.lemma_ for token in spacy_tokens]

In [11]:
# load Potts sentiment tokenizer
potts_tokenizer = potts()
def tokenize_with_potts(data):
    return list(potts_tokenizer.tokenize(data))

## Testing the tokenizers

In [12]:
tester = TokenizationTest("sentiment.1000.train.txt", "sentiment.1000.dev.txt")

In [9]:
tester.evaluate(str.split)

Function 'split' Accuracy: 0.858


In [10]:
tester.evaluate(tokenize_with_porter)

Function 'tokenize_with_porter' Accuracy: 0.866


In [12]:
tester.evaluate(nltk.word_tokenize)

Function 'word_tokenize' Accuracy: 0.874


In [13]:
tester.evaluate(tokenize_with_spacy)

Function 'tokenize_with_spacy' Accuracy: 0.872


In [14]:
tester.evaluate(tokenize_with_spacy_lemma)

Function 'tokenize_with_spacy_lemma' Accuracy: 0.872


In [15]:
tester.evaluate(tokenize_with_potts)

Function 'tokenize_with_potts' Accuracy: 0.883


## Extra

Inspect the output of some of these tokenizers. How do different tokenizers handle some of the issues we talked about in lecture (e.g., punctuation, emoticons, casing)?

In [15]:
tokenize_with_potts("I love Applied Natural Language Processing, and I hope that I can get enrolled in this course!")  # modify this to test different tokenizers / different strings

['i',
 'love',
 'applied',
 'natural',
 'language',
 'processing',
 ',',
 'and',
 'i',
 'hope',
 'that',
 'i',
 'can',
 'get',
 'enrolled',
 'in',
 'this',
 'course',
 '!']

The Potts tokenizer was designed with web text in mind, with special hand-crafted rules for emoticons, HTML tags, and hashtags. Can you approach the performance of the Potts tokenizer (>0.88) by combining some of the other methods we test?

In [16]:
def my_tokenizer(data: str) -> list[str]:
    """Tokenize the `data` string into a list of strings."""
    import re, html

    if not data:
        return []

    # compact regexes
    EMOTICON = r"(?:(?:[:;=8][\-o\*']?[\)\]\(\[dDpP/:}\{@\|\\])|(?:[\)\]\(\[dDpP/:}\{@\|\\][\-o\*']?[:;=8]))"
    URL      = r"(?:https?://\S+|www\.\S+)"
    EMAIL    = r"(?:[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,})"
    MENTION  = r"(?:@\w{1,50})"
    HASHTAG  = r"(?:#[A-Za-z0-9_]+)"
    HTMLTAG  = r"(?:</?[\w!][^>]*>)"
    NUM      = r"(?:\d+(?:[.,]\d+)*(?:%|[kKmMbB])?)"
    WORD     = r"(?:[A-Za-z]+(?:'[A-Za-z]+)*)"
    PUNCT    = r"(?:[^\w\s])"

    MASTER_RE = re.compile("|".join([EMOTICON, HTMLTAG, URL, EMAIL, MENTION, HASHTAG, NUM, WORD, PUNCT]), re.IGNORECASE)
    EMOTICON_RE = re.compile(EMOTICON)
    URL_RE = re.compile(URL, re.IGNORECASE)
    EMAIL_RE = re.compile(EMAIL, re.IGNORECASE)
    MENTION_RE = re.compile(MENTION)
    HASHTAG_RE = re.compile(HASHTAG)
    HTMLTAG_RE = re.compile(HTMLTAG)
    NUM_RE = re.compile(NUM)

    # Porter stemmer
    try:
        _stem = stemmer.stem
    except Exception:
        _stem = lambda w: w

    # preprocess
    text = html.unescape(data).replace("\u200d", " ")
    text = HTMLTAG_RE.sub(" ", text)
    text = re.sub(r"\s+", " ", text).strip()

    # tokenize
    raw = [m.group(0) for m in MASTER_RE.finditer(text)]

    def _normalize(tok: str) -> list[str]:
        if EMOTICON_RE.fullmatch(tok): return [tok]
        if URL_RE.fullmatch(tok):      return ["__URL__"]
        if EMAIL_RE.fullmatch(tok):    return ["__EMAIL__"]
        if MENTION_RE.fullmatch(tok):  return ["__USER__"]
        if NUM_RE.fullmatch(tok):      return ["__NUM__"]
        if HASHTAG_RE.fullmatch(tok):
            base = tok.lower()
            out = [base]
            inner = base[1:]
            parts = re.split(r"_+", inner)
            split_more = []
            for p in parts:
                split_more += re.findall(r"[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)|\d+", p)
            out.extend([s.lower() for s in split_more if s])
            return out
        if HTMLTAG_RE.fullmatch(tok):  return []
        return [tok.lower()]

    normed = []
    for t in raw:
        normed.extend(_normalize(t))
    normed = [t for t in normed if t]

    collapsed = []
    for t in normed:
        if len(t) > 1 and not any(ch.isalnum() for ch in t) and len(set(t)) == 1:
            collapsed.append(t[0])
        else:
            collapsed.append(t)

    final = [(_stem(t) if t.isalpha() and len(t) > 2 else t) for t in collapsed]
    return final

In [17]:
tester.evaluate(my_tokenizer)

Function 'my_tokenizer' Accuracy: 0.881
