## Tokenization

### Input:

- A set of documents (e.g. text files), $D$

### Output (tokens):

- A sequence, $W$ , containing a list of tokens – words or word pieces for use in natural language processing

### Output (n-grams):

- A matrix, $X$, containing statistics about word/phrase frequencies in those documents.


### Tokens

The most basic unit of representation in a text.

- characters: documents as sequence of individual letters {h,e,l,l,o, ,w,o,r,l,d}
- words: split on white space {hello, world}
- n-grams: learn a vocabulary of phrases and tokenize those:
  > `“hellow world → hellow_world”`


### Goals of Tokenization

To summarize: A major goal of tokenization is to produce features that are

- `predictive` in the learning task
- `interpretable` by human investigators
- `tractable` enough to be easy to work with

Two broad approaches:

1. convert documents to vectors, usually frequency distributions over pre-processed n-grams.
2. convert documents to sequences of tokens, for inputs to sequential models.


### A Traditional Tokenization Pipeline

- Extract text from documents (e.g. PDF, HTML, XML, …)
- Tokenize text into words
- Normalize words (e.g. lowercasing, stemming, lemmatization)
- Remove stop words (e.g. “the”, “a”, “an”, “in”, …)
- Build a vocabulary of words
- Convert documents to vectors of word counts or TF-IDF scores
- Train a model on those vectors


### Subword Tokenization for Sequence Models

Modern transformer models (e.g. BERT, GPT) use subword tokenization:

- construct character-level n-grams
- whitespace treated the same as letters
- all letters to lowercase, but add a special character for the next letter being capitalized.

e.g., BERT’s WordPiece tokenizer:

- character-level byte-pair encoder
- learns character n-grams to breaks words like “playing” into “play” and “##ing”.
- have to fix a vocabulary size: e.g. BERT uses 30K.

![](figs/3.png)


### Segmenting paragraphs/sentences

Many tasks should be done on sentences, rather than corpora as a whole.

- spaCy is a good (but not perfect) job of splitting sentences, while accounting for periods on abbreviations, etc.
- pySBD is a better option for splitting sentences.

There isn’t a grammar-based paragraph tokenizer.

- most corpora have new paragraphs annotated.
- or use line breaks.


### Pre-processing

An important piece of the “art” of text analysis is deciding what data to throw out.

- Uninformative data add noise and reduce statistical precision.
- They are also computationally costly.

Pre-processing choices can affect down-stream results, especially in unsupervised learning tasks (Denny and Spirling 2017).

- some features are more interpretable: “govenor has” / “has discretion” vs “govenor has discretion”.


#### Capitalization

Removing capitalization is a standard corpus normalization technique

- usually the capitalized/non-capitalized version of a word are equivalent – e.g. words showing up capitalized at beginning of sentence
- → capitalization not informative.

Also: what about “the first amendment” versus “the First Amendment”?

- Compromise: include capitalized version of words not at beginning of sentence.

For some tasks, capitalization is important

- needed for sentence splitting, part-of-speech tagging, syntactic parsing, and semantic role labeling.
- For sequence data, e.g. language modeling – huggingface tokenizer takes out capitalization but then add a special “capitalized” token before the word.


#### Punctuation

![](figs/4.png)
(Source: Chris Bail text data slides.)

Inclusion of punctuation depends on your task:

- if you are vectorizing the document as a bag of words or bag of n-grams, punctuation won’t be needed.
- like capitalization, punctuation is needed for annotations (sentence splitting, parts of speech, syntax, roles, etc)
  - also needed for language models.


#### Numbers

for classification using bag of words:

- can drop numbers, or replace with special characters

for language models:

- just treat them like letters.
- GPT-3 can solve math problems (but not well, this is an area of research)


#### Drop Stopwords?

![](figs/5.png)

- Stopwords are words that are so common that they don’t carry much information.
- can drop stopwords by themselves, but keep them as part of phrases.
- can filter out words and phrases using part-of-speech tags (later).


#### Stemming/lemmatizing

- Effective dimension reduction with little loss of information.
- Lemmatizer produces real words, but N-grams won’t make grammatical sense
  - e.g., “I am running” → “I am run”

![](figs/6.png)


## What are N-grams?

- N-grams are contiguous sequences of n tokens.
  - Bigrams: 2-grams
  - Trigrams: 3-grams
  - Quadgrams: 4-grams
- Google Developers recommend tf-idf-weighted bigrams as a baseline for text classification.

For example, the sentence “The quick brown fox jumps over the lazy dog” has the following bigrams:

> [“The quick”, “quick brown”, “brown fox”, “fox jumps”, “jumps over”, “over the”, “the lazy”, “lazy dog”]

for trigrams:

> [“The quick brown”, “quick brown fox”, “brown fox jumps”, “fox jumps over”, “jumps over the”, “over the lazy”, “the lazy dog”]


**Text classification flowchart (from Google Developers):**

![](figs/10.png)


### N-grams and high dimensionality

- N-grams will blow up your feature space:
  - 1-grams: 1000 words → 1000 features
  - 2-grams: 1000 words → 500,500 features
- Filtering out low-frequency and uninformative n-grams is important.
- Google Developers say that a feature space of 20,000 features will work well for descriptive and predictive text classification.


### Hashing Vectorizer

Rather than make a one-to-one lookup for each n-gram, put n-grams through a hashing function that takes an arbitrary string and outputs an integer in some range (e.g. 1 to 10,000).

- This is a lossy transformation, but it can be useful for very large feature spaces.
- The hashing function is deterministic, so the same string will always map to the same integer.

![](figs/11.png)


In [2]:
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(
    n_features=2**4, stop_words="english", alternate_sign=False
)

X = vectorizer.transform(twenty_train.data)

print(X.shape)

(2257, 16)


### Collocations

Collocations are phrases that occur together more often than would be expected by chance.

- Non-compositional: the meaning is not the sum of the parts
  > e.g., “New York” is not the sum of “New” and “York”
- Non-substitutable: cannot substitute one component with synonyms
  > e.g., “fast food” is not the same as “quick food”
- Non-modifiable: cannot modify with additional words
  > e.g., “kick around the bucket” is not the same as “kick the bucket around”


### Pointwise Mutual Information

- Pointwise Mutual Information (PMI) is a measure of how often two words co-occur in a corpus.
- PMI is defined as:

$$
PMI(w_1,w_2) = \frac{P(w_1\_ w_2)}{P(w_1)P(w_2)} \\
= \frac{\text{Prob. of collocation, actual}}{\text{Prob. of collocation, if independent}}
$$

where $w_1$ and $w_2$ are words in the vocabulary, and $w_1$, $w_2$ is the N-gram $w_1\_w_2$.
ranks words by how often they collocate, relative to how often they occur apart.

- Generalizes to longer phrases (length N) as the geometric mean of the probabilities:

$$ \frac{P(w*1,\ldots w_2 )}{\prod*{i=1}^{n} \sqrt[n]{P(w_i)}} $$

- Caveat: Rare words will have high PMI, but this is not necessarily a good thing.
  - Can use a threshold to filter out rare words.


### Out-of-Vocabulary Words (OOV) for N-grams

- OOV words are words that are not in the vocabulary.
- OOV words are a problem for N-gram models.
  - Can be replaced with a special token, e.g. `<UNK>`.
  - Can be replaced with the POS tag, e.g. `<NOUN>`.
  - Can be replaced with the hypernym, e.g. `<ANIMAL>`.
  - Can use a hash function to map OOV words to a fixed number of buckets.


## What are parts of speech?

- Nouns, Pronouns, Proper Nouns,
- Verbs, Auxiliaries,
- Adjectives, Adverbs
- Prepositions, Conjunctions,
- Determiners, Particles
- Numerals, Symbols,
- Interjections, etc.

See e.g. https://universaldependencies.org/u/pos/


### POS Tagging

Words often have more than one POS:

- The back door `(adjective)`
- On my back `(noun)`
- Win the voters back `(particle)`
- Promised to back the bill `(verb)`

The POS tagging task:

- Given a sequence of words, assign a POS tag to each word.
- The POS tag is a label from a fixed set of tags.
- Due to ambiguity (and unknown words), we cannot look up the POS tag in a dictionary.


### Why POS Tagging?

- POS tagging is one of the first steps in many NLP tasks.
- For a traditional NLP pipeline, POS tagging is regarded as a prerequisite for further processing.
  - Syntactic parsing: POS tags are used to build a parse tree.
  - Information extraction: POS tags are used to identify named entities, relations, etc.
- Although POS tagging is not a prerequisite for many modern NLP tasks, it is still useful.
  - To understand the basic structure of a sentence.


### Creating a POS Tagger

- A POS tagger is a classifier that assigns a POS tag to each word in a sentence.
- To handle ambiguity, a POS tagger relies on learned models.
- For a `new language or domain`, a POS tagger can be trained from scratch.
  - Define a set of POS tags.
  - Annotate a corpus with POS tags.
- For an `existing language or domain`, a POS tagger can be trained on the existing annotated corpus.
  - Obtain a corpus with POS tags.
- To train a POS tagger,
  - Choose a POS tagging algorithm. (e.g. HMM, CRF, etc.)
  - Train the POS tagging algorithm on the annotated corpus.
  - Evaluate the POS tagging algorithm on a test set.


![](figs/21.png)


[List of Universal POS tags](https://universaldependencies.org/u/pos/)

![](figs/22.png)


In [3]:
# !python -m spacy download en
import spacy

nlp = spacy.load("en_core_web_sm")

text = "POS tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context."

for token in nlp(text):
    print(token.text, "=>", token.pos_, "=>", token.tag_)

2022-10-22 08:42:06.914835: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


POS => PROPN => NNP
tagging => NOUN => NN
is => AUX => VBZ
the => DET => DT
process => NOUN => NN
of => ADP => IN
marking => VERB => VBG
up => ADP => RP
a => DET => DT
word => NOUN => NN
in => ADP => IN
a => DET => DT
text => NOUN => NN
( => PUNCT => -LRB-
corpus => PROPN => NNP
) => PUNCT => -RRB-
as => ADP => IN
corresponding => VERB => VBG
to => ADP => IN
a => DET => DT
particular => ADJ => JJ
part => NOUN => NN
of => ADP => IN
speech => NOUN => NN
, => PUNCT => ,
based => VERB => VBN
on => ADP => IN
both => CCONJ => CC
its => PRON => PRP$
definition => NOUN => NN
and => CCONJ => CC
its => PRON => PRP$
context => NOUN => NN
. => PUNCT => .


## Dependency Parsing

- Dependency parsing is the task of assigning a syntactic dependency to each word in a sentence.
- In dependency parsing, dependency tags represent the grammatical function of a word in a sentence.

For example, in the sentence “The quick brown fox jumps over the lazy dog”,

- A dependency exists from the `fox` to the `brown` in which the `fox` acts as the head and the `brown` acts as the dependent or child.
- This dependency is labeled `amod` (adjectival modifier).


[Universal Dependency Relations](https://universaldependencies.org/u/dep/)

![](figs/23.png)


In [4]:
from spacy import displacy

sent = "The quick brown fox jumps over the lazy dog."
displacy.render(nlp(sent), jupyter=True)

## Constituency Parsing

- Constituency parsing is the task of analyzing a sentence by breaking it into sub-phrases (constituents).
- These sub-phrases belong to a fixed set of syntactic categories, such as NP (noun phrase) and VP (verb phrase).


> “It took me more than two hours to translate a few pages of English.”

![](figs/27.png)
