## Words
### What counts as a word?

##### Terms:
- **Corpus** a computer-readable collection of text or speech
- **Utterance** a *spoken* sentence
- **Disfluency** a break-up of a sentence or phrase
  - **Fragment** a type of disfluency, a broken-off word: "I do uh *main-* mainly business data processing"
  - **Filler** a type of disfluency that fills a pause, for example "*uh*" or "*um*"
- **Types** distinct words in a corpus, unique word vocabulary denoted by |*V*|
- **Instances** total number of words in a corpus, denoted by *N*
- **Herdan's Law/Heaps' Law** law stating that the larger the corpora, the more word types will be found:
  - |*V*| = *kN^β* where *k* is positive and 0 < *β* < 1
  - The value of *β* depends on genre, but for large corpora it typically ranges from .67 to .75
  - "vocabulary size for a text goes up significantly faster than the square root of its length in words"
- **Lemma** a set of lexical forms having the same stem, like "cat" and "cats"
- **Lemmatization** the task of determining whether two words have the same root
- **Stem** the unmodified "root" morpheme of a lemma
- **Affix** a non-stem "add on" morpheme
- **Wordform** the fully inflected member of a lemma set, for example "cat" and "cats" share a common lemma, but are each separate wordforms
- **Token** words or parts of words, typically for neural net applications
- **Code switching** using multiple languages
- **Datasheet/data statement** specifies properties of a dataset, including motivation, situation, language variety, speaker demographics, collection process, annotation process, and distribution
- **Normalization** a process that typically constitutes tokenizing words, normalizing word formats, and segmenting sentences into a standard format
- **Case Folding** a simple normalization technique where all letters are mapped to a single case
- **Clitic** a part of a word that can't stand on its own, like the 't in can't
- **Morpheme** the smallest meaning-bearing linguistic unit, like 'un', 'wash', and 'able'
- **Morphology** the study of how words are built up from morphemes
- **Sentence segmentation** the process of tokenizing sentences
- **Coreference** the task of determining whether two strings refer to the same entity
- **Edit Distance** a quantitative measure of string similarity
- **Alignment** a correspondence between substrings of two strings
- **Dynamic Programming** a class of algorithms that apply a table driven method to solve problems by combining solutions to subproblems

##### Named Entities:
- **[Linguistic Data Consortium (LDC)](https://www.ldc.upenn.edu/)** a UPenn linguistic data, education, and technology provider that developed the Penn Treebank tokenization standard
- **[Penn Treebank (PTB)](https://paperswithcode.com/dataset/penn-treebank)** a treebank corpus provided by LDC with labelled POS tags
- **[Penn Treebank Tokenization Standard](https://www.nltk.org/api/nltk.tokenize.treebank.html)** a commonly used tokenization standard used for parse corpora (treebanks)
- **[Natural Language Toolkit (NLTK)](https://www.nltk.org/)** a python package that include corpora, NLP tools, and lexical resources like Wordnet
- **[Wordnet](https://wordnet.princeton.edu/)** a large lexical database in English, grouping words into synsets, lemmas, etc
- **[nltk.tokenize.RegexpTokenizer](https://www.nltk.org/api/nltk.tokenize.RegexpTokenizer.html)** a rule-based regex tokenizer provided by NLTK
- **[Byte-pair encoding (BPE)](https://en.wikipedia.org/wiki/Byte_pair_encoding)** initially a compression algorithm, BPE was eventually adapted to build subword-unit token vocabularies based on a corpus of raw text by Sennrich, Haddow and Birch in their paper [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162.pdf)
- **[Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)** a string metric for measuring the edit distance between two sequences
- **[Porter Stemmer](https://tartarus.org/martin/PorterStemmer/)** a simple and crude stemming algorithm [developed in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf)
- **[SentencePiece](https://github.com/google/sentencepiece)** an unsupervised text tokenizer and detokenizer using unigram language modeling. It is mainly for NN-based text generation systems where the vocabulary size is predetermined prior to model training. First put forward by [Kudo and Richardson](https://aclanthology.org/D18-2012.pdf) in 2018.

##### Considerations that are *task-dependent*.  
The answers to these will depend on what problem you are trying to solve:
- Should punctuation be counted as a word?
  - Yes for POS tagging, parsing, speech synthesis
- Should disfluencies count as words?
  - Maybe not for speech transcription
  - Yes for speech recognition prediction, speaker identification
- Should words like "They" and "they" be treated as the same or different word types?
  - Same for speech recognition
  - Different for named-entity tagging

## Corpora
### Always situated within context: language, dialect, time, place, function

- Even in the same language, tasks like segmentation can differ depending on dialect: "talmbout" vs "talking about"
  - Similarly with genre
- Demographic make-up of the speaker(s)
- Language evolves over time

## Naive Unix Command Tokenization

You can use the following unix commands to achieve a rudimentary tokenization with instance counts:

Converting words to lowercase before counting:

Sorting by word count:

## Word and Subword Tokenization

- **Top-down tokenization** defines a standard and implements rules
- **Bottom-up tokenization** breaks up words into subword tokens (words, parts of words, individual letters) using statistics of letter sequences to come up with a subword "vocabulary"

Bottom-up is more typically used for NLP tasks. Top-down tokenization gets into all kinds of exceptions that need to be handled:
- Periods generally separate sentences, and therefore break up tokens and become tokens themselves, but what about 'Ph.D.', 'm.p.h.', or $16.99?
- Commas separate sentence phrases, but what about '400,000'?
- Contractions like "don't" are more like two words ("do not"), so we'd want to break it into tokens "do" and "n't", but what about "cap'n"?
- What preceds and follows a colon should be split up into two tokens typically: but what about https://some.url?
- Spaces separate words, but what about 'New York' and "rock 'n' roll"?
- Some languages like Chinese will result in a massive amount of rarely used words, and are better tokenized at the character level.

Tokenizers are almost always the "first pass" in NLP tasks.
- They need to be *fast*
- Usually are compiled into efficient finite state automata

**Bottom-up tokenization** is the technique of choice for LLM tasks. It uses data to automatically determine what tokens should be, avoiding the question of what counts as a word or character, and also avoiding the near endless case of exceptions that can arise from top-down approaches.  

It can also generalize to words it has not encountered before, like if it has seen 'new', 'newer', and 'low', it can guess what the meaning of 'lower' is. For this reason, modern tokenizers tokenize on subwords like 'est' and 'er', which are called morphenes.

Most modern tokenizers split into a **token learner** and **token segmenter**. The learner induces a vocabulary from a raw training corpus to use as a set of tokens. The segmenter segments raw test data into tokens from the learned vocabulary.

**Byte-pair encoding**, or **BPE**, starts with a vocabulary of individual characters, and adds multi-letter composite symbols based on their adjacent frequency (similar to n-gram word models). In this way it merges raw individual letters into increasingly larger character strings until a parameterized threshold is reached. It usually does this inside words, where the first pass of the input corpus is white-space separated, and denotes a special character, like _, as an end-of-word symbol.