Subword models: models below the word level
- In some languages, build word embedding models will be hard b/c of large and open vocabulary
    - Rich morphology (lots of prefixes and suffixes)
    - New slangs/informal words

# Character-level model

Pros:
- Generates embeddings for unknown words
- Similar spellings share similar embeddings
- Solves out-of-vocabulary problem
- **Comparable results versus word-level model with FEWER PARAMETERS**

## Comparing char model with BPE model (seq2seq machine translation)

![](images/subword_1.png)

(2 graphs on left side) Char model works better for rich morphology language such as Czech rather than French, though the gain is small if you use a more complex model

# Sub-word models: two trends

- Same architecture as for word-level model:
    - But use smaller units: “word pieces”
    - [Sennrich, Haddow, Birch, ACL’16a],[Chung, Cho, Bengio, ACL’16].
    - BPE
- Hybrid architectures:
    - Main model has words; something else for characters
    - [Costa-Jussà & Fonollosa, ACL’16], [Luong & Manning, ACL’16].

## BPE

- A word segmentation algorithm:
    - Though done as bottom up clustering
    - Start with a unigram vocabulary of all (Unicode) characters in data
    - Most frequent ngram pairs ↦ a new ngram
    - Basically the SentencePieceTokenizer in fastai2 library (which is from Google's SentencePiece library

- **Automatically decides vocab for system**
- No longer strongly “word” based in conventional way

![](images/subword_2.png)

- Start with all characters in vocab
- Add a pair (e, s) with freq 9
- Add a pair (es, t) with freq 9
- Add a pair (l, o) with freq 7

## sentencepiece model

- Sentencepiece model works from raw text
- Whitespace is retained as special token (_) and grouped normally
- You can reverse things at end by joining pieces and recoding them to spaces

Text: "discover discovering discovered disco disc disk discord disconnect disconnected disconneting"

```python
subword(1000)

['▁dis', 'co', 'ver', '▁dis', 'co', 'ver', 'ing', '▁dis', 'co', 'ver', 'ed', '▁dis', 'co', '▁dis', 'c', '▁dis', 'k', '▁dis', 'c', 'or', 'd', '▁dis', 'c', 'on', 'n', 'ect', '▁dis', 'c', 'on', 'n', 'ect', 'ed', '▁dis', 'c', 'on', 'ne', 'ting']

# (note that these are not vocab, these are the original text got tokenized using BPE)


subword(10000)

['▁discover', '▁discover', 'ing', '▁discovered', '▁disco', '▁disc', '▁dis', 'k', '▁disco', 'rd', '▁disco', 'n', 'n', 'ect', '▁disco', 'n', 'n', 'ect', 'ed', '▁disco', 'nne', 'ting']

```
"Picking a subword vocab size represents a compromise: a larger vocab means fewer tokens per sentence, which means faster training, less memory, and less state for the model to remember; but on the downside, it means larger embedding matrices, which require more data to learn."


## Wordpiece

- Wordpiece model tokenizes inside words


-  BERT uses a variant of the wordpiece model
    - (Relatively) common words are in the vocabulary: at, fairfax, 1910s
- Other words are built from wordpieces:
    - hypatia = h ##yp ##ati ##a

# Use character-level model for word-level model 

## Using cnn + maxpool

![](images/novel_3.png)

## using lstm

![](images/subword_3.png)

- Run **character-level** biLSTM and concatenate 2 final hidden states (known as 'outward representation')
- Use that representation for another LSTM LM model that works along **sequence of words** (TODO: need details on this)
- Train the whole thing **end-to-end** to **update character embeddings** that **can produce good words vector** (TODO: how?)

## Character-Aware Neural Language Models

https://youtu.be/9oTHFx0Gg3Q?t=3280

http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture12-subwords.pdf

![](images/subword_4.png)

# Test word similarity between char-level word embedding and word-level word embedding

![](images/subword_5.png)

For LSTM-Char (b4 highway): since we initially doing conv layer + maxpool, the word embedding results from char-level model still remember things about character
- All words with 'le' or 'ile' (conv filter size 2 or 3) are similar to 'while'
- Words start with 'rich' are similar to 'richard', but they are not necessarily names

But for LSTM-Char after highway), things are a bit better
- Even with 'richard', the similar names coming out of this model are still name, unlike the b4 highway one.

For out-of-vocabulary similarity (note that LSTM-Word can't handle OOV problem, hence the first blank row)

![](images/subword_6.png)

# Combining word-level and character-level model

- A best-of-both-worlds architecture:
    - Translate mostly at the word level
    - Only go to the character level when needed (unknown words - words not in vocabulary)

https://youtu.be/9oTHFx0Gg3Q?t=3753

Note that this is a language model

![](images/subword_7.png)

Note that e.g. at UNK output, there will be 2 losses, 1 at the word-level and 1 at the small upper LSTM char-level

At UNK, We feed the hidden representation of the word-level model as starting hidden representation of the char-level model

## Cons

![](images/subword_8.png)

- by using the current hidden representation, the char-level model don't have much representation further back => not good at capturing context for accurate translation such as names

# FastText embeddings

- Aim: a next generation efficient word2vec-like word representation library, but **better for rare words and languages with lots of morphology**
- An extension of the w2v skip-gram model with character n-grams

https://youtu.be/9oTHFx0Gg3Q?t=4330

![](images/subword_9.png)

- A word will be broken down using n-grams, for example the word 'where' will be represented by 6 tokens. Note that "<" and ">" are boundary symbols
- Using the same w2v mechanism, calculate *similarity* by doing the **dot product between context (neighbor) vector and center-word vector, but now the center-word vector will correspond all 6 of those tokens' vector**. We will sum them together (there's a hashing trick for this)