## Summary

This is a question about how to deal with OOV tokens or ngrams. Specifically, supposing I am building a 4-gram language model, I will need to compute something like P(token|context_of_three_tokens), but I'm not sure how to do this when the context is unseen in the training data (that is, $\sum_v C(w_i w_j w_k v)$ may well equal 0).

When I read 3.4 of SLP, one of the suggested ways to determine goes like this. 

> Convert in the training set any word that is not in this set (any OOV word) to
the unknown word token <UNK> in a text normalization step. 

> The second alternative, in situations where we don’t have a prior vocabulary in ad- vance, is to create such a vocabulary implicitly, replacing words in the training data by \<UNK> based on their frequency.    

I understand this approach works with a unigram language model. However, when you make the n of ngram larger, the number of rare ngrams increases too. So, it is difficult to do something like "When there is an unseen sequence in the test set, replace it with \<UNK SEQ> and assume P(\<UNK SEQ>) equals to the count of ngrams of the same length which appear only once in the training data".

## Environment

In [1]:
!python --version

Python 3.9.9


In [2]:
!pip list | grep nltk

nltk                 3.6.7


## Answer by Vu-san

This (too high UNK sequence ratio) happens because the dataset is too limited or small. In reality, we don't meet phenomen like that because the dataset is much larger. Usually, the ratio of UNK sequence is around 5%.

## Codes

In [2]:
from traditional_lm import Lm

In [4]:
with open('data/wiki-en-train.word') as file:
    train = [['<s>'] + line.lower().split() + ['</s>'] for line in file]

In [5]:
trigram = Lm(n=3)
trigram.fit(train)

In [6]:
trigram.ngram_freq

{'<UNK> . </s>': 147,
 '-rrb- . </s>': 101,
 '<UNK> , <UNK>': 86,
 '<UNK> <UNK> ,': 53,
 ', <UNK> ,': 49,
 ', <UNK> <UNK>': 49,
 'for example ,': 47,
 '<s> for example': 35,
 '-lrb- <UNK> -rrb-': 34,
 ', such as': 33,
 '<s> however ,': 32,
 'the <UNK> of': 32,
 'natural language processing': 28,
 'text . </s>': 26,
 '<UNK> and <UNK>': 26,
 ', and <UNK>': 23,
 '<UNK> of the': 23,
 'a number of': 22,
 '<UNK> , and': 21,
 '<s> it is': 19,
 '-lrb- e.g. ,': 19,
 'the <UNK> <UNK>': 18,
 '<UNK> <UNK> <UNK>': 18,
 'in the <UNK>': 18,
 'of natural language': 17,
 'data . </s>': 17,
 '<UNK> -rrb- .': 17,
 '<UNK> -rrb- ,': 16,
 'words . </s>': 16,
 ', the <UNK>': 15,
 '<UNK> in the': 15,
 'the use of': 15,
 '<s> the <UNK>': 14,
 '<UNK> -lrb- <UNK>': 14,
 'a set of': 14,
 'part of speech': 14,
 '<s> this is': 14,
 'of speech recognition': 14,
 'to <UNK> the': 14,
 'such as the': 14,
 ', it is': 13,
 '<UNK> to the': 13,
 "`` <UNK> ''": 13,
 "'' . </s>": 13,
 '<UNK> of <UNK>': 13,
 'as well as': 13,

In [8]:
trigram.ngram_freq['<UNK NGRAM>']
# This is the number of trigram types whose frequency is 1. 
# Comparing this with the total number of trigrams below, we can see how large this is

25904

In [11]:
sum(trigram.ngram_freq.values())

34541