## Summary

I don't understand how Kneser-Ney smoothing works. Therefore, tentatively aiming to understand and reproduce the how [NLTK library](https://www.nltk.org/api/nltk.lm.html) works, I raise several questions below.

## Environment

In [1]:
!python --version

Python 3.9.9


In [1]:
!pip list | grep nltk

nltk                 3.6.7


## Codes

In [5]:
from nltk.lm.preprocessing import padded_everygram_pipeline
text = [['You', 'say', 'goodbye'], ['and', 'I', 'say', 'hello']]
train, vocab = padded_everygram_pipeline(2, text)  # specify the highest ngram order as 2

In [10]:
from nltk.lm import MLE, KneserNeyInterpolated, AbsoluteDiscountingInterpolated
lm = MLE(2)  # specify the highest ngram order as 2
lm.fit(train, vocab)

In [15]:
list(lm.vocab)

['<s>', 'You', 'say', 'goodbye', '</s>', 'and', 'I', 'hello', '<UNK>']

### bigram LM

In [23]:
test = [('I', 'say'), ('say', 'goodbye')]
lm.entropy(test)

0.5

I understand this 0.5 is calculated by 
$$ -\frac{1}{2}(logP(say|I) + logP(goodbye|say)) 
= -\frac{1}{2}(log_2{1} + log_2{\frac{1}{2}}) 
= 1/2 $$

In [20]:
train, vocab = padded_everygram_pipeline(2, text)
lmkn = KneserNeyInterpolated(2)
lmkn.fit(train, vocab)

In [22]:
lmkn.entropy(test)

0.6168136649827498

I confirmed this accords with below formula derived by my hand calculation.

In [49]:
import numpy as np
-1/2 * (np.log2(0.9 + 0.2/9) + np.log2(0.45 + 0.1/9))

0.6168136649827498

In [26]:
test_2 = [('I', 'say'), ('say', 'I')]
lmkn.entropy(test_2)  # The entropy increased because an unusual pattern "say I" is input

3.3043333806562125

In [54]:
test_3 = [('I', 'say'), ('say', 'hi')]
lmkn.entropy(test_3)

inf

Question: I'd like to know how we can make this model cope with this situation where the entropy becomes inf, that is, how to handle a token which is unseen.

### trigram LM (for additional examples)

In [51]:
train, vocab = padded_everygram_pipeline(3, text)
lmkn_tri = KneserNeyInterpolated(3)
lmkn_tri.fit(train, vocab)

In [52]:
test_tri = [('and', 'I', 'say'), ('I', 'say', 'goodbye')]
lmkn_tri.entropy(test_tri)

2.228464373874321

In [53]:
test_tri_2 = [('and', 'I', 'say'), ('I', 'say', 'hi')]
lmkn_tri.entropy(test_tri_2)

inf