Below are some codes using the module [nltk.lm](https://www.nltk.org/api/nltk.lm.html) to understand how Kneser-Ney smoothing works.

## Environment

In [1]:
!python --version

Python 3.9.9


In [2]:
!pip list | grep nltk

nltk                 3.6.7


## Codes

In [3]:
from nltk.lm.preprocessing import padded_everygram_pipeline
text = [['You', 'say', 'goodbye'], ['and', 'I', 'say', 'hello']]
train, vocab = padded_everygram_pipeline(2, text)  # specify the highest ngram order as 2

In [4]:
from nltk.lm import MLE, KneserNeyInterpolated, AbsoluteDiscountingInterpolated
lm = MLE(2)  # specify the highest ngram order as 2
lm.fit(train, vocab)

In [5]:
list(lm.vocab)

['<s>', 'You', 'say', 'goodbye', '</s>', 'and', 'I', 'hello', '<UNK>']

### bigram LM

In [6]:
test = [('I', 'say'), ('say', 'goodbye')]
lm.entropy(test)

0.5

I understand this 0.5 is calculated by 
$$ -\frac{1}{2}(logP(say|I) + logP(goodbye|say)) 
= -\frac{1}{2}(log_2{1} + log_2{\frac{1}{2}}) 
= 1/2 $$

In [7]:
train, vocab = padded_everygram_pipeline(2, text)
lmkn = KneserNeyInterpolated(2)
lmkn.fit(train, vocab)

In [8]:
lmkn.entropy(test)

0.6168136649827498

I confirmed this accords with below formula derived by my hand calculation.

In [9]:
import numpy as np
-1/2 * (np.log2(0.9 + 0.2/9) + np.log2(0.45 + 0.1/9))

0.6168136649827498

In [10]:
test_2 = [('I', 'say'), ('say', 'I')]  # Changing the second tuple to ('say', 'You') ends up with the same entropy
lmkn.entropy(test_2)  # The entropy increased because an unusual pattern "say I" is input

3.3043333806562125

In [11]:
test_3 = [('I', 'say'), ('say', '<UNK>')]
lmkn.entropy(test_3)

inf

This is a predicted behavior because '\<UNK>' is not contained in the training data and the vocab. Mathematically C('say \<UNK>') = 0 and |{v: C('v \<UNK>') > 0}| = 0. However, in usual cases we contain some \<UNK> in the train, so this would be avoided.

### trigram LM

In [16]:
train, vocab = padded_everygram_pipeline(3, text)
lmkn_tri = KneserNeyInterpolated(3)
lmkn_tri.fit(train, vocab)

In [17]:
test_tri = [('and', 'I', 'say'), ('I', 'say', 'goodbye')]
lmkn_tri.entropy(test_tri)

2.228464373874321

In [18]:
test_tri_2 = [('and', 'I', 'say'), ('I', 'say', 'You')]
lmkn_tri.entropy(test_tri_2)

5.057570115250218