# Surprisal and entropy in probabilistic language modeling

### **Surprisal** (a.k.a. *Shannon information*, *information content*)

* Plain English: amount of information gained when an event occurs which had some probability value associated with it
* Mathematically: for some token $ x_i $ in a sequence $ X = \langle x_1, x_2, ... \rangle $ and its associated probability $ p(x_i) $, the surprisal of $ x_i $ is given by $$ h(x_i) = -\log_{2}{p(x_i)} \text{ bits} $$
* $ p(x_i) = 1 \Rightarrow h(x_i) = 0 \text{ bits} $
* $ p(x_i) = 0 \Rightarrow h(x_i) = \infty \text{ bits} $

In [18]:
import numpy as np


model_vocab = {'<BOS>', '<EOS>', 'oat', 'oats', 'pecans', 'milk', 'is', 'are', 'derived', 'from', 'to', 'under' 'plants'}

X = '<BOS> oat milk is derived from oats <EOS>'
probs_X = {
	'oat': 0.025, # Oat|<BOS>
	'milk': 0.55, # milk|<BOS> Oat
	'is': 0.74, # is|<BOS> Oat milk
	'derived': 0.18, # derived|<BOS> Oat milk is
	'from': 0.84, # from|<BOS> Oat milk is derived
	'oats': 0.96, # oats|<BOS> Oat milk is derived from
	'<BOS>': 0.14 # <EOS>|<BOS> Oat milk is derived from oats
}
p_oat = probs_X['oat'] # Probability of 'oat' given the <BOS> token
surprisal_oat = -1 * np.log2(p_oat)
print(f'The surprisal or information content of token \'oat\' (index 1) is { surprisal_oat } bits\n')

p_oats = probs_X['oats'] # Probability of 'oats' given the preceding tokens
surprisal_oats = -1 * np.log2(p_oats)
print(f'The surprisal or information content of token \'oats\' (index 6) is { surprisal_oats } bits')

The surprisal or information content of token 'oat' (index 1) is 5.321928094887363 bits

The surprisal or information content of token 'oats' (index 6) is 0.058893689053568565 bits


### **Shannon entropy**
* Plain English: average number of bits required to represent or transmit a message without losing any data
* Mathematically: the entropy of a random event $ X $ distributed according to $ p : \mathcal{X} \rightarrow [0, 1] $ and with possible outcomes $ x_1, x_2, \dots $ is given by $$ H(X) = -\sum\limits_{x \in X} {P(x) \log_2{P(x)}} \text{ bits} $$
* The surprisal of each outcome is weighted by its probability
* Thus, one can think of Shannon entropy as the <strong>average</strong> information content
* Note: in the event that $ P(x) = 0 $, the summand $ P(x) \log_2{P(x)} = 0 \log_2{0} $ is taken to be $ 0 $

In [19]:
probs = probs_X.values() # individual P(x) for all x in sequence X
log_probs = [np.log2(prob) for prob in probs] # individual log_2[P(x)] for all x in sequence X
entropy_X1 = -1 * sum(prob * log_prob for prob, log_prob in zip(probs, log_probs)) # sum of product of probs and log_probs, multiplied by -1
print(f'The entropy or expected (average) surprisal of sequence X is { entropy_X1 } bits')

The entropy or expected (average) surprisal of sequence X is 2.0391276513258605 bits
