# Surprisal 😱 and entropy 🧐 in probabilistic language modeling 🔠📈💻

**Hanna Hubarava** & **Alison Y. Kim** 🌿<br>
Computational Psycholinguistics<br>
University of Zurich<br>
13. March 2023<br>

### 😱 **Surprisal** (a.k.a. *Shannon information*, *information content*)

* Plain English: measures the amount of information gained when an event occurs which had some probability value associated with it
* Mathematically: for some token $ x_i $ in a sequence $ X = \langle x_1, x_2, ... \rangle $ and its associated probability $ p(x_i) $, the surprisal of $ x_i $ is given by $$ h(x_i) = -\log_{2}{p(x_i)} \text{ bits} $$
* $ p(x_i) = 1 \Rightarrow h(x_i) = 0 \text{ bits} $
* $ p(x_i) = 0 \Rightarrow h(x_i) = \infty \text{ bits} $

In [63]:
import numpy as np
import torch


# key: word, value: vocabulary index
model_vocab = {'<BOS>': 0, '<EOS>': 1, 'oat': 2, 'oats': 3, 'milk': 4, 'is': 5, 'are': 6, 'tasty': 7, 'yummy': 8}

sequence = '<BOS> oat milk is yummy <EOS>'
sequence = sequence.split()
print(f'Sequence: { sequence }\n')

# In language modeling, one typically cares about the probability of token i conditioned on the preceding tokens 0, ..., i-1
# For this example, we create an artificial probability tensor of size [(length of sequence X) - 1] x [length of vocab]
# Again, not (length of sequence X) because we are interested in conditional probabilities

np.random.seed(0)
probs = np.random.dirichlet(
    alpha=np.ones(len(model_vocab)),
    size=len(sequence)-1
)
probs = torch.tensor(probs, dtype=torch.float32)

print('Probabilities of all vocabulary items at token position...')
for i in range(len(probs)):
    print(f'  { i + 1 }: { probs[i] }')

Sequence: ['<BOS>', 'oat', 'milk', 'is', 'yummy', '<EOS>']

Probabilities of all vocabulary items at token position...
  1: tensor([0.0694, 0.1095, 0.0805, 0.0687, 0.0481, 0.0905, 0.0502, 0.1939, 0.2891])
  2: tensor([0.0589, 0.1910, 0.0916, 0.1022, 0.3163, 0.0090, 0.0111, 0.0025, 0.2176])
  3: tensor([0.1211, 0.1641, 0.3092, 0.1291, 0.0498, 0.1220, 0.0101, 0.0821, 0.0124])
  4: tensor([0.3449, 0.0879, 0.0638, 0.0366, 0.1773, 0.0726, 0.1001, 0.0023, 0.1145])
  5: tensor([0.1018, 0.1031, 0.3092, 0.1230, 0.0479, 0.0617, 0.1285, 0.0067, 0.1181])


In [64]:
token_index = 2
token = sequence[token_index] # 'milk'
vocab_item = model_vocab[token] # vocabulary item 4
prob = probs[token_index-1][vocab_item] # probability of 'milk' given '<BOS> oat'
surprisal = -1 * torch.log2(prob)

print(f'The surprisal or information content of token \'{ token }\' (index { token_index }) is { surprisal } bits\n')

The surprisal or information content of token 'milk' (index 2) is 1.6608266830444336 bits



### 🧐 **Shannon entropy**
* Plain English: measures the uncertainty of a random event $ X $, with possible outcomes $ x_1, \dots, |X| $ and associated probabilities $ P(x_1), \dots, P(x_{|X|}) $
* Mathematically: the entropy of a random event $ X $, with possible outcomes $ x_1, \dots, |X| $ and distributed according to $ P : X \rightarrow [0, 1] $, is given by $$ H(X) = -\sum\limits_{x \in X} {P(x) \log_2{P(x)}} \text{ bits} $$
* The surprisal of each outcome is weighted by its probability
* Thus, one can think of Shannon entropy as the <strong>average</strong> information content
* Note: in the event that $ P(x) = 0 $, the summand $ P(x) \log_2{P(x)} = 0 \log_2{0} $ is taken to be $ 0 $

Continuing the example above, let us define random events $ X_{1, \dots, |X|} $ as tokens occurring at positions $ i \in [1, |X|] $. Each $ X_{i} $ can take any value in the model vocabulary $ j \in [0, |V|-1] $.

In [68]:
log_probs = torch.log2(probs) # individual log_2[P(x)] for all token positions i over all model vocabulary items j
entropy = probs * log_probs # summand for each (i, j)
entropy = -1 * torch.sum(entropy, dim=1) # sum of product of probs and log_probs, multiplied by -1

print('Per token entropies')
for i in range(len(entropy)):
    print(f'  X_{ i }, token at position { i }: { entropy[i] } bits')

Per token entropies
  X_0, token at position 0: 2.892023801803589 bits
  X_1, token at position 1: 2.507413148880005 bits
  X_2, token at position 2: 2.7292935848236084 bits
  X_3, token at position 3: 2.6935746669769287 bits
  X_4, token at position 4: 2.8194639682769775 bits


### ❓ **Questions 4-2**
##### 1. *Does the entropy of a random variable depend on the number of different values that the variable can take? Explain your answer using the formula that defines entropy.*
**No.** Entropy depends neither on the outcomes $ x_1, \dots, x_{|X|} $ themselves nor how many outcomes $ |X| $ there are. One can see that the summand depends exclusively on the probability associated with the outcome.

##### 2. *Does the entropy of a random variable depend on the distribution of the different values that the variable can take?*
**Yes.** Entropy is a function of the distribution of the values or outcomes associated with variable $ X $, i.e. the respective probabilities associated with outcomes $ x_1, \dots, x_{|X|} $.

##### 3. *Is the <s>random variable</s> entropy of a uniformly distributed random variable high or low?*
**High — in fact, it is maximal for the given random variable.** We can think about this qualitatively. Entropy is a measure of uncertainty or lack of information. The distribution that gives us the *least* amount of information is the one with the *highest* amount of uncertainty. If a random variable $ X $, e.g. an utterance, can take on any of its possible values with equal probability as the other possible values, then the random variable has high uncertainty. Conversely, a more informative distribution (e.g. one that is normally distributed) reduces the amount of uncertainty in comparison with a uniform distribution.