# Surprisal 😱 and entropy 🧐

**Hanna Hubarava** & **Alison Y. Kim** 🌿<br>
Computational Psycholinguistics<br>
University of Zurich<br>
13. March 2023<br>

### 😱 **Surprisal** (a.k.a. *Shannon information*, *information content*)

* Plain English: measures the amount of information gained when an event occurs which had some probability value associated with it
* Mathematically: for some token $ x_i $ in a sequence $ X = \langle x_1, x_2, ... \rangle $ and its associated probability $ p(x_i) $, the surprisal of $ x_i $ is given by $$ h(x_i) = -\log_{2}{p(x_i)} \text{ bits} $$
* $ p(x_i) = 1 \Rightarrow h(x_i) = 0 \text{ bits} $
* $ p(x_i) = 0 \Rightarrow h(x_i) = \infty \text{ bits} $

### 🧐 **Shannon entropy**
* Plain English: measures the uncertainty of a random event $ X $, with possible outcomes $ x_{1}, x_{2}, \dots $ and associated probabilities $ P(x_{1}), P(x_{2}), \dots $
* Mathematically: the entropy of a random event $ X $, with possible outcomes $ x_{1}, x_{2}, \dots $ and distributed according to $ P : X \rightarrow [0, 1] $, is given by $$ H(X) = -\sum\limits_{x \in X} {P(x) \log_2{P(x)}} \text{ bits} $$
* The surprisal of each outcome is weighted by its probability
* Thus, one can think of Shannon entropy as the <strong>average</strong> information content
* Note: in the event that $ P(x) = 0 $, the summand $ P(x) \log_2{P(x)} = 0 \log_2{0} $ is taken to be $ 0 $

### **Questions 4-2: Understanding entropy**
##### 1. *Does the entropy of a random variable depend on the number of different values that the variable can take? Explain your answer using the formula that defines entropy.*
**No.** Entropy depends neither on the outcomes $ x_1, \dots, x_{|X|} $ themselves nor how many outcomes $ |X| $ there are. One can see that the summand depends exclusively on the probability associated with the outcome.

##### 2. *Does the entropy of a random variable depend on the distribution of the different values that the variable can take?*
**Yes.** Entropy is a function of the distribution of the values or outcomes associated with variable $ X $, i.e. the respective probabilities associated with outcomes $ x_1, \dots, x_{|X|} $.

##### 3. *Is the <s>random variable</s> entropy of a uniformly distributed random variable high or low?*
**High — in fact, it is maximal for the given random variable.** We can think about this qualitatively. Entropy is a measure of uncertainty or lack of information. The distribution that gives us the *least* amount of information is the one with the *highest* amount of uncertainty. If a random variable $ X $, e.g. an utterance, can take on any of its possible values with equal probability as the other possible values, then the random variable has high uncertainty. Conversely, a more informative distribution (e.g. one that is normally distributed) reduces the amount of uncertainty in comparison with a uniform distribution.

##### 5. *What is the difference between entropy and surprisal?*
For a random variable $ X $:
* **Surprisal** is the amount of information learned from one instance of $ X $ with outcome $ x_i $: $ h(x_{i}) = p(x_{i}) $. Think of emoji: 😱.
* **Entropy** is the uncertainty of $ X $, which can take on values $ x_{1}, x_{2}, \dots $. It is the expected or average surprisal. Think of emoji: 🧐.


##### 6. *Give a linguistic example to illustrate the difference between entropy and surprisal.*
Our example is based on probabilistic language modeling (LM) 🔠📈💻.

In [None]:
# Define toy vocabulary: key = NL token, value: vocabulary index
model_vocab = {'<BOS>': 0, '<EOS>': 1, 'oat': 2, 'oats': 3, 'milk': 4, 'is': 5, 'are': 6, 'tasty': 7, 'yummy': 8}

# Define a (objectively true) utterance
sequence = '<BOS> oat milk is yummy <EOS>'
sequence = sequence.split()
print(f'Sequence: { sequence }\n')

In LM one typically cares about the probability of token $ i $ conditioned on the preceding token(s) $ 0, \dots, i-1 $. For this example, we create an artificial probability tensor of size $ [(\text{length of sequence } X) - 1] \times [\text{length of vocab}] $.

In [None]:
import numpy as np
import torch


np.random.seed(3)
rows = len(sequence) - 1
columns = len(model_vocab)
probs = np.random.dirichlet(
    alpha=np.ones(columns),
    size=rows
)
probs = torch.tensor(probs, dtype=torch.float32)
for i in range(len(probs)):
	assert torch.isclose(torch.sum(probs[i]), torch.tensor(1.)) # check that each row sums to 1

print(f'Sequence: { sequence }\n')
print('Probabilities of all vocabulary items at token position...')
for i in range(len(probs)):
    print(f'  { i + 1 }: { probs[i] }')

In [None]:
token_index = 2
token = sequence[token_index] # 'milk'
vocab_item = model_vocab[token] # vocabulary item 4
prob = probs[token_index-1][vocab_item] # probability of 'milk' given '<BOS> oat'
surprisal = -1 * torch.log2(prob)

print(f'The surprisal or information content of token \'{ token }\' (index { token_index }) is { surprisal } bits\n')

Continuing the LM example above, let us define random events $ X_{1, \dots, |X|} $ as tokens occurring at positions $ i \in [1, |X|] $. Each $ X_{i} $ can take any value $ V_{j} $ for $ j \in [0, |V|-1] $ in the model vocabulary $ V $.

In [None]:
log_probs = torch.log2(probs) # individual log_2[P(x)] for all token positions i over all model vocabulary items j
entropy = probs * log_probs # summand for each (i, j)
entropy = -1 * torch.sum(entropy, dim=1) # sum of product of probs and log_probs, multiplied by -1

print('Per token entropies')
for i in range(len(entropy)):
    print(f'  X_{ i }, token at position { i }: { entropy[i] } bits')