# Entropy and Information Gain
#### Math 3480 - Machine Learning
#### Dr. Michael E. Olson

# Reading
* Geron, Chapter 6

## Additional Resources
* [YouTube: Shannon Entropy and Information Gain](https://www.youtube.com/watch?v=9r7FIXEAGvs) by Serrano Academy
* [YouTube: Entropy (for data science) Clearly Explained!!!](https://www.youtube.com/watch?v=7VeUPuFGJHk&t=143s) by StatQuest

# Recall
Expected values are calculated as
$$E(x) = \sum\left(x\cdot P(x)\right)$$

# Introduction Scenario
* Choose 4 balls out of a bucket of red balls
  * How surprised are you if you get RRRR? Not
* Choose 4 balls out of a bucket of 75% red 25% blue balls
  * How surprised are you if you get RRRR? Moderately
* Choose 4 balls out of a bucket of 50% red 50% blue balls
  * How surprised are you if you get RRBB? High

# Surprise
Surprise and likelihood (probability) are inversely proportional. So, it's tempting to say, $Surprise = \frac{1}{P(x)}$. But it's deceptive, because following this, $P(x)=1$ happens when you have the most surprise.

Try a logarithm. Then $P(x)=0$ will give the most surprise
$$\log_2\left(\frac{1}{P(x)}\right) = -\log_2P(x)$$
* Why $\log_2$? When we have 2 possibile outcomes, we have a binary situation
* The number of possible outcomes for drawing 4 balls is $2*2*2*2=2^4$
* The number of possible outcomes for drawing $k$ balls is $2*2*2*2=2^k$

Because of this, we are going to use $\log_2$. The base may change if there are a different number of outcomes, but in a decision tree, we only have two outcomes.

__What is the surprise when drawing 4 balls when the bin is 75% red and 25% blue?__ The probability of this happening is,
$$P(x) = 0.75*0.75*0.75*0.25 = 0.1055$$

Fairly low probability. The surprise is then,
$$Surprise = -\log_2 (0.75*0.75*0.75*0.25) = -\log_2 0.75 + -\log_2 0.75 + -\log_2 0.75 + -\log_2 0.25$$
This is just the sum of the surprise of each event!

In [6]:
import numpy as np

-np.log2(0.75) - np.log2(0.75) - np.log2(0.75) - np.log2(0.25)

3.2451124978365313

# Entropy
__Entropy__ is a measure of disorder or uncertainty and the goal of machine learning models and Data Scientists in general is to reduce uncertainty
* Entropy is the Expected Value of the Surprise

To find the entropy, compare surprise to probability
* Find the product for each outcome
$$-P(x)\log_2 P(x)$$
* Add them together, and we get entropy!
$$E(surprise) = Entropy = \sum -P(x)\log_2 P(x)$$

Note that this is just the equation for the expected value: $E(x)=\sum x\cdot P(x)$.

In [13]:
# 4 balls from a bucket of red balls
# P(R) = 1     surprise(R) = log2(1) = 0
# P(B) = 0     surprise(B) = log2(0) = infinite

# - [ P(R) * log2 P(R) ] - [ P(B) * log2 P(B) ]
E = -1 * np.log2(1)
print("Entropy = {0:.2f}".format(E))

Entropy = -0.00


In [14]:
# 4 balls from a bucket of 75% red balls and 25% blue balls
# P(R) = 0.75     surprise(R) = log2(0.75)
# P(B) = 0.25     surprise(B) = log2(0.25)

# - [ P(R) * log2 P(R) ] - [ P(B) * log2 P(B) ]
E = -0.75 * np.log2(0.75) - 0.25 * np.log2(0.25)
print("Entropy = {0:.2f}".format(E))

Entropy = 0.81


In [15]:
# 4 balls from a bucket of 50% red balls and 50% blue balls
# P(R) = 0.50     surprise(R) = log2(0.50)
# P(B) = 0.50     surprise(B) = log2(0.50)

# - [ P(R) * log2 P(R) ] - [ P(B) * log2 P(B) ]
E = -0.5 * np.log2(0.5) - 0.5 * np.log2(0.5)
print("Entropy = {0:.2f}".format(E))

Entropy = 1.00


# Information Gain
When we leave things to probability, then our knowledge is lower and entropy is higher. When the knowledge is higher, then entropy is lower.
* Information gain is the reduction in entropy or surprise by transforming a dataset

To calculate the Information gain,
$$IG(x,A)=H(x)-\sum_{v\in Values(A)} \frac{|x_v|}{x}H(x_v)$$

# Another example
Now, 8 letters, A-D. Find the entropy of each:
1. AAAAAAAA
2. AAAABBCD
3. AABBCCDD

In [16]:
# AAAAAAAA
# P(A) = 1
# P(B) = 0
# P(C) = 0
# P(D) = 0

E = -1 * np.log2(1)
print("Entropy = {0:.2f}. Since we know that each character has to be an A, our knowledge is high.".format(E))

Entropy = -0.00. Since we know that each character has to be an A, our knowledge is high.


In [18]:
# AAAABBCD
# P(A) = 0.50
# P(B) = 0.25
# P(C) = 0.125
# P(D) = 0.125

E = -0.50 * np.log2(0.50) - 0.25 * np.log2(0.25) - 0.125 * np.log2(0.125) - 0.125 * np.log2(0.125)
print("Entropy = {0:.2f}. Since we only know that half the characters are an A, our knowledge is moderate.".format(E))

Entropy = 1.75. Since we only know that half the characters are an A, our knowledge is moderate.


In [19]:
# AABBCCDD
# P(A) = 0.25
# P(B) = 0.25
# P(C) = 0.25
# P(D) = 0.25

E = -0.25 * np.log2(0.25) - 0.25 * np.log2(0.25) - 0.25 * np.log2(0.25) - 0.25 * np.log2(0.25)
print("Entropy = {0:.2f}. Since we know that each character has to be an A, our knowledge is high.".format(E))

Entropy = 2.00. Since we know that each character has to be an A, our knowledge is high.


# Another look at Entropy
Another way to define Entropy:
* Entropy is the average number of questions we need to ask to get a certain result

__Take the first example: $AAAAAAAA$__
* How many questions do you need to ask to know which letter you drew?
* Since $P(A) = 1$, we don't have to ask any
  * Questions asked = 0 = Entropy

__Take the third example: $AABBCCDD$__
* How many questions do you need to ask to know which letter you drew?
  * We could ask 
    * Is it A?
    * Is it B?
    * Is it C?
    * At this point, we know it's not A, B, or C, so we don't have to ask D. So, we asked 3 questions.
* We can look at this a different way. Ask instead,
  * Is it A or B?
  * Consider the diagram:

[![](https://mermaid.ink/img/pako:eNpdkMEKgzAMhl8l5Kwv4GFDqxuDscN2GtZDsXEW1EptD0N991XmQJdT-L8vkGTEUkvCCF9G9DVc77wDX_F4GUBZIGVrMhCDNpAc5xVCGB5getIwQbKK8R-86QnYytiPJZvBLI8L2MXLSJonxTdkG_eUs326qOc8LTDAlkwrlPQXjIvC0S_cEsfIt5Iq4RrLkXezV10vhaVMKqsNRpVoBgpQOKsf767EyBpHPylVwj-kXa35AzbKWTA)](https://mermaid-js.github.io/mermaid-live-editor/edit/#pako:eNpdkMEKgzAMhl8l5Kwv4GFDqxuDscN2GtZDsXEW1EptD0N991XmQJdT-L8vkGTEUkvCCF9G9DVc77wDX_F4GUBZIGVrMhCDNpAc5xVCGB5getIwQbKK8R-86QnYytiPJZvBLI8L2MXLSJonxTdkG_eUs326qOc8LTDAlkwrlPQXjIvC0S_cEsfIt5Iq4RrLkXezV10vhaVMKqsNRpVoBgpQOKsf767EyBpHPylVwj-kXa35AzbKWTA)

In each case, it took 2 questions. Questions asked = $\sum n\cdot P(x) = n_AP(A) + n_BP(B) + n_CP(C) + n_DP(D) = 2*\frac{1}{4} + 2*\frac{1}{4} + 2*\frac{1}{4} + 2*\frac{1}{4} = 2 = Entropy$

__Take the second example: $AAAABBCD$__
* How many questions do you need to ask to know which letter you drew?
  * We could follow the same process, but we'd get the same result (2 questions)
  * Instead, notice that half of them are an A
    * So, ask if it's an A
    * If it's not in that half, then it's B, C, or D. Half of them are B
    * Ask if it's a B
  * Consider the diagram:

[![](https://mermaid.ink/img/pako:eNpVkEsKgzAQQK8yzLpewEWLv5ZC6aJdlSSLYMYaMEZiXBT17o2goLMaHu_BMCOWVhHG-HWyq-Hx4i2EScZ7D9pDcplXAFF0hulD_QQpS8SBPu0E2VqkW5Htipyl4kCXoliLbCuKXXFlmTjQpbixXOAJDTkjtQpHj4vC0ddkiGMcVkWVHBrPkbdzUIdOSU-F0t46jCvZ9HRCOXj7_rUlxt4NtEm5luEHZrXmP2eBVRM)](https://mermaid-js.github.io/mermaid-live-editor/edit/#pako:eNpVkEsKgzAQQK8yzLpewEWLv5ZC6aJdlSSLYMYaMEZiXBT17o2goLMaHu_BMCOWVhHG-HWyq-Hx4i2EScZ7D9pDcplXAFF0hulD_QQpS8SBPu0E2VqkW5Htipyl4kCXoliLbCuKXXFlmTjQpbixXOAJDTkjtQpHj4vC0ddkiGMcVkWVHBrPkbdzUIdOSU-F0t46jCvZ9HRCOXj7_rUlxt4NtEm5luEHZrXmP2eBVRM)

This time, the questions asked = $\sum n\cdot P(x) = n_AP(A) + n_BP(B) + n_CP(C) + n_DP(D) = 1*\frac{1}{2} + 2*\frac{1}{4} + 3*\frac{1}{8} + 3*\frac{1}{8} = 1.75 = Entropy$



In [20]:
1*0.5 + 2*0.25 + 3*0.125 + 3*0.125

1.75