In [1]:
# Information Theory:

# The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.
# A message saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but a message saying “there was a solar eclipse this morning” is very informative.

# We want to formalize this:

# # Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever.
# # Less likely events should have higher information content.
# # Independent events should have additive information. 
# # # For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.

# Thus, to satisfy these properties, we define self-information of an event x = x to be:
# I(x) = -log P(x)   --> natural logarithm, with base e.
# I is thus expressed in "nats"
# 1 nat is the amount of information gained by observing event of probability 1/e

# # This can be seen in literature as "bits" or "shannons" with the use of base-2 logarithms
# # Information measured in "bits" is a rescale of information measured in "nats"

# I(x) measures a single outcome

In [2]:
# For quantification of uncertainty in entire probability distribution --> use Shannon entropy:
# the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution
# It gives a lower bound on the number of bits (if the logarithm is base 2, otherwise the units are different) needed on average to encode symbols drawn from a distribution P

# Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy
# Distributions that are closer to uniform have high entropy
# When x is continuous, the Shannon entropy is known as the differential entropy

![image.png](attachment:73b393fd-d07d-4c63-9e49-fc5505e01dd9.png)

![image.png](attachment:5ceb5ade-f369-45bf-be0a-75df2df6cd96.png)

In [3]:
# Kullback-Leibler (KL) divergence:
# # If we have two separate probability distributions P (x) and Q(x) over the same random variable x, we can measure how different these two distributions are
# # In the case of discrete variables,
# # # it is the extra amount of information
# # # # (measured in bits if we use the base 2 logarithm, but in machine learning we usually use nats and the natural logarithm)
# # # needed to send a message containing symbols drawn from probability distribution P,
# # # when we use a code that was designed to minimize the length of messages drawn from probability distribution Q.

![image.png](attachment:5a89ad1e-6bc8-43ae-9f39-59c52eff4f69.png)

In [4]:
# The KL divergence has many useful properties, most notably that it is nonnegative. 
# The KL divergence is 0 if and only if P and Q are the same distribution in the case of discrete variables, or equal “almost everywhere” in the case of continuous variables.
# Because the KL divergence is non-negative and measures the difference between two distributions, it is often conceptualized as measuring some sort of distance between these distributions. 
# However, it is not a true distance measure because it is not symmetric: DKL(PQ) = DKL(QP) for some P and Q.
# This asymmetry means that there are important consequences to the choice of whether to use DKL(P||Q) or DKL(Q||P).

![image.png](attachment:8ef77b75-0177-4383-8b17-321cc5cad61c.png)

In [5]:
# Cross Entropy:
# # Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL divergence, because Q does not participate in the omitted term.

![image.png](attachment:ad624fe7-fe30-4312-a790-e0c413b8c171.png)