## Entropy

#### In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent in the variable's possible outcomes. The concept of information entropy was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication" and is sometimes called Shannon entropy in his honour.


#### in a more understandable way, entropy is the amount of information an event or message contains. In other words, it can be expressed as how surprising an event is. We store information in bits on our computers. While performing the storage process, we try to store the information with the minimum number of bits, because an increase in the number of bits means an increase in the energy to be spent.

#### To find the amount of information an event contains, we can use the probability of that event. When we formulate this, we get the following formula.

#### h(x) = -log2( p(x) ) where p(x) is probability of event x occurring.

#### The reason we do the operation in logarithm base 2 is that we display the information in binary digits on our computers.

#### Let's try to understand these concepts better with examples. Let's have a fair coin.

In [8]:
import math
# p(tail)=0.5
# p(head)=0.5
#lets calculate amount of informations of these events
h_p_tail=-math.log2(0.5)
h_p_head=-math.log2(0.5)
print(h_p_tail, " ", h_p_head)

1.0   1.0


#### Since the probability of occurrence of these events is equal, the amount of information they carry about the event is equal and equal to 2.

#### let's assume our coin is unfair and the probability of getting heads and tails are not equal.

In [16]:
# p(tail)=0.25
# p(head)=0.75
#lets calculate amount of informations of these events
h_p_tail=-math.log2(0.25)
h_p_head=-math.log2(0.75)
print(h_p_tail, " ", h_p_head)

2.0   0.4150374992788438


#### Therefore, as we mentioned before, events with low probability contain a larger amount of information about that event.

#### Now, if we want to calculate the information for a random variable X, we need to look at the expected information over all the actual events. Here we can use the Shannon Entropy formula, which is one of the entropy formulas and is used very often.

![Shannon Entropy Formula](https://www.walletfox.com/course/shannonSource/shannon_formula.png)

In [17]:
#Expected Information of Fair Coin
exp_fair_info=- (0.5*math.log2(0.5) + 0.5*math.log2(0.5))
exp_fair_info

1.0

In [18]:
#Expected Information of Unfair Coin
exp_unfair_info=- (0.25*math.log2(0.25) + 0.75*math.log2(0.75))
exp_unfair_info

0.8112781244591328

#### When we look at the unfair coin, the result is more likely to land heads, as the probability of getting heads is higher. In other words, the surprise in the event is lower than the equal probability coin flip. So entropy will be lower.

## Cross Entropy

#### Cross-Entropy is the expected entropy value for the Q probability distribution we find while the true probability distribution is P. It measures the relative entropy between two probability distributions over the same set of events. When we look at the Cross-Entropy formula;

![Shannon Entropy Formula](https://miro.medium.com/max/700/1*koxFBp0VFEzqeiNL9diaUw.png)

#### Let P be the true label distribution and Q be the predicted label distribution. Suppose the true label of one particular sample is B and our classifier predicts probabilities for A, B, C as (0.15, 0.60, 0.25)


In [34]:
# True label distribution: P(A)=0.00, P(B)=1.00, P(C)=0.00
# Predicted Label Distribution: q(A)=0.15, q(B)=0.60, q(C)=0.25
H_p_q = -(0 * math.log2(0.15) + 1 * math.log2(0.6) + 0 * math.log2(0.25))
print("Cross Entropy Loss:" , H_p_q)

Cross Entropy Loss: 0.7369655941662062


#### On the other hand, if our classifier was more confident and predicted probabilities as (0.05, 0.90, 0.05), we would get cross-entropy as 0.15, which is lower than the above example.

### KL Divergence

#### KL Divergence = Cross Entropy - Entropy or it can be formulated like this

#### It is the amount of information we lose when we use the Q probability distribution to estimate the P probability distribution. We obtain the KL divergence formula as above. KL divergence of 0 indicates that the P and Q distributions are the same. The belove formula can be used for discrete probability distributions for the P and Q distributions

![KL Divergence Formula](https://www.statisticshowto.com/wp-content/uploads/2016/10/kl-divergence-2.png)

### References

https://en.wikipedia.org/wiki/Entropy_(information_theory)

https://medium.com/kaveai/d%C3%BCzensizlik-entropy-%C3%A7apraz-d%C3%BCzensizlik-cross-entropy-ve-kl-iraksakl%C4%B1%C4%9F%C4%B1-kl-divergence-89d26735789f

https://www.youtube.com/watch?v=ErfnhcEV1O8