# Entropy
----

Entropy is a feature value that describes the shape of a probability distribution, and is also a value indicating the amount of information a probability distribution has.

### Define of Entropy
- There are 3 different probability distribution in binary set.
    - Probability distribution $ Y_1 : P(Y=0) = 0.5, P(Y=1) = 0.5 $ 
    - Probability distribution $ Y_2 : P(Y=0) = 0.8, P(Y=1) = 0.2 $ 
    - Probability distribution $ Y_3 : P(Y=0) = 1.0, P(Y=1) = 0.0 $ 
- In terms of bayesian statistics, this distributions has some information like this.
    - 1. Knowing nothing about Y
    - 2. Believe that Y is 0, but it may not be.
    - 3. 100% believe in Y is 0.
- And the entropy reperesent this difference of information.
- **In summary**, entropy is a `numerical expression of the certainty or amount of information in the probability distribution`.
    - If the probability of specific value in probability distribution go higher and other value's probability go lower, then entropy is decrease. But the other value's probability go higher, entropy is increase.
    - In other words, entropy is a `characteristic value indicating what the probability distribution looks like`.
    - In physics, the degree to which a substance's state is dispersed is defined as entropy.
- Mathematically, entropy is a function has input(pdf) and output(numerical result).
    - For example, if random variable Y follow the discrete probability distribution with K'th class, entropy defined like below.
    - $ H[Y] = -\sum_{k=1}^{K}p(y_k)log_2p(y_k) $

----
### Gini impurity (지니 불순도)

Similar concept with entropy, gini is a measure of where the probability distribution is it. The difference with entropy is do not use log.

----
### Joint Entropy

Joint entropy is the entropy using joint probability.

$$ H[X, Y] = -\sum_{i=1}^{K_x}\sum_{j=1}^{K_y}p(x_i, y_j)log_2p(x_i, y_j) $$

### Conditional Entropy

Conditional entropy is a method to measure which random variable 𝑋 helps predict the value of another random variable 𝑌.

$$ H[Y | X=x_i] = -\sum_{j=1}^{K_y}p(y_j|x_i)log_2p(y_j|x_i) $$

$$ H[Y | X] = -\sum_{i=1}^{K_x}\sum_{j=1}^{K_y}p(x_i, y_j)log_2p(y_j|x_i) $$

----
### Cross Entropy

Cross entropy usually using at classification problem. 

- Cross entropy of two probability distribution x,y : $H[p,q]$
    - If discrete probability distribution : $ H[p,q] = -\sum_{k=1}^{K}p(y_k)log_2q(y_k) $
    - If continuous probability distribution : $ H[p,q] = -\int_y p(y)log_2q(y)dy $
- Cross entropy's input is not a random variable. It is pdf.

### Cross Entropy for Classification

Now show one example - binary classification.

- In binary classification, Y is 0 or 1. 
- p is Y's probability distribution.
    - So, When Y=1, $ p(Y=0) = 0, p(Y=1) = 1 $
    - And when Y=0, $ p(Y=1) = 1, p(Y=0) = 0 $
- And q is X's probability distribution. If it is Bern distribution, then
    - $ q(Y=0) = 1 - \mu, q(Y=1) = \mu $
- Therefore, The cross entropy of p,q is
    - When Y=1, $ H[p,q] = -p(Y=0)log_2q(Y=0) -p(Y=1)log_2q(Y=1) = -log_2\mu $
    - When Y=0, $ H[p,q] = -p(Y=0)log_2q(Y=0) -p(Y=1)log_2q(Y=1) = -log_2(1-\mu) $
- Now, we can use this for loss function of classification model. Because when Y=1, $\mu$ go lower(predict go uncorrect), cross entropy bigger. And when Y=0, $\mu$ go higher(predict go uncorrect), cross entropy bigger.

----

- If we have N data : log-loss : $ -\frac{1}{N}\sum_{i=1}^{N}(y_i log_2 \mu_i + (1 - y_i) log_2 (1-\mu_i)) $
- Same method, we can calculate the multi-class classification.
    - categorical log-loss : $ -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K}(g(y_i, k)*log_2p(y_i=k) $
    - $ g(y_i, k) $ means : if $y_i$ == $k$, then 1 else 0. It is a indicator function.
    - $ p(y_i=k) $ means : probability of $ y_i = k $ calculated by classification model(e.g. softmax).

----
### Kullback-Leibler divergence

KL divergence is the calculated value that how are different the shape of probability distribution between $p(y)$ and $q(y)$. Generally written as $KL(p||q)$

- $KL(p||q)$ = $ H[p,q] - H[p] $ (It means cross entropy - p's entropy. So sometimes KLD calls `relative entropy`)
- = $ \sum_{i=1}^{K}p(y_i)log_2(\frac{p(y_i)}{q(y_i)}) $ (In discrete probability distribution)
- If two probability distribution is same, KLD is converge to 0.

### Mutual Information

Mutual information can replace the `Correlation coefficient`. It is derive from the assumptions about the independence between variables. If X, Y is independent each other, The combined probability density function is equal to the product of the peripheral probability density function. $ p(x,y) = p(x)p(y) $.

- Mutual information is KL divergence between $p(x,y)$ and $p(x)*p(y)$. 
- This method measures the correlation between two random variables by measuring the difference between the combined probability density function and the peripheral probability density function. If two random variable is independent each other, mutual information is 0.

$$ MI[X,Y] = \sum_{i=1}^{K}p(x_i, y_i)log_2(\frac{p(x_i, y_i)}{p(x_i)*p(y_i)}) $$

#### references
- https://datascienceschool.net/view-notebook/d3ecf5cc7027441c8509c0cae7fea088/