By The End Of This Session You Should Be Able To:
----

- Write a joint and conditional probability distribution for a given dataset
- Define and calculate the following Information Theory concepts:
    - Joint entropy
    - Conditional entropy

Probability Distributions: Joint vs Conditional
--------



Joint probability distribution: p(x,y)

Conditional probability distribution: p(y|x)

Given the data:    
&nbsp;&nbsp;&nbsp;&nbsp;(x, y)     
&nbsp;&nbsp;&nbsp;&nbsp;\-\-\-\-\-  
&nbsp;&nbsp;&nbsp;&nbsp;(1, 0)   
&nbsp;&nbsp;&nbsp;&nbsp;(1, 0)   
&nbsp;&nbsp;&nbsp;&nbsp;(2, 0)   
&nbsp;&nbsp;&nbsp;&nbsp;(2, 1)   

Make the following tables:

- __Joint probability distribution p(x,y)__ 
- __Conditional probability distribution p(y|x)__

__Joint probability distribution p(x,y)__:

| __p(x,y)__ | y=0 | y=1 |  
|:-------:|:------:|:------:|
| x=1 | ?  | ?  |
| x=2 | ? | ? |


__Conditional probability distribution p(y|x)__:

| __p(y pipe x)__ | y=0 | y=1 |  
|:-------:|:------:|:------:|
| x=1 | ? | ? |
| x=2 | ? | ? |

| __p(x,y)__ | y=0 | y=1 |  
|:-------:|:------:|:------:|
| x=1 | 1/2 | 0 |
| x=2 | 1/4 | 1/4 |

| __p(y pipe x)__ | y=0 | y=1 |  
|:-------:|:------:|:------:|
| x=1 | 1 | 0 |
| x=2 | 1/2 | 1/2 |

Joint entropy
-----



Joint Entropy 
-----
The joint entropy of two discrete random variables X and Y is the entropy of their pairing: (X, Y).

<center><img src="images/joint.png" width="80%"/></center>

Comparing Independent Joint Probability and  Joint Entropy
-------

If X and Y are independent:

X ⊥ Y then __P(X, Y)__ = P(X)P(Y)

Our uncertainty is maximal:

X ⊥ Y then __H(X, Y)__ = H(X) + H(Y)

__NOTE__: You multiple independent probabilities and add entropy. Taking the log makes for simpler math!

Our joint uncertainty is ≥ our marginal uncertainty

H(X, Y) ≥  H(X) ≥ H(Y) ≥ 0

In general, considering events jointly reduces our uncertainty

H(X, Y) ≤  H(X) + H(Y)

Conditional entropy
-----

H(Y | X) is how much uncertainty is left in Y, once X is known.

__or__

Quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. 

Conditional entropy formula
-----

<center>H(Y | X) = H(X, Y) - H(X)</center>

<br>
<center><img src="images/conditional.svg" width="80%"/></center>

Information never hurts
------

H(X | Y) ≤ H(Y) 

Conditioning on data decreases our uncertainty<sup>*</sup>.

<sub>Almost always. Or at least never increases uncertainty. On average!</sub>

https://en.wikipedia.org/wiki/Conditional_entropy

2 Kinds of Conditional Entropy
------

Entropy can be conditioned on

1. A random variable __H(Y | X)__
2. Random variable taking a certain value __H(Y|X=x)__

Care should be taken not to confuse these two definitions of conditional entropy

Generally the specific conditional entropy of Y given a specific value x of X is more useful:

H(Y | X=x)

Check for understanding
------

If H(Y |X = x) = 0, then what does x tells us about Y?

x accounts for all the uncertainty of Y.

If we know the value of x we know the value of Y.

The Rosetta Stone of IT <br>(What would DS be without Venn diagrams!)
-----

<center><img src="https://www.researchgate.net/profile/Alejandro_Villaverde/publication/261443288/figure/fig1/AS:213423871795203@1427895624743/Graphical-representation-of-the-entropies-HX-HY-joint-entropy-HX-Y.png" width="75%"/></center>

- Η(X) - Entropy of X
- Η(Y) - Entropy of X
- Η(X,Y) - Joint Entropy
- Η(X|Y) - Conditional entropy of X given Y
- Η(Y|X) - Conditional entropy of Y given X
- I(X;Y) - Mutual information

Futher Study for Information Theory
-----

- Loss metric (e.g., binary cross entropy)

- Pointwise Mutual Information (PMI)

- Maximum Entropy Modeling / MaxEnt

- Markov chains/fields for sufficient statistics

- Kolmogorov complexity for minimum message length

Summary
----

- Building from joint and conditional probability distributions, we can find the corresponding entropies.
- Joint entropy is the entropy of their pairing.
- Conditional entropy is the entropy of one given another.
- This is just an overview of the landscape, applying IT successfully requires practice.

Bonus Material
----

Example Problem
------

X(n) will be Success (or 1) when n is even.  
Y(n) will be Success (or 1) when n is prime.

Let's make the outcome table:

| n | 1 |   2 |  3 |  4 |  5 |  6 |  7 |  8 |
|:-------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
| X | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| Y | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |

Joint distribution
------
|__X__/__Y__ | 0 | 1 |  
|:-------:|:------:|:------:|
| 0 |1/8 | 3/8 |
| 1| 3/8 | 1/8 |  

In [1]:
from numpy import log2

H_X_Y = -((1/8)*log2(1/8)+
          (3/8)*log2(3/8)+
          (3/8)*log2(3/8)+
          (1/8)*log2(1/8))
print(f"{H_X_Y:.3}")

1.81


Mutual Information
-----
<br>
<center><img src="https://upload.wikimedia.org/wikipedia/commons/9/91/Mutual_Information_Examples.svg" width="50%"/></center>

Measures the mutual dependence between the two variables.


Mutual Information
-----

Measures the amount of information that can be obtained about one random variable by observing another.



Mutual Information Formula
-----

<center><img src="images/mutual.svg" width="80%"/></center>

where SI (Specific mutual Information) is the pointwise mutual information

The mutual information between random variables X and Y is a function of their joint probability mass function p(x,y) and marginal probability mass functions p(x) and p(y).

Mutual Information Property
-----

I(X; Y) = H(X) - H(X|Y)

Knowing Y, we can save an average of I(X; Y) bits in encoding X compared to __not__ knowing Y.

Check for understanding
-----

I(X;Y) = 0 then what do we know about X and Y....

X and Y are independent from each other

- Mutual information is a measure of the mutual dependence between the two variables.

Entropy for Search Engine Results Page (SERP)
-----

<center><img src="http://searchengineland.com/figz/wp-content/seloads/2012/10/Twitter-Search-Screenshot-.png" width="30%"/></center>
"Clickology" is one part of Data Science

Click entropy
-----

Entropy can apply to the probability of clicking.

What does it mean that there is __low click entropy__ for SERP?

If almost everyone clicks on that result, that query's click entropy is low.

Check for understanding
------

Examples of __low click entropy__ for SERP?

- Factual queries
- Unambiguous queries
- Trending queries
- High precision results



What would it mean if there was high entropy for a SERP?
-------

High entropy means clicks by people are distributed uniformly across results.

__PROTIP__ If you already have good search system and still have high click entropy for some searches, personalized search results.

Cross entropy
----


H(p, q)

https://www.youtube.com/watch?v=tRsSi_sqXjI

Cross Entropy Formula
-----

<center><img src="images/cross.png" height="500"/></center>

mean cross entropy (MXE)
------


predicting the probability that an example is positive 

It can be proven that in this setting minimizing the cross entropy gives the maximum likelihood hypothesis. 



<center><img src="images/mxe.png" height="500"/></center>

Softmax classifier
-----

loss function

a binary Logistic Regression classifier generalized to multiple classes

used to minize cross-entropy loss

<center><img src="images/soft.png" height="500"/></center>

Categorical Cross-Entropy Loss
-----

The categorical cross-entropy loss is also known as the negative log likelihood. 

Measures the similarity between two probability distributions, 
typically the true labels and the predicted labels. 

It is given by L = - sum(y * log(y_prediction)) 
where y is the probability distribution of true labels (typically a one-hot vector) 
and y_prediction is the probability distribution of the predicted labels, 
often coming from a softmax.

Cross Entropy
------

between two probability distributions over the same underlying set of events 

the average number of bits needed to identify an event drawn from the set,
if a coding scheme is used that is optimized for an "unnatural" probability distribution q, rather than the "true" distribution p.

between two probability distributions over the same underlying set of events measures 
the average number of bits needed to identify an event drawn from the set, 
if a coding scheme is used that is optimized for an "unnatural" probability distribution q,
rather than the "true" distribution p.

The cross entropy for the distributions p and q over a given set is defined as follows:
<center><img src="images/cross_def.png" height="500"/></center>

Check for understanding
-----

What is H(X)?  
What is H(X)?  
What is H(X)+H(Y)?  

H(X) = 1 = (-((1/2)*log2(1/2)+(1/2)*log2(1/2))  
H(Y) = 1 = (-((1/2)*log2(1/2)+(1/2)*log2(1/2))  
H(X)+H(Y) = 2

There is less entropy in joint than the marginals

(H(X, Y) ≈ 1.811) < (H(X) + H(Y) = 2)

Mutual information is symmetric
------
<br>
<center><img src="images/sym.svg" height="500"/></center>

Mutual Information
-----

Mutual information is the communication rate in the presence of noise. 

Mutual Information
-----
<br>
<center><img src="images/mutual_information.jpg" height="500"/></center>

<center><img src="images/proof.png" height="500"/></center>

<br>
<br>