# Bayesian Networks

## Cascading rule

Let X and Y be 2 dependent variables. The joint probability is given by:

$
P(X=X_i, Y=Y_j) = P(X=X_i|Y=Y_j) \cdot P(Y=Y_j)
$

Let Z be a third dependent variable. The joint probability is given by:

$
P(X=X_i, Y=Y_j, Z=Z_k) = P(X=X_i|Y=Y_j, Z=Z_k) \cdot P(Y=Y_j, Z=Z_k)
$

We can further develop to:

$
P(X=X_i, Y=Y_j, Z=Z_k) = P(X=X_i|Y=Y_j, Z=Z_k) \cdot P(Y=Y_j|Z=Z_k) \cdot P(Z=Z_k)
$

## Dentist example

Consider the following joint probability:
    
$
P(Detect, Pain, Decay, Competent)
$

$P(Detect)$ is the probability that dentist detects an anomaly, $P(Pain)$ is the probability that patient suffers pain, $P(Decay)$ is the probability that patient has a tooth decay and $P(Competent)$ is the probability of the dentist to be competent.

Applying the cascading rule, we can write:

$
P(Detect, Pain, Decay, Competent) \\
= P(Detect|Pain, Decay, Competent) \cdot P(Pain, Decay, Competent) \\
= P(Detect|Pain, Decay, Competent) \cdot P(Pain|Decay, Competent) \cdot P(Decay, Competent) \\
= P(Detect|Pain, Decay, Competent) \cdot P(Pain|Decay, Competent) \cdot P(Decay|Competent) \cdot P(Comptent)
$

We can note the following:

- $P(Decay|Competent)$: there should be no relation between the competence of the dentist and the fact patient has a tooth decay or not
- $P(Pain|Decay, Comptent)$: there might be a relation between having a tooth decay and suffering pain, but this should be independent on the dentist competence
- $P(Detect|Pain, Decay, Competent)$: there is a relationship between the fact dentist detects an anomaly based on the patient having a tooth decay and the dentist being competent. But the fact patient is suffering pain is independent on the fact dentist detects an anomaly.

We can then simplify the joint probaility to:

$
P(Detect, Pain, Decay, Competent) \\
= P(Detect|Pain, Decay, Competent) \cdot P(Pain|Decay, Competent) \cdot P(Decay|Competent) \cdot P(Comptent) \\
= P(Detect|Decay, Competent) \cdot P(Pain|Decay) \cdot P(Decay) \cdot P(Comptent)
$

We can conclude that:

- the fact dentist detects an anomaly depends on the patient having a tooth decay and the dentist being competent
- the fact patient suffers pain depends on the patient having a tooth decay
- the patient having a tooth decay is independent to the other variables
- the dentist being competent is independent to the other variables

We can report this into a graph:

<img src="./images/dentist_network.svg" style="height: 200px">

For this to be an actual Bayesian Network, we need to add the probabilities:

<img src="./images/dentist_network2.svg" style="height: 300px">

A node in this graph with no predecessor requires 1 parameter (one probability), a node with 1 predecessor requires 2 and a node with 2 predecessors requires 4 parameters. In this example the Bayesian network requires 8 parameters.

The number of parameters required is $2^{number of predecessors}$.

As we have 4 variables, each having two possible values, the total number of probabilities that we would have need is $2^4 = 16$. As the sum of probabilities must be 1, the total number of probabilities thta we would have needed is then $2^n-1$. A Bayesian Network can then be seen as a factorized view as it requires only 8 probabilities instead of 15. From the 8 probability we are able to infer all the 16 probabilities.

## Conditional dependence

Let's consider the following Bayes Network:

<img src="./images/conditional_dependence01.svg" style="height: 200px">

The network implies that if we know A, B and C are independent. This can be written like this:

$$
B \bot C | A
$$

B and C are conditionaly independent. But this does not imply that B and C are absolute independent. Actually, if we know B, it means that we might have a more accurate idea of what A might be. A because of that we also have a better understanding of what C could be.

We can actually compute the probability of C given B:

$
P(C|B) = P(C|B,A) \cdot P(A|B) + P(C|B,\lnot A) \cdot P(\lnot A|B)
$

This formulation actually reflects the following. To know C from B, we need to have a look at A. A can be either true or false. The probability of C knowing B is then the sum of the probability of C knowing A and B times the probability that A is true given B and the probability of C knowing B and not A times the probability of A being false knowing B. Everything is written as if we know B.

## Confounding cause

Let's now consider this network:

<img src="./images/conditional_dependence02.svg" style="height: 200px">

B and C are independent. But if we know C, then A and B become linked. If there are two potential causes A and B leading to C event and we know that C has occured:

- if A occured as well, then B is less likely to have occured
- if B occured as well, then A is less likely to have occured

This is known as the explain away effect and so:

- $A \bot B$: **True**: A and B are absolute independent
- $A \bot B | C$: **False**: A and B are dependent when C is given, this is the explain away effect.

## d-Separation

The d-separation technique consists in finding absolute or conditional independences in a Bayes network. D stands for dependence. Two nodes in a Bayes Network are dependent if they are connected by an *active* path of edges. *active* edges means that all nodes on the path are *active*. A node is by default *active* when there is no known variable (no givens) unless it's a collider in which case it is by default *inactive*. A collider is a node with confounding causes (more than one upstream connection). If a node becomes part of a set of givens, its status toggles. A node with only one upstream connection becomes *inactive* and a collider node becomes *active*.

Let's consider the following examples from the Udacity course:

<img src="./images/d-separation01.svg" style="height: 200px">

We can say the following about this graph is there is no given:

- $C \bot E$: **False** there is a path with only active nodes connected C and E nodes, C and E are d-connected.
- $B \bot D$: **False** there is a path with only active nodes connected B and D nodes
- $A \bot C$: **False** there is a path with only active nodes connected A and C nodes
- $A \bot E$: **False** there is a path with only active nodes connected A and E nodes

Now let's assume than the given set includes B:

- $A \bot C | B$: **True** the only path connecting A and C passes through a inactive node B. A and C are d-separated given B.
- $C \bot E | B$: **True** the only path connecting C andE passes through a inactive node B. C and E are d-separated given B.

C is only influenced by B. So if B is known, the knowledge of A brings no additional knowledge to C, that's the reason A and C are independent given B. And there is no way than knowing E would bring additional knowledge on C either, that's why they also are d-separated given B.

A second example:

<img src="./images/d-separation02.svg" style="height: 200px">

We can say the following about this graph is there is no given:

- $A \bot B$: **True**: A and B are only connected through C which is a collider and then it is inactive in the set of givens is empty. A and B are d-separated.
- $A \bot E$: **False**: A influences C which in turn influences E. In the path connecting A et E (A -> C -> E), C is not a collider as it only has 1 upstream connection. A and E are d-connected, they are dependent.
- $D \bot E$: **False**: D and E are connected only through active nodes. In the path D -> C -> E, C is not considered as a collider (no upstram connection). D and E are dependent (d-connected )by the virtue of C.

Let's now consider that B is given:

- $A \bot E | B$: **False**: A influences C which in turn influences E. In the path connecting A et E (A -> C -> E), C is not a collider as it only has 1 upstream connection. A and E are d-connected, they are dependent. The fact B is known does not change anything. The connection between A and E doest not pass through B.

Let's now consider that C is given:

- $A \bot E | C$: **True**: The only connection between A and C is through C. Now C is *inactive* because it is not considered as a collider node (only one upstram connection) and it is part of the givens set. Knowing A does not bring any additional information to E, because E already knows that C is true.
- $A \bot B | C$: **False**: The path through A and B passes by C. In that case C is a collider (two upstream connections). And because C is also part of the set of givens, it means C is an active node. And so A and B are d-connected. This is the **explain away** effect. Two events A and B which are absolute independent become dependent when an event C is known and A and B are confounding causes for C. C being given, knowing A increases or reduces the odds that B also occured.

Another way to see d-dependence is considering active or inactive triplets. On the left hand side of the following picture, the events are dependent. On the roght hand side they are independent. Plain nodes are known and empty ones are unkown.

<img src="./images/d-separation.png" style="height: 400px">

## Inferences with Bayesian networks

In this section, we will consider the following example:

<img src="./images/bayes_net01.png">

### Enumeration technique

$
P(A,B,C|D,E) = \large \frac{P(A,B,C,D,E)}{P(D,E)}
$

In a Bayes network, A, B and C are the query variables. D and E are the evidence. The conditional probabilities of the queries given the evidences is the ratio between the joint probability of all variables over the joint probabilities of the evidences.

Let's take an example and try to calculate the probability $P(+b|+j,+m)$ which is the probability that a burglary occured given John and Mary called:

$
P(+b|+j,+m) = \large \frac{P(+b,+j,+m)}{P(+j,+m)}
$

To calculate the joint probabuility $P(+b,+j,+m)$ using the enumeration technique, we need to sum the probabilties over the hidden variables to the problem:

$
P(+b,+j,+m) = \sum_{a} \sum_{e} P(+b,+j,+m, a, e)
$

Using the cascading rule for the probabilities, this leads to:

$
P(+b,+j,+m) = \sum_{a} \sum_{e} P(+b) \cdot P(e) \cdot P(a|+b,e) \cdot P(+j|a) \cdot P(+m|a)
$

It turns out it is the product of conditional probabilities of each node in the Bayes Network given upstream nodes in the network. This product of probabilities is the factor f, so the equation becomes:

$
P(+b,+j,+m) = \sum_{a} \sum_{e} f(e, a) \\
= f(+e,+a) + f(+e,\lnot a) + f(\lnot e,+a) + f(\lnot e,\lnot a)
$

For example:

$
f(+e,+a) = P(+b) \cdot P(+e) \cdot P(+a|+b,+e) \cdot P(+j|+a) \cdot P(+m|+a) \\
= 0.001 \cdot 0.002 \cdot 0.95 \cdot 0.9 \cdot 0.7 \\
= 0.000001197
$

This is only one of the four terms that we would need to calculate $P(+b,+j,+m)$, so there is still along way to go before we can reach $P(+b|+j,+m)$

### Variable elimination

Doing enumeration in a large net might be tedious, another technique exists to speed up the process, this is variable elimination.

Let's take the following example:

<img src="./images/conditional_dependence03.svg" style="height: 300px">

To solve the questions we are going to use the *elimination of variables* technique. We introduce the following factors:

f1(sunny):

sunny | p
------|----
T     | 0.7
F     | 0.3

f2(raise):

raise | p
------|----
T     | 0.01
F     | 0.99

f3(happy, sunny, raise):

happy | sunny | raise | p
------|-------|-------|-----
T     | T     | T     | 1.0
T     | T     | F     | 0.7
T     | F     | T     | 0.9
T     | F     | F     | 0.1
F     | T     | T     | 0.0
F     | T     | F     | 0.3
F     | F     | T     | 0.1
F     | F     | F     | 0.9

The factors are table of probabilities.

The probability $P(R|S)$ that I got a raise knowing that is sunny is 0.01, which is the probability of getting a raise. Both events *raise* and *sunny* are independent.

What is the probability $P(R|H, S)$ of getting a raise knowing that I am happy and it is sunny ?

There is a single factor f3 which is a function of H. But as we know H, we can eliminate a few lines from the table:

f3'(sunny, raise):

sunny | raise | p
------|-------|-----
T     | T     | 1.0
T     | F     | 0.7
F     | T     | 0.9
F     | F     | 0.1

We also know that it is sunny, but we have two factors f3' and f1 depending on S so we multiply the factors (which is the inner product of the two tables) and the probability is the product of all probabilities. This new factor is f4:

f4(sunny, raise):

sunny | raise | f3' | f1   | p
------|-------|-----|------|------
T     | T     | 1.0 | 0.7  | 0.7
T     | F     | 0.7 | 0.7  | 0.49
F     | T     | 0.9 | 0.3  | 0.27
F     | F     | 0.1 | 0.3  | 0.03

And because we know S, we can eliminate the variable from the factor so that we have f4':

f4'(raise):

raise | p
------|-----
T     | 0.7
F     | 0.49

If we want to know the probability $P(R|H,S)$ we also need to take into account all the factors depending on *raise* doing a inner product. This result into the f5 factor:

f5(raise):

raise | f4'  | f2   | p
------|------|------|-----
T     | 0.7  | 0.01 | 0.007
F     | 0.49 | 0.99 | 0.4581

We need to normalize the probability so:

$
P(R|H,S) = \large \frac{0.007}{0.007 + 0.4581} = 0.0142
$

Let's know introduce f6 that is product of all factors:

f6(happy, raise, sunny):

happy | sunny | raise | p
------|-------|-------|-----
T     | T     | T     | 1.0 * 0.7 * 0.01 = 0.007
T     | T     | F     | 0.7 * 0.7 * 0.99 = 0.4851
T     | F     | T     | 0.9 * 0.3 * 0.01 = 0.0027
T     | F     | F     | 0.1 * 0.3 * 0.99 = 0.0297
F     | T     | T     | 0.0 * 0.7 * 0.01 = 0.0
F     | T     | F     | 0.3 * 0.7 * 0.99 = 0.2079
F     | F     | T     | 0.1 * 0.3 * 0.01 = 0.0003
F     | F     | F     | 0.9 * 0.3 * 0.99 = 0.2673

What is the probability of getting a raise if I am happy $P(R|H)$ ? From the factor f6 table, we can eliminate the variables where H is False:

f6'(raise, sunny):

sunny | raise | p
------|-------|-----
T     | T     | 0.007
T     | F     | 0.4851
F     | T     | 0.0027
F     | F     | 0.0297

As we know nothing about weather, we will add all the probabilities for a given value of raise:

f6''(raise):

raise | p
------|-----
T     | 0.007 + 0.0027 = 0.0097
F     | 0.4851 + 0.0297 = 0.5148

And so:

$
P(R|H) = \large \frac{0.0097}{0.0097 + 0.5148} = 0.0185
$

If we don't know about the wheather, the probability of getting a raise if I'm happy is 0.0185. The probability of getting a raise drops to 0.0142 if we also know that it's sunny.

When making an inference like this in a Bayesian network:

- H is called the **evidence**: this is what we know
- R is the **query**: this is what we are looking for
- S is a **hidden variable**: we know nothing about it and we are basically not interested by it

### Approximate inference

Even for very large networks, variable elimination might take a long time to process. Another technique consists in making an approximate inference. The approximation is done thanks to a sampling.

Let's consider the following example:

<img src="./images/bayes_net02.svg" style="height: 300px">

A first example would be to calculate the probability that the grass is wet $P(W)$:

- we start off with the variable for which all parents are defined, in that case variable C
- we randomly select the C variable so that it has 50% chances of being cloudy: let's say it's cloudy
- we randomly select the S variable limiting the cases where it's cloudy. There is 10% of chances that sprinlers are turned on. So let's assume that they are turned off.
- we randomly select the R variable following the same process than with S. There is 80% of chances that it's rainy if it's cloudy, so let's assume that is is actually raining.
- finally, we randomly select the W variable taking into account that sprinklers are turned off and that it's rainy. That leaves us with a probability of 90% that the grass is wet, so let's assume it is actually wet.
- we repeat the same process over and over.
- we can then calculate the probability that the grass is wet by dividing the number of samples for which grass was wet by the total number of samples.
- this process ensure consistency with an actual inference in the Bayesian network.

In a second example, we want to calculate the conditional probability $P(W|\lnot C)$. We can actually take the sampling that we have done earlier and reject all samples for which weather was cloudy. We can then calculate the probability of the grass getting wet in the same way we did before. This procedure is still consistent.

But let's assume that the probability that weather is cloudy is really small. We would have to reject a lot of samples, and maybe we would be left with a very small number of them which can not lead to an accurate value for the probability.

We can then proceed with a likelihood weigthing. Instead of randomly selecting all the variables, we are going to fix the given values. The problem with that is that it results in a sampling that is inconsistent. To make it consistent, we need to add a weight for each sample.

In a third example, we are going to calculate the conditional probability $P(R|S,W)$:

- we initialize the weight of the sample to 1.0
- we start off with the variable for which all parents are defined, in that case variable C
- we randomly select the C variable so that it has 50% chances of being cloudy: let's say it's cloudy
- S variable is fixed: it is a given and the upstream node C has been set. We multiply the weight of the sample by the probability P(S|C) because this is the row matching the constraints that we have.
- R is randomly generated taking into account that it is cloudy. Let's assume it's raining.
- W variable is fixed: it is a given and both parents have been fixed S and R. We multiply the weight sample by P(W|R,S) because it is the row that we are constrained to choose.
- we repeat the process over and over
- we calculate the conditional probability taking into account the weight of each sample.
- this process ensure consistency with an actual inference in the Bayesian network.

In a fourth example, we want to calculate the conditional probability $P(C|R,S)$:

- we could randomly select the variable W given the constraints and updating the sample weight accordingly
- but, the generation of the C variable would be completly random and would not match the evidences that we have. This is a limitation of this random sapling technique.

An alternative sampling method exists, its the Gibbs sampling and uses Markov Chain Monte Carlo algorithm: [Wikipedia](https://en.wikipedia.org/wiki/Gibbs_sampling) has a dedicated article.

## References

[MathJax tutorial](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference)

[Introduction to Bayesian Networks](https://www.youtube.com/watch?v=OjlC-4iIndU)

[Bayesian inference](https://www.youtube.com/watch?v=dvi2k7OzBHA)

[d-separation](http://web.mit.edu/jmn/www/6.034/d-separation.pdf) examples and [d-Separation](http://bayes.cs.ucla.edu/BOOK-2K/d-sep.html) without tears. But I found this [one](http://www.andrew.cmu.edu/user/scheines/tutor/d-sep.html#explanation) clearer to me and this is what I have reported in the d-Separation section.

[Gibbs sampling](https://en.wikipedia.org/wiki/Gibbs_sampling) on Wikipedia.