In [1]:
import matplotlib.pyplot as plt
from scipy.stats import binom
import numpy as np
import math
import scipy.stats as stats
# to use hedgehog, one needs to install two packages vose and hedgehog
# pip install git+https://github.com/MaxHalford/vose
# pip install git+https://github.com/MaxHalford/hedgehog
# you may also need to install graphviz to plot PGM
# conda install -c conda-forge python-graphviz 
import hedgehog as hh
import pandas as pd
from scipy.special import logsumexp
from IPython.display import Markdown as md
def hide_code_in_slideshow():   
    from IPython import display
    import binascii
    import os
    uid = binascii.hexlify(os.urandom(8)).decode()    
    html = """<div id="%s"></div>
    <script type="text/javascript">
        $(function(){
            var p = $("#%s");
            if (p.length==0) return;
            while (!p.hasClass("cell")) {
                p=p.parent();
                if (p.prop("tagName") =="body") return;
            }
            var cell = p;
            cell.find(".input").addClass("hide-in-slideshow")
        });
    </script>""" % (uid, uid)
    display.display_html(html, raw=True)
#  a hack to hide code from cell: https://github.com/damianavila/RISE/issues/32    

In [2]:
%%html
<style>
 .container.slides .celltoolbar, .container.slides .hide-in-slideshow {
    display: None ! important;
}
</style>

In [3]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

# CS5010 Artificial Intelligence Principles
### Lecture 12 Uncertainty 3
#### Bayesian (Belief) Networks 

Lei Fang

University of St Andrews

# Last time


* Probabilistic inference: $P(\text{Query}|\text{Evidence})$

* Conditional independence assumption: $P(X_1, X_2|C) = P(X_1|C) P(X_2|C)$ or equivalently $P(X_1|C, X_2) = P(X_1|C)$
  * e.g. knowing you are infected with COVID, two tests results become indepedent 
  * marginally, they are dependent! one positive result influence the other:
  $P(X_1, X_2)\neq P(X_1)P(X_2)$

# This time

- Bayesian network: a graphical representation of probabilistic model
  * also known as: directed probabilistic graphical model, Baye's net, Bayesian belief networks ...
  * easier for human to model
  * make automated machine inference/learning possible (automated inference $\Rightarrow$ AI)
 
- Exact inference of Bayesian network

# Recap: coin tossing example 

Two coins with $p_1 = 0.5$ and $p_2 = 0.2$, your friend randomly picks one and flipping that coin three times and records the result. 

Random variables: $C, Y_1, Y_2, Y_3$
  * $C = 1,2$: coin choice
  * $Y_1, Y_2, Y_3$: tossing results, each either `head` or `tail`

Rememember the notation: 
  * capital letters, e.g. $C$, $Y_1$, are random variables, 
  * smaller cases, e.g. `head`, `tail`, are the realisations or values the r.v.s take; 

# Recap: conditional Independence (CI)

* For the coin example, the conditional independence assumption: knowing the coin in use, the tosses become independent: $$P(Y_1, Y_2, Y_3|C) = P(Y_1|C)P(Y_2|C)P(Y_3|C)$$

* Equivalently, CI also means: $P(Y_1 |C, Y_2, Y_3) = P(Y_1|C)$, $P(Y_1 |C, Y_2) = P(Y_1|C)$, $P(Y_1 |C, Y_3) = P(Y_1|C)$
  * intuition: $C$ provides all the information required, adding more does *not* affect the distribution

Note that **marginally**, or without the condition, the tosses are not independent ! $$P(Y_1, Y_2, Y_3) \neq P(Y_1)P(Y_2)P(Y_3)$$

E.g. 

* by sum rule, then chain rule, then CI assumption:
  $$\begin{align}P(Y_1=t, Y_2=t, Y_3=t) &= \sum_{c=1,2} P(C=c, Y_1=t, Y_2=t, Y_3=t) \\
  &= \sum_{c=1,2} P(c)P(t, t, t|c) \\
  &= \sum_{c=1,2} P(c)P(t|c)P(t|c)P(t|c)\\
  &= 0.5\cdot 0.5^3 + 0.5\cdot 0.8^3 = \textbf{0.319}\end{align}$$

* to calculate $P(Y_1)$, we follow the same: sum (more to marginalise here)+chain+CI:
$$\begin{align}P(Y_1=t) &= \sum_{c=1,2}\sum_{y_2=h,t}\sum_{y_3=h,t} P(c)P(Y_1=t|c)P(y_2|c)P(y_3|c) = \sum_{c=1,2}P(c)P(Y_1=t|c) \sum_{y_2}\sum_{y_3} P(y_2|c)P(y_3|c)\\
&= \sum_{c=1,2}P(c)P(Y_1=t|c) \underbrace{\sum_{y_2}P(y_2|c) \sum_{y_3} P(y_3|c)}_{1\times 1}= 0.65\end{align}$$
$P(Y_2=t) = P(Y_3=t) = 0.65; \text{so }P(Y_1=t)P(Y_2=t)P(Y_3=t) = 0.65^3 = \textbf{0.275}$

* not independent: $P(Y_1=t, Y_2=t, Y_3=t) \neq P(Y_1=t) P(Y_2=t) P(Y_3=t)$

Alternatively, you can simply verify e.g.

$$P(Y_3=t|Y_2=t, Y_1=t) \neq P(Y_3=t|Y_1=t) \neq P(Y_3=t)$$

Because of the other independence definition: if $X,Y$ are independent, then $P(X|Y) = P(X)$

You should try it as an exercise!

# Recap: CI assumption simplifies joint distribution

Due to chain rule $$P(C, Y_1, Y_2, Y_3) = P(C)P(Y_1, Y_2,Y_3|C)$$


Due to the conditional independence (CI) assumption we have

$$P(C, Y_1, Y_2,Y_3) = P(C)\prod_{i=1}^3P(Y_i|C)$$

Reduce the number of parameters from $2^4-1$ to $3$

# Recap: probabilistic inference

Probabilistic inference: $$P(\text{Query}|\text{Evidence}): P(Y_3|Y_1=\text{head}, Y_2=\text{head})$$

Evidence: $\{Y_1, Y_2 \}$

Query: $\{Y_3\}$

Hidden (Nuisance) r.v.: $\{C\}$
  * $\{\text{All}\}/ (\text{Evidence } \cup \text{ Query})$

# Digress: Graph and Directed Acyclic Graph (DAG)

Graph is a data structure consists of: $$G = \{V, E\}$$
  * a set of **vertices** $V=\{X, Y, Z\}$, 
  * a set of **edges** $E=\{(X,Y), (Y,Z)\}$: 
    * an edge is an un-ordered pairs of nodes e.g $(X, Y)$ $(Y,X)$ is the same
    * the relationship is mutual or symmetric: e.g. neighour, friendship (?)
    
<center><img src="./figs/graph.png" width = "600" height="500"/></center>  

A **directed** graph is a a graph with **directed** edges, i.e. the edges are **ordered** pairs
  * direction matters: $(X, Y)$ are not the same as $(Y, X)$
  * asymmetric relationship: parent to child relationship (reverse is not true)
    
<center><img src="./figs/directedGraph.png" width = "600" height="500"/></center>  

$\textbf{parent}(\cdot)$ returns the set of parent nodes, e.g.

$$\text{parent}(Y) = \{X, Z\}$$

A directed **acyclic** graph (DAG) is a directed graph **without** cycles 
  * a cycle: directed path starts and ends at the same node

**NOT a DAG**: cycles NOT allowed

$Y_1 \Rightarrow X$, then cycle: $X\Rightarrow Y_1\Rightarrow X$, （would be an **invalid** Baye's net!）

<center><img src="./figs/diceExampleDCG.png" width = "400" height="500"/></center>

A valid DAG: multiple paths are allowed
  * two possible paths from $X$ to $Y_2$: $X\Rightarrow Y_2$ and $X \Rightarrow Y_1 \Rightarrow Y_2$
  * still **acyclic** though

<center><img src="./figs/diceExampleDAG.png" width = "400" height="500"/></center>

# Bayesian net: a DAG with CPTs $\text{bn} = \{G, \{P\}\}$


A Bayesian network is a graphical representation of a probabilistic model


It consists of a **Directed Acyclic Graph (DAG)**, $G$ together with conditional probability tables **CPTs**, $\{P(X_i|\ldots)\}$
  * DAG **G**: 
    * $V=\{X_1, X_2, \ldots, X_n\}$, one random variable per vertex (also called node) 
    * $E$, directed edges represent (conditional) dependences between r.v.s
  * **CPTs**: $P(X_i|\text{parent}(X_i))$, one $P$ for each vertex or r.v.

<!-- Why we need a Bayesian network ? 
  * easier for human to model (we prefer visual graphical syntax to math equations after all)
  * also easier for machine computation (make automated algorithmic inference) -->

For our previous example, the BN representation would contain 4 nodes (one for each r.v.)

<center><img src="./figs/diceExample0.png" width = "600" height="500"/></center>

**Edges** are added for the conditional independence structure 
<center><img src="./figs/diceExample.png" width = "400" height="500"/></center>

Recall $P(C, Y_1, Y_2,\ldots, Y_5) = P(C)\prod_{i=1}^n P(Y_i|C)$

  * edge here means _direct influence_ from parent nodes to child nodes
  * the result of $Y_i$ depends on the coin choice $C$

# Conditional probability tables (CPT)

For each random variable (or node), there is one conditional probability distribution to specify
  $$P(X_i|\text{parent}(X_i))$$
  * $\text{parent}(X_i)$ return the parent nodes of $X_i$

How many parameters do we need for $P(X_i|\text{parent}(X_i))$?

  * remember conditional distribution are distributions as well
  * depends on the dize of $\text{parent}(X)$: $O((k-1)\cdot k^{|\text{parent}(X)|})$
    * where $k$ is the average possible values a r.v can take
      * e.g. $k=2$ for binary choice, like $C, Y_i$
    * ${|\text{parent}(X)|}$ is the number of parents of $X$  
    * one for each combination of $X$'s parent nodes' realisation
      * if all r.v. are binary, $O(2^{|\text{parent}(X)|})$

Basically, CPTs are a set of distribution tables (one for each node or r.v.), denoted as $\{P\}$

# Example: Conditional Probability Tables (CPT)

<center><img src="./figs/diceExample.png" width = "400" height="500"/></center>

For node Coin Choice (short-handed as $C$), the CPT is $P(C)$
  * as there is no parent, there is no conditions to worry about

|C   | P(C=c) |
| --- | ---  | 
| 1   | 0.5 | 
| 2   | 0.5 | 

One may also represent the CPT as a **fat** or **short** table 

| C=1  | C=2 |
| --- | ---  | 
| 0.5   | 0.5 | 

  * for $P(C)$, just one parameter is needed as $C$ is binary
    * the other is just $P(C=2)=1-P(C=1)$

<center><img src="./figs/diceExample.png" width = "400" height="500"/></center>

For node $Y_i$, the CPT $P(Y_i|C)$
  
|C   | $Y_i$ |P($Y_i$\|C=c) |
| ---| ---   | ---     |
| 1  | head | 0.5 |
| 1  | tail | 0.5 |
| 2  | head | 0.2 |
| 2  | tail | 0.8 |

or **fat/short** CPT table 

|C   | head | tail |
| ---| ---   | ---     |
| 1  | 0.5 | 0.5 |
| 1  | 0.2 | 0.8 |


  * only one parent node, i.e. $\text{parent}(Y_i) = \{C\}$
  * we need $2^1$ distrubtions to specify: namely $P(Y_i|C=1)$ and $P(Y_i|C=2)$
    * one parameter for each: conditional distributions are distributions

# Draw everything together: a full BN

A Baye's net is: the DAG together with the (fat) CPTs, or $\text{bn}=\{G, \{P\}\}$

<center><img src="./figs/coinBN2.png" width = "500" height="500"/></center>

In total, for this model, we only need 1+2 = 3 parameters
  * one for $P(C)$ and two for all $P(Y_i|C)$
  * remember $P(Y_i|C)$ are shared among all $Y_i$

# Plate notation to simplify repeated models

You may also see "plate" notation to simplify the graphical representation

For independent and identically distributed r.v.s e.g. $P(Y_i|C)$ for $i =1,2,3$, plate notation is handy

Just like a for loop, the two represent the same thing.

<center><img src="./figs/coinPlate.png" width = "300" height="400"/></center>

# Another Baye's network example

I’m at work. My neighbor John calls to say my alarm is ringing but neighbor
Mary doesn’t call. Sometimes the alarm is set off by minor earthquake. So Is there a
burglar?


What are the random variables ?

Burglar, Earthquake, Alarm, John Calls, Mary Calls
  * all of them are binary: true or false

A network topology reflects "causal" inflence relationships
  * a burglar can set the alarm off
  * an earthquake can set the alarm off
  * the alarm can cause Mary to call me
  * the alarm can cause John to call me 

# Example continued


<center><img src="./figs/burglar.png" width = "400" height="500"/></center>

# Example continued (with CPTs)

We have used **compact representations** of the CPTs (redundant parameters are not shown)

Pay attention to CPT of $\text{Alarm}$, it has two parents
  * one conditional distribution per $\text{Burglary}\times \text{Earhquake}$ conbination
  * $2^2 = 4$ parameters

<center><img src="./figs/burglarCPTs.png" width = "600" height="400"/></center>

In total, for this Bayesian network, we need 10 parameters for the CPTs $\{P\}$: 1+1+4+2+2=10

# Baye's network encodes conditional independence relationships

Bayesian network encodes **conditional independence** graphically: the simplest one is

    
<div class="alert alert-block alert-info">
<center> <b>Given its parents, a node is independent from all other nodes except its descendants</b> </center>
</div>

For example, 

<center><img src="./figs/burglar.png" width = "400" height="400"/></center>

John Calls is indepedent of Burglar and Earthquake given Alarm (Mary Calls is the same):

$$JohnCall \perp \{Burglar, Earthquake, Mary Calls\} \mid Alarm$$
  * `Alarm` is the parent of `JohnCalls`
  
which implies $\require{cancel} P(\text{JohnCalls}|\text{Alarm}, \cancel{Burglary, Earthquake, MaryCalls}) = P(\text{JohnCalls}|\text{Alarm})$

<center><img src="./figs/burglar.png" width = "400" height="400"/></center>

We can also state the following CI relationships:
$$Burglary \perp  Earthquake|\emptyset, \text{ or simply}\;\; Burglary \perp  Earthquake$$
  * $Burglary$'s parent is $\emptyset$: $\{A,J, M\}$ are $B$'s descendants, $E$ is the other r.v.

Actually, all nodes without parents are independent: $\{Burglary, Earthquake\}$ 

and more $$John calls \perp Mary calls |Alarm$$
  * the CI implies: $P(\text{Mary calls}|\text{Alarm},\text{John calls}) = P(\text{Mary calls}|\text{Alarm})$

# Bayesian network: joint distribution factorisation

Due to the conditional independence relationship encoded in a BN, an important property emerges (**factoring property**):

<div class="alert alert-block alert-info">
<center> <b>Joint distribution factorises as the product of CPTs: $P(X_1, X_2,\ldots, X_n) = \prod_{i=1}^n P(X_i|\text{parent}(X_i))$</b> </center>
</div>

For example, due to chain rule, the joint distribution can be decomposed as
    $$P(B, E, A,J,M) = P(B)P(E|B)P(A|B,E)P(J|B,E,A)P(M|B,E,A,J)$$

<center><img src="./figs/burglarCPTs.png" width = "650" height="400"/></center>

Due to CI assumptions encoded in the BN:

$$\begin{align}P(B,E, A,J,M) 
&= P(B)P(E|\cancel{B})P(A|B,E)P(J|\cancel{B,E},A)P(M|\cancel{B,E},A,\cancel{J})\\
&=\underbrace{P(B)P(E)P(A|B,E)P(J|A)P(M|A)}_{\prod \text{CPTs}}\end{align}$$


<div class="alert alert-block alert-info">
<center> <b>Joint distribution factorises as the product of CPTs: $P(X_1, X_2,\ldots, X_n) = \prod_{i=1}^n P(X_i|\text{parent}(X_i))$</b> </center>
</div>

# How to construct a Bayesian Network

One thing we yet to mention is how a Bayesian Network can be constructed

In light of the probability chain rule 

<div class="alert alert-block alert-info">
    <center> <b>$P(X_1, X_2,\ldots, X_n) = P(X_1)P(X_2|X_1)P(X_3|X_2, X_1)\ldots P(X_n|X_{n-1}, \ldots)=\prod_{i=1}^n P(X_i|X_{i-1}, \ldots, X_1)$</b> </center>
</div>

and the factoring property encoded by a BN 
<div class="alert alert-block alert-info">
<center> <b>Joint distribution factorises as the product of CPTs: $P(X_1, X_2,\ldots, X_n) = \prod_{i=1}^n P(X_i|\text{parent}(X_i))$</b> </center>
</div>

To construct a BN (that satisfies the semantics):
  * identify and number the random variables in *some* order $X_1, X_2, \ldots, X_n$

  * for each $X_i$ in the chosen order $1,2,\ldots,n$
    * identify $\text{parent}(X_i) \subseteq \{X_{i-1}, X_{i-2}, \ldots, X_1\}$ then add the links
      * only need to consider preceding nodes, *aka* topological order
      * also avoid cycles (DAG: acyclic)
    * add CPTs  

# An example with a good order

The order we choose matters, the general rule is
  * causes precede effects 

Consider the Burglary example with order: \[Burglar, Earthquake, Alarm, John Calls, Mary Calls\]

The process goes 
1. Adding `Burglar`: No parents

2. Adding `Earthquake`: No parents (only need to consider `Burglary`: no reason to believe influence exists)

3. Adding `Alarm`: and $\text{parent}$(`Alarm`) = {`Burglary`, `Earthquake`}
  * both have direct influence
  * add directed edges (`Burglary`,`Alarm`), (`Earthquake`,`Alarm`)

4. Adding `JohnCalls`: and $\text{parent}$(`JohnCalls`) = {`Alarm`}
  * direct influence: `JohnCall` becomes conditional independent from `Burglary` and `Earthquake` given `Alarm`
    * you may add `Earthquake` as parent as well: he may not make the call if there is an earthquake
    * all models are wrong but some are useful 
  * add directed edges (`Alarm`,`JohnCalls`)

5. Adding `MaryCalls`: and $\text{parent}$(`MaryCalls`) = {`Alarm`}  


# An example with a bad order

* Consider order \[`MaryCalls`, `JohnCalls`, `Alarm`, `Burglar`, `Earthquake`\]
  * left BN 
  * numbers next nodes are required parameters for the CPT)
    * in total 13 parameters 
* Consider order \[`MaryCalls`, `JohnCalls`, `Earthquake`, `Burglar`, `Alarm` \]
  * right BN
    * in total 31 parameters !

<center><img src="./figs/wrongorders.png" width = "800" height="400"/></center>

Conclusion: bad order leads to overly complicated model
  * less compact: the orginal model has only 10 parameters
  * harder to specify or learn the CPTs as well!
  * too many parameters might also lead to overfitting

# Only causal relationships ?


Consider the case: `Traffic` and `Rain`
* Rain causes Traffic: so Traffic depends on Rain
<center><img src="./figs/bnRainTraffic1.png" width = "400" height="400"/></center>

We can calculate the joint distrubtion by $P(R,T)=P(R)P(T|R)$

|R   | T |P(R,T) |
| ---| ---   | ---     |
| t  | t | 3/16 |
| t  | f | 1/16 |
| f  | t | 6/16 |
| f  | f | 6/16 |

We can also construct a BN the other way around `Traffic` to `Rain`:
<center><img src="./figs/bnRainTraffic2.png" width = "400" height="400"/></center>

* the joint distribution is $P(R,T)=P(T)P(R|T)$

|R   | T |P(R,T) |
| ---| ---   | ---|
| t  | t | 3/16 |
| t  | f | 1/16 |
| f  | t | 6/16 |
| f  | f | 6/16 |

* the two Bayesian networks are **consistent**: as the $P(R,T)$ are the same
  * all possible inference results are the same
  
* but the first one is much easier to specify, explanable, reasonable.   

# Causality topological order

If BN constructed in a causal order
  * simpler (less edges, less parameters for CPTs)
  * easier to elicit from experts
  * better model interpretation
  
Meanwhile, BN needs not to be specified by causal order
  * still consistent
  * edges are **correlations** but not causality
  
So an edge of a BN after all   
  * might encode causal relationship
  * but mostly just conditional independence relationships (correlations)

# Summary

* Bayesian networks 
  * each node is a r.v.
  * edges encode CI relationships between r.v.s
* How to construct a BN
  * Topological order that follows cause-effect order makes compact BN

# Next time

* Exact inference: enumeration algorithm
* Use software package to create BN
* Case study of Sally Clark's case
* Approximate inference (next week)
  * sampling based algorithm (MCMC)