# Probabilistic Models – Spring 2022
## Exercise Session 2
Return by Feb 8th 12.00 through Moodle. Session on Feb 8th 14.15.

<span style="color:red">**Enrico Buratto**</span>

### Instructions
Make sure the notebook produces correct results when ran sequentially starting from the first cell. You can ensure this by clearing all outputs (`Edit > Clear All Outputs`), running all cells (`Run > Run All Cells`), and finally correcting any errors.

To get points:
1. Submit your answers to the automatically checked Moodle test. 
 - You have 5 tries on the test: the highest obtained score will be taken into account.
 - For numerical questions the tolerance varies by question and will be specified in Moodle.
2. Submit this notebook containing your derivations to Moodle.

Some of the exercises will ask you to return a DAG as an answer. To make it possible to evaluate the answer automatically in Moodle use the following format. Construct the DAG as an adjacency matrix where $A[i, j] = 1$ if there is an edge $j \rightarrow i$ and 0 otherwise. The nodes should be in alphabetical order, so $A \rightarrow B$ corresponds to $0 \rightarrow 1$ (or $1 \rightarrow 2$ in R's 1-based indexing). Finally, concatenate all the rows starting from the first one and submit the vector as your answer. For example the DAG $A \rightarrow B \leftarrow C$ is encoded by the matrix 

$$\begin{bmatrix} 0 & 0 & 0 \\ 1 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix}$$

and vector $000101000$.

You can make use of the following examples to construct the DAG with Python/R.

In [1]:
# RUN THIS IF WORKING IN PYTHON
import pandas as pd

# Function to concatenate matrix rows into a single string
def mat2vec(dag):
    return ''.join(str(x) for x in dag.values.reshape(dag.values.shape[0]**2))

# Adjacency matrix
rvs = ["A", "B", "C"]
DAG = pd.DataFrame(0, index=rvs, columns=rvs)

# Example: Set parents of B to be A and C.
DAG.loc["B", ["A", "C"]] = 1

# Create the vector
print(mat2vec(DAG))

000101000


In [2]:
# RUN THIS IF WORKING IN R

# Function to concatenate matrix rows into a single string
mat2vec <- function(dag) {
    return(paste(apply(dag, 1, paste, collapse=""), collapse=""))
}

# Adjacency matrix
rvs <- c("A", "B", "C")
DAG <- data.frame(matrix(0L, ncol = 3, nrow = 3))
colnames(DAG) <- rvs
rownames(DAG) <- rvs

# Example: Set parents of B to be A and C.
DAG["B", c("A", "C")] <- 1

# Create the vector
cat(mat2vec(DAG))

000101000

## Exercise 1
***

Let us consider a 4-sided dice rolling experiment as a multinomial model (i.i.d.   multi-valued Bernoulli). We roll the dice 20 times, and observe data $D$ with the following counts for the sides:

In [21]:
!cat data/1.csv

side	counts
1	5
2	3
3	7
4	5


(a) Calculate the maximum likelihood parameters, given the above data.

(b) Calculate the posterior distribution $P(\theta_1, \theta_2, \theta_3, \theta_4 | D)$ considering the prior $Dir(\alpha_1, \alpha_2, \alpha_3, \alpha_4)$, with
- $\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = 1$, i.e., the uniform prior and
- $\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = 0.5$, i.e., the Jeffrey's prior.

For both, report the mean. 

(c) Using Bayesian inference with the uniform prior (above), calculate the predictive distribution (all 4 probabilities) of the next result given $D$.

(d) Let $\alpha_4$, the 4th hyperparameter to the Dirichlet prior be 3. Specify $\alpha_1$, $\alpha_2$ and $\alpha_3$ such that the mode of the posterior distribution is at $\theta_1 = \theta_2 = \theta_3 = \theta_4$.


### Answer

#### Task a

We can expand what we saw in class considering $N_A$ = parameter of which we are searching the maximum likelihood parameter, say $N$, and $N_B$ the others, say $M$. We then derive the likelihood function and we put it to zero to find the maximum: $$ \frac{d}{d\theta}P(D|\theta) = 0$$ And we find as well that $$\theta=\frac{N}{N+M}$$

#### Task b

We know that the mean of a Dirichlet distribution is $$E[X_i]=\frac{\alpha_i}{\sum_{k=1}^{K}a_k}$$ and that the posterior distribution is proportional to the prior; to be more specific, if the prior is $Dir(\theta;\alpha_1,...,\alpha_k)$, the posterior is $Dir(\theta;\alpha_1+N_1,...,\alpha_k+N_k)$. The mean of the posterior distribution is then $$E[X_i]=\frac{x_i+\alpha_i}{\sum_{k=1}^{K}x_k\alpha_k}$$

#### Task c

Bayesian inference can be done with model averaging; the formula is equal to the mean of a Dirichlet distribution, so the results are the same.

#### Task d

We know that the mode of a Dirichlet distribution is $$\theta_i = \frac{\alpha_i-1}{\sum_{k=1}^{K}\alpha_k-K}$$ the mode of the posterior distribution is then $$\theta_i = \frac{\alpha_i+x_i-1}{\sum_{k=1}^{K}\alpha_k+x_k-K}$$ for the same proportionality property we saw for the mean. Now, for $\theta_4$ we have that $$\theta_4 = \frac{3+5-1}{\sum_{k=1}^{K}\alpha_k+x_k-K}$$ and we must have $\theta_4=\theta_1=\theta_2=\theta_3$, so $$\theta_1=\frac{\alpha_1+5-1}{\sum_{k=1}^{K}\alpha_k+x_k-K}$$ therefore, it's trivial that $\alpha_1=3$ (they have the same numerical parameters). We also have that for every $\theta$ the denominator is equal, so now we can just calculate the other $\alpha$ saying that $\alpha_1+5-1=\alpha_2+3-1=\alpha_3+7-1=\alpha_4+5-1$. So, finally:
- $\alpha_1=3$
- $\alpha_2=5$
- $\alpha_3=1$
- $\alpha_4=3$

In [13]:
import pandas as pd
df = pd.read_csv('./data/1.csv', sep='\t')

# Task a
print('Task a')
t = df['counts'].sum() # N+M
for side, count in df.iterrows():
    print('Side', side)
    print('ML', count['counts']/t)

# Task b
print('\nTask b')
def calculate_mean(df, alpha):
    t = df['counts'].sum()
    for side, count in df.iterrows():
        print('Side', side)
        p = (count['counts']+alpha)/(t+4*alpha)
        print('Mean', p)
print('Alpha=1')
calculate_mean(df,1)
print('Alpha=0.5')
calculate_mean(df,.5)


Task a
Side 0
ML 0.25
Side 1
ML 0.15
Side 2
ML 0.35
Side 3
ML 0.25

Task b
Alpha=1
Side 0
Mean 0.25
Side 1
Mean 0.16666666666666666
Side 2
Mean 0.3333333333333333
Side 3
Mean 0.25
Alpha=0.5
Side 0
Mean 0.25
Side 1
Mean 0.1590909090909091
Side 2
Mean 0.3409090909090909
Side 3
Mean 0.25


## Exercise 2
***

Show by using the d-separation criterion that a node in a Bayesian network is conditionally independent of all the other nodes, given its (minimal) Markov blanket (parents, children, spouses (parents of children)). 

Give the answer verbally in Moodle. It will be checked manually. For your own future reference it's a good idea to paste the answer here, too. 

Some correct and incorrect proofs will be posted anonymously in Moodle for study, by returning you agree that your version may be published anonymously. 

### Answer

In order to prove that a node in a Bayesian network is conditionally independent of all the other nodes, given its Markov blanket, we can use the **Global Markov property**, that is defined as follows (ref. [COMP538, Hong Kong University](https://cse.hkust.edu.hk/bnbook/pdf/l03.h.pdf)):

Given a Bayesian network, let $X$ and $Y$ be two variables and $\bold{Z}$ to be a set of variables that does not contain $X$ or $Y$. If $\bold{Z}$ d-separates $X$ and $Y$, then $X\mathrel{⫫}Y|\bold{Z}$.

Then, a variable $X$ is conditionally independent of all other variables given its Markov blanket: the Markov blanket of X, in fact, d-separates X from all other nodes because in any path starting from $X$ and going outside the Markov blanket of $X$ the connection at the last node before leaving the blanket could be (using Darwiche's book naming) sequential or diverging, so the path is always blocked. $\square$

## Exercise 3
***

Consider the following BN structure. Answer the following queries and questions, justifying your answers.

![](2.3.svg)

(a) Decide whether the following d-separations hold or not. 
- $G \mathrel{⫫}_{G} D \mid A, E$
- $D \mathrel{⫫}_{G} F$
- $H \mathrel{⫫}_{G} B \mid G, C$
- $G \mathrel{⫫}_{G} H \mid A, F$

(b) Construct a Markov equivalent DAG (other than the given), and return it to Moodle in the format specified at the top of the notebook. How many equivalent DAGs are there in total (including the given one)?

(c) Suppose all variables in this network are binary. How many free parameters are needed to parameterize this network?

### Answer

#### Task a

In order to verify is a d-separation holds or not we can use the valve system, described also in Darwiche's book. We have that:

Let $\bold{X}$, $\bold{Y}$ and $\bold{Z}$ be disjointed sets of nodes in a DAG _G_, we say that $\bold{X}$ and $\bold{Y}$ are _d-separated_ by $\bold{Z}$ if and only if every path between a node $\bold{X}$ and a node $\bold{Y}$ is blocked by $\bold{Z}$. A path is blocked by $\bold{X}$ if and only if at least one valve on the path is closed given $\bold{Z}$.

We can have three different valves, that are closed under certain conditions:
- **Sequential valve**: -> W ->. This is closed if and only if the variable W appears in $\bold{Z}$;
- **Divergent valve**: <- W ->. This is closed if and only if the variable W appears in $\bold{Z}$;
- **Convergent valve**: -> W <-. This is closed if and only if neither variable W nor any of its descendants appear in $\bold{Z}$.

We then have that:
- $G \mathrel{⫫}_{G} D \mid A, E$ **does not hold** because G->F->B, F->B->A, B->A<-C and A<-C-> are open;
- $D \mathrel{⫫}_{G} F$ **holds** because B->A<-C and F->B<-E are closed (if we are considering $D \mathrel{⫫}_{G} F$ given all the remaining set);
- $H \mathrel{⫫}_{G} B \mid G, C$ **does not hold** because H->E->B is open;
- $G \mathrel{⫫}_{G} H \mid A, F$ **holds** beacuse G->F->B is closed.

In [2]:
"""
Task b

Two DAGs are Markov equivalent if and only if they have the
same skeleton (structure omitting edge directions) and the same
set of (unshielded) v-structures (X → Y ← Z , no edge between Z
and X , also called immorality).
"""

import pandas as pd

# Function to concatenate matrix rows into a single string
def mat2vec(dag):
    return ''.join(str(x) for x in dag.values.reshape(dag.values.shape[0]**2))

# Adjacency matrix
rvs = ["A", "B", "C", "D", "E", "F", "G", "H"]
DAG = pd.DataFrame(0, index=rvs, columns=rvs)

# This is the DAG in the picture
DAG.loc["A", ["B", "C"]] = 1
DAG.loc["B", ["E", "F"]] = 1
DAG.loc["F", "G"] = 1
DAG.loc["E", "H"] = 1
DAG.loc["C", "E"] = 1
DAG.loc["D", "C"] = 1

# We can change a "secure" connection in order to have a Markov equivalent DAG
DAG.loc["E", "H"] = 0
DAG.loc["H", "E"] = 1

# There are 8 equivalent DAGs: different combinations of H-E, G-F, C-D

# Create the vector
print(mat2vec(DAG))

0110000000001100000010000010000000000000000000100000000000001000


#### Task c

In order to calculate the number of free parameters needed to parameterize this network we can calculate the factorization using the chain rule. We than have $$P(G)P(H)P(F|G)P(B|F,E)P(A|B,C)P(E|H)P(C|E)P(D|C) = 1+1+2+4+4+2+2+2 = 18$$

## Exercise 4
***

Consider again the DAG in Exercise 3.

a) What is the factorization implied by the DAG?

Return the factorization in Moodle in plain text in the exact same format as the example: `P(A,B,C)P(B,C|D)P(C|E,F)`. Here
- a set of variables $\{ B, A, C \}$ is encoded as `A,B,C` (note the alphabetical order);
- the factors themselves are in alphabetical order, so not `P(B)P(A)` but `P(A)P(B)`, not `P(A|C)P(A|B)` but `P(A|B)P(A|C)`.

b) Which of the following independencies are stated by the local Markov condition asserted by the DAG?

- $D \mathrel{⫫} F$
- $E \mathrel{⫫} A \mid H$
- $B \mathrel{⫫} H \mid F, G$

### Answer

#### Task a

The factorization is the one calculated on the exercise above. Using the order requested it is $$P(A|B,C)P(B|E,F)P(C|E)P(D|C)P(E|H)P(F|G)P(G)P(H)$$

#### Task b

A distribution P satisfies the local Markov property if and only if $X\mathrel{⫫}Nondesc(X)|Parents(X)$ holds for all variables $X$. So:

- $D \mathrel{⫫} F$ **does not hold** because parents are not given;
- $E \mathrel{⫫} A \mid H$ **does not hold** because $A$ is in fact a descendant of $E$;
- $B \mathrel{⫫} H \mid F, G$ **does not hold** beacuse it's true that $F$ and $G$ are parents of $B$, but $H$ and $E$ are parents as well and they are not given.

## Exercise 5
***

Consider the following DAG: $X \rightarrow Y \rightarrow Z$.

(a) Suppose the variables are binary and another, equivalent DAG encodes the same joint distribution with the following parameters:

\begin{aligned}
P(Y = 1) = 0.3 \\
P(X = 1 | Y = 1) = 0.2 \\
P(X = 1 | Y = 0) = 0.8 \\
P(Z = 1 | Y = 1) = 0.8 \\
P(Z = 1 | Y = 0) = 0.2 \\
\end{aligned}

Give the parameters corresponding to the first DAG.

(b) What values do the variables take at the mode of the joint distribution?

(c) Compute the marginal probabilities $P(X)$, $P(Y)$, $P(Z)$ and their respective most probable arguments (which value for each random variable gets the highest probability).

In [None]:
# Provide your answer in cells here

## Exercise 6
***

Faithfulness. Consider a DAG $X \rightarrow Y$ over binary random variables $X,Y$.

(a) Give parameters for a BN over the DAG such that we have $X \mathrel{⫫} Y$ (conditional independence) and the distribution is positive, i.e., gives a non-zero probability for all assignments of $X,Y$.

(b) Give parameters for a BN over the DAG such that we have $X \not\mathrel{⫫} Y$ such that the distribution is positive as well.

(c) Take the parameters in (a), add small random noise to the parameters and renormalize the probabilities such that you have a (valid) BN (i.e. rows of CPTs should sum up to one and all probabilities should be positive). Do you still have $X \mathrel{⫫} Y$?

(d) Does any of this contradict the soundness and completeness of d-separation? Why?

For each part give a short verbal answer in Moodle, e.g., "P(X = 1) = x, P(Y = 1 | X = 1) = y, P(Y = 1 | X = 0) = z  ...". They will be graded manually.

### Answer

#### Task a

Conditional independence between two variables means that both their conditioned probabilities are the same as the unconditioned, _i.e._ $P(X|Y) = P(X)$ and $P(Y|X) = P(Y)$. This task then reduces to find the parameters such that, for every $X=\{0,1\}$ $$P(X)=P(X|Y=0)=P(X|Y=1)$$ and viceversa with Y being the conditioned event and X being the conditioning event. This could hold for this "chain" of probabilities equal to 0.5 in order to have the distribution normalized.

#### Task b

It's sufficent to add some noise to the parameters.

#### Task c

Same as task b.

#### Task d

On the contrary, this prove what is stated on the slides, _i.e._ to create a BN whose distribution induces additional unfaithful independencies requires fine hand-tuning of the parameters, and adding even small perturbations on the parameters will destroy independencies.