# Probabilistic Models – Spring 2021
## Exercise Session 2
Feb 3rd 16.15.

Carmen Díez

### Instructions
Make sure the notebook produces correct results when ran sequentially starting from the first cell. You can ensure this by clearing all outputs (`Edit > Clear All Outputs`), running all cells (`Run > Run All Cells`), and finally correcting any errors.

To get points:
1. Submit your answers to the automatically checked Moodle test. 
 - You have 5 tries on the test: the highest obtained score will be taken into account.
 - For numerical questions the tolerance varies by question and will be specified in Moodle.
2. Submit this notebook containing your derivations to Moodle.

Some of the exercises will ask you to return a DAG as an answer. To make it possible to evaluate the answer automatically in Moodle use the following format. Construct the DAG as an adjacency matrix where $A[i, j] = 1$ if there is an edge $j \rightarrow i$ and 0 otherwise. The nodes should be in alphabetical order, so $A \rightarrow B$ corresponds to $0 \rightarrow 1$ (or $1 \rightarrow 2$ in R's 1-based indexing). Finally, concatenate all the rows starting from the first one and submit the vector as your answer. For example the DAG $A \rightarrow B \leftarrow C$ is encoded by the matrix 

$$\begin{bmatrix} 0 & 0 & 0 \\ 1 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix}$$

and vector $000101000$.

You can make use of the following examples to construct the DAG with Python/R.

In [1]:
# RUN THIS IF WORKING IN PYTHON
import pandas as pd

# Function to concatenate matrix rows into a single string
def mat2vec(dag):
    return ''.join(str(x) for x in dag.values.reshape(dag.values.shape[0]**2))

# Adjacency matrix
rvs = ["A", "B", "C"]
DAG = pd.DataFrame(0, index=rvs, columns=rvs)

# Example: Set parents of B to be A and C.
DAG.loc["B", ["A", "C"]] = 1

# Create the vector
print(mat2vec(DAG))

000101000


## Exercise 1
***

Let us consider a 4-sided dice rolling experiment as a multinomial model (i.i.d.   multi-valued Bernoulli). We roll the dice 20 times, and observe data $D$ with the following counts for the sides:

In [2]:
!cat data/1.csv

side	counts
1	5
2	3
3	7
4	5


(a) Calculate the maximum likelihood parameters, given the above data.

(b) Calculate the posterior distribution $P(\theta_1, \theta_2, \theta_3, \theta_4 | D)$ considering the prior $Dir(\alpha_1, \alpha_2, \alpha_3, \alpha_4)$, with
- $\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = 1$, i.e., the uniform prior and
- $\alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = 0.5$, i.e., the Jeffrey's prior.

For both, report the mean. 

(c) Using Bayesian inference with the uniform prior, calculate the predictive distribution (all 4 probabilities) of the next result given $D$.

(d) Let $\alpha_4$, the 4th hyperparameter to the Dirichlet prior be 3. Specify $\alpha_1$, $\alpha_2$ and $\alpha_3$ such that the mode of the posterior distribution is at $\theta_1 = \theta_2 = \theta_3 = \theta_4$.


In [3]:
data = "data/1.csv"
df = pd.read_csv(data, sep='\t')
df

Unnamed: 0,side,counts
0,1,5
1,2,3
2,3,7
3,4,5


### Answers

(a) Let $i \in \{1,2,3,4\}$ (the different sides) and $\theta_i = p_i$. 

Let $\mathbf{p} = \{p_i\}$ such that $\sum_{i} p_{i}=1$ and $n=20$ be the parameters of the multinomial distribution. And $x_i$ the number of time that side $i$ was counted such that $\sum_{i} x_{i}=n$.

The multinomial distribution with this parameters is defined by: 

$f_{\mathrm{p}}(\mathrm{n})=n ! \cdot \prod_{i} \frac{p_{i}^{x_{i}}}{x_{i} !}$

and the likelihood is then:

$L(\mathbf{p})=f_{\mathbf{p}}(\mathbf{n})$. 

We have the constraint $C(\mathbf{p})=\sum_{i} p_{i}=1$. So, to maximize $L$, we can use Lagrange multipliers method. Gradient of $L$ and $C$ have to be colinear:

$\frac{\partial}{\partial p_{i}} L(\mathbf{p})=\lambda \frac{\partial}{\partial p_{i}} C(\mathbf{p})$

Which is deriving:

$\frac{x_{i}}{p_{i}} L(\mathbf{p})=\lambda$

As $C(\mathbf{p})=1$ and $x_i$ is proportional to $p_i$ by the above formula, we finally get that:

$\hat{p}_{i}=\frac{x_{i}}{n}$

In [4]:
total = df['counts'].sum()
for c in df['counts']:
    print(c/total)

0.25
0.15
0.35
0.25


(b) Dirichlet prior:

$Dir(\mathbf{p}; \alpha_1, \alpha_2, \alpha_3, \alpha_4)=\frac{\Gamma\left(\sum_{i=1}^{K} \alpha_{i}\right)}{\prod_{i=1}^{K} \Gamma\left(\alpha_{i}\right)} \prod_{i=1}^{K} p_{i}^{\alpha_{i}-1}$

Dirichlet posterior (proportional to the prior, same type of distribution with different parameters):

$P(\mathbf{p} | D) = Dirichlet(\mathbf{p}; \alpha_1 + x_1, \alpha_2 + x_2, . . . , \alpha_k + x_k )$

The mean of a Dirichlet is:

$\mathrm{E}\left[X_{i}\right]=\frac{\alpha_{i}}{\sum_{k=1}^{K} \alpha_{k}}$

In [5]:
alpha = 1 #uniform prior
for c in df['counts']:
    print((c+alpha)/(total+4*alpha))

0.25
0.16666666666666666
0.3333333333333333
0.25


In [6]:
alpha = 0.5 #jeffreys prior
for c in df['counts']:
    print((c+alpha)/(total+4*alpha))

0.25
0.1590909090909091
0.3409090909090909
0.25


(c) Predicition by model averaging (same as the mean for the Dirichlet with these new parameters):

$P(X=i|D,\alpha) = \frac{\alpha_i+x_i}{\sum_{i} \alpha_{i}+x_i}$

So the results are the same as in (b) with uniform prior.

(d) The mode of this Dirichlet is 

$\theta_{i}=\frac{\alpha_{i}+x_i-1}{\sum_{k=1}^{4} \alpha_{k}+x_i-4}$

Having $\alpha_4=3$, for all the $\theta_{i}$ to be the same we need: 

$\theta_4=\frac{\alpha_{4}+x_4-1}{\sum_{k=1}^{4} \alpha_{k}+x_i-4} = \frac{3+5-1}{\sum_{k=1}^{4} \alpha_{k}+x_i-4}=\theta_1=\theta_2=\theta_3$

Then,

$\alpha_1+x_1=\alpha_2+x_2=\alpha_3+x_3= 5+3$, as $ \alpha_4+x_4= 5+3$.

$\alpha_1+5=\alpha_2+3=\alpha_3+7=8$ 

Thus,

$\alpha_1=3, \alpha_2=5, \alpha_3=1$.

## Exercise 2
***

Show by using the d-separation criterion that a node in a Bayesian network is conditionally independent of all the other nodes, given its (minimal) Markov blanket (parents, children, spouses (parents of children)). 

Give the answer verbally in Moodle. It will be checked manually. For your own future reference it's a good idea to paste the answer here, too. 

### Answers

#### d-separation criterion
If $X$, $Y$, and $Z$ are three disjoint subsets of nodes in a DAG $G$, then $Z$ is said to d-separate $X$ and $Y$, if there is no path between a node in $X$ and a node in $Y$ along which the following two conditions hold:
1. every node with converging arrows (→ $V$ ←) is in $Z$ or has a descendant in $Z$.
2. every other node is outside $Z$.

We know that if $X \perp \!\!\! \perp G Y|Z$, $X$ is blocked from $Y$ by $Z$, $X$ and $Y$ are d-separated.

#### Markov blanket

The Markov blanket for bayesian networks is formed by the parents, children and spouses (variable $Y$ is a spouse of $X$ if the two variables have a common child in DAG $G$) of a node $X$.


#### Proof

We want $X$ to be independent of $Y \neq X$ given its Markov blanket $B$ (parents, children and spouses) and $Y \notin B$, meaning:

$X \perp \!\!\! \perp Y | B$

* If there is a parent between them, the path is blocked.

* If there is a child between them, the path is blocked or $Y$ is a spouse (but $Y \notin B$ so it is a contradiction).

* If there is a spouse between them, the path is blocked.

$\square$

## Exercise 3
***

Consider the following BN structure. Answer the following queries and questions, justifying your answers.

![](2.3.svg)

(a) Decide whether the following d-separations hold or not. 
- $G \mathrel{\unicode{x2AEB}}_{G} D \mid A, E$
- $D \mathrel{\unicode{x2AEB}}_{G} F$
- $H \mathrel{\unicode{x2AEB}}_{G} B \mid G, C$
- $G \mathrel{\unicode{x2AEB}}_{G} H \mid A, F$

(b) Construct a Markov equivalent DAG (other than the given), and return it to Moodle in the format specified at the top of the notebook. How many equivalent DAGs are there in total (including the given one)?

(c) Suppose all variables in this network are binary. How many free parameters are needed to parameterize this network?

### Answers

(a)
* False: cannot guarantee independence. There exists an active path: G → F → B → $A$ ← C → D.
* True: all possible paths are blocked by F → B ← E and B → A ← C.
* False: cannot guarantee independence. There exists an active path: H → E → B.
* True: all possible paths are blocked by G → $F$ → B.

(b)

In [7]:
# Adjacency matrix
rvs = ["A", "B", "C", "D", "E", "F", "G", "H"]
DAG = pd.DataFrame(0, index=rvs, columns=rvs)

DAG.loc['A', ['B', 'C']] = 1
DAG.loc['B', ['E', 'F']] = 1
DAG.loc['C', 'E'] = 1
DAG.loc['D', 'C'] = 1
DAG.loc['E', 'H'] = 1
DAG.loc['G', 'F'] = 1 # direction changed in this one

# Create the vector
print(mat2vec(DAG))

0110000000001100000010000010000000000001000000000000010000000000


There are 8 equivalent DAGs. We have to be careful with immoralities: 

* F → B ← E and B → A ← C must remain the same.
* We must be careful to not create immoralities between H E C D nodes (H → E ← C and E → C ← D, 2 posibilities).

$\frac{2^4}{2}=8$

(c) Given the factorization (calculated in the next exercise): P(A|B,C)P(B|E,F)P(C|E)P(D|C)P(E|H)P(F|G)P(G)P(H).

In [8]:
print(2*4+4*2+2*1, 'free variables needed.')

18 free variables needed.


## Exercise 4
***

Consider again the DAG in Exercise 3.

a) What is the factorization implied by the DAG?

Return the factorization in Moodle in plain text in the exact same format as the example: `P(A,B,C)P(B,C|D)P(C|E,F)`. Here
- a set of variables $\{ B, A, C \}$ is encoded as `A,B,C` (note the alphabetical order);
- the factors themselves are in alphabetical order, so not `P(B)P(A)` but `P(A)P(B)`, not `P(A|C)P(A|B)` but `P(A|B)P(A|C)`.

b) Which of the following independencies are stated by the local Markov condition asserted by the DAG?

- $G \mathrel{\unicode{x2AEB}} H$
- $D \mathrel{\unicode{x2AEB}} F$
- $E \mathrel{\unicode{x2AEB}} A \mid H$
- $B \mathrel{\unicode{x2AEB}} H \mid F, G$

### Answers

(a) P(A|B,C)P(B|E,F)P(C|E)P(D|C)P(E|H)P(F|G)P(G)P(H)

(b) Local Markov: for any $X$, $X  \perp \!\!\! \perp Nondesc(X)|Pa(X)$ holds.

* True: $G$ independent with $H$ (a non descendant) given parents (no parents).
* False: parents of $D$ are not given.
* False: $A \in descendants(E)$.
* False: $B$ has more parents that are not given.

## Exercise 5
***

Consider the following DAG: $X \rightarrow Y \rightarrow Z$.

(a) Suppose the variables are binary and another, equivalent DAG encodes the same joint distribution with the following parameters:

\begin{aligned}
P(Y = 1) = 0.3 \\
P(X = 1 | Y = 1) = 0.2 \\
P(X = 1 | Y = 0) = 0.8 \\
P(Z = 1 | Y = 1) = 0.8 \\
P(Z = 1 | Y = 0) = 0.2 \\
\end{aligned}

Give the parameters corresponding to the first DAG.

(b) What values do the variables take at the mode of the joint distribution?

(c) Compute the marginal probabilities $P(X)$, $P(Y)$, $P(Z)$ and their respective most probable arguments (which value for each random variable gets the highest probability).

### Answers

Using 

$P(A)=\sum_{i}P(A|B_i)P(B_i)$,

$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$,

$P(C,x_1,x_2,...x_n)= P(C) \prod_{i=1}^{n}P(x_i| C)$,

and the data given:

In [9]:
def pY(y):
    if y==1: 
        return 0.3
    elif y==0: 
        return 1-0.3
def pXCondY(x,y):
    if x==1 and y==1:
        return 0.2
    if x==0 and y==1:
        return 1-0.2
    if x==1 and y==0: 
        return 0.8
    if x==0 and y==0: 
        return 1-0.8

def pZCondY(z,y):
    if z==1 and y==1: 
        return 0.8
    if z==0 and y==1: 
        return 1-0.8
    if z==1 and y==0: 
        return 0.2
    if z==0 and y==0: 
        return 1-0.2
    
def pX(x):
    pX1 = pXCondY(1,0)*pY(0)+pXCondY(1,1)*pY(1)
    if x==1:
        return pX1
    if x==0:
        return 1-pX1

def pZ(x):
    pZ1 = pZCondY(1,0)*pY(0)+pZCondY(1,1)*pY(1)
    if x==1:
        return pZ1
    if x==0:
        return 1-pZ1
    
def pYCondX(y,x):
    pY1CondX1 = pXCondY(1,1)*pY(1)/pX(1)
    pY1CondX0 = pXCondY(0,1)*pY(1)/pX(0)
    if y==1 and x==1:
        return pY1CondX1
    if y==0 and x==1:
        return 1-pY1CondX1
    if y==1 and x==0:
        return pY1CondX0
    if y==0 and x==0:
        return 1-pY1CondX0

def pXIYIZ(x,y,z):
    return pY(y)*pXCondY(x,y)*pZCondY(z,y) #we have the conditional for X and Z given Y

(a)

In [10]:
print('P(X=1)=', pX(1))
print('P(Y=1|X=0)=', pYCondX(1,0))
print('P(Y=1|X=1)=', pYCondX(1,1))
print('P(Z=1|Y=0)=', pZCondY(1,0))
print('P(Z=1|Y=1)=', pZCondY(1,1))

P(X=1)= 0.6199999999999999
P(Y=1|X=0)= 0.6315789473684208
P(Y=1|X=1)= 0.09677419354838711
P(Z=1|Y=0)= 0.2
P(Z=1|Y=1)= 0.8


(b)

In [11]:
maxi = 0
x_max = -1
y_max = -1
z_max = -1
for x in range(2):
    for y in range(2):
        for z in range(2):
            prob = pXIYIZ(x,y,z)
            if prob > maxi:
                maxi = prob
                x_max = x
                y_max = y
                z_max = z
            print(x,y,z,' ',prob)
print('Max:',x_max,y_max,z_max, ' ', maxi)

0 0 0   0.11199999999999997
0 0 1   0.027999999999999994
0 1 0   0.04799999999999999
0 1 1   0.192
1 0 0   0.44799999999999995
1 0 1   0.11199999999999999
1 1 0   0.011999999999999997
1 1 1   0.048
Max: 1 0 0   0.44799999999999995


(c)

In [12]:
print('P(X=1)=', pX(1))
print('P(Y=1)=', pY(1))
print('P(Z=1)=', pZ(1))

P(X=1)= 0.6199999999999999
P(Y=1)= 0.3
P(Z=1)= 0.38


## Exercise 6
***

Faithfulness. Consider a DAG $X \rightarrow Y$ over binary random variables $X,Y$.

(a) Give parameters for a BN over the DAG such that we have $X \mathrel{\unicode{x2AEB}} Y$ (conditional independence).

(b) Give parameters for a BN over the DAG such that we have $X \not\mathrel{\unicode{x2AEB}} Y$.

(c) Take the parameters in (a), add small random noise to the parameters and renormalize the probabilities such that you have a (valid) BN. Do you still have $X \mathrel{\unicode{x2AEB}} Y$?

(d) Does any of this contradict the soundness and completeness of d-separation? Why?

For each part give a short verbal answer in Moodle, e.g., "P(X = 1) = x, P(Y = 1 | X = 1) = y, P(Y = 1 | X = 0) = z  ...". They will be graded manually.

### Answers

(a) For $X \mathrel{\unicode{x2AEB}} Y$ we can have that $P(X | Y) = P(X)$. So, for instance: 

$P(X=1)=0.5$ and $P(X=1|Y=0)=P(X=1|Y=1)=0.5$.

(b) We want the opposite thing, for example:

$P(X=1)=0.5$, $P(X=1|Y=0)=0.6$ and $P(X=1|Y=1)=0.4$. Then $P(X | Y) = P(X)$ doesn't hold.

(c) With noise the independence most likely won't hold: it will be more difficult that $P(X | Y) = P(X)$ still holds as it is an equality. With small perturbations of the parameters, independencies are usually destroyed.

(d) We are not using d-separation to prove independence in these examples, so it doesn't contradict. Furthermore, it shows that adding unfaithful independencies to the BN requires fine hand-tuning of the parameters.