## Probabilistic Models – Spring 2022
## First Course Exam, March 11.3.2022 9.00-11.30

<span style="color:red">**Enrico Buratto 015621911**</span>

### Instructions
Make sure the notebook produces correct results when ran sequentially starting from the first cell. You can ensure this by clearing all outputs, running all cells, and finally correcting any errors (e.g. Kernel -> Restart & Run all).

-Submit this notebook containing your derivations to Moodle.

-The return form will be closed at 11.30 (or a minute later), be sure to submit your answers in time. Do not forget any files.

-Any outside contact during the exam is strictly forbidden.

-Points are given only to answers that have all the calculations and justifications returned. This include commenting your code such that JUSTIFY ALL YOUR ANSWERS, FINAL RESULTS ARE NOT ENOUGH.

-Return any partial solutions, points are awarded for partial solution. FOCUS ESPECIALLY ON DESCRIBING EACH STEP IN CALCULATIONS, these are more highly awarded than the numerical calculations. Implementing full general algorithms is not required (though allowed), you need to run the algorithm only on the particular instance.

-Some questions may benefit from online search, you are permitted to use internet, books and sources -
but direct copy-paste from sources is not permitted, and ANY SOURCE, OUTSIDE COURSE MATERIAL, YOU USE, MUST BE CITED.
You can use code from your exercises (if it is needed, cited).

-You can use any language for the calculations, Python and R are preferred.

-The answers will be graded by a person.

Return format:

-Jupyter notebooks are the preferred format.

-Text format is also permitted.

-Pdf is also permitted.

-Clear photos of clearly hand written answers are also permitted.

# Question 1 (6 points)
***

Select 3 of the following 4 concepts and explain their meaning in the context of this course by one paragraph of text (4-5 sentences). How is it defined? What limitations does it have? What is good about it? What is troublesome about it? What can it be used for? How can it be avoided? Provide answers to the appropriate questions to each term.


(a) Soundness and Completeness of d-separation.

(b) Overfitting.

(c) Equivalent Sample size.

(c) Hidden Markov Model.


### Answer

**Soundness and Completeness of d-separation**: both are properties, formalized with theorems, concerning the correlation between independence and d-separation. Soundness theorem says that if we have two nodes $X$ and $Y$ that are d-separated given another node $Z$, then $X$ and $Y$ are conditionally independent given $Z$; formally: $ X \mathrel{⫫}_{G} Y | Z \Rightarrow  X \mathrel{⫫} Y | Z$. There could be independencies in the probability distribution that are not represented only by d-separation, because it only uses the structure; however, completeness theorem says that there exists (but it's not necessarily unique) a parametrization $\theta$ such that independence relations in the distrubtion correspond to d-separations; formally: $X \mathrel{⫫}_{G} Y | Z \Leftrightarrow  X \mathrel{⫫} Y | Z$.

**Overfitting**: as a general definition, overfitting is a phenomena that happens when a function is too closely aligned to a limited set of data points. This means that an overfitted model reduces bias too much in order to have a smaller variance, and the result of this are usually poor statistical power and performance on newly seen data. We saw this concept when we talked about parameter estimation. As the slides report, Maximum Likelihood Estimation performs poorly because it's extremely sensitive to the training data; in fact, it fits perfectly to the training data, and this could most probably lead to overfitting. We then saw that another extimate could and should be used to avoid this problem, _i.e._ the Bayesian Estimate.

**Hidden Markov Model**: there is a lot that can be said on HMM, but in short we talked about them as temporal probabilistic models, composed by _states_ $X_t$ and _emissions_ $E_t$: states are hidden, while emissions are visible. To pass between a state and another, a _transition_ $X_t \rightarrow X_{t+1}$ is performed according to $P(X_{t+1}|X_t)$; to pass between a state and an emission, an _emission_ $X_t \rightarrow E_t$ is performed according to $P(E_t|X_t)$. An HMM can also be seens as a specific bayesian network, where there are no colliding arcs and the independencies are always the same: an emission is independent from the rest of the chain given its state, a state is independent from every $X_{t-i-1}$ given $X_{t-i}$ and from every $X_{t+i+1}$ given $X_{t+i}$. The inference is, then, easy to perform. 

# Question 2 (5 points)
***

Give a single concrete probability distribution over three binary random variables X, Y, Z, with properties:

(a) The distribution is positive, i.e., all value assignments have probability > 0.

(b) X is statistically dependent on Y.

(c) Z is statistically dependent on Y.

(d) X is statistically independent of Z.

Clearly justify that the properties are satisfied EXACTLY in a single distribution. (Hint: Use a Bayesian Network to define the distribution!)

## Answer

We can start writing $P(X,Y,Z)$ as a factorization using the chain rule: $$P(X,Y,Z) = P(X)P(Y|X)P(Z|X,Y)$$
Now we know that:

- $X$ is statistically dependent on $Y$, so $P(X|Y)\neq P(X)$, $P(Y|X)\neq P(Y)$
- $Z$ is statistically dependent on $Y$, so $P(Z|Y)\neq P(Z)$, $P(Y|Z)\neq P(Y)$
- $X$ is statistically independent of $Z$, so $P(X|Z)=P(Z)$, $P(Z|X)=P(Z)$

Therefore, $$P(X)P(Y|X)P(Z|X,Y)=P(X)P(Y|X)P(Z|Y)$$

We can represent this factorization graphically using a DAG, which follows (**note**: may not be visualized inside the notebook, in case check file q2a.jpg)

![](./q2a.jpg)

We then define the conditional probability tables, which follow (**note**: may not be visualized inside the notebook, in case check file q2b.jpg)

![](./q2b.jpg)

We then just need to use the formula we obtain with the chain rule to calculate a concrete probability distribution; the results follow (**note**: may not be visualized inside the notebook, in case check file q2c.jpg).

![](./q2c.jpg)

# Question 3 (8 points)
***

Consider the following BN structure. 

![bn.svg](./dag2.png)

Answer the following queries and questions.

(a) Decide whether the following d-separations hold or not. Justify your answer here in detail. Points only with a solid justification.


(a1) A d-separated from B?

(a2) C d-separated from E given B?

(a3) C d-separated from F given D?

(a4) A d-separated from  B given E, F?

(b) Give another DAG that is in the same Markov equivalence class as the DAG given above.

(c) For the DAG, give an arc and a d-separation relation, such that reversing the arc changes the status of the relation, i.e., from holding to not holding or from not holding to holding.

(d) Suppose two nodes X and Y are not adjacent in the DAG structure of a Bayesian network, i.e., there is no arc between them. Show that there is a set S, such that X and Y are d-separated given S. (Hint: Assume without loss of generality that X is before Y in a causal order of the DAG. Which nodes need to be conditioned on to make X and Y d-separated?)

## Answer

### Task a

_(I cite from my second excercise set)_
In order to verify is a d-separation holds or not we can use the valve system, described also in Darwiche's book. We have that:

Let $\bold{X}$, $\bold{Y}$ and $\bold{Z}$ be disjointed sets of nodes in a DAG _G_, we say that $\bold{X}$ and $\bold{Y}$ are _d-separated_ by $\bold{Z}$ if and only if every path between a node $\bold{X}$ and a node $\bold{Y}$ is blocked by $\bold{Z}$. A path is blocked by $\bold{X}$ if and only if at least one valve on the path is closed given $\bold{Z}$.

We can have three different valves, that are closed under certain conditions:
- **Sequential valve**: -> W ->. This is closed if and only if the variable W appears in $\bold{Z}$;
- **Divergent valve**: <- W ->. This is closed if and only if the variable W appears in $\bold{Z}$;
- **Convergent valve**: -> W <-. This is closed if and only if neither variable W nor any of its descendants appear in $\bold{Z}$.

We than have that:

- $A\mathrel{⫫}_{G}B|C,D,E,F$ **does not hold** because A->C<-B is open (there exists an active path)
- $C\mathrel{⫫}_{G}E|B$ **holds** because C<-B->E is closed, and even if C->D->F is open, D->F<-E is closed
- $C\mathrel{⫫}_{G}F|D$ **does not hold** because C<-B->E and B->E->F are open
- $A\mathrel{⫫}_{G}B|E,F$ **holds** because A->C<-B, A->D<-C and A->D->F are closed

### Task b

Two DAGs are Markov equivalent if and only if they have the same skeleton (structure omitting edge directions) and the same set of (unshielded) v-structures (X → Y ← Z , no edge between Z and X , also called immorality). We can then change a "secure" connection in order to have a Markov equivalent DAG, for instance B->E. The new DAG follows (**note**: may not be visualized inside the notebook, in case check file q3b.jpg).

![](./q3b.jpg)

### Task c

To solve this task we can just choose one d-separation in Task b, say $A\mathrel{⫫}_{G}B|E,F$. If we reverse C<-B to C->B we are changing the valve A->C<-B from convergent to sequential (A->C->B). The d-separation then changes from holding to not holding, because A->C->B is open.

### Task d

The set S could be the markov boundary (i.e. the minimal markov blanket): the markov boundary for a variable $X$ is a minimal set of variables $B$ such that $X \notin B$: $$X \mathrel{⫫} (\mathcal{X}  \setminus B) \setminus {X} |B$$
Since the markov boundary always exists (source https://en.wikipedia.org/wiki/Markov_blanket, but it's also intuitive), it always exist a set S such that X,Y are d-separated given S, and this S is the minimal markov blanket.

# Question 4 (5 points)
***

Suppose W1 and W2 are words that appear commonly in emails. Suppose you have the following dataset:

SPAM:

-400 emails with W1 and W2

-300 emails with W1 but not W2

-200 emails with W2 but not W1

-100 emails with neither W1 or W2

NOT SPAM:

-100 emails with W1 and W2

-200 emails with W1 but not W2

-300 emails with W2 but not W1

-400 emails with neither W1 or W2

(a) Formulate the Naive Bayes Classifier for classifying mail in to SPAM and not SPAM with the occurence of words W1 and W2 as the features. What is the DAG structure? Which indepence relation(s) are we assuming here?

(b) Learn the parameters using Maximum likelihood estimation.

(c) Using these parameters, what is the probability of the next email being spam, if both words W1 and W2 appear in it?

In [6]:
# (using modified code from my notebook)

import pandas as pd
import warnings
warnings.filterwarnings('ignore') # probably using something deprecated but works

# count occurrences. Just do it manually
df = pd.DataFrame(columns=['word','spam','ham'])
df = df.append({'word': 'W1', 'spam': 700, 'ham': 300}, ignore_index=True)
df = df.append({'word': 'W2', 'spam': 600, 'ham': 400}, ignore_index=True)

# maximum likelihood
df['spam'] = (df['spam'])/(df['spam'].sum())
df['ham'] = (df['ham'])/(df['ham'].sum())
print('Task b')
print(df)

# P(C=spam|W1W2)
def calculatePosterior(data, df):
    last = data
    spam_given_d = 1
    for word in last.split():
        spam_given_d *= df[df['word']==word]['spam'].values[0]
    spam_given_d *= .5

    ham_given_d = 1
    for word in last.split():
        ham_given_d *= df[df['word']==word]['ham'].values[0]
    ham_given_d *= .5
    return spam_given_d/(spam_given_d+ham_given_d)

post = calculatePosterior('W1 W2', df)
print('Task c')
print('P(C=spam|W1W2) =', post)
print('P(C=ham|W1W2) =', 1-post)

Task b
  word      spam       ham
0   W1  0.538462  0.428571
1   W2  0.461538  0.571429
Task c
P(C=spam|W1W2) = 0.5036710719530103
P(C=ham|W1W2) = 0.4963289280469897


# Question 5 (6 points)
***

Consider the following Bayesian network modelling a student's chances on arriving to a school on time.

![bn_ontime.png](bn_ontime.png)

(a) What is the probability of all random variables getting the value yes?

(b) Calculate the probability of being on time by factor elimination. You may use any version of the algorithm but you must calculate it yourself, and not use packages. Return all calculations.

(c) Suppose you intervene on the system, forcing the alarm on. Calculate the probability P("On Time"=Yes|do("Alarm On"=Yes)), that is the probability of being on time when alarm is set on by intervention.

## Answer

### Task a

We can represent the BN as follows:

- Say A is the random variable associated with "Alarm on?"
- Say B is the random variable associated with "Bus late"
- Say S is the random variable associated with "Over-slept?"
- Say O is the random variable associated with "On time?"

Using the chain rule we fave the following factorization: $$P(A,B,S,O) = P(A)P(B|A)(S|A,B)P(O|A,B,S)$$

where $P(B|A)=P(B)$ since A and B are independent, $P(S|A,B)=P(S|A)$ since S and B are independent, $P(O|A,B,S)=P(O|B,S)$ since O and A are independent given S. Therefore, we have $$P(A,B,S,O)=P(A)P(B)P(S|A)P(O|B,S)$$

If we want all the random variables to be equal to yes, we can use the above formula to compute the probability. We then have $0.9*0.2*0.1*0.1=0.0018$.