# Probabilistic Models – Spring 2022
## Exercise Session 1
Return by Jan 25th 9.00 through Moodle. Session on Jan 25th 14.15.

### Instructions
Make sure the notebook produces correct results when ran sequentially starting from the first cell. You can ensure this by clearing all outputs (`Edit > Clear All Outputs`), running all cells (`Run > Run All Cells`), and finally correcting any errors.

To get points:
1. Submit your answers to the automatically checked Moodle test. 
 - You have 5 tries on the test: the highest obtained score will be taken into account.
 - For numerical questions the tolerance is +/- 0.01.
2. Submit this notebook containing your derivations to Moodle.

## Exercise 1
***

Consider the following joint distribution $P$:

In [1]:
!cat data/1.csv

A	B	C	P
True	True	True	0.075
True	True	False	0.05
True	False	True	0.225
True	False	False	0.15
False	True	True	0.025
False	True	False	0.1
False	False	True	0.075
False	False	False	0.3


(a) What is $P(A=T, C=T)$?

Update the distribution by conditioning on the event $C=T$, that is, construct the conditional distribution $P( \cdot |C=T$).

(b) What is $P(A=T|C=T)$? $P(B=T|C= T)$?

(c) Is the event $A=T$ independent of the event $C=T$? Is $B=T$ independent of $C=T$?

### Instructions

If you're using Python you can start by reading the provided file into a [Pandas](https://pandas.pydata.org/) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) or similarly to a [data.frame](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame) in R. To check for equality between two real numbers do not use `x == y`, as it gives false negatives on limited precision floats. Rather, use for example [`math.isclose(x, y)`](https://docs.python.org/3/library/math.html#math.isclose) in Python or [`near(x, y)`](https://dplyr.tidyverse.org/reference/near.html) in R.


In [2]:
import pandas as pd
import math as math

df = pd.read_csv('./data/1.csv', sep='\t')

# Task a. P(A=T,C=T) = sum of rows matching the condition
print('Task a')
prob_a = df.loc[(df['A'] == True) & (df['C'] == True)]['P'].sum()
print('P(A=T,C=T) =', prob_a)

# Task b. P(.|C=T) = P(., C=T)/P(C=T)
print('\nTask b')
pctrue = df.loc[df['C'] == True]['P'].sum()
prob_ba = prob_a/pctrue
prob_bb = (df.loc[(df['B'] == True) & (df['C'] == True)]['P'].sum())/pctrue
print('P(A=T|C=T) =', prob_ba, '\nP(B=T|C=T) =', prob_bb)

# Task c. A and B are independent if P(A|B) = P(A) and P(B|A) = P(B)
print('\nTask c')
patrue = df.loc[df['A'] == True]['P'].sum()
pbtrue = df.loc[df['B'] == True]['P'].sum()
ind_c1 = math.isclose(prob_ba,patrue)
ind_c2 = math.isclose(prob_bb,pbtrue)
print('A=T,C=T', ind_c1)
print('B=,C=T', ind_c2)

Task a
P(A=T,C=T) = 0.3

Task b
P(A=T|C=T) = 0.7499999999999999 
P(B=T|C=T) = 0.25

Task c
A=T,C=T False
B=,C=T True


## Exercise 2
***

Consider again the joint distribution $P$ from Exercise 1.

(a) What is $P(A=T \vee B=T)$?


Update the distribution by conditioning on the event $(A=T \vee B=T)$, this is, construct the conditional distribution $P( \cdot |A=T \vee B=T)$.

(b) What is $P(A=T|A=T \vee B=T)$? $P(B=T|A=T \vee B=T)$?

(c) Is the event $B=T$ conditionally independent of $C=T$ given the event $(A=T \vee B=T)$?

In [3]:
# Task a. P(A=T or B=T) is the union of the probabilities of the two events
print('Task a')
updated = df.loc[(df['A'] == True) | (df['B'] == True)]
prob_a = updated['P'].sum()
print('P(A=T or B=T) =', prob_a)

# Task b. Same idea as in Exercise 1, but with the different fixed event
print('\nTask b')
patrue = updated.loc[df['A'] == True]['P'].sum()
pbtrue = updated.loc[df['B'] == True]['P'].sum()
prob_ba = patrue/prob_a
prob_bb = pbtrue/prob_a
print('P(A=T|A=T or B=T) =', prob_ba, '\nP(B=T|A=T or B=T) =', prob_bb)

# Task c. Same idea as in Exercise 1, but with the different fixed event
print('\nTask c')
pbtrue = df.loc[df['B'] == True]['P'].sum()
ind_c1 = math.isclose(prob_ba, pbtrue)
print('Is B=T independent of C=T given (A=T or B=T)?', ind_c1)


Task a
P(A=T or B=T) = 0.625

Task b
P(A=T|A=T or B=T) = 0.8 
P(B=T|A=T or B=T) = 0.4

Task c
Is B=T independent of C=T given (A=T or B=T)? False


## Exercise 3
***

Consider the following joint distribution.

In [4]:
!cat data/3.csv

A	B	C	P
True	True	True	0.27
True	True	False	0.18
True	False	True	0.03
True	False	False	0.02
False	True	True	0.02
False	True	False	0.03
False	False	True	0.18
False	False	False	0.27


For each pair of variables, state whether they are independent. State also whether they are independent given the third variable. Justify your answers.

In [5]:
import pandas as pd
import math as math
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('./data/3.csv', sep='\t')

# X and Y are independent if P(X=x, Y=y) = P(X=x)P(Y=y) for all x,y
def areIndependent(x, y, df):
    check = False
    for i in [True, False]:
        for j in [True, False]:
            px = df.loc[df[x] == i]['P'].sum()
            py = df.loc[df[y] == i]['P'].sum()
            pcond = df.loc[(df[x] == i) & (df[y] == j)]['P'].sum()
            if math.isclose(pcond,px*py):
                check = True
    return check

# A and B
print('A','B',areIndependent('A', 'B', df))
print('A','C',areIndependent('A', 'C', df))
print('B','C',areIndependent('B', 'C', df))

# X and Y are independent given Z if P(X=x, Y=y | Z=z) = P(X=x | Z=z)P(Y=y | Z=z) for all x,y,z

# A and B, fixed C
updated = df.loc[df['C'] == False]
prob = updated['P'].sum()
updated['P'] = updated['P'].apply(lambda x: x/prob)
print('A,B,C', areIndependent('A', 'B', updated))

# A and C, fixed B
updated = df.loc[df['B'] == False]
prob = updated['P'].sum()
updated['P'] = updated['P'].apply(lambda x: x/prob)
print('A,C,B', areIndependent('A', 'C', updated))

# B and C, fixed A
updated = df.loc[df['A'] == False]
prob = updated['P'].sum()
updated['P'] = updated['P'].apply(lambda x: x/prob)
print('B,C,A', areIndependent('B', 'C', updated))

A B False
A C False
B C False
A,B,C False
A,C,B False
B,C,A True


## Exercise 4
***

We have three urns labeled 1, 2 and 3. The urns contain, respectively, three white and three black balls, four white and two black balls, and one white and two black balls. An experiment consists of selecting an urn at random then drawing a ball from it.

Define the joint probability distribution over $U$ and $C$, where $U$ is the chosen urn with values 1, 2 and 3; and $C$ is the color of the ball, with values black and white.

(a) What is the probability of drawing a black ball?

(b) What is the conditional probability that urn 2 was selected given that a black ball was drawn?

(c) What is the probability of selecting urn 1 or a white ball?

In [6]:
# Task a. P(select a specific urn) = 1/3
prob_a = 1/3*(1/2+1/3+2/3)
print('Task a:', prob_a)

# Task b. P(U=2|B=black) = P(U=2,B=black)/P(B=black) for conditional probability properties
prob_b = (1/3*1/3)/prob_a
print('Task b:', prob_b)

# Task c. P(U=1 or B=white) is the union of the two events, so P(U=1) + P(B=white) - P(U=1,B=white)
prob_c = 1/3+1/3*(1/2+2/3+1/3)-1/3*1/2
print('Task c:', prob_c)

Task a: 0.5
Task b: 0.2222222222222222
Task c: 0.6666666666666666


## Exercise 5
***

Suppose Ed keeps track of forecasts of Finnish Meterological Institute (FIM) and believes they are correct with 80% probability and Mary belives the forecasts of Foreca are correct with 70% probability. Then suppose FIM predicts rain and Foreca does not.

Consider four sets of bets:

> (1) Bookie offers to sell Ed a bet for 85 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 60 euros returning Mary 100 euros if it does not rain.
> 
> (2) Bookie offers to sell Ed a bet for 79 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 69 euros returning Mary 100 euros if it does not rain.
> 
> (3) Bookie offers to sell Ed a bet for 73 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 73 euros returning Mary 100 euros if it does not rain.
> 
> (4) Bookie offers to sell Ed a bet for 55 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 34 euros returning Mary 100 euros if it does not rain.

(a) Which set of bets is a Dutch book?

(b) How much money is the bookie guaranteed to make in the Dutch book scenario?

Provide some calculations justifying your answers in the notebook.


### Answer, (in short: Set 2)

#### Set 1

This is not a set of Dutch books: in fact, it would be it only if we consider Mary, since she's betting 60 euros on a 70% probability. Ed, however, is offered to bet 85 euros, and then this is not a Dutch book.

#### Set 2

This is a set of Dutch books, since both in both the cases the bookie offers to sell a bet for less than the believed probability. In both the cases (rain or not rain), in fact, the bookie is earning some money ($79+69-100=48$).

#### Set 3

This is not a set of Dutch books for the same reason of Set 1 (with Mary instead of Ed).

#### Set 4

This is not a set of Dutch books; the bookie, in fact, is actually losing money, since $55+34$ is less than $100$ (that is the amount he/she have to pay to the winner).

## Exercise 6
***



The file `data/6.csv` contains 200 data points sampled from the distribution defined in exercise 3, with `True` mapped to 1 and `False` to 0.

For each pair of variables, conduct the G²-test for statistical independence. Also conduct the test for each pair of variables given the third variable. That is, repeat the task specified in exercise 3, but this time based on data sampled from the distribution instead of direct access to the distribution. For each conducted test report the p-value obtained when the null hypothesis is that the independence holds.

You can also try sampling data from the distribution yourself to see how the obtained p-values behave, but for the Moodle return use the given data set.

### G²-test

Under the null hypothesis $H_0: X \mathrel{⫫} Y \mid C$ we have that

$$\#_{e}(X=x \wedge Y=y \wedge C=c) = \frac{\# (X=x \wedge C=c) \cdot \# (Y=y \wedge C=c)}{\# (C=c)}$$

where $\#$ marks the number of samples satisfying the condition after, and $\#_{e}$ is the expected number of samples under $H_{0}$.

Then examine the following quantity:

$$G^{2} = 2 \sum \# \log \frac{\#}{\#_{e}} $$

where the summation is over the different configurations of the variables (i.e., different values the variables can assume).

Under $H_0$ the quantity $G^2$ is distributed as [$\chi^2$](https://en.wikipedia.org/wiki/Chi-square_distribution) with $(m_X - 1)(m_Y - 1)m_C$ degrees of freedom, where $m_X,m_Y,m_C$ are the number of possible configurations for $X$, $Y$ and $C$, respectively.

### Instructions

You can use any libraries you find for the task, but it probably makes sense to implement the $G^2$ computation yourself, and then compute the p-value for example using [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html) (if you're using Python) or the built-in [chisquare functions](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html) in R.

In [7]:
import pandas as pd
from scipy import stats as st
import numpy as np

df = pd.read_csv('./data/6.csv')

def calculate_chi(x, y, z=None):
    sum = 0
    if not z: # if two variables
        for i in [0,1]:
            for j in [0,1]:
                sat = df.loc[(df[x]==i) & (df[y]==j)].shape[0]
                h0 = df.loc[(df[x]==i)].shape[0]*df.loc[(df[y]==j)].shape[0]/df.shape[0]
                sum += 2*sat*np.log(sat/h0)
                chi = st.chi2.cdf(sum, 1)
    else: # if three variables (conditioned probability)
        for i in [0,1]:
            for j in [0,1]:
                for k in [0,1]:
                    sat = df.loc[(df[x]==i) & (df[y]==j) & (df[z]==k)].shape[0]
                    h0 = df.loc[(df[x]==i) & (df[z]==k)].shape[0]*df.loc[(df[y]==j) & (df[z]==k)].shape[0]/df.loc[(df[z]==k)].shape[0]
                    sum += 2*sat*np.log(sat/h0)
                    chi = st.chi2.cdf(sum, 2)
    return 1-chi

print('A,B', calculate_chi('A','B'))
print('A,C', calculate_chi('A','C'))
print('B,C', calculate_chi('B','C'))
print('A,B,C', calculate_chi('A','B','C'))
print('A,C,B', calculate_chi('A','C','B'))
print('B,C,A', calculate_chi('B','C','A'))

A,B 0.0
A,C 2.487762227554313e-06
B,C 0.0010939032025529816
A,B,C 0.0
A,C,B 0.002097928936509952
B,C,A 0.6639242354843888
