# Probabilistic Models – Spring 2021
## Exercise Session 1
Jan 27nd 16.15.

Carmen Díez.

### Instructions
Make sure the notebook produces correct results when ran sequentially starting from the first cell. You can ensure this by clearing all outputs (`Edit > Clear All Outputs`), running all cells (`Run > Run All Cells`), and finally correcting any errors.

To get points:
1. Submit your answers to the automatically checked Moodle test. 
 - You have 5 tries on the test: the highest obtained score will be taken into account.
 - For numerical questions the tolerance is +/- 0.01.
2. Submit this notebook containing your derivations to Moodle.

## Exercise 1
***

Consider the following joint distribution $P$:

In [1]:
!cat data/1.csv

A	B	C	P
True	True	True	0.075
True	True	False	0.05
True	False	True	0.225
True	False	False	0.15
False	True	True	0.025
False	True	False	0.1
False	False	True	0.075
False	False	False	0.3


(a) What is $P(A=T, C=T)$?

Update the distribution by conditioning on the event $C=T$, that is, construct the conditional distribution $P( \cdot |C=T$).

(b) What is $P(A=T|C=T)$? $P(B=T|C= T)$?

(c) Is the event $A=T$ independent of the event $C=T$? Is $B=T$ independent of $C=T$?

### Instructions

If you're using Python you can start by reading the provided file into a [Pandas](https://pandas.pydata.org/) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) or similarly to a [data.frame](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame) in R. To check for equality between two real numbers do not use `x == y`, as it gives false negatives on limited precision floats. Rather, use for example [`math.isclose(x, y)`](https://docs.python.org/3/library/math.html#math.isclose) in Python or [`near(x, y)`](https://dplyr.tidyverse.org/reference/near.html) in R.


In [2]:
import pandas as pd
import numpy as np
import math

In [3]:
data = "data/1.csv"
df = pd.read_csv(data, sep='\t')
df

Unnamed: 0,A,B,C,P
0,True,True,True,0.075
1,True,True,False,0.05
2,True,False,True,0.225
3,True,False,False,0.15
4,False,True,True,0.025
5,False,True,False,0.1
6,False,False,True,0.075
7,False,False,False,0.3


In [4]:
def prob(var, df):
    return df.loc[df[var] == True]['P'].sum()

def intersection(var1, var2, df):
    return df.loc[(df[var1] == True) & (df[var2] == True)]['P'].sum()

def union(var1, var2, df):
    return df.loc[(df[var1] == True) | (df[var2] == True)]['P'].sum()

def independence2(var1, var2, df):
    return math.isclose(prob(var1, df)*prob(var2, df), intersection(var1, var2, df))

In [5]:
#a P(A=T, C=T)
pAIC = intersection('A', 'C', df)
pAIC

0.3

Conditional probability formula: $P( \cdot |C=T) = \frac{P(\cdot ,C=T)}{P(C=T)}$ 

In [6]:
#b P(A=T|C=T) and P(B=T|C=T)
pBIC = intersection('B', 'C', df)
pC = prob('C', df)
pACondC = pAIC/pC
pBCondC = pBIC/pC

In [7]:
pACondC

0.7499999999999999

In [8]:
pBCondC

0.25

$\cdot$ and $C$ are independent if $P(\cdot | C) = P(\cdot)$ (because $P(C) > 0$) or $P(A,B)=P(A)P(B)$.

In [9]:
#c A=T independent C=T? B=T and C=T?
pA = prob('A', df)
pB = prob('B', df)

In [10]:
math.isclose(pACondC, pA) #dependent

False

In [11]:
independence2('A','C', df)

False

In [12]:
math.isclose(pBCondC, pB) #independent

True

In [13]:
independence2('B','C', df)

True

## Exercise 2
***

Consider again the joint distribution $P$ from Exercise 1.

(a) What is $P(A=T \vee B=T)$?


Update the distribution by conditioning on the event $(A=T \vee B=T)$, this is, construct the conditional distribution $P( \cdot |A=T \vee B=T)$.

(b) What is $P(A=T|A=T \vee B=T)$? $P(B=T|A=T \vee B=T)$?

(c) Is the event $B=T$ conditionally independent of $C=T$ given the event $(A=T \vee B=T)$?

In [14]:
#a P(A=T U B=T)
pAUBFrame = df.loc[(df['A'] == True) | (df['B'] == True)]
pAUB = union('A', 'B', df)
pAUB

0.625

Conditional probability formula: $P( \cdot | A=T \vee B=T) = \frac{P(\cdot , A=T \vee B=T)}{P(A=T \vee B=T)}$ 

In [15]:
pAIAUB = prob('A', pAUBFrame)
pBIAUB = prob('B', pAUBFrame)
pACondAUB = pAIAUB/pAUB
pBCondAUB = pBIAUB/pAUB

In [16]:
pACondAUB

0.8

In [17]:
pBCondAUB

0.4

$\cdot$ and $(A=T \vee B=T)$ are independent if $P(\cdot | A=T \vee B=T) = P(\cdot)$ (because $P(A=T \vee B=T) > 0$).

In [18]:
# B=T independent of C=T given (A=T U B=T)?
math.isclose(pB, pBCondAUB) #dependent

False

## Exercise 3
***

Consider the following joint distribution.

In [19]:
!cat data/3.csv

A	B	C	P
True	True	True	0.27
True	True	False	0.18
True	False	True	0.03
True	False	False	0.02
False	True	True	0.02
False	True	False	0.03
False	False	True	0.18
False	False	False	0.27


For each pair of variables, state whether they are independent. State also whether they are independent given the third variable. Justify your answers.

In [20]:
data = "data/3.csv"
df = pd.read_csv(data, sep='\t')
df

Unnamed: 0,A,B,C,P
0,True,True,True,0.27
1,True,True,False,0.18
2,True,False,True,0.03
3,True,False,False,0.02
4,False,True,True,0.02
5,False,True,False,0.03
6,False,False,True,0.18
7,False,False,False,0.27


$A$ and $B$ are independent if $P(A | B) = P(A)$ if $P(B) > 0$.

A and B, A and C, B and C independent?

In [21]:
pAIB = intersection('A', 'B', df)
pAIC = intersection('A', 'C', df)
pBIC = intersection('B', 'C', df)
pB = prob('B', df)
pA = prob('A', df)
pC = prob('C', df)
pACondB = pAIB/pB
pCCondB = pBIC/pB
pACondC = pAIC/pC
pBCondC = pBIC/pC
pBCondA = pAIB/pA
pCCondA = pAIC/pA

In [22]:
math.isclose(pACondB, pA) #dependent

False

In [23]:
independence2('A', 'B', df)

False

In [24]:
math.isclose(pACondC, pA) #dependent

False

In [25]:
independence2('A', 'C', df)

False

In [26]:
math.isclose(pBCondC, pB) #dependent

False

In [27]:
independence2('B', 'C', df)

False

A and B given C,
A and C given B,
B and C given A independent?

A and B are conditionally independent given event C where $P(C) > 0$, if:
$P(A, B | C) = P(A | C)P(B | C)$,
equivalently, $P(A | B, C) = P(A | C)$.

$P(A, B | C) = \frac{P(A, B, C)}{P(C)}$

In [28]:
pAIBIC = df.loc[(df['A'] == True) & (df['B'] == True) & (df['C'] == True)]['P'].sum()
pBICCondA = pAIBIC/pA
pAIBCondC = pAIBIC/pC
pAICCondB = pAIBIC/pB

In [29]:
math.isclose(pAIBCondC, pACondC*pBCondC) #dependent

False

In [30]:
math.isclose(pAICCondB, pACondB*pCCondB) #dependent

False

In [31]:
math.isclose(pBICCondA, pBCondA*pCCondA) #independent

True

## Exercise 4
***

We have three urns labeled 1, 2 and 3. The urns contain, respectively, three white and three black balls, four white and two black balls, and one white and two black balls. An experiment consists of selecting an urn at random then drawing a ball from it.

Define the joint probability distribution over $U$ and $C$, where $U$ is the chosen urn with values 1, 2 and 3; and $C$ is the color of the ball, with values black and white.

(a) What is the probability of drawing a black ball?

(b) What is the conditional probability that urn 2 was selected given that a black ball was drawn?

(c) What is the probability of selecting urn 1 or a white ball?

In [32]:
#a
pBlack = 1/3*1/2+1/3*2/3+1/3*1/3
pBlack

0.49999999999999994

$P(U=2 | C=black) = P(U=2, C=black)/P(C=black)$

In [33]:
#b
(1/3*1/3)/pBlack

0.22222222222222224

$P(U=1 $U$ C=white)= P(U=1)+P(C=white)-P(U=1, C=white)$

$P(C=white)=1-P(C=black)$

In [34]:
#c
1/3+(1-pBlack)-1/3*1/2

0.6666666666666666

## Exercise 5
***

Suppose Ed keeps track of forecasts of Finnish Meterological Institute (FIM) and believes they are correct with 80% probability and Mary belives the forecasts of Foreca are correct with 70% probability. Then suppose FIM predicts rain and Foreca does not.

Consider four sets of bets:

> (1) Bookie offers to sell Ed a bet for 85 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 60 euros returning Mary 100 euros if it does not rain.
> 
> (2) Bookie offers to sell Ed a bet for 79 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 69 euros returning Mary 100 euros if it does not rain.
> 
> (3) Bookie offers to sell Ed a bet for 73 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 73 euros returning Mary 100 euros if it does not rain.
> 
> (4) Bookie offers to sell Ed a bet for 55 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 34 euros returning Mary 100 euros if it does not rain.

(a) Which set of bets is a Dutch book?

(b) How much money is the bookie guaranteed to make in the Dutch book scenario?

Provide some calculations justifying your answers in the notebook.


Set (1) and (3) won't be possible assuming they are rational agents (first Ed, then Mary). In (4) Bookie won't get profit as he loses 100 euros towards one of them and only gets $55+34<100$. (2) is a Dutch book as whatever it happens Bookie receives $79+69-100=48$.

## Exercise 6
***



The file `data/6.csv` contains 200 data points sampled from the distribution defined in exercise 3, with `True` mapped to 1 and `False` to 0.

For each pair of variables, conduct the G²-test for statistical independence. Also conduct the test for each pair of variables given the third variable. That is, repeat the task specified in exercise 3, but this time based on data sampled from the distribution instead of direct access to the distribution. For each conducted test report the p-value obtained when the null hypothesis is that the independence holds.

You can also try sampling data from the distribution yourself to see how the obtained p-values behave, but for the Moodle return use the given data set.

### G²-test

Under the null hypothesis $H_0: X \mathrel{\unicode{x2AEB}} Y \mid C$ we have that

$$\#_{e}(X=x \wedge Y=y \wedge C=c) = \frac{\#(X=x \wedge C=c) \cdot \#(Y=y \wedge C=c)}{\#(C=c)}$$

where $\#$ marks the number of samples satisfying the condition after, and $\#_{e}$ is the expected number of samples under $H_{0}$.

Then examine the following quantity:

$$G^{2} = 2 \sum \# \log \frac{\#}{\#_{e}} = 2 \sum \#(X=x \wedge Y=y \wedge C=c) \log \frac{\#(X=x \wedge Y=y \wedge C=c)}{\#_{e}(X=x \wedge Y=y \wedge C=c)} $$

where the summation is over the different configurations of the variables (i.e., different values the variables can assume).

Under $H_0$ the quantity $G^2$ is distributed as [$\chi^2$](https://en.wikipedia.org/wiki/Chi-square_distribution) with $(m_X - 1)(m_Y - 1)m_C$ degrees of freedom, where $m_X,m_Y,m_C$ are the number of possible configurations for $X$, $Y$ and $C$, respectively.

### Instructions

You can use any libraries you find for the task, but it probably makes sense to implement the $G^2$ computation yourself, and then compute the p-value for example using [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html) (if you're using Python) or the built-in [chisquare functions](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html) in R.

In [35]:
from scipy.stats import chi2
import numpy as np

In [36]:
data = "data/6.csv"
df = pd.read_csv(data)
df

Unnamed: 0,A,B,C
0,0,0,1
1,0,0,0
2,1,1,1
3,1,1,0
4,0,0,0
...,...,...,...
195,1,1,0
196,0,0,0
197,1,1,0
198,1,1,0


In [37]:
def one(var, val):
    inter = df.loc[(df[var]==val)]
    return len(inter.index)   

def inter2(var1, val1, var2, val2):
    inter = df.loc[(df[var1]==val1) & (df[var2]==val2)]
    return len(inter.index)

def inter3(var1, val1, var2, val2, var3, val3):
    inter = df.loc[(df[var1]==val1) & (df[var2]==val2) & (df[var3]==val3)]
    return len(inter.index)

def expected3(var1, val1, var2, val2, varCond, valCond):
    a = inter2(var1, val1, varCond, valCond)
    b = inter2(var2, val2, varCond, valCond)
    c = one(varCond,valCond)
    return (a*b)/c

def expected2(var1, val1, var2, val2):
    return one(var1,val1)*one(var2,val2)/len(df.index)

In [38]:
def G2test3vars(var1, var2, varCond):
    suma = 0
    
    for a in range(2):
        for b in range(2):
            for c in range(2):
                n = inter3(var1, a, var2, b, varCond, c)
                e = expected3(var1, a, var2, b, varCond, c)                
                suma += 2 * n * np.log(n/e)
                
    dof = (2-1)*(2-1)*2
    p_value = 1-chi2.cdf(suma, dof)
    return p_value

In [39]:
def G2test2vars(var1, var2):
    suma = 0
    
    for a in range(2):
        for b in range(2):
            n = inter2(var1, a, var2, b)
            e = expected2(var1, a, var2, b)
            suma += 2 * n * np.log(n/e)
            
    dof = (2-1)*(2-1)
    p_value = 1-chi2.cdf(suma, dof)
    return p_value

In [40]:
G2test2vars('A', 'B')

0.0

In [41]:
G2test3vars('A', 'B', 'C')

0.0

In [42]:
G2test2vars('A', 'C')

2.487762227554313e-06

In [43]:
G2test3vars('A', 'C', 'B')

0.002097928936509952

In [44]:
G2test2vars('B', 'C')

0.0010939032025529816

In [45]:
G2test3vars('B', 'C', 'A')

0.6639242354843888