# Probabilistic Models – Spring 2022
## Exercise Session 1
Return by Jan 25th 9.00 through Moodle. Session on Jan 25th 14.15.

### Instructions
Make sure the notebook produces correct results when ran sequentially starting from the first cell. You can ensure this by clearing all outputs (`Edit > Clear All Outputs`), running all cells (`Run > Run All Cells`), and finally correcting any errors.

To get points:
1. Submit your answers to the automatically checked Moodle test. 
 - You have 5 tries on the test: the highest obtained score will be taken into account.
 - For numerical questions the tolerance is +/- 0.01.
2. Submit this notebook containing your derivations to Moodle.

## Exercise 1
***

Consider the following joint distribution $P$:

In [1]:
!cat data/1.csv

A	B	C	P
True	True	True	0.075
True	True	False	0.05
True	False	True	0.225
True	False	False	0.15
False	True	True	0.025
False	True	False	0.1
False	False	True	0.075
False	False	False	0.3


(a) What is $P(A=T, C=T)$?

Update the distribution by conditioning on the event $C=T$, that is, construct the conditional distribution $P( \cdot |C=T$).

(b) What is $P(A=T|C=T)$? $P(B=T|C= T)$?

(c) Is the event $A=T$ independent of the event $C=T$? Is $B=T$ independent of $C=T$?

### Instructions

If you're using Python you can start by reading the provided file into a [Pandas](https://pandas.pydata.org/) [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) or similarly to a [data.frame](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame) in R. To check for equality between two real numbers do not use `x == y`, as it gives false negatives on limited precision floats. Rather, use for example [`math.isclose(x, y)`](https://docs.python.org/3/library/math.html#math.isclose) in Python or [`near(x, y)`](https://dplyr.tidyverse.org/reference/near.html) in R.


In [3]:
import pandas as pd
import math as math

df = pd.read_csv('./data/1.csv', sep='\t')

# Task a. P(A=T,C=T) = sum of rows matching the condition
prob_a = df.loc[df['A'] == True].loc[df['C'] == True]['P'].sum()
print('P(A=T,C=T) =', prob_a)

# Task b. P(.|C=T) = P(., C=T)/P(C=T)
pctrue = df.loc[df['C'] == True]['P'].sum()
prob_ba = prob_a/pctrue
prob_bb = (df.loc[df['B'] == True].loc[df['C'] == True]['P'].sum())/pctrue
print('P(A=T|C=T) =', prob_ba, '\nP(B=T|C=T) =', prob_bb)

# Task c. A and B are independent if P(A|B) = P(A) and P(B|A) = P(B)
patrue = df.loc[df['A'] == True]['P'].sum()
pbtrue = df.loc[df['B'] == True]['P'].sum()
ind_c1 = math.isclose(prob_ba,patrue)
ind_c2 = math.isclose(prob_bb,pbtrue)
print('Is A=T independent of C=T?', ind_c1)
print('Is B=T independent of C=T?', ind_c2)

P(A=T,C=T) = 0.3
P(A=T|C=T) = 0.7499999999999999 
P(B=T|C=T) = 0.25
Is A=T independent of C=T? False
Is B=T independent of C=T? True


## Exercise 2
***

Consider again the joint distribution $P$ from Exercise 1.

(a) What is $P(A=T \vee B=T)$?


Update the distribution by conditioning on the event $(A=T \vee B=T)$, this is, construct the conditional distribution $P( \cdot |A=T \vee B=T)$.

(b) What is $P(A=T|A=T \vee B=T)$? $P(B=T|A=T \vee B=T)$?

(c) Is the event $B=T$ conditionally independent of $C=T$ given the event $(A=T \vee B=T)$?

In [3]:
# Provide your answer in cells here

## Exercise 3
***

Consider the following joint distribution.

In [4]:
!cat data/3.csv

A	B	C	P
True	True	True	0.27
True	True	False	0.18
True	False	True	0.03
True	False	False	0.02
False	True	True	0.02
False	True	False	0.03
False	False	True	0.18
False	False	False	0.27


For each pair of variables, state whether they are independent. State also whether they are independent given the third variable. Justify your answers.

In [5]:
# Provide your answer in cells here

## Exercise 4
***

We have three urns labeled 1, 2 and 3. The urns contain, respectively, three white and three black balls, four white and two black balls, and one white and two black balls. An experiment consists of selecting an urn at random then drawing a ball from it.

Define the joint probability distribution over $U$ and $C$, where $U$ is the chosen urn with values 1, 2 and 3; and $C$ is the color of the ball, with values black and white.

(a) What is the probability of drawing a black ball?

(b) What is the conditional probability that urn 2 was selected given that a black ball was drawn?

(c) What is the probability of selecting urn 1 or a white ball?

In [6]:
# Provide your answer in cells here

## Exercise 5
***

Suppose Ed keeps track of forecasts of Finnish Meterological Institute (FIM) and believes they are correct with 80% probability and Mary belives the forecasts of Foreca are correct with 70% probability. Then suppose FIM predicts rain and Foreca does not.

Consider four sets of bets:

> (1) Bookie offers to sell Ed a bet for 85 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 60 euros returning Mary 100 euros if it does not rain.
> 
> (2) Bookie offers to sell Ed a bet for 79 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 69 euros returning Mary 100 euros if it does not rain.
> 
> (3) Bookie offers to sell Ed a bet for 73 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 73 euros returning Mary 100 euros if it does not rain.
> 
> (4) Bookie offers to sell Ed a bet for 55 euros returning Ed 100 euros if it rains. Bookie offers to sell Mary a bet for 34 euros returning Mary 100 euros if it does not rain.

(a) Which set of bets is a Dutch book?

(b) How much money is the bookie guaranteed to make in the Dutch book scenario?

Provide some calculations justifying your answers in the notebook.


In [7]:
# Provide your answer in cells here

## Exercise 6
***



The file `data/6.csv` contains 200 data points sampled from the distribution defined in exercise 3, with `True` mapped to 1 and `False` to 0.

For each pair of variables, conduct the G²-test for statistical independence. Also conduct the test for each pair of variables given the third variable. That is, repeat the task specified in exercise 3, but this time based on data sampled from the distribution instead of direct access to the distribution. For each conducted test report the p-value obtained when the null hypothesis is that the independence holds.

You can also try sampling data from the distribution yourself to see how the obtained p-values behave, but for the Moodle return use the given data set.

### G²-test

Under the null hypothesis $H_0: X \mathrel{⫫} Y \mid C$ we have that

$$\#_{e}(X=x \wedge Y=y \wedge C=c) = \frac{\# (X=x \wedge C=c) \cdot \# (Y=y \wedge C=c)}{\# (C=c)}$$

where $\#$ marks the number of samples satisfying the condition after, and $\#_{e}$ is the expected number of samples under $H_{0}$.

Then examine the following quantity:

$$G^{2} = 2 \sum \# \log \frac{\#}{\#_{e}} $$

where the summation is over the different configurations of the variables (i.e., different values the variables can assume).

Under $H_0$ the quantity $G^2$ is distributed as [$\chi^2$](https://en.wikipedia.org/wiki/Chi-square_distribution) with $(m_X - 1)(m_Y - 1)m_C$ degrees of freedom, where $m_X,m_Y,m_C$ are the number of possible configurations for $X$, $Y$ and $C$, respectively.

### Instructions

You can use any libraries you find for the task, but it probably makes sense to implement the $G^2$ computation yourself, and then compute the p-value for example using [SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html) (if you're using Python) or the built-in [chisquare functions](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2.html) in R.

In [8]:
# Provide your answer in cells here