# Types and Tools

## Learning Objectives
* Joint Probability Distributions
* Marginal Probability Distributions
* Conditional Probability Distributions
* Bayes Theorem
* Probability Chain Rule

# Joint Probability Distributions
After we learned about probability distributions and measures to describe them in the last chapter, we will now focus on different types of distributions.<br>
Joint probability is the probability of two events occurring at the same time. In general, the notation for two events $A$ and $B$ happening together is $P(A\cap B)$, which means it is the intersection of these two events as illustrated below:<br>
<img src="images/joint.png" align="center"/> <br>
([Source](https://www.researchgate.net/figure/Venn-diagrams-of-event-intersection-AB-and-union-AUB-The-ellipse-sizes-are-not-drawn_fig1_308880835))<br>
In general, we can think of the joint probability as the probability of A __and__ B or as their intersection. <br>

Rather than only considering the probabilities associated with one random variable, we can also look at the probabilities associated with a pair or set of random variables at specific values. The probability of multiple random variables taking specific values is known as a __joint probability__. This is an interesting property in understanding the relationship between these variables. If we consider two random variables, X and Y, their joint probability is, in general, denoted as follows:

$$p(x,y) = P(X=x \;and\; Y=y)$$


Consider the example data shown below, based on the random variables "AI Core Graduate" and "Jobs In Data Science":

|                               | AI Core Graduate | Non-AI Core Graduate |
|-------------------------------|------------------|----------------------|
| Jobs In Data Science    | 120              | 30                   | 
| No Jobs In Data Science | 0                | 150                  | 

We have seen above that to calculate a probability, we simply divide the number of wanted outcomes by the total number of outcomes. Therefore, to calculate the probability of each individual combination of values of our random variables, we divide by the total number of outcomes, which in this case is 300. This leads to the table below:

|                               | AI Core Graduate | Non-AI Core Graduate |
|-------------------------------|------------------|----------------------|
| Jobs In Data Science    | 0.4              | 0.1                  |
| No Jobs In Data Science | 0.0              | 0.5                  | 


This is now a table of __joint probabilities__. Let us rewrite these results in their general form:

$$P(\text{AI Core Graduate, Jobs In Data Science}) = 0.4$$

$$P(\text{AI Core Graduate, No Jobs In Data Science}) = 0.0$$

$$P(\text{Non-AI Core Graduate, Jobs In Data Science}) = 0.1$$

$$P(\text{Non-AI Core Graduate, No Jobs In Data Science}) = 0.5$$

This example helps us understand the importance of joint probailities, as in this case, the different joint probabilities tell us that it is way more likely to have a job in data science if you are an AI Core graduate than if you are not a graduate!


# Marginal Probability Distributions
We might not always know the individual probability distributions of $X$ and $Y$. Instead, it can happen that we are only given their joint probability distribution. We can then calculate the individual probability distributions or $X$ and $Y$ as the __mariginal distributions __of the joint distributions. This will make more sense, if we visualise the joint probability distribution in a table. Let's say we have two random variables $X$ and $Y$ that can both take two values, 0 and 1, for simplicity:<br>

| x/y | $Y=0$ | $Y=1$         
| :- |----: | :-:
|$X=0$| 0.2 | 0.3
| $X=1$ | 0.4 | 0.1

<br> where each cell indicates the joint probability of two events occcurring together.<br>
We can now add the marginal probabilities by adding row and column sums:<br>

| x/y | $Y=0$ | $Y=1$ |        
| :- |----: | :-: | 
|$X=0$| 0.2 | 0.3 | 0.5
| $X=1$ | 0.4 | 0.1 | 0.5
|   | 0.6  | 0.4  | 1

<br> We now see that both the row as well as the column sums sum up to 1. The row sums display the probability distribution of $X$ and the column sums display the probability distribution of $Y$.<br>
More formally, we can say that following the sum rule:
$$ P(X) = \sum_y P(x,y) \text{ and } P(Y) = \sum_x P(x,y) $$ 

## Exercise 1:
Calculate the following joint distribution in a table and add the marginal probabilities:<br>
A company wants to analyse the amount of smokers by gender among their employees. Therefore, they collected the following data from 100 employees:<br>
35 employees are female and smokers <br>
20 employees are female and non-smokers <br>
25 employees are male and smokers <br>
20 employees are male and non-smokers


# Conditional Probability Distributions
Beyond the joint occurrence of two events, we are also interested in the occurrence of an event __given__ that another event occurred. Sometimes, the probability of an event can change depending on whether another event occurred. Formally, it is denoted as $P(X|Y)$ and calculated as follows based on the product rule:
$$ P(X|Y) = P(X \cap Y) P(Y)$$
On this note, we should also distinguish between __dependent and independent events__. If an event is independent of another event, it means that its outcome is not influenced by the outcome of the other event. This could for example be throwing of a die as the outcome of the first throw does not affect the outcome of the second throw. __Dependent events__ on the other hand are affected by other events. Let's take as an example drawing marbles from a bag. If we have a bag of 5 marbles, 3 of them being red and 2 being blue, the probability of drawing a red marble changes once we have drawn a marble without replacing it. This behaviour also affects the conditional probability. If two events are independent of another, we can calculate the conditional as follows:
$$ P(X|Y) = P(X)$$ and vice versa. We can also use this formula to check for the independene of two random variables. If this formula holds, they are independent. Otherwise, they are dependent.<br>
Furthermore, we can also distinguish events on whether they are __mutually exlusive__. In this case, once event $X$ has occurred, event $Y$ cannot occur anymore. Formally, they are also denoted as __disjoint events__ s.t.
$$ P(X|Y) = 0 $$


We will now look at a quick example to understand these concepts in practice, shown in the __tree diagram__ below, with the properties:
- Imagine a type of bolt that can be produced either in factory A or factory B. They sometimes end up defective.
- 60% of bolts are produced in A and 40% of bolts are produced in B
- 2% of bolts produced in A are defective and 4% of bolts produced in B are defective

We can model this situation with two random variables:
- X: can either take the value 'A' or 'B', corresponding to which factory the bolt was produced in
- Y: can either take the value 'D' or 'D'' corresponding to whether it is defective or not

<img src="images/tree.png" alt="tree-diagram"
	title="Tree diagram of the bolt production process" width="750px" height="500px" />

We can also use the __conditional probability__ to calculate the __joint probability__ of two random variables.
For example, in a deck of cards we can define as event A drawing a red card and as event B drawing a 610 In a deck of 52 cards, we will have 26 red cards s.t. $P(A) =  \frac{26}{53}$ and 4 cards with the number 10 s.t. $P(B) = \frac{4}{52}$.
This probability can be calculated using the conditional probability:
$$ P(A \cap B) = P(A|B) P(B) = P(B|A) P(A)$$
The two formulas are equal because joint probability distributions are symmetrical s.t. $P(A \cap B) = P(B \cap A)$.
According to this formula, we can calculate the joint probability as follows based on the product rule:
$$ P(A \cap B) = \frac{2}{26} \frac{26}{53} = \frac{1}{26}$$

## Exercise 2:
Compute the conditional probability that a female employee smokes and that a male employee smokes based on the data from Exercise 2. Draw a probability tree to visualise these values. Are the two random variables independent or dependent?

# Bayes Theorem
One of the most important theoremes in probability theory is __Bayes theorem__. It says that the __conditional probability__ of two events $X$ and $Y$, $P(X|Y)$ is related to the inverse form $P(Y|X)$. Therefore, the conditional probability $P(X|Y)$ can be calculated, if we know $P(Y|X)$:
$$ P(X|Y) = \frac{P(Y|X) P(X)}{P(Y)} $$
This theorem is useful because we can use it to calculate $P(X|Y)$ without knowing $P(X \cap Y)$.
We can easily derive this theorem based on what we have learned about joint and conditional probabilities:
$$ P(X|Y) = \frac{P(Y|X) P(X)}{P(Y)} = \frac{P(X \cap Y)}{P(Y)}$$

## Exercise 3:
Suppose we are interested in a test to detect a disease which affects one in 100000 people on average.
A lab has developed a test which works but is not perfect. If a person has the disease, it will give a
positive result with probability 0.97; if they do not, the test will be positive with probability 0.007.
You took the test, and it gave a positive result. What is the probability that you actually have the
disease?

# Probability Chain Rule
The examples we have given only refer to the relationship between two events. However, in reality more than just two events react with each other. If we want to calculate the joint probability of multiple random variables, we can extend the formula we have used above. We already used the product rule to calculate joint probabilities between two events $A$ and $B$. We can generalise this rule further for any set of events $X_1$, $E_2$,...,$E_n$ as follows:
$$ P(\cap_{i=1,...,n} E_i) = P(E_n | \cap_{i=1,...,n-1} E_i) P(\cap_{i=1,...,n-1} E_i) $$

The __probability chain rule__ is a crucial concept in probability, and is based on the expansion of the concepts we covered in conditional probability so far. It is used for computing the joint probability of any number of random variables based on only their conditional probabilities. We have already looked at the case of two random variables, so for now, let us cover what happens for three random variables X, Y and Z. The probability of $X=x$, $Y=y$ and $Z=z$ taking place can be computed as the probability of $X=x$ _given that_ $Y=y$ and $Z=z$ have taken place times the probability of $Y=y$ and $Z=z$ taking place. This is shown below:

$$P(X=x, Y=y, Z=z) = p(x,y,z) = p(x|y,z)p(y,z)$$

We can do this by in that particular case, treating the values of "Y and Z" one outcome, then applying the relationship for conditional probabilities and joint probabilities we have previously seen:

$$p(x,y,z) = p(x|y,z)p(y,z) = p(x|y,z)p(y|z)p_{Z}(z)$$

To nail down this concept we will also cover this process for the case of the joint probability of four random variables, X, Y, Z, T. We start by initially treating the values of (Y, Z, T) as one outcome:

$$p(x,y,z,t) = p(x|y,z,t)p(y,z,t)$$

We now have another joint probability of three random variables, which we can substitute by what we got above:

$$p(x,y,z,t) = p(x|y,z,t)p(y,z,t) = p(x|y,z,t)p(y|z,t)p(z|t)p_{T}(t)$$

For cases with more than four random variables, we can apply the same logic recursively:
- Treat all the values of all but one random variable as one outcome
- Re-write the joint probability as a conditional probability times the joint probability of one less variable
- Do the same process for the new joint probability until we get to our base case of two random variables

We can generalize the above results. For any N random variables $X_{1},X_{2},...,X_{N}$, we can use the chain rule of probability to re-write their joint probability (given the reasoning above) in the following manner:

$$p(x_{1},x_{2},...,x_{N}) = \prod_{i=1}^{N}p(x_{i}|x_{1},...,x_{i-1})$$

And that is the gist of it! This is an important tool in data science and machine learning, and is crucial when dealing with Bayesian Networks!