## **Conditional Probability & Total Probability Theorem**

### **1. Conditional Probability**
Conditional probability answers the question:

> _"What is the probability of event A occurring, given that event B has already occurred?"_

#### **Formula**
$$
P(A | B) = \frac{P(A \cap B)}{P(B)}
$$
where:
- $ P(A | B) $ = Probability of $ A $ given $ B $
- $ P(A \cap B) $ = Probability of both $ A $ and $ B $ occurring
- $ P(B) $ = Probability of $ B $ occurring

#### **Example: Weather and Traffic**
- Suppose:
  - $ P(T) = 0.3 $ (probability of **traffic jam**)
  - $ P(R) = 0.4 $ (probability of **rain**)
  - $ P(T \cap R) = 0.2 $ (probability of **traffic and rain** happening together)

Now, if we know that **it’s raining**, what is the probability of traffic?

$$
P(T | R) = \frac{P(T \cap R)}{P(R)} = \frac{0.2}{0.4} = 0.5
$$

So, given that it’s raining, the probability of a traffic jam increases to **50%**.

### **2. Total Probability Theorem**
The **Total Probability Theorem** helps us compute the probability of an event by considering all possible ways that event can happen.

#### **Formula**
If $ B_1, B_2, ..., B_n $ are **mutually exclusive** events that cover the entire sample space, then:

$$
P(A) = P(A | B_1) P(B_1) + P(A | B_2) P(B_2) + \dots + P(A | B_n) P(B_n)
$$

#### **Example: Disease Testing**
Suppose a **disease** affects people differently based on age:
- **30%** of the population is **young** ($ B_1 $) and **70%** is **old** ($ B_2 $).
- Probability of having the disease:
  - **Young people**: $ P(D | B_1) = 0.01 $
  - **Old people**: $ P(D | B_2) = 0.05 $

What’s the probability that a **random person has the disease**?

$$
P(D) = P(D | B_1) P(B_1) + P(D | B_2) P(B_2)
$$

$$
= (0.01 \times 0.3) + (0.05 \times 0.7)
$$

$$
= 0.003 + 0.035 = 0.038
$$

So, the overall probability of disease in the population is **3.8%**.

### **Connection Between Conditional Probability & Total Probability**
- **Conditional probability** helps compute probabilities when some information is known.
- **Total probability** helps compute probabilities when considering all possible cases.

---

## **What is Joint Probability?**  
Joint probability refers to the probability of two (or more) random variables occurring **simultaneously**.  

For two random variables $ X $ and $ Y $, the **joint probability** is denoted as:  

$$
P(X = x, Y = y)
$$

It describes the likelihood of $ X $ taking a specific value $ x $ and $ Y $ taking a specific value $ y $ **at the same time**.  

## **Joint PMF (Probability Mass Function) – Discrete Case**  
When both $ X $ and $ Y $ are **discrete random variables**, the **Joint Probability Mass Function (Joint PMF)** is defined as:

$$
P(X = x, Y = y) = p(x, y)
$$

**Example: Joint PMF of Two Dice Rolls**  
Consider rolling two fair dice:
- $ X $ = outcome of the first die
- $ Y $ = outcome of the second die  

Since both dice are fair and independent:

$$
P(X = x, Y = y) = \frac{1}{36}, \quad x, y \in \{1, 2, 3, 4, 5, 6\}
$$

**Key Property** (Sum over all possible values equals 1):

$$
\sum_{x} \sum_{y} P(X = x, Y = y) = 1
$$


## **Joint PDF (Probability Density Function) – Continuous Case**  
When $ X $ and $ Y $ are **continuous random variables**, the **Joint Probability Density Function (Joint PDF)** is defined as:

$$
f(x, y) \geq 0, \quad \text{and} \quad \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x, y) dx dy = 1
$$

To compute the **probability over a region**, integrate the joint PDF:

$$
P(a \leq X \leq b, c \leq Y \leq d) = \int_{a}^{b} \int_{c}^{d} f(x, y) dx dy
$$

**Example: Joint PDF of Two Normal Variables**  
If $ X $ and $ Y $ follow a **bivariate normal distribution**, their joint PDF is:

$$
f(x, y) = \frac{1}{2\pi\sigma_X \sigma_Y \sqrt{1 - \rho^2}} \exp\left( -\frac{1}{2(1 - \rho^2)} \left( \frac{(x - \mu_X)^2}{\sigma_X^2} + \frac{(y - \mu_Y)^2}{\sigma_Y^2} - 2\rho \frac{(x - \mu_X)(y - \mu_Y)}{\sigma_X \sigma_Y} \right) \right)
$$

where:
- $ \mu_X, \mu_Y $ = means of $ X $ and $ Y $,
- $ \sigma_X, \sigma_Y $ = standard deviations,
- $ \rho $ = correlation coefficient.

## **Marginal PMF and Marginal PDF**  
Marginal probabilities describe the probability of **one** variable, regardless of the other.

### **(A) Marginal PMF (Discrete Case)**
For a discrete random variable $ X $, the **marginal PMF** is found by summing over all values of $ Y $:

$$
P(X = x) = \sum_{y} P(X = x, Y = y)
$$

Similarly, for $ Y $:

$$
P(Y = y) = \sum_{x} P(X = x, Y = y)
$$

**Example: Two Dice Rolls**  
$$
P(X = x) = \sum_{y=1}^{6} P(X = x, Y = y) = \sum_{y=1}^{6} \frac{1}{36} = \frac{6}{36} = \frac{1}{6}
$$

### **Marginal PDF (Continuous Case)**
For a continuous random variable $ X $, the **marginal PDF** is found by integrating over all values of $ Y $:

$$
f_X(x) = \int_{-\infty}^{\infty} f(x, y) dy
$$

Similarly, for $ Y $:

$$
f_Y(y) = \int_{-\infty}^{\infty} f(x, y) dx
$$

## **Practical Applications in Machine Learning** 

### **(A) Feature Selection & Dimensionality Reduction**  
- **Marginal Distributions** help analyze the **importance of individual features** in datasets.  
- Used in **PCA (Principal Component Analysis)**, where we model the **joint distribution** of multiple features and extract marginal distributions to find independent components.

### **(B) Naive Bayes Classifier**  
- Assumes that features $ X_1, X_2, ..., X_n $ are **conditionally independent** given a class $ Y $.
- Computes the **joint probability** as a product of **marginal probabilities**:

$$
P(Y | X_1, X_2, ..., X_n) \propto P(Y) P(X_1 | Y) P(X_2 | Y) ... P(X_n | Y)
$$

### **(C) Generative Models (GANs, VAEs)**  
- **VAEs (Variational Autoencoders)** model the **joint probability** of latent variables and observed data, then use **marginalization** to obtain meaningful latent representations.
- **GANs (Generative Adversarial Networks)** learn the **joint PDF** of real and generated data.

### **(D) Correlation and Dependency in Feature Engineering**  
- **Joint PDFs** help measure **dependencies** between features.  
- **Mutual Information** is computed from the **joint PMF/PDF** to detect **non-linear correlations**.

### **(E) Bayesian Inference**  
- Used in **Bayesian Neural Networks** and **Hidden Markov Models (HMMs)**.
- Involves computing **posterior distributions** using joint probabilities:

$$
P(\theta | X) = \frac{P(X | \theta) P(\theta)}{P(X)}
$$

where:
- $ P(X | \theta) $ is the **likelihood** (joint probability of data given parameters),
- $ P(\theta) $ is the **prior** (initial belief),
- $ P(X) $ is the **marginal likelihood**.
