## 1 Random Variable

### 1.1 Random Event

An occurrence whose outcome is not predictable with certainty before it happens, even though all possible outcomes are known is called as Random Event.

Examples of Random Event:

1. Weather forecast
2. Stock market prediction
3. Rolling a dice

### 1.2 Random Variable

Variable used to define a Random Event is called as Random Variable.

Types of Random Variables:

1. Discrete Random Variable
2. Continuous Random Variable

#### 1 Discrete Random Variable

Example: Rolling a dice

Outcomes: {1, 2, 3, 4, 5, 6}

#### 2 Continuous Random Variable

Example: Height of human beings

Outcomes is a floating point number

### 1.3 Distribution

A distribution describes the spread, pattern, and frequency of outcomes within a dataset or for a random variable.

A statistical data distribution is a function that shows the possible values of a variable and how frequently they occur.  
It provides a mathematical description of behavior of the data which indicate where most data points are concentrated and how they are spread out.  
Distributions can be represented in various forms such as probability density functions for continuous data or probability mass functions for discrete data.

> **Note**:
>
> Concentration of data-points is indicated by Distributions.

### 1.4 Expected Value

#### Definition

* Expectation of a random variable X is the weighted average of the values that X takes, with the weights beings the probabilities.
* Expected Value of a Random Variable X is the product of values (that X can take) and its probability i.e., weighted average of probability.

## 2 Distribution Functions

### 2.1 Probability Mass Function

#### Definition

* PMF is a distribution function that describes the probability of a **discrete random variable** taking on a **specific value**.
* Graphical representation of probability on a distributed data is called as Probability Mass Function.

### 2.2 Probability Density Function

#### Definition

* PDF is a distribution function that describes the probability density of a **continuous random variable** over its **range**.
* The term "density" here is similar to how tightly data is packed around a specific point.

#### Explanation

PDF is used for **continuous random variables**, as opposed to PMF, which is for discrete variables.  
PDF does not provide the probability of a specific value but gives the **probability of the random variable falling within a certain interval**.  
Probability of a specific value in PDF is Zero.  
To calculate probability of some range we have to calculate the area under the curve. Area under the curve is calculated using Integration.

> **Note**:
>
> 1. Use KDE plot to get probability of each specific value in the distribution.
> 2. Use Area-Under-Curve to get the probability of a random variable that lies between certain range.

1. Use integration to get area under the curve to get CDF from PDF
2. Use differentiation to get get PDF from CDF.

### 2.3 Cumulative Distribution Function

#### Definition

CDF is a distribution function that gives the probability that a random variable (Discrete and Continuous) is less than or equal to a specified value.

## 3 Empirical vs Theoretical Probability

There are two ways to find probability:

1. Empirical approach
2. Theoretical approach

### 3.1 Empirical approach

In empirical approach, the experiment is done N number of time, say 10,000, and then the result is concluded.

In [1]:
import numpy as np
from scipy import stats
import pandas as pd

In [2]:
def empirical_approach(bucket, simulation: int = 10_000) -> list[int]:
    """
    Function to perform experiment using empirical approach.
    """
    trial = 4
    red_counts: list[int] = []

    for _ in range(simulation):
        # Define event for experiment and save outcome.
        outcomes = np.random.choice(bucket, size=trial)

        # Count total number of red balls in outcomes.
        red_count: int = len(np.where(outcomes == "R")[0])

        # Collect experiment result.
        red_counts.append(red_count)

    return red_counts

#### Probabilities

In [3]:
bucket = ["R", "R", "R", "B", "B"]
output = empirical_approach(bucket)

df = pd.Series(output).value_counts(normalize=True).sort_index().to_frame(name="Probability")
df

Unnamed: 0,Probability
0,0.0252
1,0.1508
2,0.3438
3,0.3446
4,0.1356


#### Expected Value

In [4]:
np.sum(df.index * df["Probability"]).item()

2.4146

In [5]:
np.mean(output).item()

2.4146

##### Interpretation of Expected value

The Expected Value of 2.4 means that we will get 2.4 (i.e., 2 or 3) number of red balls on an average if we we perform this experiment large number of times.

### 3.2 Theoretical approach

In theoretical approach, the experiment is conducted purely using mathematics.

In [6]:
import math
import fractions
from fractions import Fraction as F

In [7]:
def C(n, k):
    """
    Function to compute combinations as fractions.
    """
    return F(math.comb(n, k))


def P(n, d):
    """
    Function to represent probability as fractions.
    """
    return F(n, d)

In [8]:
pr = P(3, 5)
pb = P(2, 5)

print(pr, pb)

3/5 2/5


In [9]:
print(pr**4)

81/625


In [10]:
float(pr**4)

0.1296

## 4 Binomial Distribution

### 4.1 What is Binomial Distribution?

#### Definition

Binomial distribution is a **discrete probability distribution** of the number of successes in $n$ **independent experiments** sequence.

A Binomial trial will always have two possible outcomes:

* Success / Win
* Failure / Loss

#### Conditions for Binomial Distributions

1. Independent Trial
2. Only two possible outcomes
3. One of the probabilities of either success or failure should be given.

#### Formula

Given $p$ the probability of success, Binomial distribution helps us in calculating $k$ successes from $n$ trials.

$
\large
\begin{align}
P(X = k) = {}^nC_k \cdot p^k \cdot (1 - p)^{n - k}
\end{align}
$

#### Expected Value

$
\large
\begin{align}
E[X] = n \cdot p
\end{align}
$

### 4.2 Examples

#### Quiz #1

A bag has 3 red balls and 2 blue balls.  
A ball in chosen at random 4 time.  
What is the probability of getting:

1. Zero red balls
2. One red balls
3. Two red balls
4. Three red balls
5. Four red balls

##### Solution

In [11]:
p = 3 / 5  # Probability of success.
n = 4  # Number of trails.
# k = 0, 1, 2, 3, 4  # Random variable x taking k values.

In [12]:
p_k0 = stats.binom.pmf(p=p, k=0, n=n).round(4).item()
p_k1 = stats.binom.pmf(p=p, k=1, n=n).round(4).item()
p_k2 = stats.binom.pmf(p=p, k=2, n=n).round(4).item()
p_k3 = stats.binom.pmf(p=p, k=3, n=n).round(4).item()
p_k4 = stats.binom.pmf(p=p, k=4, n=n).round(4).item()

p_k0, p_k1, p_k2, p_k3, p_k4

(0.0256, 0.1536, 0.3456, 0.3456, 0.1296)

In [13]:
stats.binom.expect(args=(n, p)).round(2).item()

2.4

#### Quiz #2

A factory produces LED bulbs, and each bulb has a 5% chance of being defective.  
A quality inspector randomly selects 20 bulbs from the production line.  

What is the probability that exactly 2 bulbs are defective?

##### Solution

In [22]:
p = 0.05
n = 20
x = 2

In [25]:
p_x2 = stats.binom.pmf(p=0.05, n=20, k=2)
p_x2.round(4).item()

0.1887

#### Quiz #3

Suppose that we float 10 quizzes, with 4 options each.  
Only 1 option is correct, and we are guessing the answers.  
What is the probability that we will get at least 4 answers correct?

##### Solution

In [14]:
p = 1 / 4
n = 10
# k = 4, 5, 6, 7, 8, 9, 10
# Find: P(x >= 4)
# P(x >= 4) = 1 - P(x < 3)

###### Using PMF

In [15]:
p_k0 = stats.binom.pmf(p=p, k=0, n=n)
p_k1 = stats.binom.pmf(p=p, k=1, n=n)
p_k2 = stats.binom.pmf(p=p, k=2, n=n)
p_k3 = stats.binom.pmf(p=p, k=3, n=n)

p_x_ge_4 = 1 - (p_k0 + p_k1 + p_k2 + p_k3)

p_x_ge_4.round(4).item()

0.2241

or

In [16]:
p_k04 = stats.binom.pmf(p=p, k=4, n=n)
p_k05 = stats.binom.pmf(p=p, k=5, n=n)
p_k06 = stats.binom.pmf(p=p, k=6, n=n)
p_k07 = stats.binom.pmf(p=p, k=7, n=n)
p_k08 = stats.binom.pmf(p=p, k=8, n=n)
p_k09 = stats.binom.pmf(p=p, k=9, n=n)
p_k10 = stats.binom.pmf(p=p, k=10, n=n)

p_tot = p_k04 + p_k05 + p_k06 + p_k07 + p_k08 + p_k09 + p_k10
p_tot.round(4).item()

0.2241

###### Using CDF

In [17]:
# Find: P(X >= 4)
# 1 - P(x < 3)

In [18]:
p_x_ge_4 = 1 - stats.binom.cdf(k=3, p=p, n=n)
p_x_ge_4.round(4).item()

0.2241

#### Quiz #4

Suppose that we float 10 quizzes, with 4 options each. Only 1 option is correct.  
What is the probability that we will get exactly 2 answers correct?

##### Solution

In [19]:
p = 1 / 4
n = 10
x = 2

In [20]:
p_k2 = stats.binom.pmf(k=x, p=p, n=n)
p_k2.round(4).item()

0.2816

### 4.3 Variance of Binomial Distribution

$
\large
\begin{align}
\sigma^2(x) = n \cdot p \cdot (1 - p)
\end{align}
$

### 4.4 Bernoulli Trial

#### Definition

Bernoulli Trial is a special case of Binomial Distribution when the trial count is fixed to one.

#### Formula

Given $p$ the probability of success, Bernoulli Trial helps us in calculating $k$ successes from $1$ trials.

$
\large
\begin{align}
P(X = k) = p^k \cdot (1 - p)^{1 - k}
\end{align}
$

#### Expected Value

$
\large
\begin{align}
E[X] = p
\end{align}
$

> **Note**:
>
> Binomial distribution is a collection of Bernoulli Trials.

You toss 2 dice:

1. if both dice are 6, you get Rs 2.
2. else if one dice is 6, you get Rs 1.

Otherwise, you do not get anything

In [21]:
k2 = stats.binom.pmf(n=2, p=1 / 12, k=2)
k2 = stats.binom.pmf(n=2, p=1 / 6, k=2)
k2.round(4).item()

0.0278