# Statistics & Probability



---

### **Basic Descriptive Statistics**
1. Write a function to compute the mean of a dataset.  
2. Compute the median of a dataset and verify its robustness to outliers.  
3. Calculate the mode of a dataset.  
4. Implement a function to compute the variance and standard deviation of a dataset.  
5. Write a function to calculate the interquartile range (IQR) of a dataset.  
6. Detect outliers in a dataset using the IQR method.  
7. Compute the correlation coefficient between two variables.  
8. Visualize a dataset using a histogram and calculate skewness.  
9. Write a function to compute the covariance matrix of a multivariate dataset.  
10. Standardize a dataset to have a mean of 0 and a variance of 1.

---

### **Probability Basics**
11. Simulate rolling a fair six-sided die and compute probabilities of outcomes.  
12. Implement a function to calculate the probability of an event using relative frequency.  
13. Simulate flipping a biased coin and estimate its probability of heads.  
14. Write a function to calculate the complement of an event.  
15. Use Python to compute conditional probability $ P(A|B) $ given sample data.  
16. Verify the Law of Total Probability using simulated data.  
17. Simulate and calculate the probability of drawing specific cards from a deck.  
18. Compute joint probabilities for two events using Python.  
19. Verify Bayes’ Theorem with a real-world example.  
20. Visualize a probability distribution (e.g., uniform, normal).

---

### **Discrete and Continuous Random Variables**
21. Generate a binomial random variable and compute its mean and variance.  
22. Simulate a Poisson process and calculate probabilities of specific events.  
23. Implement and visualize the probability mass function (PMF) of a discrete random variable.  
24. Generate a normal random variable and compute probabilities for specific intervals.  
25. Compute the cumulative distribution function (CDF) of a normal distribution.  
26. Use the Central Limit Theorem to approximate the sum of random variables.  
27. Implement and visualize the probability density function (PDF) of a normal distribution.  
28. Simulate and compute probabilities for an exponential distribution.  
29. Fit a normal distribution to a given dataset and estimate its parameters.  
30. Compare the behavior of discrete vs. continuous random variables.

---

### **Sampling and Estimation**
31. Implement simple random sampling on a dataset.  
32. Write a function to compute the sample mean and sample variance.  
33. Simulate and analyze sampling distributions of the mean.  
34. Perform stratified sampling on a dataset.  
35. Estimate population parameters using maximum likelihood estimation (MLE).  
36. Simulate bootstrap sampling to compute confidence intervals.  
37. Compute the bias and variance of an estimator using simulation.  
38. Use Python to verify the Law of Large Numbers.  
39. Simulate and compute the impact of sample size on estimation accuracy.  
40. Write a function to calculate standard error for a sample mean.

---

### **Hypothesis Testing**
41. Perform a one-sample $ t $-test to check if a sample mean differs from a known value.  
42. Perform a two-sample $ t $-test to compare the means of two datasets.  
43. Implement and interpret a chi-square test for independence.  
44. Conduct an ANOVA test to compare means of multiple groups.  
45. Perform a permutation test for a given hypothesis.  
46. Implement and interpret a Mann-Whitney U test for non-parametric data.  
47. Simulate Type I and Type II errors for hypothesis tests.  
48. Write a function to compute p-values from test statistics.  
49. Visualize the rejection region of a hypothesis test.  
50. Perform a hypothesis test to determine if a dataset follows a normal distribution.

---


In [1]:
import statistics as stats
import scipy.stats as sciStats
import numpy as np
import random
from collections import Counter

## -- Basic Descriptive --

In [None]:
# Compute the mean of dataset

data = np.array([12,4,5,6,7,8,2323])
print(np.mean(data))
print(stats.mean(data))

In [None]:
# Calculate the median of data set and verify robustness of outliers

# The median is considered robust to outliers because it is based on the middle value(s) of the dataset, not the actual values of all data points. This means that extreme values (outliers) have little to no effect on the median.

data = [1, 2, 3, 4, 100]

print(np.median(data))
print(stats.median(data))

In [None]:
# Mode of dataset

data = np.array([12,4,54,5,6,67,8,341])

print(stats.mode(data))

In [None]:
# compute the variance and standard deviation

data = [2, 4, 6, 8, 10]

# Population variance and standard deviation
populationVariance = stats.pvariance(data)
populationStandardDeviation = stats.pstdev(data)

# Sample variance and standard deviation
sampleVariance = stats.variance(data)
sampleStandardDeviation = stats.stdev(data)

print("Population Variance:", populationVariance)
print("Population Standard Deviation:", populationStandardDeviation)
print("Sample Variance:", sampleVariance)
print("Sample Standard Deviation:", sampleStandardDeviation)

The **Interquartile Range (IQR)** is a measure of statistical dispersion that represents the range between the first quartile (Q1) and the third quartile (Q3). It is used to describe the middle 50% of the data and is robust to outliers.

---

### Steps to Calculate the Inter-quartile Range (IQR):
1. **Sort the dataset** in ascending order.
2. **Find the median** of the dataset. This is the **second quartile (Q2)**.
3. **Find the first quartile (Q1)**:
   - This is the median of the lower half of the data (values below Q2).
4. **Find the third quartile (Q3)**:
   - This is the median of the upper half of the data (values above Q2).
5. **Calculate the IQR**:
   $
   \text{IQR} = Q3 - Q1
   $

---

### Example:
Dataset: $\{3, 7, 8, 5, 12, 14, 21, 13, 18$$

1. **Sort the data**:
   $
   \{3, 5, 7, 8, 12, 13, 14, 18, 21$
   $

2. **Find the median (Q2)**:
   - There are 9 values, so the median is the 5th value:  
   $
   Q2 = 12
   $

3. **Find the first quartile (Q1)**:
   - Lower half of the data: $\{3, 5, 7, 8$$  
   - Median of the lower half:  
     $
     Q1 = \frac{5 + 7}{2} = 6
     $

4. **Find the third quartile (Q3)**:
   - Upper half of the data: $\{13, 14, 18, 21$$  
   - Median of the upper half:  
     $
     Q3 = \frac{14 + 18}{2} = 16
     $

5. **Calculate the IQR**:
   $
   \text{IQR} = Q3 - Q1 = 16 - 6 = 10
   $

So, the inter-quartile range is **10**.



In [None]:
# Calculate the inter-quartile range of data set

data = [3, 7, 8, 5, 12, 14, 21, 13, 18]

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)

# Calculate IQR
IQR = Q3 - Q1

print("Q1:", Q1)
print("Q3:", Q3)
print("IQR:", IQR)

In [None]:
# detect Outliers using IQR
# The IQR measures the spread of the middle 50% of the data, and outliers are defined as data points that fall significantly below or above the "fences" calculated using the IQR.



data = [1, 3, 5, 7, 9, 11, 13, 15, 17, 50]

# Calculate Q1, Q3, and IQR
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1

# Define fences
lowerFence = Q1 - 1.5 * IQR
upperFence = Q3 + 1.5 * IQR

# Detect outliers
outliers = [x for x in data if x < lowerFence or x > upperFence]

print("Q1:", Q1)
print("Q3:", Q3)
print("IQR:", IQR)
print("Lower Fence:", lowerFence)
print("Upper Fence:", upperFence)
print("Outliers:", outliers)

# The IQR method is robust for detecting outliers in skewed datasets.

The **correlation coefficient** measures the strength and direction of the linear relationship between two variables. The most common measure is the **Pearson correlation coefficient**, which ranges from **-1 to 1**:
- **1**: Perfect positive linear relationship,
- **-1**: Perfect negative linear relationship,
- **0**: No linear relationship.

---

### Formula for Pearson Correlation Coefficient:
The Pearson correlation coefficient ($ r $) between two variables $ X $ and $ Y $ is calculated as:

$
r = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2} \sum{(y_i - \bar{y})^2}}}
$

Where:
- $ x_i $ and $ y_i $ are individual data points,
- $ \bar{x} $ and $ \bar{y} $ are the means of $ X $ and $ Y $, respectively.

---

### Steps to Compute the Correlation Coefficient:
1. **Calculate the mean** of $ X $ ($ \bar{x} $) and the mean of $ Y $ ($ \bar{y} $).
2. **Compute the deviations** from the mean for each data point:
   - $ (x_i - \bar{x}) $ and $ (y_i - \bar{y}) $.
3. **Multiply the deviations** for each pair of data points:
   - $ (x_i - \bar{x})(y_i - \bar{y}) $.
4. **Sum the products** of deviations:
   - $ \sum{(x_i - \bar{x})(y_i - \bar{y})} $.
5. **Square the deviations** for $ X $ and $ Y $, then sum them:
   - $ \sum{(x_i - \bar{x})^2} $ and $ \sum{(y_i - \bar{y})^2} $.
6. **Divide the sum of products** by the square root of the product of the sums of squared deviations:
   - $ r = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2} \sum{(y_i - \bar{y})^2}}} $.

---

### Example:
Let’s calculate the correlation coefficient between $ X $ and $ Y $:

| $ X $ | $ Y $ |
|--------|--------|
| 1      | 2      |
| 2      | 4      |
| 3      | 5      |
| 4      | 4      |
| 5      | 5      |

1. **Calculate the means**:
   - $ \bar{x} = \frac{1 + 2 + 3 + 4 + 5}{5} = 3 $,
   - $ \bar{y} = \frac{2 + 4 + 5 + 4 + 5}{5} = 4 $.

2. **Compute deviations and their products**:

   | $ X $ | $ Y $ | $ x_i - \bar{x} $ | $ y_i - \bar{y} $ | $ (x_i - \bar{x})(y_i - \bar{y}) $ | $ (x_i - \bar{x})^2 $ | $ (y_i - \bar{y})^2 $ |
   |--------|--------|---------------------|---------------------|--------------------------------------|-------------------------|-------------------------|
   | 1      | 2      | -2                  | -2                  | 4                                    | 4                       | 4                       |
   | 2      | 4      | -1                  | 0                   | 0                                    | 1                       | 0                       |
   | 3      | 5      | 0                   | 1                   | 0                                    | 0                       | 1                       |
   | 4      | 4      | 1                   | 0                   | 0                                    | 1                       | 0                       |
   | 5      | 5      | 2                   | 1                   | 2                                    | 4                       | 1                       |

3. **Sum the columns**:
   - $ \sum{(x_i - \bar{x})(y_i - \bar{y})} = 4 + 0 + 0 + 0 + 2 = 6 $,
   - $ \sum{(x_i - \bar{x})^2} = 4 + 1 + 0 + 1 + 4 = 10 $,
   - $ \sum{(y_i - \bar{y})^2} = 4 + 0 + 1 + 0 + 1 = 6 $.

4. **Calculate the correlation coefficient**:
   $
   r = \frac{6}{\sqrt{10 \times 6}} = \frac{6}{\sqrt{60}} = \frac{6}{7.746} \approx 0.775
   $

So, the correlation coefficient is approximately **0.775**, indicating a **strong positive linear relationship**.

---


### Key Takeaways:
- The correlation coefficient ($ r $) measures the **linear relationship** between two variables.
- It ranges from **-1 to 1**, where:
  - $ r = 1 $: Perfect positive correlation,
  - $ r = -1 $: Perfect negative correlation,
  - $ r = 0 $: No correlation.
- Use Python libraries like `numpy`, `pandas`, or `scipy` for quick calculations.


In [None]:
# Correlation coefficient between two variables 
X = [1, 2, 3, 4, 5]
Y = [2, 4, 5, 4, 5]

r = np.corrcoef(X, Y)[0, 1]
print("Correlation Coefficient (r):", r)

The **covariance matrix** is a square matrix that summarizes the variances and covariances of a **multivariate dataset**. It is a key concept in statistics and machine learning, especially in dimensionality reduction techniques like Principal Component Analysis (PCA).

---

### What is a Covariance Matrix?
For a dataset with $ n $ variables (features), the covariance matrix is an $ n \times n $ matrix where:
- The **diagonal elements** represent the **variances** of each variable.
- The **off-diagonal elements** represent the **covariances** between pairs of variables.

---

### Formula for Covariance Matrix:
Given a dataset with $ p $ variables and $ n $ observations, the covariance matrix $ \Sigma $ is calculated as:

$
\Sigma = \frac{1}{n-1} \cdot (X - \bar{X})^T (X - \bar{X})
$

Where:
- $ X $ is the $ n \times p $ data matrix (each row is an observation, each column is a variable),
- $ \bar{X} $ is the $ 1 \times p $ vector of means for each variable,
- $ (X - \bar{X}) $ is the mean-centered data matrix,
- $ ^T $ denotes the transpose of a matrix.

---

### Steps to Compute the Covariance Matrix:
1. **Center the data** by subtracting the mean of each variable.
2. **Compute the product** of the centered data matrix and its transpose.
3. **Divide by $ n-1 $** (for sample covariance) or $ n $ (for population covariance).

---

### Example:
Consider a dataset with 3 variables ($ X_1, X_2, X_3 $) and 4 observations:

| $ X_1 $ | $ X_2 $ | $ X_3 $ |
|----------|----------|----------|
| 1        | 2        | 3        |
| 4        | 5        | 6        |
| 7        | 8        | 9        |
| 10       | 11       | 12       |

1. **Compute the mean of each variable**:
   - $ \bar{X_1} = \frac{1 + 4 + 7 + 10}{4} = 5.5 $,
   - $ \bar{X_2} = \frac{2 + 5 + 8 + 11}{4} = 6.5 $,
   - $ \bar{X_3} = \frac{3 + 6 + 9 + 12}{4} = 7.5 $.

2. **Center the data** by subtracting the means:

   | $ X_1 - \bar{X_1} $ | $ X_2 - \bar{X_2} $ | $ X_3 - \bar{X_3} $ |
   |-----------------------|-----------------------|-----------------------|
   | -4.5                  | -4.5                  | -4.5                  |
   | -1.5                  | -1.5                  | -1.5                  |
   | 1.5                   | 1.5                   | 1.5                   |
   | 4.5                   | 4.5                   | 4.5                   |

3. **Compute the product of the centered data matrix and its transpose**:
   - Let $ A = X - \bar{X} $. Then:
     $
     A^T A = \begin{bmatrix}
     -4.5 & -1.5 & 1.5 & 4.5 \\
     -4.5 & -1.5 & 1.5 & 4.5 \\
     -4.5 & -1.5 & 1.5 & 4.5
     \end{bmatrix}
     \begin{bmatrix}
     -4.5 & -4.5 & -4.5 \\
     -1.5 & -1.5 & -1.5 \\
     1.5 & 1.5 & 1.5 \\
     4.5 & 4.5 & 4.5
     \end{bmatrix}
     $
   - The result is a $ 3 \times 3 $ matrix.

4. **Divide by $ n-1 $** (for sample covariance):
   $
   \Sigma = \frac{1}{4-1} \cdot A^T A
   $

---

### Key Takeaways:
- The **covariance matrix** summarizes the relationships between variables in a multivariate dataset.
- Diagonal elements represent **variances**, and off-diagonal elements represent **covariances**.
- Use Python libraries like `numpy` or `pandas` for efficient computation.


In [None]:
# Compute the covariance matrix of a multivariate dataset

# Define the dataset
X = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])

# Compute the covariance matrix
covarianceMatrix = np.cov(X, rowvar=False)  # rowvar=False means columns are variables
print("Covariance Matrix:")
print(covarianceMatrix)

Standardizing a dataset to have a **mean of 0** and a **variance of 1** is a common preprocessing step in data analysis and machine learning. This process is also known as **z-score normalization**. Standardization ensures that all features are on the same scale, which is particularly important for algorithms that are sensitive to the magnitude of features (e.g., PCA, k-means, SVM).

---

### Steps to Standardize a Dataset:
1. **Compute the mean** of each feature (column) in the dataset.
2. **Compute the standard deviation** of each feature.
3. **Standardize each value** using the formula:
   $
   z = \frac{x - \mu}{\sigma}
   $
   Where:
   - $ x $ is the original value,
   - $ \mu $ is the mean of the feature,
   - $ \sigma $ is the standard deviation of the feature.

---

### Example:
Consider the following dataset with 2 features ($ X_1 $ and $ X_2 $):

| $ X_1 $ | $ X_2 $ |
|----------|----------|
| 1        | 2        |
| 2        | 3        |
| 3        | 4        |
| 4        | 5        |

1. **Compute the mean** of each feature:
   - $ \mu_{X_1} = \frac{1 + 2 + 3 + 4}{4} = 2.5 $,
   - $ \mu_{X_2} = \frac{2 + 3 + 4 + 5}{4} = 3.5 $.

2. **Compute the standard deviation** of each feature:
   - For $ X_1 $:
     $
     \sigma_{X_1} = \sqrt{\frac{(1-2.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (4-2.5)^2}{4}} = \sqrt{\frac{2.25 + 0.25 + 0.25 + 2.25}{4}} = \sqrt{1.25} \approx 1.118
     $
   - For $ X_2 $:
     $
     \sigma_{X_2} = \sqrt{\frac{(2-3.5)^2 + (3-3.5)^2 + (4-3.5)^2 + (5-3.5)^2}{4}} = \sqrt{\frac{2.25 + 0.25 + 0.25 + 2.25}{4}} = \sqrt{1.25} \approx 1.118
     $

3. **Standardize each value**:
   - For $ X_1 $:
     $
     z_{X_1} = \frac{x - \mu_{X_1}}{\sigma_{X_1}}
     $
   - For $ X_2 $:
     $
     z_{X_2} = \frac{x - \mu_{X_2}}{\sigma_{X_2}}
     $

   Applying this to each value:

   | $ X_1 $ | $ X_2 $ | $ z_{X_1} $ | $ z_{X_2} $ |
   |----------|----------|---------------|---------------|
   | 1        | 2        | $\frac{1-2.5}{1.118} \approx -1.34$ | $\frac{2-3.5}{1.118} \approx -1.34$ |
   | 2        | 3        | $\frac{2-2.5}{1.118} \approx -0.45$ | $\frac{3-3.5}{1.118} \approx -0.45$ |
   | 3        | 4        | $\frac{3-2.5}{1.118} \approx 0.45$ | $\frac{4-3.5}{1.118} \approx 0.45$ |
   | 4        | 5        | $\frac{4-2.5}{1.118} \approx 1.34$ | $\frac{5-3.5}{1.118} \approx 1.34$ |

   The standardized dataset is:

   | $ z_{X_1} $ | $ z_{X_2} $ |
   |---------------|---------------|
   | -1.34         | -1.34         |
   | -0.45         | -0.45         |
   | 0.45          | 0.45          |
   | 1.34          | 1.34          |

---


### Key Takeaways:
- Standardization transforms the data to have a **mean of 0** and a **standard deviation of 1**.
- It is essential for algorithms that are sensitive to feature scales.
- Use Python libraries like `scikit-learn` or `numpy` for efficient standardization.


In [None]:
# standardize a dataset to have a mean of 0 and variance of 1

# Define the dataset
X = np.array([
    [1, 2],
    [2, 3],
    [3, 4],
    [4, 5]
])

# Compute mean and standard deviation
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)

# Standardize the data
xStandardized = (X - mean) / std

print("Standardized Dataset:")
print(xStandardized)

## -- Probability Basics --

In [None]:
# Simulate a rolling a fair six sided die and compute probabilities of  of outcomes


def rollDice()-> int:
    return random.randint(1, 6)
def simulateRolls(rolls=10)-> list:
    return [rollDice() for _ in range(rolls)]
total: list = simulateRolls()
print("TotalRolls:", total)


def largeRollSimulation(rolls=1000):
    rolling: list = simulateRolls(rolls=rolls)
    outcome: dict = Counter(rolling)
    # Compute probabilities
    probabilities: dict = {outcome: count / rolls for outcome, count in outcome.items()}
    print("Outcome Counts:", outcome)
    print("Probabilities:", probabilities)

largeRollSimulation()




### Formula for Relative Frequency:
The probability of an event $ A $ using relative frequency is given by:

$
P(A) = \frac{\text{Number of times event } A \text{ occurs}}{\text{Total number of trials}}
$

---

### Steps to Calculate Relative Frequency:
1. **Perform the experiment** or simulation multiple times (e.g., roll a die, flip a coin).
2. **Count the number of times** the event of interest occurs.
3. **Divide by the total number of trials** to get the relative frequency.


---

### Key Takeaways:
- Relative frequency is an empirical way to estimate probabilities based on observed data.
- It is particularly useful when theoretical probabilities are unknown or difficult to compute.
- Use Python to simulate experiments and calculate relative frequencies efficiently.


In [None]:

# Simulate rolling a die
def rollDie():
    return random.randint(1, 6)

# Simulate multiple die rolls
def simulateRolls(num_rolls):
    return [rollDie() for _ in range(num_rolls)]

# Number of trials
numRolls = 1000
rolls = simulateRolls(numRolls)

# Count the number of times a 4 appears
eventCount = rolls.count(3)

# Calculate relative frequency
relativeFrequency = eventCount / numRolls

print("Number of times 4 appears:", eventCount)
print("Relative Frequency of rolling a 4:", relativeFrequency)

In [None]:
# Simulate flipping a biased coin and estimate it's probability of heads

# Define the bias (probability of heads)
p = 0.7  # Example: 70% chance of heads

# Simulate a single biased coin flip
def biasedCoinFlip(p):
    return "Heads" if random.random() < p else "Tails"

# Simulate multiple biased coin flips
def simulateFlips(nFlips, p):
    return [biasedCoinFlip(p) for _ in range(nFlips)]

def probabilityHead(nFlips:int=1000):

# Number of flips
    flips:list[str] = simulateFlips(nFlips, p)

# Count the number of heads
    headsCount:int = flips.count("Heads")

# Calculate the relative frequency of heads
    relativeFrequencyHeads:float = headsCount / nFlips

    print("Number of heads:", headsCount)
    print("Relative Frequency of heads:", relativeFrequencyHeads)

probabilityHead()

The **complement of an event** is a fundamental concept in probability. The complement of an event $ A $, denoted as $ A^c $ or $ \overline{A} $, represents all outcomes that are **not** in $ A $. The probability of the complement of an event is given by:

$
P(A^c) = 1 - P(A)
$

---

### Key Properties of Complements:
1. **Mutually Exclusive**:
   - An event $ A $ and its complement $ A^c $ cannot occur simultaneously.
   - $ A \cap A^c = \emptyset $ (they are disjoint).

2. **Exhaustive**:
   - Either $ A $ or $ A^c $ must occur.
   - $ A \cup A^c = S $, where $ S $ is the sample space.

3. **Probability**:
   - The sum of the probabilities of an event and its complement is always 1:
     $
     P(A) + P(A^c) = 1
     $

---

### Steps to Calculate the Complement of an Event:
1. **Identify the event $ A $** and its probability $ P(A) $.
2. **Use the complement formula**:
   $
   P(A^c) = 1 - P(A)
   $

---

### Example 1: Simple Event
Suppose you roll a fair six-sided die. Let $ A $ be the event of rolling a **6**. The probability of $ A $ is:
$
P(A) = \frac{1}{6}
$

The complement $ A^c $ is the event of **not rolling a 6**. The probability of $ A^c $ is:
$
P(A^c) = 1 - P(A) = 1 - \frac{1}{6} = \frac{5}{6}
$

---

### Example 2: Compound Event
Suppose you draw a card from a standard deck of 52 cards. Let $ A $ be the event of drawing a **heart**. The probability of $ A $ is:
$
P(A) = \frac{13}{52} = \frac{1}{4}
$

The complement $ A^c $ is the event of **not drawing a heart**. The probability of $ A^c $ is:
$
P(A^c) = 1 - P(A) = 1 - \frac{1}{4} = \frac{3}{4}
$

---

### Example 3: Real-World Scenario
Suppose the probability of rain tomorrow is $ 0.3 $. Let $ A $ be the event of **rain tomorrow**. Then:
$
P(A) = 0.3
$

The complement $ A^c $ is the event of **no rain tomorrow**. The probability of $ A^c $ is:
$
P(A^c) = 1 - P(A) = 1 - 0.3 = 0.7
$
---

### Key Takeaways:
- The complement of an event $ A $ represents all outcomes **not** in $ A $.
- The probability of the complement is $ P(A^c) = 1 - P(A) $.
- Complements are useful for simplifying probability calculations, especially when it’s easier to calculate $ P(A^c) $ than $ P(A) $.


In [None]:
# Calculate the complement of an event

def complementEvent():
    # P = probability of rain
    P:float = 0.6
    pComplement = 1 - P
    print("Probability of event A:", P)
    print("Probability of complement of event A:", pComplement)

complementEvent()


**Conditional probability** is the probability of an event $ A $ occurring given that another event $ B $ has already occurred. It is denoted as $ P(A|B) $ and is calculated using the formula:

$
P(A|B) = \frac{P(A \cap B)}{P(B)}
$

Where:
- $ P(A \cap B) $ is the probability of both $ A $ and $ B $ occurring,
- $ P(B) $ is the probability of event $ B $.

If you have **sample data**, you can estimate $ P(A|B) $ using relative frequencies.

---

### Steps to Compute Conditional Probability from Sample Data:
1. **Identify the relevant events**:
   - Let $ A $ and $ B $ be two events of interest.

2. **Count the occurrences**:
   - $ N $: Total number of observations in the sample.
   - $ N_B $: Number of observations where event $ B $ occurs.
   - $ N_{A \cap B} $: Number of observations where both $ A $ and $ B $ occur.

3. **Compute the probabilities**:
   - $ P(B) = \frac{N_B}{N} $,
   - $ P(A \cap B) = \frac{N_{A \cap B}}{N} $.

4. **Calculate the conditional probability**:
   $
   P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{N_{A \cap B}}{N_B}
   $

---

### Example:
Suppose you have the following sample data for a survey of 100 people:

| Event          | Number of People |
|----------------|------------------|
| $ B $: Smoker | 30               |
| $ A \cap B $: Smoker and has lung disease | 10 |

Here:
- $ N = 100 $ (total number of people),
- $ N_B = 30 $ (number of smokers),
- $ N_{A \cap B} = 10 $ (number of smokers with lung disease).

1. **Compute $ P(B) $**:
   $
   P(B) = \frac{N_B}{N} = \frac{30}{100} = 0.3
   $

2. **Compute $ P(A \cap B) $**:
   $
   P(A \cap B) = \frac{N_{A \cap B}}{N} = \frac{10}{100} = 0.1
   $

3. **Compute $ P(A|B) $**:
   $
   P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{0.1}{0.3} \approx 0.333
   $

So, the probability of having lung disease given that a person is a smoker is approximately **0.333** (or 33.3%).

---

### Key Takeaways:
- Conditional probability $ P(A|B) $ measures the likelihood of event $ A $ occurring given that event $ B $ has occurred.
- It is calculated as $ P(A|B) = \frac{P(A \cap B)}{P(B)} $.
- When working with sample data, you can estimate $ P(A|B) $ using relative frequencies.


In [None]:
# Compute Conditional Probability P(A|B) given sample data

def conditionalProbability(observations=1000, observationB= 20, observationAB= 10):
    # P(B)
    PB = observationB / observations
    # A# Compute P(A ∩ B)
    PAB = observationAB / observations
    # Compute P(A|B)
    PGivenAB = PAB / PB
    print("P(A|B):", PGivenAB)
    return PGivenAB

conditionalProbability()


The **Law of Total Probability** is a fundamental rule in probability theory that allows you to calculate the total probability of an event by considering all possible scenarios or partitions of the sample space. It states:

If $ B_1, B_2, \dots, B_n $ are mutually exclusive and exhaustive events (i.e., they partition the sample space), then for any event $ A $:

$
P(A) = \sum_{i=1}^n P(A|B_i) \cdot P(B_i)
$

To **verify the Law of Total Probability using simulated data**, you can:
1. Simulate data that follows a known distribution or process.
2. Partition the data into mutually exclusive and exhaustive events $ B_1, B_2, \dots, B_n $.
3. Compute $ P(A|B_i) $ and $ P(B_i) $ for each partition.
4. Verify that $ P(A) = \sum_{i=1}^n P(A|B_i) \cdot P(B_i) $.

---

### Example: Verifying the Law of Total Probability
Let’s say we have a biased coin and two dice:
- The coin has a probability $ P(H) = 0.6 $ of landing heads and $ P(T) = 0.4 $ of landing tails.
- If the coin lands heads, we roll a fair 6-sided die.
- If the coin lands tails, we roll a biased 6-sided die where the probability of rolling a 6 is $ 0.5 $, and the other numbers are equally likely.

We want to verify the Law of Total Probability for the event $ A $: **Rolling a 6**.

---

### Theoretical Calculation
Using the Law of Total Probability:
$
P(A) = P(A|H) \cdot P(H) + P(A|T) \cdot P(T)
$

- $ P(H) = 0.6 $, $ P(T) = 0.4 $.
- If the coin lands heads, we roll a fair die: $ P(A|H) = \frac{1}{6} $.
- If the coin lands tails, we roll a biased die: $ P(A|T) = 0.5 $.

So:
$
P(A) = \left(\frac{1}{6}\right) \cdot 0.6 + 0.5 \cdot 0.4 = 0.1 + 0.2 = 0.3
$

The theoretical probability of rolling a 6 is **0.3**.

---


###  Verify the Law of Total Probability
From the simulation:
- The simulated probability $ P(A) \approx 0.2998 $.
- The theoretical probability $ P(A) = 0.3 $.

The results are very close, verifying the Law of Total Probability.

---

### Key Takeaways:
1. The Law of Total Probability allows you to compute $ P(A) $ by considering all possible scenarios ($ B_i $).
2. You can verify the law using simulated data by:
   - Partitioning the sample space into mutually exclusive and exhaustive events.
   - Computing $ P(A|B_i) $ and $ P(B_i) $ for each partition.
   - Confirming that $ P(A) = \sum_{i=1}^n P(A|B_i) \cdot P(B_i) $.


In [None]:
# Verify Law of Total Probability using simulated Data

# Parameters
headsProbability = 0.2  # Probability of heads
tailsProbability = 0.8  # Probability of tails
numOfTrials = 100000  # Number of simulations

# Simulate the process
countA = 0  # Count of event A (rolling a 6)

for _ in range(numOfTrials):
    # Flip the coin
    if random.random() < headsProbability:
        # Roll a fair die
        roll = random.randint(1, 6)
    else:
        # Roll a biased die
        if random.random() < 0.5:
            roll = 6
        else:
            roll = random.randint(1, 5)
    
    # Check if event A occurs
    if roll == 6:
        countA += 1

# Compute P(A) from simulation
PASimulated = countA / numOfTrials

print("Simulated P(A):", PASimulated)
print("Theoretical P(A):", 0.3)

In [None]:
# Simulate and Calculate the probability of drawing specific card from deck

# Specific card to draw (e.g., Ace of Spades)
def randomDeckOfCards( specificCard = "Ace of Spades", simulations = 100000):

# Define the deck of cards
    suits: list[str] = ["Hearts", "Diamonds", "Clubs", "Spades"]
    ranks:list[str] = ["2", "3", "4", "5", "6", "7", "8", "9", "10", "Jack", "Queen", "King", "Ace"]
    deck = [f"{rank} of {suit}" for suit in suits for rank in ranks]
    countSpecificCard = 0

# Simulate drawing a card
    for _ in range(simulations):
        # Shuffle the deck
        random.shuffle(deck)
        # Draw the top card
        drawnCard = deck[0]
        # Check if it's the specific card
        if drawnCard == specificCard:
            countSpecificCard += 1

    # Calculate the empirical probability
    empiricalProbability = countSpecificCard / simulations

    print("Specific Card:", specificCard)
    print("Theoretical Probability:", 1/52)
    print("Empirical Probability:", empiricalProbability)

randomDeckOfCards(specificCard="Queen of Hearts")

**Joint probability** refers to the probability of two events occurring together. It is denoted as $ P(A \cap B) $ or $ P(A, B) $, and it represents the likelihood of both events $ A $ and $ B $ happening simultaneously.

---

### Formula for Joint Probability:
The joint probability of two events $ A $ and $ B $ is given by:

$
P(A \cap B) = P(A) \cdot P(B|A)
$

Or equivalently:

$
P(A \cap B) = P(B) \cdot P(A|B)
$

Where:
- $ P(A) $ is the probability of event $ A $,
- $ P(B|A) $ is the probability of event $ B $ given that $ A $ has occurred,
- $ P(B) $ is the probability of event $ B $,
- $ P(A|B) $ is the probability of event $ A $ given that $ B $ has occurred.

If events $ A $ and $ B $ are **independent**, then:
$
P(A \cap B) = P(A) \cdot P(B)
$

---

### Steps to Compute Joint Probability:
1. **Identify the events**:
   - Define events $ A $ and $ B $.

2. **Determine if the events are independent**:
   - If $ A $ and $ B $ are independent, use $ P(A \cap B) = P(A) \cdot P(B) $.
   - If $ A $ and $ B $ are dependent, use $ P(A \cap B) = P(A) \cdot P(B|A) $ or $ P(A \cap B) = P(B) \cdot P(A|B) $.

3. **Compute the probabilities**:
   - Calculate $ P(A) $, $ P(B) $, and (if necessary) $ P(B|A) $ or $ P(A|B) $.

4. **Calculate the joint probability**:
   - Use the appropriate formula to compute $ P(A \cap B) $.

---

### Example 1: Independent Events
Suppose you roll a fair six-sided die and flip a fair coin. Let:
- $ A $: Rolling a **3** on the die.
- $ B $: Flipping **heads** on the coin.

Since the die roll and coin flip are independent:
$
P(A) = \frac{1}{6}, \quad P(B) = \frac{1}{2}
$

The joint probability is:
$
P(A \cap B) = P(A) \cdot P(B) = \frac{1}{6} \cdot \frac{1}{2} = \frac{1}{12} \approx 0.0833
$

---

### Example 2: Dependent Events
Suppose you draw two cards from a standard deck of 52 cards without replacement. Let:
- $ A $: First card is an **Ace**.
- $ B $: Second card is also an **Ace**.

Here, $ A $ and $ B $ are dependent events.

1. Compute $ P(A) $:
   $
   P(A) = \frac{4}{52} = \frac{1}{13}
   $

2. Compute $ P(B|A) $:
   - If the first card is an Ace, there are now 3 Aces left in the remaining 51 cards.
   $
   P(B|A) = \frac{3}{51} = \frac{1}{17}
   $

3. Compute the joint probability:
   $
   P(A \cap B) = P(A) \cdot P(B|A) = \frac{1}{13} \cdot \frac{1}{17} = \frac{1}{221} \approx 0.0045
   $

---


### Key Takeaways:
- Joint probability measures the likelihood of two events occurring together.
- For **independent events**, $ P(A \cap B) = P(A) \cdot P(B) $.
- For **dependent events**, $ P(A \cap B) = P(A) \cdot P(B|A) $ or $ P(A \cap B) = P(B) \cdot P(A|B) $.
- Use Python to compute joint probabilities efficiently.


In [None]:
# Compute Joint Probabilities of Two Events



def independentEvent()->float:
    # Probabilities
    pA: float = 1 / 6  # Probability of rolling a 3
    pB: float = 1 / 2  # Probability of flipping heads

# Joint probability for independent events
    pApB:float = pA * pB

    print("Joint Probability P(A ∩ B):", pApB)
    return pApB
def dependentEvent()->float:
    # Probabilities
    pA:float = 4 / 52  # Probability of first card being an Ace
    pBgivenA:float = 3 / 51  # Probability of second card being an Ace given the first was an Ace

    # Joint probability for dependent events
    pApB:float = pA * pBgivenA

    print("Joint Probability P(A ∩ B):", pApB)
    return pApB

independentEvent()
dependentEvent()

**Bayes' Theorem** is a fundamental concept in probability that allows us to update the probability of an event based on new information. It is stated as:

$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$

Where:
- $ P(A|B) $: Posterior probability of event $ A $ given event $ B $,
- $ P(B|A) $: Likelihood of event $ B $ given event $ A $,
- $ P(A) $: Prior probability of event $ A $,
- $ P(B) $: Total probability of event $ B $.

To **verify Bayes' Theorem with a real-world example**, let’s use a medical testing scenario.

---

### Real-World Example: Medical Testing
Suppose:
- A disease affects **1%** of the population ($ P(A) = 0.01 $).
- A test for the disease is **99% accurate**:
  - If a person has the disease, the test is positive **99%** of the time ($ P(B|A) = 0.99 $).
  - If a person does not have the disease, the test is negative **99%** of the time ($ P(B^c|A^c) = 0.99 $).

We want to find:
- The probability that a person has the disease given that they tested positive ($ P(A|B) $).

---

### Step 1: Theoretical Calculation Using Bayes' Theorem
1. **Compute $ P(B|A) $**:
   - $ P(B|A) = 0.99 $.

2. **Compute $ P(A) $**:
   - $ P(A) = 0.01 $.

3. **Compute $ P(B|A^c) $**:
   - The probability of a false positive is $ 1 - 0.99 = 0.01 $.

4. **Compute $ P(B) $**:
   - $ P(B) = P(B|A) \cdot P(A) + P(B|A^c) \cdot P(A^c) $,
   - $ P(B) = (0.99 \cdot 0.01) + (0.01 \cdot 0.99) = 0.0099 + 0.0099 = 0.0198 $.

5. **Apply Bayes' Theorem**:
   $
   P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} = \frac{0.99 \cdot 0.01}{0.0198} = \frac{0.0099}{0.0198} = 0.5
   $

So, the probability that a person has the disease given that they tested positive is **50%**.


---

### Key Takeaways:
1. **Bayes' Theorem** allows us to update probabilities based on new evidence.
2. In this example:
   - Even with a **99% accurate test**, the probability of having the disease after testing positive is only **50%** due to the low prevalence of the disease.
3. Simulations can be used to verify theoretical results and build intuition.


In [None]:
# Verify Bayes Theorem with a Real World Example

def realWorldBayesTheorem():

    # Parameters
    pA = 0.01  # Prevalence of the disease
    pBgivenA = 0.99  # Probability of testing positive given the disease
    pBgivenNotA = 0.01  # Probability of testing positive given no disease

    # Number of simulations
    trials = 100000

    # Counters
    countAandB = 0  # Number of people with the disease and testing positive
    countB = 0  # Number of people testing positive

# Simulate
    for _ in range(trials):
      # Determine if the person has the disease
        hasDisease = random.random() < pA
     # Determine if the test is positive
        if hasDisease:
            testPositive = random.random() < pBgivenA
        else:
            testPositive = random.random() < pBgivenNotA
     # Update counters
        if testPositive:
            countB += 1
            if hasDisease:
                countAandB += 1

    # Compute P(A|B) from simulation
    pAgivenBSimulated = countAandB / countB

    # Theoretical P(A|B)
    pB = pBgivenA * pA + pBgivenNotA * (1 - pA)
    pAgivenBTheoretical = (pBgivenA * pA) / pB

    print("Simulated P(A|B):", pAgivenBSimulated)
    print("Theoretical P(A|B):", pAgivenBTheoretical)


realWorldBayesTheorem()

## -- Discrete and Continuous Random Variables --


A **binomial random variable** models the number of successes in a fixed number of independent trials, each with the same probability of success. It is characterized by two parameters:
- $ n $: Number of trials,
- $ p $: Probability of success in each trial.

The **mean** and **variance** of a binomial random variable are given by:
- Mean: $ \mu = n \cdot p $,
- Variance: $ \sigma^2 = n \cdot p \cdot (1 - p) $.

---

### Steps to Generate a Binomial Random Variable and Compute Its Mean and Variance:
1. **Generate binomial random variables**:
   - Use a random number generator to simulate $ n $ independent trials, each with probability $ p $ of success.
   - Count the number of successes.

2. **Compute the mean and variance**:
   - Use the formulas $ \mu = n \cdot p $ and $ \sigma^2 = n \cdot p \cdot (1 - p) $.

---

### Example:
Suppose:
- $ n = 10 $ (number of trials),
- $ p = 0.5 $ (probability of success in each trial).

We want to:
1. Generate a binomial random variable.
2. Compute its mean and variance.

---

### Key Takeaways:
- A binomial random variable represents the number of successes in $ n $ independent trials, each with probability $ p $ of success.
- The mean is $ \mu = n \cdot p $, and the variance is $ \sigma^2 = n \cdot p \cdot (1 - p) $.
- Use Python libraries like `numpy` or `scipy.stats` to generate binomial random variables and compute their properties.


In [None]:
#   Generate Binomial random variable and compute it's mean and variance

def binomialMeanVariance(trials=10, successProbability= 0.5, iterations =10000):
    randomBinomialVariable = np.random.binomial(trials, successProbability, iterations)
    empiricalMean = np.mean(randomBinomialVariable)
    empiricalVariance = np.var(randomBinomialVariable)
    theoreticalMean = trials * successProbability
    theoreticalVariance = trials * successProbability * (1 - successProbability)
    print("Empirical Mean:", empiricalMean)
    print("Theoretical Mean:", theoreticalMean)
    print("Empirical Variance:", empiricalVariance)
    print("Theoretical Variance:", theoreticalVariance)

binomialMeanVariance()

The **Poisson process** is a stochastic process that models events occurring randomly over time (or space) at a constant average rate. It is characterized by:
- $ \lambda $: The average rate of events per unit time (or space),
- $ k $: The number of events in a given interval.

The probability of observing $ k $ events in a time interval $ t $ is given by the **Poisson distribution**:

$
P(X = k) = \frac{(\lambda t)^k e^{-\lambda t}}{k!}
$

---

### Steps to Simulate a Poisson Process and Calculate Probabilities:
1. **Simulate the Poisson process**:
   - Generate the times at which events occur using the exponential distribution (since inter-arrival times in a Poisson process are exponentially distributed).

2. **Calculate probabilities of specific events**:
   - Use the Poisson distribution formula to compute the probability of observing $ k $ events in a given interval.

---

### Example:
Suppose:
- $ \lambda = 2 $ (average rate of 2 events per unit time),
- $ t = 3 $ (time interval of interest).

We want to:
1. Simulate a Poisson process over the interval $ [0, 3] $.
2. Calculate the probability of observing $ k = 5 $ events in this interval.

---

In [None]:
# Simulate the Poison Process and calculate the probabilities of specific events


def poisonProcess()-> int:

# Parameters
    lambda_:int = 2  # Average rate of events per unit time
    t:int = 3  # Time interval

# Simulate inter-arrival times (exponentially distributed)
    interArrivalTimes = np.random.exponential(scale=1/lambda_, size=1000)

# Simulate event times
    eventTimes = np.cumsum(interArrivalTimes)

# Filter events within the interval [0, t]
    eventsInInterval = eventTimes[eventTimes <= t]

# Number of events in the interval
    numEvents:int = len(eventsInInterval)

    print("Number of events in [0, 3]:", numEvents)
    return numEvents

poisonProcess()

The **probability mass function (PMF)** of a discrete random variable gives the probability that the variable takes on a specific value. For a discrete random variable $ X $, the PMF is defined as:

$
P(X = x) = f(x)
$

Where:
- $ f(x) $ is the probability that $ X $ takes the value $ x $,
- $ \sum_{x} f(x) = 1 $ (the probabilities sum to 1).

---

### Steps to Implement the PMF:
1. **Define the random variable**:
   - Identify the possible values $ x $ that the random variable $ X $ can take.

2. **Define the probabilities**:
   - Assign probabilities $ f(x) $ to each value $ x $, ensuring that $ \sum_{x} f(x) = 1 $.

3. **Implement the PMF**:
   - Create a function that takes a value $ x $ as input and returns $ P(X = x) $.

---

### Example: PMF of a Fair Six-Sided Die
For a fair six-sided die:
- The random variable $ X $ can take values $ \{1, 2, 3, 4, 5, 6\} $.
- The PMF is $ P(X = x) = \frac{1}{6} $ for $ x \in \{1, 2, 3, 4, 5, 6\} $.

---

### Key Takeaways:
- The **PMF** of a discrete random variable gives the probability that the variable takes on a specific value.
- You can implement the PMF using conditional statements or a dictionary for more flexibility.
- Ensure that the probabilities sum to 1 for all possible values of the random variable.


In [None]:
# Implement the probability mass function (PMF) of a discrete random variable

def probabilityMassFunction(x, pmfDictionary):
    for value in x:
        if value in pmfDictionary:
            print(f"P(X = {value}) = {pmfDictionary[value]}")
        else:
            print(f"P(X = {value}) is not defined in the PMF.")
    return pmfDictionary

# Define the PMF as a dictionary
pmf_dict = {
    0: 0.2,
    1: 0.5,
    2: 0.3,
    3: 0.4
}

# Example usage
x_values = [0, 1, 2, 3]

probabilityMassFunction(x_values ,pmf_dict)

### **Normal Random Variable: Definition**
A **Normal random variable** follows the **Normal (Gaussian) Distribution**, defined by two parameters:
- **Mean (μ):** The center of the distribution.
- **Standard deviation (σ):** Controls the spread of the distribution.

The probability density function (PDF) is given by:

$
f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
$

This describes the probability of a continuous variable $ X $ taking on a specific value $ x $. The area under the curve of the PDF over an interval represents the probability for that interval.

---


### **Summary**
1. Use `numpy.random.normal` to generate random variables.
2. Use `scipy.stats.norm.pdf` for PDF values and `norm.cdf` for probabilities.
3. Calculate probabilities over an interval with $ F(b) - F(a) $.
4. Visualize the distribution using `matplotlib`.


In [None]:
# generate normal random variable and compute probabilities at specific intervals
def probabilitiesAtSpecificIntervals(mean=0, standerDeviation= 1, sampleSize=1000):
    # Normal RandomVariables
    data = np.random.normal(mean, standerDeviation, sampleSize)
    # PDF at specific points
    point = 1
    pdfValues = sciStats.norm.pdf(point, loc=mean, scale=standerDeviation)
    print(f"PDF at X={point}: {pdfValues}")
    # 3. Compute probabilities for an interval [a, b]
    intervalA, intervalB = -1, 1   
    probability = sciStats.norm.cdf(intervalB, loc=mean, scale=standerDeviation) -sciStats.norm.cdf(intervalA, loc=mean, scale=standerDeviation)
    print(f"P({intervalA} <= X <= {intervalB}): {probability}")
    # Proportion of samples in the interval [a, b]
    proportion = np.mean((data >= intervalA) & (data <= intervalB))
    print(f"Proportion of samples in [{intervalA}, {intervalB}]: {proportion}")


probabilitiesAtSpecificIntervals()



### **What Is the Cumulative Distribution Function (CDF)?**

The **cumulative distribution function (CDF)** of a random variable $ X $ represents the probability that $ X $ takes on a value less than or equal to a specific value $ x $. 

Mathematically, it is defined as:

$
F(x) = P(X \leq x)
$

For a **normal distribution**, this means $ F(x) $ is the probability that a randomly selected value from the distribution is less than or equal to $ x $.

---

### **Key Properties of the CDF:**

1. **Range:** The CDF value $ F(x) $ is always between 0 and 1:
   $
   0 \leq F(x) \leq 1
   $
   - $ F(x) = 0 $: All values are smaller than $ x $ (unlikely).
   - $ F(x) = 1 $: All values are less than or equal to $ x $ (certain).

2. **Monotonicity:** The CDF is a non-decreasing function, meaning $ F(x_1) \leq F(x_2) $ for $ x_1 < x_2 $.

3. **Asymptotes for Normal Distribution:**
   - As $ x \to -\infty $, $ F(x) \to 0 $.
   - As $ x \to +\infty $, $ F(x) \to 1 $.

---

### **How It Relates to a Normal Distribution**

For a **normal distribution** with mean $ \mu $ and standard deviation $ \sigma $, the cumulative distribution function $ F(x) $ cannot be solved in closed form. Instead, it is computed numerically using integration of the **probability density function (PDF):**

$
F(x) = \int_{-\infty}^{x} \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(t-\mu)^2}{2\sigma^2}} dt
$

The area under the curve of the PDF from $ -\infty $ to $ x $ gives the value of $ F(x) $.

---

### **Why Is CDF Useful?**
1. **Probability over an Interval:**
   - You can compute the probability that $ X $ lies between two values, $ a $ and $ b $, using:
     $
     P(a \leq X \leq b) = F(b) - F(a)
     $

2. **Comparison of Values:**
   - The CDF tells us how likely a value is relative to the rest of the distribution. For example, if $ F(x) = 0.75 $, then $ x $ is greater than 75% of the values in the distribution.

---


In [None]:
# Compute the cumulative distribution function of a normal distribution

def cumulativeDistributionFunction(mean= 0,standardDeviation=1, value=1.5)-> float:
    cdfValue:float = sciStats.norm.cdf(value, loc=mean, scale=standardDeviation)
    print(f"CDF at X={value} for N({mean}, {standardDeviation}): {cdfValue}")
    return cdfValue
cumulativeDistributionFunction()

### **What is the Central Limit Theorem (CLT)?**

The **Central Limit Theorem (CLT)** is a fundamental theorem in probability and statistics. It states that:

> The sum (or average) of a large number of independent and identically distributed (i.i.d.) random variables, regardless of their original distribution, will tend to follow a **normal distribution**, provided the number of variables is sufficiently large.

#### **Key Points:**
1. The mean of the resulting normal distribution is the sum of the means of the individual random variables.
   $
   \mu_{sum} = n \cdot \mu
   $
2. The variance of the resulting normal distribution is the sum of the variances of the individual random variables.
   $
   \sigma^2_{sum} = n \cdot \sigma^2
   $
3. This approximation becomes more accurate as $ n $, the number of random variables, increases.

---

### **Why Is CLT Important?**

The CLT allows us to:
- Approximate the sum (or average) of random variables using a normal distribution.
- Make statistical inferences about sample means even when the population distribution is not normal.

---

### **How to Use CLT to Approximate the Sum of Random Variables**

#### **1. Identify the Random Variables**
   - Ensure the random variables are:
     - Independent
     - Identically distributed (same mean $ \mu $ and variance $ \sigma^2 $).

#### **2. Compute the Parameters for the Sum**
   - If you are summing $ n $ random variables with:
     - Mean $ \mu $,
     - Standard deviation $ \sigma $,
   - Then:
     $
     \mu_{sum} = n \cdot \mu, \quad \sigma_{sum} = \sqrt{n} \cdot \sigma
     $

#### **3. Approximate Using the Normal Distribution**
   - Treat the sum as a random variable following:
     $
     S \sim N(\mu_{sum}, \sigma_{sum}^2)
     $

#### **4. Calculate Probabilities**
   - Use the normal distribution's **CDF** or **PDF** to compute probabilities about the sum.

---

### **Step-by-Step Example**

#### Problem: 
Suppose we roll a fair 6-sided die 50 times. Each roll is a random variable $ X $, with:
- $ \mu = 3.5 $ (average of die outcomes),
- $ \sigma^2 = \frac{35}{12} \approx 2.92 $ (variance of die outcomes).

We want to approximate the probability that the sum of the 50 rolls is less than 200.

---

#### **Step-by-Step Solution**

1. **Identify the Parameters of the Individual Random Variables**:
   - $ \mu = 3.5 $
   - $ \sigma = \sqrt{\frac{35}{12}} \approx 1.71 $

2. **Compute the Parameters for the Sum**:
   - $ \mu_{sum} = n \cdot \mu = 50 \cdot 3.5 = 175 $
   - $ \sigma_{sum} = \sqrt{n} \cdot \sigma = \sqrt{50} \cdot 1.71 \approx 12.09 $

3. **Approximate Using the Normal Distribution**:
   - The sum $ S $ follows approximately:
     $
     S \sim N(175, 12.09^2)
     $

4. **Standardize the Value $ x = 200 $ Using the Z-Score**:
   - The Z-score is given by:
     $
     Z = \frac{x - \mu_{sum}}{\sigma_{sum}}
     $
   - For $ x = 200 $:
     $
     Z = \frac{200 - 175}{12.09} \approx 2.07
     $

5. **Find the Probability Using the CDF**:
   - Using the standard normal distribution table or Python:
     $
     P(S \leq 200) = P(Z \leq 2.07)
     $
   - From the CDF of the standard normal distribution:
     $
     P(Z \leq 2.07) \approx 0.9808
     $

So, the probability that the sum of the dice rolls is less than 200 is approximately **98.08%**.

---

### **Key Takeaways:**
- The Central Limit Theorem approximates the sum of random variables as a normal distribution.
- Use the CLT when:
  - Random variables are independent and identically distributed.
  - The sample size $ n $ is large (usually $ n > 30 $ is sufficient).
- Approximate probabilities using the mean and standard deviation of the sum.


In [None]:
# Central Limit Theorem to approximate the the sum of random variables 



def approximateSumViaCLT(mean=3.5, standardDeviation=0, trials= 50)-> np.float32:
    standardDeviation = np.sqrt(35/12)
    print(standardDeviation)
    meanSum:float = trials * mean
    standardDeviationSum:np.float32 = np.sqrt(trials) * standardDeviation
    # Compute the probability for the sum being less than 200
    x:int = 200
    z = (x - meanSum) / standardDeviationSum   
    probability:np.float32 =sciStats.norm.cdf(z)
    print(f"P(Sum ≤ {x}): {probability}")
    return probability

approximateSumViaCLT()


### **What is an Exponential Distribution?**

The **exponential distribution** is a continuous probability distribution used to model the time until an event occurs (e.g., time between arrivals of customers at a store). It is widely used in reliability analysis, queuing theory, and survival analysis.

---

### **Mathematical Definition**

1. **Probability Density Function (PDF):**
   $
   f(x; \lambda) = \lambda e^{-\lambda x}, \quad x \geq 0
   $
   - $ \lambda > 0 $ is the rate parameter (mean time between events is $ \frac{1}{\lambda} $).

2. **Cumulative Distribution Function (CDF):**
   $
   F(x; \lambda) = P(X \leq x) = 1 - e^{-\lambda x}, \quad x \geq 0
   $

3. **Key Properties:**
   - The mean of the distribution is $ \mu = \frac{1}{\lambda} $.
   - The variance is $ \sigma^2 = \frac{1}{\lambda^2} $.

---

### **How to Simulate and Compute Probabilities**

Here’s how you can simulate an exponential distribution and compute probabilities:

---

### **Step-by-Step Guide**

1. **Define the Rate Parameter ($ \lambda $):**
   - Choose a positive value for $ \lambda $, which represents the event rate per unit time.

2. **Simulate Random Variables:**
   - Use `numpy.random.exponential(scale, size)` where:
     - $ \text{scale} = \frac{1}{\lambda} $,
     - `size` is the number of random variables to generate.

3. **Compute the PDF at a Specific Point:**
   - Use the formula $ f(x; \lambda) = \lambda e^{-\lambda x} $.

4. **Compute the CDF at a Specific Point:**
   - Use the formula $ F(x; \lambda) = 1 - e^{-\lambda x} $.

5. **Calculate Probabilities for Intervals:**
   - For $ P(a \leq X \leq b) $, use:
     $
     P(a \leq X \leq b) = F(b; \lambda) - F(a; \lambda)
     $

---


### **Key Takeaways**
- The exponential distribution is memory-less: $ P(X > a + b | X > a) = P(X > b) $.
- Use $ \lambda $ to define the event rate and compute probabilities or simulate values.
- CDF helps calculate probabilities over intervals easily.


In [None]:
# Simulate and Compute the probabilities of an exponential distribution

def probabilityOfExponentialDistribution(rate=2, sample=1000)->np.float32:
    np.random.exponential(scale=1/rate, size=sample)
    pointX:int = 1
    pdfValue:float =  rate * np.exp(-rate * pointX)
    print(f"PDF at X={pointX}: {pdfValue}")
    cdfValue:float = 1 - np.exp(-rate * pointX)
    print(f"CDF at X={pointX}: {cdfValue}")
    intervalA, intervalB = 0.5, 2
    cdfA:np.float32 = 1 - np.exp(-rate * intervalA)
    cdfB:np.float32 = 1 - np.exp(-rate * intervalB)
    probabilityInterval:np.float32 = cdfB - cdfA
    print(f"P({intervalA} <= X <= {intervalB}): {probabilityInterval}")
    return probabilityInterval

probabilityOfExponentialDistribution()

In [None]:
# Fit a Normal Distribution to a give dataset and estimate it's parameters

def fitDatasetToNormalDistribution():
    # Step 1: Sample dataset (replace with your actual data)
    data = np.random.normal(loc=5, scale=2, size=1000)  # Example data (mean=5, std=2)

# Step 2: Estimate the parameters (mean and std)
    estimatedMean = np.mean(data)
    estimatedStandardDeviation = np.std(data, ddof=1)  # Sample standard deviation (use ddof=1)

# Step 3: Print the estimated parameters
    print(f"Estimated Mean (μ): {estimatedMean:.2f}")
    print(f"Estimated Standard Deviation (σ): {estimatedStandardDeviation:.2f}")

fitDatasetToNormalDistribution()

### **Comparison of Discrete and Continuous Random Variables**

Random variables are fundamental in probability theory and statistics. They represent the outcomes of random processes or experiments. The distinction between **discrete** and **continuous** random variables is based on the nature of their possible values.

---

### **1. Definition:**

- **Discrete Random Variable:**
  - A discrete random variable can take on only a **finite or countably infinite** number of distinct values.
  - The values are typically integers or whole numbers, but they could also be any countable set.
  - Examples: Number of heads in 10 coin tosses, number of cars passing a traffic light in one minute.

- **Continuous Random Variable:**
  - A continuous random variable can take on an **infinite number of possible values** within a certain range or interval.
  - These values are not countable and can take any value within a real number interval.
  - Examples: Height of individuals, time taken to run a race, temperature.

---

### **2. Probability Distribution:**

- **Discrete Random Variable:**
  - The probability distribution is given by the **probability mass function (PMF)**.
  - The sum of the probabilities for all possible outcomes is always 1:
    $
    P(X = x_i) = p(x_i), \quad \sum p(x_i) = 1
    $
  - For discrete random variables, the probabilities of specific outcomes can be computed directly.

- **Continuous Random Variable:**
  - The probability distribution is given by the **probability density function (PDF)**.
  - The probability that the variable takes any specific value is **zero**. Instead, we compute the probability that the variable falls within an interval:
    $
    P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx
    $
  - For continuous random variables, the probability density is a function, and the total area under the PDF curve is 1.

---

### **3. Example Distributions:**

- **Discrete Random Variable:**
  - **Binomial Distribution**: Models the number of successes in a fixed number of independent Bernoulli trials.
  - **Poisson Distribution**: Models the number of events occurring in a fixed interval of time or space.
  
- **Continuous Random Variable:**
  - **Normal Distribution**: Describes data that clusters around a mean value and is symmetric.
  - **Exponential Distribution**: Models the time between events in a Poisson process.

---

### **4. Probability Calculation:**

- **Discrete Random Variable:**
  - We can directly calculate the probability of specific outcomes (e.g., $ P(X = 3) $) using the PMF.

- **Continuous Random Variable:**
  - We calculate probabilities over intervals (e.g., $ P(3 \leq X \leq 5) $) using the area under the PDF curve between the bounds of the interval.

---

### **5. Notation:**

- **Discrete Random Variable:**
  - $ X $ takes values from a finite or countably infinite set: $ X \in \{x_1, x_2, x_3, \dots\} $.

- **Continuous Random Variable:**
  - $ X $ can take any value from a continuous interval: $ X \in [a, b] $, or $ X \in (-\infty, \infty) $.

---

### **6. Examples:**

- **Discrete Random Variable Example:**
  - **Tossing a Die**: When rolling a fair six-sided die, the possible outcomes are $ 1, 2, 3, 4, 5, 6 $. The probability of rolling each specific number is $ \frac{1}{6} $.
  
- **Continuous Random Variable Example:**
  - **Height of a Person**: The height can take any value within a range (e.g., between 140 cm and 200 cm), and the probability of a person being exactly 180 cm is technically zero. We instead calculate the probability of being within a range (e.g., 170 cm to 180 cm).

---

### **7. Mathematical Expectation:**

- **Discrete Random Variable:**
  - The expected value (mean) is computed as a weighted average of the possible outcomes:
    $
    E(X) = \sum_{i} x_i \cdot P(X = x_i)
    $

- **Continuous Random Variable:**
  - The expected value (mean) is computed as an integral:
    $
    E(X) = \int_{-\infty}^{\infty} x \cdot f(x) \, dx
    $

---

### **8. Variance and Standard Deviation:**

- **Discrete Random Variable:**
  - Variance is calculated as:
    $
    Var(X) = \sum_{i} (x_i - E(X))^2 \cdot P(X = x_i)
    $
  
- **Continuous Random Variable:**
  - Variance is calculated as:
    $
    Var(X) = \int_{-\infty}^{\infty} (x - E(X))^2 \cdot f(x) \, dx
    $

---

### **Key Differences Summary**

| Feature                        | Discrete Random Variable          | Continuous Random Variable        |
|---------------------------------|-----------------------------------|-----------------------------------|
| **Possible Values**             | Countable (finite or infinite)    | Uncountable (in a range/interval) |
| **Probability Function**        | Probability Mass Function (PMF)   | Probability Density Function (PDF)|
| **Specific Value Probability**  | Can calculate $ P(X = x) $      | $ P(X = x) = 0 $, calculate over intervals |
| **Example**                     | Binomial, Poisson                 | Normal, Exponential               |
| **Probability Calculation**     | Directly for specific values      | Over an interval (area under curve) |
| **Mathematical Expectation**    | $ E(X) = \sum x_i \cdot P(X = x_i) $ | $ E(X) = \int x \cdot f(x) \, dx $ |

---

### **Conclusion:**

- **Discrete random variables** are more straightforward in terms of calculating probabilities and expectations, as the possible outcomes are countable and distinct.
- **Continuous random variables** deal with probabilities over intervals and require the use of integrals for calculations, making them more complex but useful for modeling real-world phenomena like time, distance, or measurements.


In [None]:
# Compare the behavior of Discrete and Continuous random variables


# Discrete Random Variable Example: Rolling a Die (Uniform Distribution)
# Simulating 1000 rolls of a fair six-sided die
discreteData = np.random.choice([1, 2, 3, 4, 5, 6], size=1000, p=[1/6]*6)

# Calculate probability of rolling a 3
probRoll3 = np.sum(discreteData == 3) / len(discreteData)
print(f"Probability of rolling a 3: {probRoll3:.2f}")

# Calculate the expected value (mean) and variance for the discrete data
expectedValueDiscrete = np.mean(discreteData)
discreteVariance = np.var(discreteData)
print(f"Discrete Expected Value (Mean): {expectedValueDiscrete:.2f}")
print(f"Discrete Variance: {discreteVariance:.2f}")

# Continuous Random Variable Example: Generating data from a Normal Distribution
# Simulating 1000 random variables from a normal distribution (mean=5, std=2)
mu = 5
sigma = 2
n = 1000
continuousData = np.random.normal(mu, sigma, n)

# Calculate probability of being between 4 and 6 (CDF approach)
cdf4 = sciStats.norm.cdf(4, mu, sigma)
cdf6 = sciStats.norm.cdf(6, mu, sigma)
probBetween4And6 = cdf6 - cdf4
print(f"Probability of being between 4 and 6: {probBetween4And6:.2f}")

# Calculate the expected value (mean) and variance for the continuous data
expectedValueContinuous = np.mean(continuousData)
continuousVariance = np.var(continuousData)
print(f"Continuous Expected Value (Mean): {expectedValueContinuous:.2f}")
print(f"Continuous Variance: {continuousVariance:.2f}")


## -- Sampling & Estimation

In [None]:
# Random Sampling on dataset

def randomSamplingOnDataSet(dataset, sampleSize:int):
    randomSampleWithOutReplace = np.random.choice(dataset, size=sampleSize, replace=False)
    print(f"Random Sample (without replacement): {randomSampleWithOutReplace}")
    randomSampleWithReplace = np.random.choice(dataset, size=sampleSize, replace=True)
    print(f"Random Sample (with replacement): {randomSampleWithReplace}")
    return randomSampleWithReplace, randomSampleWithOutReplace

data = np.array([1,2,3,4,5,6,7,8,9])
randomSamplingOnDataSet(data, 3)

In [None]:
# Compute Sample mean and Variance

def sampleMeanAndVariance(data):
    sampleMean = np.mean(data)
    sampleVariance= np.var(data, ddof=1)
    print(f"Sample Mean: {sampleMean}")
    print(f"Sample Variance: {sampleVariance}")
    return sampleMean, sampleVariance

sampleMeanAndVariance([12,4,5,6,57,])

In [None]:
# Simulate Analyze sample Distribution of mean

def analyzeSampleDistributionMean():

# Parameters for the population
    populationMean = 50
    populationStandard = 10
    populationSize = 10000

# Create the population (e.g., normal distribution)
    populationData = np.random.normal(populationMean, populationStandard, populationSize)

# Parameters for sampling
    sampleSize = 30  # Size of each sample
    totalSamples = 1000  # Number of samples to draw

# List to store the sample means
    sampleMeans = []

# Draw samples and calculate their means
    for _ in range(totalSamples):
        sample = np.random.choice(populationData, size=sampleSize, replace=False)
        sampleMeans.append(np.mean(sample))

# Convert sample means to a numpy array for easier analysis
    sampleMeans = np.array(sampleMeans)

# Compute sample mean and sample standard deviation of the sample means
    meanSampleMeans = np.mean(sampleMeans)
    standardSampleMeans = np.std(sampleMeans)

    print(f"Mean of sample means: {meanSampleMeans}")
    print(f"Standard deviation of sample means: {standardSampleMeans}")


analyzeSampleDistributionMean()

### **Stratified Sampling**

**Stratified sampling** is a method of sampling that involves dividing the population into distinct subgroups or **strata** that share a specific characteristic. Then, random samples are taken from each of these subgroups. This method ensures that each subgroup is represented in the sample, making it particularly useful when the population contains subgroups that may vary significantly from each other.

### **Steps for Stratified Sampling**:
1. **Divide the population into strata** based on a characteristic (e.g., age, income, gender).
2. **Randomly sample** from each stratum.
3. Combine the samples from each stratum to form the final sample.

The main goal of stratified sampling is to ensure that important subgroups are adequately represented in the sample.

---

### **Mathematical Concept**:
In stratified sampling, the population is divided into $ k $ strata, and a sample is drawn from each stratum.

#### Formula for Stratified Sampling:
- **Proportional Stratified Sampling**: The number of samples selected from each stratum is proportional to the size of the stratum in the population.
  
  $
  n_i = \frac{N_i}{N} \times n
  $
  Where:
  - $ n_i $ = number of samples to be drawn from stratum $ i $
  - $ N_i $ = size of stratum $ i $
  - $ N $ = total population size
  - $ n $ = total sample size

---


### **Key Notes**:
1. **Proportional Stratified Sampling** ensures that each subgroup is represented according to its size in the population.
2. **Equal Allocation** (an alternative approach) can be used where the same number of samples are drawn from each stratum, regardless of the strata's size in the population.

---

### **Applications of Stratified Sampling**:
- **Ensuring Representation**: When certain subgroups are smaller but important, stratified sampling ensures that they are not overlooked in the sample.
- **Improved Precision**: When the variability within strata is lower than the variability in the overall population, stratified sampling can result in more precise estimates of population parameters.
- **Market Research**: It can be used in market research to ensure all customer segments (e.g., by age or income) are represented in the sample.


In [None]:
# Stratified Sampling on dataset

def stratifiedSampling(data):
# Step 1: Define the strata (we will stratify by 'Gender')
    maleData = [data['Age'][i] for i in range(len(data['Gender'])) if data['Gender'][i] == 'Male']
    femaleData = [data['Age'][i] for i in range(len(data['Gender'])) if data['Gender'][i] == 'Female']

# Step 2: Define the total sample size
    totalSampleSize = 6

# Step 3: Calculate the number of samples for each stratum (proportional allocation)
    maleSampleSize = int(np.rint(len(maleData) / len(data['Gender']) * totalSampleSize))
    female_sample_size = totalSampleSize - maleSampleSize  # Remaining samples for females

# Step 4: Randomly sample from each stratum
    maleSample = np.random.choice(maleData, size=maleSampleSize, replace=False)
    femaleSample = np.random.choice(femaleData, size=female_sample_size, replace=False)

# Step 5: Combine the samples from each stratum
    stratifiedSample = np.concatenate((maleSample, femaleSample))

    print("Stratified Sample:")
    print(stratifiedSample)
    return stratifiedSample
# Sample dataset (replace with your actual dataset)
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
}
stratifiedSampling(data)

### **Maximum Likelihood Estimation (MLE)**

**Maximum Likelihood Estimation (MLE)** is a method for estimating the parameters of a statistical model. The idea is to find the parameter values that maximize the likelihood of observing the given sample data.

In MLE, we define the **likelihood function** for the data given the parameters of the model. Then, we seek the parameter values that maximize this likelihood.

### **Steps for Maximum Likelihood Estimation**:
1. **Define the Likelihood Function**: Given a statistical model, the likelihood function represents the probability of observing the sample data as a function of the model parameters.
2. **Maximize the Likelihood**: We estimate the parameters by finding the values that maximize the likelihood function. Often, we work with the **log-likelihood** because it simplifies the maximization process.
3. **Estimate the Parameters**: The values of the parameters that maximize the likelihood are the MLE estimates.

---

### **Mathematical Formulation of MLE**:

Suppose we have a dataset $ X = \{x_1, x_2, ..., x_n\} $, and we are trying to estimate the parameters $ \theta $ of a probability distribution $ f(x|\theta) $.

1. **Likelihood Function**:
   The likelihood function $ L(\theta) $ is the joint probability of observing the data:
   $
   L(\theta) = \prod_{i=1}^{n} f(x_i|\theta)
   $

2. **Log-Likelihood Function**:
   The log-likelihood is the logarithm of the likelihood function, which is easier to work with:
   $
   \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i|\theta)
   $

3. **Maximizing the Likelihood**:
   To find the maximum likelihood estimate, we differentiate the log-likelihood with respect to $ \theta $ and set it equal to zero:
   $
   \frac{d}{d\theta} \ell(\theta) = 0
   $
   Solving this equation gives us the MLE for the parameter $ \theta $.

---

### **MLE Example: Estimating the Parameters of a Normal Distribution**

For simplicity, let's consider the case of estimating the parameters of a **Normal Distribution** with unknown mean $ \mu $ and standard deviation $ \sigma $ using MLE.

- **Normal Distribution**: The probability density function (PDF) for a normal distribution is:
  $
  f(x|\mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)
  $

- **Likelihood Function**:
  For a dataset $ X = \{x_1, x_2, ..., x_n\} $, the likelihood function is:
  $
  L(\mu, \sigma) = \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)
  $

- **Log-Likelihood**:
  The log-likelihood function is:
  $
  \ell(\mu, \sigma) = -n\log(\sigma) - \frac{n}{2}\log(2\pi) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2
  $

- **Maximizing the Log-Likelihood**:
  To estimate the parameters $ \mu $ and $ \sigma $, we maximize the log-likelihood. The solutions are:
  - $ \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} x_i $ (Sample mean)
  - $ \hat{\sigma} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \hat{\mu})^2} $ (Sample standard deviation)


---

### **Conclusion**:

Using MLE, we can estimate the parameters of a statistical model, such as the **mean** and **standard deviation** of a normal distribution, by finding the values that maximize the likelihood of the observed data.

In this example, we applied MLE to estimate the parameters of a normal distribution, but the same concept can be applied to other distributions by defining the likelihood function for those distributions.


In [None]:
# Estimate population parameters via maximum likelihood estimation

def populationMaximumLikelihoodEstimation(data):
    MLEHat = np.mean(data)
    MLEStandardDeviation = np.std(  data, ddof=0)
    print(f"Estimated Mean (mu): {MLEHat}")
    print(f"Estimated Standard Deviation (sigma): {MLEStandardDeviation}")

data = np.array([50, 52, 53, 55, 48, 49, 50, 51, 55, 54])
populationMaximumLikelihoodEstimation(data)

### **Bootstrap Sampling for Confidence Interval**

**Bootstrap sampling** is a powerful statistical technique that involves resampling with replacement from a given dataset to estimate the sampling distribution of a statistic. This method can be used to estimate confidence intervals for any statistic (mean, median, variance, etc.).

### **Steps for Bootstrap Sampling to Compute Confidence Interval**:

1. **Resample**: Randomly draw samples from the original dataset with replacement. The number of samples drawn is the same as the size of the original dataset.
2. **Calculate Statistic**: For each resample, compute the statistic of interest (e.g., mean, median, etc.).
3. **Repeat**: Repeat the resampling process many times (typically 1,000 or more) to build a distribution of the statistic.
4. **Compute Confidence Interval**: From the distribution of the statistic, compute the confidence interval by finding the appropriate percentiles.

The most commonly used confidence intervals are **percentile-based** intervals. For example, to compute a 95% confidence interval, we would take the 2.5th and 97.5th percentiles of the resampled statistic values.

---

### **Mathematical Formulation**:

Given a sample $ X = \{x_1, x_2, ..., x_n\} $, we want to estimate the confidence interval for a statistic $ \theta $ (e.g., the mean).

1. **Resample** the data with replacement to create $ B $ bootstrap samples $ X^* = \{X_1^*, X_2^*, ..., X_B^*\} $.
2. **Compute the statistic** $ \hat{\theta}^* $ for each resample.
3. **Determine the percentiles** of the distribution of $ \hat{\theta}^* $.

For a 95% confidence interval:
$
\left[\text{percentile}(2.5\%), \text{percentile}(97.5\%)\right]
$


---

### **Key Points**:

1. **Resampling with Replacement**: In each iteration, we sample with replacement from the dataset to create a bootstrap sample.
2. **Distribution of the Statistic**: By repeating the sampling process many times (typically 1,000 or more), we create a distribution of the statistic (mean in this case).
3. **Percentile Confidence Interval**: The confidence interval is obtained by taking the appropriate percentiles of the bootstrap distribution.

---

### **Applications of Bootstrap Sampling**:
- **Non-parametric Estimation**: It doesn't assume a particular distribution for the data and can be used for almost any statistic.
- **Confidence Intervals**: It can be used to estimate confidence intervals for complex statistics that are difficult to handle analytically.
- **Model Evaluation**: It's commonly used in machine learning to assess the performance of models, especially when sample sizes are small.


In [None]:
# bootstrap sampling to compute confidence interval


# Function to perform bootstrap sampling and compute confidence interval
def bootstrapConfidenceInterval(data, samples, confidenceLevel):
    # Array to store the mean of each bootstrap sample
    bootstrapMeans = []

    # Perform bootstrap sampling
    for _ in range(samples):
        sample = np.random.choice(data, size=len(data), replace=True)  # Resampling with replacement
        bootstrapMeans.append(np.mean(sample))  # Compute the statistic (mean in this case)

    # Calculate the confidence interval
    lowerPercentile = (1 - confidenceLevel) / 2 * 100
    upperPercentile = (1 + confidenceLevel) / 2 * 100

    # Compute the percentiles for the confidence interval
    lowerBound = np.percentile(bootstrapMeans, lowerPercentile)
    upperBound = np.percentile(bootstrapMeans, upperPercentile)
    print(f"{int(confidenceLevel * 100)}% Confidence Interval for the Mean: ({lowerBound}, {upperBound})")
    return lowerBound, upperBound




# Sample dataset (replace with your actual dataset)
data = np.array([50, 52, 53, 55, 48, 49, 50, 51, 55, 54])

# Number of bootstrap samples
samples = 1000

# Confidence level
confidenceLevel = 0.95

# Compute the 95% confidence interval for the mean
bootstrapConfidenceInterval(data, samples, confidenceLevel)



### **Computing Bias and Variance of an Estimator Using Simulation**

In statistical estimation, the **bias** and **variance** of an estimator provide important information about how well the estimator performs in approximating the true population parameter.

- **Bias** of an estimator $ \hat{\theta} $ is the difference between the expected value of the estimator and the true value of the parameter:
  $
  \text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta
  $
  If the bias is zero, the estimator is said to be **unbiased**.

- **Variance** of an estimator measures how much the estimator varies from its expected value:
  $
  \text{Var}(\hat{\theta}) = E[(\hat{\theta} - E[\hat{\theta}])^2]
  $

We can estimate both bias and variance using **simulation**. In a simulation, we repeat the process of estimating the parameter from multiple random samples and then compute the average bias and variance of the estimators.

### **Steps to Compute Bias and Variance Using Simulation**:

1. **Define the True Parameter**: Specify the true population parameter $ \theta $ that you want to estimate (e.g., mean, variance).
2. **Simulate Samples**: Generate multiple random samples from a population with the true parameter.
3. **Estimate the Parameter**: For each sample, compute the estimator (e.g., sample mean, sample variance).
4. **Compute Bias**: Calculate the difference between the average of the estimators and the true parameter.
5. **Compute Variance**: Calculate the variance of the estimators.


### **Explanation of Results**:

- **Bias**: 
  - If the bias is close to 0, the estimator is approximately unbiased. In this case, the bias is very small (close to zero), indicating that the sample mean is an **unbiased estimator** of the population mean.
  
- **Variance**: 
  - The variance indicates how much the sample mean varies from the true population mean across different simulations. A smaller variance means that the estimator is more stable and reliable.

---

### **General Steps to Apply This Method**:

1. **Choose an Estimator**: You can apply this approach to any estimator (e.g., sample median, sample variance) by defining the true parameter and estimating it from the sample.
2. **Repeat the Process**: Run the simulation multiple times (e.g., 1000 simulations) to get a reliable estimate of the bias and variance.
3. **Calculate the Metrics**: Use the average of the estimator values to compute the bias, and the variance formula to compute the estimator's variability.

---

### **Applications of Bias and Variance Estimation**:

- **Model Evaluation**: Understanding the bias and variance of estimators helps in evaluating the performance of machine learning models, especially when considering trade-offs like **bias-variance tradeoff**.
- **Simulation Studies**: Used in research to simulate and evaluate the properties of statistical estimators before applying them to real-world data.
- **Algorithm Comparison**: By simulating different estimators, you can choose the one with the smallest bias and variance for more accurate and stable parameter estimation.


In [None]:
# compute the bias and variance of estimator using simulation

import numpy as np



# Function to calculate bias and variance of the sample mean estimator
def biasVarianceSimulation(trueMean, trueStandardDeviation, sampleSize, nSimulations):
    # Store the sample means
    sampleMeans = []

    for _ in range(nSimulations):
        # Simulate a random sample from the population
        sample = np.random.normal(trueMean, trueStandardDeviation, sampleSize)

        # Estimate the parameter (mean of the sample)
        sampleMean = np.mean(sample)
        sampleMeans.append(sampleMean)

    # Compute the bias of the estimator
    estimatedMean = np.mean(sampleMeans)
    bias = estimatedMean - trueMean

    # Compute the variance of the estimator
    variance = np.var(sampleMeans)

    return bias, variance


# Set the true population mean (parameter)
trueMean = 50
trueStd = 10
sampleSize = 30
numSimulations = 1000
# Run the simulation
bias, variance = biasVarianceSimulation(trueMean, trueStd, sampleSize, numSimulations)

# Print the results
print(f"Bias of the sample mean estimator: {bias}")
print(f"Variance of the sample mean estimator: {variance}")


### **Verifying the Law of Large Numbers (LLN)**

The **Law of Large Numbers (LLN)** is a fundamental theorem in probability theory and statistics. It states that as the size of a sample increases, the sample mean (or average) of the observed values gets closer to the true population mean.

There are two main versions of LLN:
1. **Weak Law of Large Numbers (WLLN)**: This states that for a sequence of independent and identically distributed (i.i.d.) random variables, the sample mean converges in probability to the expected value (population mean) as the sample size increases.
2. **Strong Law of Large Numbers (SLLN)**: This states that the sample mean almost surely converges to the population mean as the sample size increases.

The idea behind verifying LLN is to simulate how the sample mean approaches the true population mean as the sample size increases.

### **Steps to Verify the Law of Large Numbers**:

1. **Define the True Population Mean**: Choose a probability distribution with a known mean (e.g., normal distribution with known mean).
2. **Generate Random Samples**: Simulate random samples from the chosen distribution.
3. **Compute Sample Mean**: For increasing sample sizes, compute the sample mean at each step.
4. **Observe Convergence**: Plot or track how the sample mean converges to the true population mean as the sample size increases.


---

### **Mathematical Formulation of LLN**:

- Let $ X_1, X_2, ..., X_n $ be a sequence of i.i.d. random variables with a population mean $ \mu = E[X_i] $.
- The sample mean $ \hat{\mu}_n $ for a sample of size $ n $ is:
  $
  \hat{\mu}_n = \frac{1}{n} \sum_{i=1}^{n} X_i
  $
- The Law of Large Numbers states that as $ n \to \infty $:
  $
  \hat{\mu}_n \xrightarrow{P} \mu
  $
  This means that as the sample size increases, the sample mean $ \hat{\mu}_n $ converges to the true mean $ \mu $ in probability.

---

### **Step-by-Step Guide to Verify LLN Using Simulation**:

1. **Choose a Distribution**: Pick a distribution, such as the **normal distribution**, with known parameters (e.g., mean = 50, standard deviation = 10).

2. **Simulate Random Samples**: Generate random samples from this distribution using a fixed random seed for reproducibility.

3. **Calculate Sample Mean**: As you simulate increasing sample sizes, compute the sample mean for each sample size.

4. **Observe Convergence**: Track how the sample mean approaches the true mean.

---


### **How It Verifies LLN**:

- **Convergence of Sample Mean**: The sample mean fluctuates for small sample sizes, but as the sample size grows, it gets closer to the true mean. This demonstrates the **convergence** predicted by the Law of Large Numbers.
  
- **Accuracy of Estimator**: With larger sample sizes, the sample mean becomes a better estimate of the population mean.

---

### **Key Takeaways**:

1. **LLN Convergence**: As the sample size increases, the sample mean converges to the true population mean. This confirms the Law of Large Numbers.
2. **Simulation**: The simulation method allows us to empirically observe the behavior of the sample mean.
3. **Practical Implication**: In practice, we can rely on large samples to accurately estimate population parameters.


In [None]:
# Verify the Law of Large Numbers

# Function to simulate LLN
def verifyLawOfLargeNumber(trueMean, maxSampleSize):
    sampleMeans = []
    
    for n in range(1, maxSampleSize + 1):
        # Generate a random sample of size n
        sample = np.random.normal(trueMean, trueStd, n)
        
        # Compute the sample mean
        sampleMean = np.mean(sample)
        sampleMeans.append(sampleMean)
    
    return sampleMeans
trueMean = 50
trueStd = 10
maxSampleSize = 10000
# Run the simulation
sampleMeans = verifyLawOfLargeNumber(trueMean, maxSampleSize)

# Print some results (you can remove the visualization code if not needed)
print(f"True mean: {trueMean}")
print(f"Sample mean for large sample size: {sampleMeans[-1]}")




### **Simulating and Computing the Impact of Sample Size on Estimation Accuracy**

In statistics, the accuracy of an estimator improves as the sample size increases. This is a fundamental concept, as larger samples tend to produce more reliable estimates of population parameters.

To simulate and compute the impact of sample size on estimation accuracy, we typically look at how the **sample mean** (or any other estimator) approaches the **true population mean** as the sample size increases.

### **Steps to Simulate and Compute Impact of Sample Size on Estimation Accuracy**:

1. **Define the True Population Parameter**: Choose a distribution with known parameters (e.g., population mean and standard deviation).
2. **Generate Random Samples**: Simulate random samples of increasing sizes.
3. **Estimate the Parameter**: Compute the estimator (e.g., sample mean) for each sample.
4. **Compute Estimation Accuracy**: For each sample size, compute the **bias** and **mean squared error (MSE)** to measure how well the estimator approximates the true parameter.

---

### **Mathematical Formulation**:

1. **Bias** of an estimator $ \hat{\theta}_n $ for sample size $ n $:
   $
   \text{Bias}(\hat{\theta}_n) = E[\hat{\theta}_n] - \theta
   $
   Where $ \theta $ is the true population parameter.

2. **Mean Squared Error (MSE)** of an estimator $ \hat{\theta}_n $ is a measure of the average of the squared differences between the estimator and the true parameter:
   $
   \text{MSE}(\hat{\theta}_n) = E[(\hat{\theta}_n - \theta)^2]
   $
   MSE combines both **variance** and **bias**:
   $
   \text{MSE}(\hat{\theta}_n) = \text{Var}(\hat{\theta}_n) + \text{Bias}^2(\hat{\theta}_n)
   $

3. **Accuracy of Estimation**: As the sample size $ n $ increases, both the **bias** and **MSE** decrease, demonstrating that larger samples yield more accurate estimates.

---

### **Step-by-Step Guide for Simulation**:

1. **Choose a Distribution**: Select a distribution, e.g., a **normal distribution** with known mean and standard deviation.
2. **Vary Sample Sizes**: Choose different sample sizes, e.g., 10, 100, 1000, etc.
3. **Simulate Samples**: For each sample size, simulate random samples and compute the sample mean.
4. **Compute Accuracy Metrics**:
   - **Bias**: Calculate the difference between the true mean and the sample mean.
   - **Mean Squared Error (MSE)**: Calculate the MSE for each sample size.

---


### **Interpretation of Results**:

1. **Bias**: As the sample size increases, the bias tends to decrease, indicating that the estimator becomes more accurate.
   
2. **MSE**: The **Mean Squared Error** also decreases as the sample size increases, showing that larger samples yield more precise estimates of the true population mean.

   - For smaller sample sizes, the sample mean fluctuates more, leading to higher MSE.
   - As the sample size increases, the estimator stabilizes, and MSE decreases.

### **Key Takeaways**:

- **Larger Sample Sizes = More Accurate Estimation**: As the sample size increases, the sample mean (or any estimator) becomes a more accurate estimate of the true population parameter.
- **Bias and MSE**: Both bias and MSE decrease with increasing sample size, confirming that larger samples are more reliable.
- **Statistical Confidence**: This shows that with larger sample sizes, we can have greater confidence in our estimates.

---


In [None]:
# Simulate and compute the impact of the sample size  on estimation accuracy


# Function to simulate and compute bias and MSE for different sample sizes
def sampleSizeImpactSimulation(trueMean, trueStandardDeviation, sampleSizes, nSimulations=1000):
    biasValues = []
    meanEstimatedValues = []

    # Run simulations for each sample size
    for n in sampleSizes:
        sampleMeans = []
        
        # Perform multiple simulations for the current sample size
        for _ in range(nSimulations):
            sample = np.random.normal(trueMean, trueStandardDeviation, n)  # Generate random sample
            sample_mean = np.mean(sample)  # Calculate sample mean
            sampleMeans.append(sample_mean)
        
        # Compute the bias (difference between average sample mean and true mean)
        estimatedMean = np.mean(sampleMeans)
        bias = estimatedMean - trueMean
        
        # Compute the MSE (mean squared error)
        mse = np.mean((np.array(sampleMeans) - trueMean) ** 2)
        
        # Store results
        biasValues.append(bias)
        meanEstimatedValues.append(mse)
    
    return biasValues, meanEstimatedValues


# Define the true population parameters
trueMean = 50
trueStd = 10

# Define the maximum sample sizes to test
sampleSizes = [10, 50, 100, 500, 1000, 5000]
# Run the simulation for different sample sizes
biasValues, mse_values = sampleSizeImpactSimulation(trueMean, trueStd, sampleSizes)

# Print the results
for i, n in enumerate(sampleSizes):
    print(f"Sample Size: {n}")
    print(f"  Bias: {biasValues[i]}")
    print(f"  MSE: {mse_values[i]}")
    print("-" * 30)


### **Standard Error of the Sample Mean**

The **Standard Error of the Sample Mean (SE)** measures the variability of the sample mean as an estimate of the population mean. It quantifies the uncertainty in the sample mean due to random sampling.

---

### **Formula for Standard Error**:

The Standard Error is calculated using the formula:

$
SE = \frac{\sigma}{\sqrt{n}}
$

Where:
- $ \sigma $: Population standard deviation (if known) or sample standard deviation (if population standard deviation is unknown).
- $ n $: Sample size.

---

### **Steps to Compute Standard Error**:

1. **Determine the Standard Deviation**:
   - If the population standard deviation $ \sigma $ is known, use it directly.
   - If not, estimate it using the sample standard deviation $ s $, calculated as:
     $
     s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}
     $
     where $ \bar{x} $ is the sample mean, and $ x_i $ are the data points.

2. **Determine the Sample Size $ n $**:
   - Count the number of observations in the sample.

3. **Calculate the Standard Error**:
   - Plug the values of $ s $ (or $ \sigma $) and $ n $ into the formula:
     $
     SE = \frac{s}{\sqrt{n}}
     $

---

### **Key Insights**:

1. **Smaller Standard Error**:
   - A smaller $ SE $ indicates that the sample mean is a more precise estimate of the population mean.
   - $ SE $ decreases as the sample size $ n $ increases.

2. **Variability**:
   - If the sample standard deviation $ s $ is large, $ SE $ will also be larger, indicating greater variability in the data.

3. **Use in Confidence Intervals**:
   - The $ SE $ is often used to construct confidence intervals for the population mean:
     $
     \text{Confidence Interval} = \bar{x} \pm z \cdot SE
     $


In [None]:
# Calculate Standard error for sample mean

# Function to calculate standard error
def standardError(sample):
    n = len(sample)  # Sample size
    sampleStandardDeviation = np.std(sample, ddof=1)  # Sample standard deviation (ddof=1 for unbiased estimate)
    standardError = sampleStandardDeviation / np.sqrt(n)  # Standard Error formula
    return standardError

# Example usage
data = [12, 15, 14, 10, 13, 15, 11, 14, 13, 12]  # Example dataset
stdError = standardError(data)

print(f"Sample Data: {data}")
print(f"Standard Error: {stdError}")


## -- Hypothesis Testing --

### **One-Sample t-Test**

A **one-sample t-test** is used to determine whether the mean of a sample differs significantly from a known or hypothesized value.

---

### **Mathematical Formulation**

1. **Null Hypothesis ($ H_0 $)**: The sample mean ($ \bar{x} $) is equal to the hypothesized value ($ \mu_0 $).
   $
   H_0: \mu = \mu_0
   $

2. **Alternative Hypothesis ($ H_1 $)**: The sample mean ($ \bar{x} $) is not equal to ($ \mu_0 $) (two-tailed), or is greater/less than ($ \mu_0 $) (one-tailed).

3. **Test Statistic** ($ t $):
   $
   t = \frac{\bar{x} - \mu_0}{SE}
   $
   Where:
   - $ \bar{x} $: Sample mean
   - $ \mu_0 $: Known or hypothesized value
   - $ SE $: Standard error of the sample mean:
     $
     SE = \frac{s}{\sqrt{n}}
     $
     - $ s $: Sample standard deviation
     - $ n $: Sample size

4. **Degrees of Freedom (df)**:
   $
   df = n - 1
   $

5. **p-value**: Compare the test statistic to the t-distribution to compute the p-value.

---

### **Steps for Performing a One-Sample t-Test**

1. **Formulate the Hypotheses**:
   - Null hypothesis ($ H_0 $): Sample mean equals the hypothesized value.
   - Alternative hypothesis ($ H_1 $): Sample mean differs from the hypothesized value.

2. **Compute the t-statistic**:
   - Calculate the sample mean, sample standard deviation, and standard error.
   - Plug these into the formula for the t-statistic.

3. **Determine the p-value**:
   - Use the t-distribution with $ n-1 $ degrees of freedom to compute the p-value.

4. **Compare the p-value to the significance level ($ \alpha $)**:
   - If $ p \leq \alpha $, reject the null hypothesis.
   - If $ p > \alpha $, fail to reject the null hypothesis.

---

### **Interpretation of Results**

1. **Sample Mean**: The mean of the given sample.
2. **t-Statistic**: The test statistic calculated based on the sample data.
3. **p-Value**:
   - If $ p \leq \alpha $: Reject $ H_0 $. The sample mean significantly differs from the hypothesized mean.
   - If $ p > \alpha $: Fail to reject $ H_0 $. The sample mean does not significantly differ from the hypothesized mean.

4. **Result**: For the example, since $ p > 0.05 $, we fail to reject the null hypothesis, meaning the sample mean does not significantly differ from 13.

---


In [None]:
# Perform a one sample t-test to check if a sample mean differ from a known value


def oneSampleTTest(sample, hypothesizedMean, alpha=0.05)->dict:
    # Step 1: Calculate sample statistics
    sampleMean = np.mean(sample)
    sampleStd = np.std(sample, ddof=1)  # Unbiased standard deviation
    n = len(sample)
    standardError = sampleStd / np.sqrt(n)

    # Step 2: Compute the t-statistic
    tStatistic = (sampleMean - hypothesizedMean) / standardError

    # Step 3: Compute degrees of freedom
    degreesOfFreedom = n - 1

    # Step 4: Compute the p-value (two-tailed test)
    pValue = 2 * sciStats.t.sf(abs(tStatistic), df=degreesOfFreedom)

    # Step 5: Compare the p-value with the significance level
    if pValue < alpha:
        result = "Reject the null hypothesis (significant difference)"
    else:
        result = "Fail to reject the null hypothesis (no significant difference)"
    
    # Return results
    return {
        "Sample Mean": sampleMean,
        "T-Statistic": tStatistic,
        "P-Value": pValue,
        "Result": result
    }

# Example usage
data = [12, 15, 14, 10, 13, 15, 11, 14, 13, 12]
hypothesizedMean = 13
alpha = 0.05

testResult = oneSampleTTest(data, hypothesizedMean, alpha)

# Print the results
for key, value in testResult.items():
    print(f"{key}: {value}")


### **Two-Sample t-Test**

The **two-sample t-test** is used to compare the means of two independent datasets to determine whether there is a significant difference between them.

---

### **Mathematical Formulation**

1. **Hypotheses**:
   - **Null Hypothesis ($H_0$)**: The means of the two datasets are equal.
     $
     H_0: \mu_1 = \mu_2
     $
   - **Alternative Hypothesis ($H_1$)**: The means of the two datasets are not equal (two-tailed test) or one is greater/less than the other (one-tailed test).
     $
     H_1: \mu_1 \neq \mu_2 \quad \text{(two-tailed)}
     $

2. **Test Statistic ($ t $)**:
   $
   t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
   $
   Where:
   - $ \bar{x}_1, \bar{x}_2 $: Means of the two samples
   - $ s_1^2, s_2^2 $: Variances of the two samples
   - $ n_1, n_2 $: Sizes of the two samples

3. **Degrees of Freedom (df)**:
   If sample variances are not assumed to be equal (Welch's t-test):
   $
   df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2 - 1}}
   $

4. **p-value**: Compare the test statistic to the t-distribution to compute the p-value.

---

### **Steps to Perform Two-Sample t-Test**

1. **Formulate Hypotheses**:
   - $ H_0 $: The two means are equal.
   - $ H_1 $: The two means are not equal (or greater/less than).

2. **Compute Sample Statistics**:
   - Calculate the means and variances of the two samples.
   - Compute the standard error of the difference in means.

3. **Compute the t-statistic**:
   - Plug the values into the t-statistic formula.

4. **Determine Degrees of Freedom**:
   - If variances are unequal, use Welch's formula for degrees of freedom.

5. **Determine the p-value**:
   - Compute the p-value based on the t-statistic and degrees of freedom.

6. **Make a Decision**:
   - Compare the p-value with the significance level $ \alpha $ (typically 0.05).

---

### **Key Insights**

1. **Reject $ H_0 $**:
   - If $ p < \alpha $, the means are significantly different.

2. **Fail to Reject $ H_0 $**:
   - If $ p > \alpha $, there is no significant difference between the means.

3. **Equal vs. Unequal Variances**:
   - Use Welch’s t-test (default) if you suspect the variances are unequal.


In [None]:
# Perform two sample t-test to compare the mean of two datasets 



def twoSampleTTest(sample1, sample2, alpha=0.05, equalVariance=False):
    # Step 1: Calculate sample statistics
    mean1, mean2 = np.mean(sample1), np.mean(sample2)
    var1, var2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1)
    n1, n2 = len(sample1), len(sample2)

    # Step 2: Compute the pooled or un-pooled standard error
    if equalVariance:  # Assume equal variances (pooled variance)
        pooledVariance = ((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2)
        standardError = np.sqrt(pooledVariance * (1/n1 + 1/n2))
        degreesOfFreedom = n1 + n2 - 2
    else:  # Unequal variances (Welch's t-test)
        standardError = np.sqrt(var1/n1 + var2/n2)
        degreesOfFreedom = ((var1/n1 + var2/n2) ** 2) / (
            (var1/n1) ** 2 / (n1 - 1) + (var2/n2) ** 2 / (n2 - 1)
        )

    # Step 3: Compute the t-statistic
    tStatistic = (mean1 - mean2) / standardError

    # Step 4: Compute the p-value (two-tailed test)
    pValue = 2 * sciStats.t.sf(abs(tStatistic), df=degreesOfFreedom)

    # Step 5: Compare the p-value with the significance level
    if pValue < alpha:
        result = "Reject the null hypothesis (means are significantly different)"
    else:
        result = "Fail to reject the null hypothesis (no significant difference)"
    
    # Return results
    return {
        "Sample 1 Mean": mean1,
        "Sample 2 Mean": mean2,
        "T-Statistic": tStatistic,
        "P-Value": pValue,
        "Degrees of Freedom": degreesOfFreedom,
        "Result": result
    }

# Example usage
sample1 = [12, 15, 14, 10, 13, 15, 11, 14, 13, 12]
sample2 = [22, 25, 20, 19, 24, 23, 21, 22, 20, 23]

testResult = twoSampleTTest(sample1, sample2, alpha=0.05, equalVariance=False)

# Print the results
for key, value in testResult.items():
    print(f"{key}: {value}")


### **Chi-Square Test for Independence**

The **chi-square test for independence** determines whether two categorical variables are independent or have an association. It is based on comparing the observed frequencies in a contingency table to the expected frequencies.

---

### **Mathematical Formulation**

1. **Hypotheses**:
   - **Null Hypothesis ($H_0$)**: The two variables are independent.
   - **Alternative Hypothesis ($H_1$)**: The two variables are dependent (associated).

2. **Chi-Square Test Statistic**:
   $
   \chi^2 = \sum \frac{(O - E)^2}{E}
   $
   Where:
   - $O$: Observed frequency in each cell
   - $E$: Expected frequency in each cell, calculated as:
     $
     E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}
     $

3. **Degrees of Freedom (df)**:
   $
   df = (r - 1) \times (c - 1)
   $
   Where $r$ is the number of rows and $c$ is the number of columns in the contingency table.

4. **p-value**: Compare the calculated $ \chi^2 $ value to the chi-square distribution with the given degrees of freedom to obtain the p-value.

---

### **Steps to Perform Chi-Square Test**

1. **Construct a Contingency Table**:
   - A table summarizing the frequencies of two categorical variables.

2. **Calculate Expected Frequencies**:
   - Use the formula $ E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}} $ for each cell.

3. **Compute the Chi-Square Statistic**:
   - Use the formula $ \chi^2 = \sum \frac{(O - E)^2}{E} $ for all cells.

4. **Determine the Degrees of Freedom**:
   - Use $ df = (r - 1) \times (c - 1) $.

5. **Compute the p-value**:
   - Compare the test statistic $ \chi^2 $ to the chi-square distribution.

6. **Make a Decision**:
   - If $ p \leq \alpha $ (e.g., 0.05), reject the null hypothesis ($H_0$): The variables are dependent.
   - If $ p > \alpha $, fail to reject $H_0$: The variables are independent.

---

### **Interpretation**

1. **Chi-Square Statistic**:
   - Represents how much the observed frequencies deviate from the expected frequencies.

2. **Degrees of Freedom**:
   - The number of independent comparisons available.

3. **p-Value**:
   - If $ p \leq \alpha $: There is evidence to suggest the variables are dependent (associated).
   - If $ p > \alpha $: No evidence to suggest an association; the variables are likely independent.

---


In [None]:
# Implement and interpret a chi-square test for independence



def chiSquareTest(contingency_table, alpha=0.05):
    # Step 1: Compute observed frequencies
    observed = np.array(contingency_table)

    # Step 2: Compute expected frequencies
    rowTotals = observed.sum(axis=1).reshape(-1, 1)
    columnTotals = observed.sum(axis=0)
    grandTotal = observed.sum()
    expected = (rowTotals @ columnTotals.reshape(1, -1)) / grandTotal

    # Step 3: Compute chi-square statistic
    chiSquareStatistic = np.sum((observed - expected) ** 2 / expected)

    # Step 4: Degrees of freedom
    mRows, nColumns = observed.shape
    degreesOfFreedom = (mRows - 1) * (nColumns - 1)

    # Step 5: Compute p-value
    pValue = 1 - sciStats.chi2.cdf(chiSquareStatistic, degreesOfFreedom)

    # Step 6: Hypothesis testing
    if pValue < alpha:
        result = "Reject the null hypothesis (variables are dependent)"
    else:
        result = "Fail to reject the null hypothesis (variables are independent)"

    # Return results
    return {
        "Chi-Square Statistic": chiSquareStatistic,
        "Degrees of Freedom": degreesOfFreedom,
        "P-Value": pValue,
        "Result": result
    }

# Example usage
contingencyTable = [
    [10, 20, 30],  # Row 1: Category 1 of Variable A
    [15, 25, 20],  # Row 2: Category 2 of Variable A
    [35, 30, 25]   # Row 3: Category 3 of Variable A
]

testResult = chiSquareTest(contingencyTable, alpha=0.05)

# Print the results
for key, value in testResult.items():
    print(f"{key}: {value}")


### **ANOVA (Analysis of Variance)**

ANOVA (Analysis of Variance) is used to compare the means of **three or more groups** to determine if at least one group mean is significantly different from the others.

---

### **Mathematical Formulation**

1. **Hypotheses**:
   - **Null Hypothesis ($H_0$)**: All group means are equal.
     $
     H_0: \mu_1 = \mu_2 = \mu_3 = \ldots = \mu_k
     $
   - **Alternative Hypothesis ($H_1$)**: At least one group mean is different.

2. **Test Statistic**:
   ANOVA is based on the **F-statistic**, which is the ratio of two variances:
   $
   F = \frac{\text{Between-Group Variance}}{\text{Within-Group Variance}}
   $
   - **Between-Group Variance**: Variance due to differences between group means.
   - **Within-Group Variance**: Variance due to differences within groups.

3. **Degrees of Freedom**:
   - Between-group: $df_b = k - 1$, where $k$ is the number of groups.
   - Within-group: $df_w = N - k$, where $N$ is the total number of observations.

4. **p-value**:
   - Use the F-statistic and degrees of freedom to compute the p-value.

---

### **Steps to Perform ANOVA**

1. **Formulate Hypotheses**:
   - $H_0$: All group means are equal.
   - $H_1$: At least one group mean is different.

2. **Calculate Group Means and Variances**:
   - Compute the mean for each group and the overall mean.

3. **Calculate Between-Group Variance**:
   - Measure how much the group means differ from the overall mean.

4. **Calculate Within-Group Variance**:
   - Measure the variability within each group.

5. **Compute the F-statistic**:
   - $ F = \frac{\text{Between-Group Variance}}{\text{Within-Group Variance}} $.

6. **Determine the p-value**:
   - Compare the F-statistic to the F-distribution.

7. **Make a Decision**:
   - If $p \leq \alpha$, reject $H_0$: At least one group mean is different.

---

### **Interpreting Results**

1. **Reject $H_0$**:
   - If $p \leq \alpha$, there is evidence that at least one group mean is different.

2. **Fail to Reject $H_0$**:
   - If $p > \alpha$, there is no evidence to suggest significant differences between the group means.

---

### **Limitations of ANOVA**

- ANOVA only tells if there is a significant difference but does not specify which groups differ.
- To identify specific group differences, use **post-hoc tests** (e.g., Tukey’s test).

---


In [None]:
# Conduct ANOVA test to compare mean of multiple groups


def oneWayANOVA(*groups, alpha=0.05):
    # Step 1: Perform ANOVA
    fStatistic, pValue = sciStats.f_oneway(*groups)
    
    # Step 2: Interpret Results
    if pValue < alpha:
        result = "Reject the null hypothesis (at least one group mean is different)"
    else:
        result = "Fail to reject the null hypothesis (no significant difference between group means)"
    
    return {
        "F-Statistic": fStatistic,
        "P-Value": pValue,
        "Result": result
    }

# Example usage
group1 = [15, 20, 25, 30, 35]
group2 = [22, 25, 30, 35, 40]
group3 = [28, 32, 35, 40, 45]

testResult = oneWayANOVA(group1, group2, group3, alpha=0.05)

# Print the results
for key, value in testResult.items():
    print(f"{key}: {value}")


### **Permutation Test for Hypothesis Testing**

The **permutation test** is a non-parametric statistical method used to test the null hypothesis by comparing the observed test statistic to a distribution of test statistics generated by randomly permuting the data. It is especially useful when the assumptions of parametric tests (e.g., t-test) are not satisfied.

---

### **Steps of Permutation Test**

1. **Define Hypotheses**:
   - **Null Hypothesis ($H_0$)**: The two groups have the same distribution (no significant difference between means).
   - **Alternative Hypothesis ($H_1$)**: The two groups have different distributions (there is a significant difference between means).

2. **Calculate Observed Test Statistic**:
   - Compute a metric like the difference in means or medians between the two groups.

3. **Generate Permutation Distribution**:
   - Combine the two datasets.
   - Randomly shuffle (permute) the combined dataset and split it back into two groups.
   - Recalculate the test statistic for each permutation.

4. **Compute p-value**:
   - The p-value is the proportion of permuted test statistics that are as extreme as or more extreme than the observed test statistic.

5. **Decision**:
   - If $p \leq \alpha$ (e.g., 0.05), reject the null hypothesis $H_0$.

---

### **Key Points to Note**

1. **Permutation Test Strengths**:
   - No assumptions about the distribution of the data.
   - Useful for small sample sizes or non-parametric data.

2. **Permutation Test Limitations**:
   - Computationally expensive for large datasets with many permutations.
   - Only applicable when data can be shuffled meaningfully under $H_0$.

3. **Interpretation**:
   - A small p-value indicates evidence against the null hypothesis ($H_0$).
   - If $p > \alpha$, there is no evidence to suggest the groups are different.

---


In [None]:
# Permutation test for Hypothesis


def permutationTestHypothesis(group1, group2, nPermutations=1000, alpha=0.05):
    # Step 1: Calculate observed test statistic (difference in means)
    observedStatistic = np.mean(group1) - np.mean(group2)
    
    # Step 2: Combine both groups into one dataset
    combined = np.concatenate([group1, group2])
    
    # Step 3: Generate permutation distribution
    permutedStatistics = []
    for _ in range(nPermutations):
        np.random.shuffle(combined)  # Shuffle the combined dataset
        permutedGroup1 = combined[:len(group1)]
        permutedGroup2 = combined[len(group1):]
        permutedStatistic = np.mean(permutedGroup1) - np.mean(permutedGroup2)
        permutedStatistics.append(permutedStatistic)
    
    # Step 4: Calculate p-value
    permutedStatistics = np.array(permutedStatistics)
    pValue = np.sum(np.abs(permutedStatistics) >= np.abs(observedStatistic)) / nPermutations
    
    # Step 5: Decision
    if pValue <= alpha:
        result = "Reject the null hypothesis (the groups are significantly different)"
    else:
        result = "Fail to reject the null hypothesis (no significant difference between the groups)"
    
    return {
        "Observed Statistic": observedStatistic,
        "P-Value": pValue,
        "Result": result
    }

# Example usage
group1 = [5.2, 5.8, 6.1, 5.9, 6.3]
group2 = [6.5, 6.7, 7.0, 6.8, 7.2]

testResult = permutationTestHypothesis(group1, group2, nPermutations=1000, alpha=0.05)

# Print the results
for key, value in testResult.items():
    print(f"{key}: {value}")



### **Mann-Whitney U Test**

The **Mann-Whitney U Test** is a non-parametric test used to determine whether there is a significant difference between two independent groups when the data does not meet the assumptions of a parametric test like the t-test (e.g., non-normality or ordinal data).

---

### **When to Use**
- Data is **non-parametric** or ordinal (not normally distributed).
- Two groups are **independent**.
- Compares the **medians** or general distribution of two groups.

---

### **Mathematical Formulation**

1. **Hypotheses**:
   - **Null Hypothesis ($H_0$)**: The two groups come from the same population (no difference in medians).
   - **Alternative Hypothesis ($H_1$)**: The two groups come from different populations (difference in medians).

2. **Test Statistic**:
   - Based on the ranks of the combined data:
     $
     U = R_1 - \frac{n_1(n_1 + 1)}{2}
     $
     Where:
     - $R_1$: Sum of ranks for the first group.
     - $n_1$: Number of observations in the first group.

3. **p-value**:
   - The p-value is derived from the U statistic based on the Mann-Whitney distribution.

---

### **Steps to Perform the Mann-Whitney U Test**

1. **Define Hypotheses**:
   - $H_0$: The two groups come from the same distribution.
   - $H_1$: The two groups come from different distributions.

2. **Rank Data**:
   - Combine the two groups, assign ranks to all data points, and compute the rank sums for each group.

3. **Calculate the U Statistic**:
   - Use the formula for $U$ to compute the test statistic.

4. **Compute p-value**:
   - Use the U statistic to find the p-value.

5. **Decision**:
   - If $p \leq \alpha$ (e.g., 0.05), reject $H_0$.

---

### **Interpretation**

1. **Reject $H_0$**:
   - A small p-value ($p \leq \alpha$) indicates that the distributions of the two groups are significantly different.

2. **Fail to Reject $H_0$**:
   - A large p-value ($p > \alpha$) suggests there is no evidence to conclude the groups are different.

---

### **Advantages of Mann-Whitney U Test**

1. **Non-parametric**: Does not assume normality.
2. **Robust**: Works well for ordinal data or when variances differ between groups.

---

### **Limitations**

1. **Independent Groups**: Assumes the two groups are independent.
2. **Equal Shape Distributions**: Assumes the two groups have the same shape under the null hypothesis.

---


In [None]:
# Implement and interpret a Mann-Whitney U test for non parametric data


def mannWhitneyUTest(group1, group2, alpha=0.05, alternative='two-sided')->dict:
    # Step 1: Perform the Mann-Whitney U Test
    uStatistic, pValue = sciStats.mannwhitneyu(group1, group2, alternative=alternative)
    
    # Step 2: Interpret Results
    if pValue < alpha:
        result = "Reject the null hypothesis (the groups have different distributions)"
    else:
        result = "Fail to reject the null hypothesis (no significant difference between the groups)"
    
    return {
        "U-Statistic": uStatistic,
        "P-Value": pValue,
        "Result": result
    }

# Example usage
group1 = [55, 60, 65, 70, 75]  # Example non-parametric data
group2 = [80, 85, 90, 95, 100]

testResult = mannWhitneyUTest(group1, group2, alpha=0.05, alternative='two-sided')

# Print the results
for key, value in testResult.items():
    print(f"{key}: {value}")


### **Simulating Type-1 and Type-2 Errors for Hypothesis Tests**

In hypothesis testing, **Type-1 and Type-2 errors** are crucial concepts that describe the accuracy of the test. Here’s how to understand and simulate them:

---

### **Key Concepts**

1. **Type-1 Error ($\alpha$)**:
   - Occurs when the null hypothesis ($H_0$) is rejected even though it is true.
   - Probability of Type-1 error is equal to the significance level ($\alpha$) of the test, usually 0.05.

2. **Type-2 Error ($\beta$)**:
   - Occurs when the null hypothesis ($H_0$) is not rejected even though the alternative hypothesis ($H_1$) is true.
   - $1 - \beta$ is the **power of the test**, indicating its ability to detect a true effect.

---

### **Steps to Simulate Type-1 and Type-2 Errors**

1. **Generate Data**:
   - Simulate data from a distribution under the null hypothesis ($H_0$).
   - Simulate data under the alternative hypothesis ($H_1$).

2. **Conduct Hypothesis Testing**:
   - Use a statistical test (e.g., t-test).
   - Compare the p-value with the significance level ($\alpha$) to make decisions.

3. **Record Outcomes**:
   - Record whether a Type-1 error occurs for $H_0$-data.
   - Record whether a Type-2 error occurs for $H_1$-data.

4. **Calculate Error Rates**:
   - Type-1 error rate = (Type-1 errors) / (Total $H_0$-tests).
   - Type-2 error rate = (Type-2 errors) / (Total $H_1$-tests).

---

### **Key Points to Interpret**

1. **Type-1 Error Rate**:
   - Should be close to the predefined significance level ($\alpha$).

2. **Type-2 Error Rate**:
   - Higher for smaller sample sizes or smaller effect sizes.
   - Decreases as the sample size or effect size increases.

3. **Power**:
   - Power ($1 - \beta$) increases with larger sample sizes and effect sizes.

---

### **Practical Applications**

- **Type-1 Error**: Controlling false positives, e.g., in medical trials where rejecting a true null can lead to unnecessary treatments.
- **Type-2 Error**: Minimizing false negatives, e.g., ensuring a test detects real effects when they exist.
- **Power Analysis**: Used to design studies with adequate sample size for detecting true effects.


In [None]:
# Simulate Type-1 and Type-2 error for hypothesis tests


def simulateType1Type2Errors(
    sampleSize=30, 
    nSimulations=1000, 
    alpha=0.05, 
    effectSize=0.5
):
    # Step 1: Initialize counters for errors
    type1Errors = 0
    type2Errors = 0

    # Step 2: Simulate for null hypothesis (H0 is true)
    for _ in range(nSimulations):
        # Generate two groups with the same mean under H0
        group1 = np.random.normal(loc=0, scale=1, size=sampleSize)
        group2 = np.random.normal(loc=0, scale=1, size=sampleSize)
        _, pValue = sciStats.ttest_ind(group1, group2)

        # Record Type-1 error if we reject H0
        if pValue < alpha:
            type1Errors += 1

    # Step 3: Simulate for alternative hypothesis (H1 is true)
    for _ in range(nSimulations):
        # Generate two groups with different means under H1
        group1 = np.random.normal(loc=0, scale=1, size=sampleSize)
        group2 = np.random.normal(loc=effectSize, scale=1, size=sampleSize)
        _, pValue = sciStats.ttest_ind(group1, group2)

        # Record Type-2 error if we fail to reject H0
        if pValue >= alpha:
            type2Errors += 1

    # Step 4: Calculate error rates
    type1ErrorRate = type1Errors / nSimulations
    type2ErrorRate = type2Errors / nSimulations

    return {
        "Type-1 Error Rate (α)": type1ErrorRate,
        "Type-2 Error Rate (β)": type2ErrorRate,
        "Power (1 - β)": 1 - type2ErrorRate
    }

# Example usage
results = simulateType1Type2Errors(
    sampleSize=30, 
    nSimulations=1000, 
    alpha=0.05, 
    effectSize=0.5
)

# Print the results
for key, value in results.items():
    print(f"{key}: {value:.4f}")

# * second to execute - LOL

### **Computing p-values from Test Statistics**

In hypothesis testing, the **p-value** is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis ($H_0$) is true. You can compute p-values from test statistics using the corresponding probability distribution (e.g., Normal, t, Chi-Square).

---

### **Steps to Compute p-values**

1. **Determine the Type of Test**:
   - Identify the appropriate distribution (e.g., normal, t-distribution, chi-square).

2. **Obtain the Test Statistic**:
   - Compute the test statistic using the formula for the specific hypothesis test.

3. **Choose the Alternative Hypothesis**:
   - **Two-tailed**: Tests for differences in both directions.
   - **One-tailed**: Tests for differences in one direction (greater or less).

4. **Compute the p-value**:
   - Use the cumulative distribution function (CDF) of the relevant distribution to calculate the p-value.

---

### **Interpreting the p-value**

1. **Compare p-value to Significance Level ($\alpha$)**:
   - If $p \leq \alpha$, reject the null hypothesis ($H_0$).
   - If $p > \alpha$, fail to reject $H_0$.

2. **Common $\alpha$ Levels**:
   - $0.05$ (5% significance): Standard in many fields.
   - $0.01$ (1% significance): Used for stricter confidence.

---

### **Examples of Use**

1. **z-test**: Compare the mean of a sample to a population mean.
2. **t-test**: Compare sample means when the sample size is small or the population standard deviation is unknown.
3. **Chi-Square Test**: Test for independence or goodness of fit.


In [None]:
# compute pValues from test statistics


def pValueZTest(zStat, alternative="two-sided"):
    if alternative == "two-sided":
        p_value = 2 * (1 - sciStats.norm.cdf(abs(zStat)))
    elif alternative == "greater":
        p_value = 1 -sciStats.norm.cdf(zStat)
    elif alternative == "less":
        p_value =sciStats.norm.cdf(zStat)
    else:
        raise ValueError("Invalid alternative hypothesis. Choose 'two-sided', 'greater', or 'less'.")
    return p_value

# Example usage
zStat = 2.5  # Example test statistic
pValue = pValueZTest(zStat, alternative="two-sided")
print(f"p-value: {pValue:.4f}")


# ---


def pValueTTest(tStat, df, alternative="two-sided"):
    if alternative == "two-sided":
        p_value = 2 * (1 - sciStats.t.cdf(abs(tStat), df))
    elif alternative == "greater":
        p_value = 1 - sciStats.t.cdf(tStat, df)
    elif alternative == "less":
        p_value = sciStats.t.cdf(tStat, df)
    else:
        raise ValueError("Invalid alternative hypothesis. Choose 'two-sided', 'greater', or 'less'.")
    return p_value

# Example usage
tStat = 2.0  # Example test statistic
df = 10       # Degrees of freedom
pValue = pValueTTest(tStat, df, alternative="two-sided")
print(f"p-value: {pValue:.4f}")


# ----


def pValueChi2Test(chi2Stat, df):
    p_value = 1 - sciStats.chi2.cdf(chi2Stat, df)
    return p_value

# Example usage
chi2Stat = 5.0  # Example test statistic
df = 3           # Degrees of freedom
pValue = pValueChi2Test(chi2Stat, df)
print(f"p-value: {pValue:.4f}")



In [None]:
# Implement the rejection region of hypothesis test

def rejectionRegionZTest(alpha=0.05, testType="two-tailed"):
    if testType == "two-tailed":
        # Two-tailed critical values
        lowerCritical = sciStats.norm.ppf(alpha / 2)
        upperCritical = sciStats.norm.ppf(1 - alpha / 2)
        return lowerCritical, upperCritical
    elif testType == "right-tailed":
        # One-tailed critical value (right)
        criticalValue = sciStats.norm.ppf(1 - alpha)
        return criticalValue
    elif testType == "left-tailed":
        # One-tailed critical value (left)
        criticalValue =sciStats.norm.ppf(alpha)
        return criticalValue
    else:
        raise ValueError("Invalid test type. Choose 'two-tailed', 'right-tailed', or 'left-tailed'.")

# Example usage
alpha = 0.05
testType = "two-tailed"
criticalValues = rejectionRegionZTest(alpha, testType)

print(f"Rejection region (z-test, {testType}): {criticalValues}")



# ---


def rejectionRegionTTest(alpha=0.05, df=10, testType="two-tailed"):
    if testType == "two-tailed":
        # Two-tailed critical values
        lowerCritical = sciStats.t.ppf(alpha / 2, df)
        upperCritical = sciStats.t.ppf(1 - alpha / 2, df)
        return lowerCritical, upperCritical
    elif testType == "right-tailed":
        # One-tailed critical value (right)
        criticalValue = sciStats.t.ppf(1 - alpha, df)
        return criticalValue
    elif testType == "left-tailed":
        # One-tailed critical value (left)
        criticalValue = sciStats.t.ppf(alpha, df)
        return criticalValue
    else:
        raise ValueError("Invalid test type. Choose 'two-tailed', 'right-tailed', or 'left-tailed'.")

# Example usage
alpha = 0.05
df = 10  # Degrees of freedom
testType = "two-tailed"
criticalValues = rejectionRegionTTest(alpha, df, testType)

print(f"Rejection region (t-test, {testType}, df={df}): {criticalValues}")

#---    


def rejectionRegionChi2Test(alpha=0.05, df=10, testType="right-tailed"):
    if testType == "right-tailed":
        # One-tailed critical value (right)
        criticalValue = sciStats.chi2.ppf(1 - alpha, df)
        return criticalValue
    elif testType == "left-tailed":
        # One-tailed critical value (left)
        criticalValue = sciStats.chi2.ppf(alpha, df)
        return criticalValue
    else:
        raise ValueError("Chi-square tests are usually one-tailed.")

# Example usage
alpha = 0.05
df = 10
criticalValue = rejectionRegionChi2Test(alpha, df, testType="right-tailed")

print(f"Rejection region (chi-square test, right-tailed, df={df}): Critical value = {criticalValue:.4f}")




### **Hypothesis Test to Determine if a Dataset Follows a Normal Distribution**

To determine if a dataset follows a normal distribution, you can perform a **Goodness-of-Fit test** using one of the following methods:

1. **Shapiro-Wilk Test**:
   - A statistical test that checks if a sample comes from a normally distributed population.
   
2. **Anderson-Darling Test**:
   - A more general test for normality that is based on the empirical distribution function.
   
3. **Kolmogorov-Smirnov Test (KS Test)**:
   - Compares the sample's cumulative distribution function (CDF) with the CDF of a normal distribution.

4. **Chi-Square Goodness-of-Fit Test**:
   - Compares the observed frequency distribution of data against the expected frequencies under the normal distribution.

---

### **Steps to Perform a Hypothesis Test for Normality**

1. **State Hypotheses**:
   - Null Hypothesis ($H_0$): The data follows a normal distribution.
   - Alternative Hypothesis ($H_1$): The data does not follow a normal distribution.

2. **Select Significance Level ($\alpha$)**:
   - Common choice is $\alpha = 0.05$.

3. **Select a Normality Test**:
   - Choose one of the tests (Shapiro-Wilk, Anderson-Darling, KS test, etc.).

4. **Perform the Test**:
   - Compute the test statistic and corresponding p-value.

5. **Interpret the p-value**:
   - If the p-value $\leq \alpha$, reject the null hypothesis ($H_0$); conclude that the data does not follow a normal distribution.
   - If the p-value $> \alpha$, fail to reject $H_0$; conclude that there is insufficient evidence to say the data is not normal.

---

### **Interpreting Results**

- **Shapiro-Wilk, Anderson-Darling, KS test**:
  - **Null Hypothesis ($H_0$)**: The data follows a normal distribution.
  - **Alternative Hypothesis ($H_1$)**: The data does not follow a normal distribution.
  - If the **p-value** $> 0.05$, we fail to reject $H_0$ (the data is normal).
  - If the **p-value** $ \leq 0.05$, we reject $H_0$ (the data is not normal).

- **Chi-Square Test**:
  - **Null Hypothesis ($H_0$)**: The observed frequencies match the expected frequencies from a normal distribution.
  - If the **p-value** $> 0.05$, we fail to reject $H_0$.
  - If the **p-value** $ \leq 0.05$, we reject $H_0$.

---

### **Summary**

- You can use these tests to assess whether your dataset follows a normal distribution.
- **Shapiro-Wilk** and **Anderson-Darling** are powerful tests specifically for normality.
- **Chi-Square** and **Kolmogorov-Smirnov** can also be used but are less commonly applied for normality testing.


In [None]:
# Perform A hypothesis test to determine if a dataset follow normal Distribution


def ShapiroWilk(data):
    stat, pValue = sciStats.shapiro(data)
    return stat, pValue

# Example usage
data = [4.5, 6.7, 8.8, 5.6, 7.9, 8.4, 6.1, 7.3]  # Example data
stat, pValue = ShapiroWilk(data)

print(f"Shapiro-Wilk Test Statistic: {stat:.4f}, p-value: {pValue:.4f}")
if pValue > 0.05:
    print("Fail to reject the null hypothesis: Data follows a normal distribution")
else:
    print("Reject the null hypothesis: Data does not follow a normal distribution")


# -----



def AndersonDarling(data):
    result = sciStats.anderson(data)
    return result.statistic, result.critical_values, result.significance_level

# Example usage
data = [4.5, 6.7, 8.8, 5.6, 7.9, 8.4, 6.1, 7.3]  # Example data
stat, criticalValues, significanceLevel = AndersonDarling(data)

print(f"Anderson-Darling Test Statistic: {stat:.4f}")
for i in range(len(criticalValues)):
    print(f"Critical value for {significanceLevel[i]}% significance: {criticalValues[i]}")
    if stat > criticalValues[i]:
        print(f"Reject the null hypothesis at {significanceLevel[i]}% significance: Data does not follow a normal distribution")



# ----- 


def KolmogorovSmirnov(data):
    stat, pValue = sciStats.kstest(data, sciStats.norm.cdf, args=(data.mean(), data.std()))
    return stat, pValue

# Example usage
data = [4.5, 6.7, 8.8, 5.6, 7.9, 8.4, 6.1, 7.3]  # Example data
stat, pValue = KolmogorovSmirnov(data)

print(f"Kolmogorov-Smirnov Test Statistic: {stat:.4f}, p-value: {pValue:.4f}")
if pValue > 0.05:
    print("Fail to reject the null hypothesis: Data follows a normal distribution")
else:
    print("Reject the null hypothesis: Data does not follow a normal distribution")
