# 📊 Statistics & Probability for Data Science - A Complete Guide
## **Fundamentals, Application**


# Introduction to Statistics 
## 🔍 What is Statistics?
Statistics is the **science of collecting, organizing, analyzing, and interpreting data**.  
It plays a crucial role in **machine learning, business analytics, finance, and healthcare**.

### 📌 Two Main Branches:
1. **Descriptive Statistics** → Summarizing and visualizing data  
2. **Inferential Statistics** → Drawing conclusions using probability & hypothesis testing  


## Code Example: Descriptive Statistics

In [1]:
import numpy as np
import pandas as pd

# Sample Data
data = [23, 45, 56, 78, 89, 90, 100, 120, 150]

# Compute Metrics
mean = np.mean(data)
median = np.median(data)
variance = np.var(data)
std_dev = np.std(data)

print(f"Mean: {mean}, Median: {median}, Variance: {variance}, Std Dev: {std_dev}")


Mean: 83.44444444444444, Median: 89.0, Variance: 1336.4691358024693, Std Dev: 36.55775069397007


# Probability Theory & Bayes’ Theorem
## 🎲 What is Probability?
Probability is **the measure of uncertainty in an event occurring**.

### 📌 Probability Formula:
**P(A) = Favorable Outcomes / Total Outcomes**

Example: The probability of rolling a **4** on a six-sided die is:
P(4) = 1/6 = 0.1667 (16.67%)


## 📌 Bayes' Theorem

Bayes' Theorem describes the probability of an event occurring based on prior knowledge of related conditions.


P(A|B) = [P(B|A).P(A)]/P(B)


Where:
- \( P(A|B) \) = Posterior probability (Probability of A given B)
- \( P(B|A) \) = Likelihood (Probability of B given A)
- \( P(A) \) = Prior probability (Initial probability of A)
- \( P(B) \) = Marginal probability (Total probability of B)


In [2]:
# Bayes Theorem Example: Disease Diagnosis
P_Disease = 0.01  # Prior Probability
P_Positive_Given_Disease = 0.95  # Sensitivity
P_Positive_Given_NoDisease = 0.05  # False Positive Rate

# Bayes Theorem Calculation
P_Disease_Given_Positive = (P_Positive_Given_Disease * P_Disease) / (
    (P_Positive_Given_Disease * P_Disease) + (P_Positive_Given_NoDisease * (1 - P_Disease))
)

print(f"Probability of having disease given a positive test: {P_Disease_Given_Positive:.4f}")


Probability of having disease given a positive test: 0.1610


# Hypothesis Testing & Confidence Intervals
## 🔍 Hypothesis Testing
Hypothesis testing is used to **validate assumptions about data**.

### 📌 Key Terms:
✔ **Null Hypothesis (H₀)** - No significant difference  
✔ **Alternative Hypothesis (H₁)** - There is a difference  
✔ **p-value** - Probability of getting results as extreme as observed  
✔ **Confidence Interval (CI)** - A range where true parameter lies with a given probability  


## Code Example: Hypothesis Testing

In [4]:
from scipy import stats

# Sample Data
group1 = [120, 130, 150, 165, 190, 200]
group2 = [110, 115, 140, 160, 175, 180]

# Perform t-test
t_stat, p_value = stats.ttest_ind(group1, group2)

# Check significance
if p_value < 0.05:
    print("Reject Null Hypothesis: Significant difference found")
else:
    print("Fail to Reject Null Hypothesis: No significant difference")


Fail to Reject Null Hypothesis: No significant difference


# Decision Theory & Risk Analysis
## 🔍 What is Decision Theory?
Decision Theory helps in **choosing the best action under uncertainty**.

### 📌 Key Concepts:
✔ **Bayesian Risk** - Minimizing expected loss  
✔ **Action Space** - Set of all possible actions  
✔ **Loss Function** - Measures cost of incorrect decisions  



In [6]:
# Define Probabilities
theta = np.linspace(0, 1, 100)
loss_no_insurance = 500 * theta
loss_with_insurance = 50 - 500 * theta

# Find Risk Threshold
optimal_decision = "Buy Insurance" if loss_with_insurance.mean() < loss_no_insurance.mean() else "No Insurance"
print(f"Optimal Decision: {optimal_decision}")


Optimal Decision: Buy Insurance


# 📌 Key Statistical Concepts and Applications  

- **Type I and Type II Errors**:  
  - **Type I Error (False Positive)**: Rejecting a true null hypothesis.  
  - **Type II Error (False Negative)**: Failing to reject a false null hypothesis.  

- **Central Limit Theorem (CLT)**:  
  - The sampling distribution of the sample mean approaches a normal distribution as the sample size increases.  
  - Essential for hypothesis testing and confidence intervals.  

- **p-value Interpretation**:  
  - If **p < 0.05**, reject the null hypothesis (**statistically significant**).  
  - If **p > 0.05**, fail to reject the null hypothesis (**not enough evidence to support the alternative hypothesis**).  

- **Law of Large Numbers**:  
  - As the sample size increases, the sample mean gets closer to the true population mean.  
  - Helps justify why larger samples yield more accurate estimates.  

- **Confidence Intervals**:  
  - A **95% confidence interval** means that if we take 100 random samples, about **95 of them will contain the true population parameter**.  
  - Used to quantify uncertainty in estimates.  

- **t-test vs. z-test**:  
  - **t-test**: Used when the **sample size < 30** or when the **population variance is unknown**.  
  - **z-test**: Used when the **sample size ≥ 30** and **population variance is known**.  

- **Likelihood Function**:  
  - Measures the probability of observing data given a set of parameters.  
  - Used in **Maximum Likelihood Estimation (MLE)** to determine model parameters.  

- **Correlation vs. Causation**:  
  - **Correlation**: Two variables move together, but one does not necessarily cause the other.  
  - **Causation**: A change in one variable directly influences another.  

- **Parametric vs. Non-Parametric Tests**:  
  - **Parametric Tests**: Assume a normal distribution (e.g., **t-test, ANOVA**).  
  - **Non-Parametric Tests**: Do not assume any distribution (e.g., **Wilcoxon test, Kruskal-Wallis test**).  

- **Bayes’ Theorem & Applications**:  
  - Formula:  
   
    P(A|B) = {P(B|A).P(A)}/{P(B)}
   
  - **Spam Detection**: Given an email contains certain words (e.g., "discount"), Bayes’ Theorem helps determine the probability that the email is spam.  
  - **Medical Diagnosis**: Used to calculate the probability that a patient has a disease given a positive test result.  

These statistical concepts are fundamental for understanding data patterns, making predictions, and validating results in **data science, AI, finance, and healthcare applications**.
