## Z -Test

1. When to Use a Z-test
    - Sample Size: n > 30
    - Population's SD/($\sigma$) & Mean/(&mu;) is Known
    - Sample's Mean/(x&#773;) is Known
    - The data points are independent and follow a normal distribution
    
2. The FormulaFor a One-Sample Z-test (comparing a sample to a population):$$Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$Where:
    - $\bar{x}$ = Sample Mean
    - $\mu$ = Population Mean
    - $\sigma$ = Population Standard Deviation
    - $n$ = Sample Size

> ## Reasons for Diffrence 

In [7]:
import numpy as np
from scipy.stats import norm
from statsmodels.stats.weightstats import ztest

# --- 1. THE SETUP ---
data = [172, 174, 168, 169, 171, 173, 175, 170, 169, 172]
pop_mean = 170
known_sigma = 3  # The value given in your textbook problem

# --- 2. OPTION A: The "Textbook" Method (Use this for your homework) ---
# We use a custom function because we MUST use the given sigma (3)
def textbook_z_test(sample, value, sigma):
    n = len(sample)
    sample_mean = np.mean(sample)
    
    # Standard Error using the KNOWN sigma
    se = sigma / np.sqrt(n)
    
    z = (sample_mean - value) / se
    p = 2 * (1 - norm.cdf(abs(z))) # Two-tailed
    
    return z, p

z_textbook, p_textbook = textbook_z_test(data, value=pop_mean, sigma=known_sigma)

print(f"--- Textbook Result (Sigma={known_sigma}) ---")
print(f"Z-score: {z_textbook:.4f}")
print(f"P-value: {p_textbook:.4f}")
# Result matches your screenshot: Z=1.37

# --- 3. OPTION B: The "Real World" Method (Statsmodels) ---
# This calculates sigma from the data automatically (approx 2.2)
z_real, p_real = ztest(data, value=pop_mean)

print(f"\n--- Statsmodels Result (Calculated Sigma) ---")
print(f"Z-score: {z_real:.4f}")
print(f"P-value: {p_real:.4f}")
# Result will be different (Z approx 1.85) because real sigma is smaller than 3.

# --- 4. EXPLANATION ---
print("\n--- WHY ARE THEY DIFFERENT? ---")
real_sigma = np.std(data, ddof=1)
print(f"You told the manual code to use Sigma = {known_sigma}")
print(f"But the data's actual Sigma is = {real_sigma:.4f}")
print("Because the actual sigma (2.21) is smaller than 3, the Z-score in Statsmodels is higher.")

--- Textbook Result (Sigma=3) ---
Z-score: 1.3703
P-value: 0.1706

--- Statsmodels Result (Calculated Sigma) ---
Z-score: 1.7782
P-value: 0.0754

--- WHY ARE THEY DIFFERENT? ---
You told the manual code to use Sigma = 3
But the data's actual Sigma is = 2.3118
Because the actual sigma (2.21) is smaller than 3, the Z-score in Statsmodels is higher.


#### ztest from statsmodel uses sigma as sigma of sample data with n-1
- NOTE: When running ztest() without a known population_std, 
- statsmodels estimates it from the sample using n-1 (ddof=1).
- It effectively calculates a t-statistic but tests it against the Normal distribution.
```python
z_score, p_value = ztest(data, value=170)
```

In [11]:
a = np.std(data)
a.round(2)

np.float64(2.19)

### Blunder in Numpy and Pandas
Summary Rule:

- If you are describing data you have: Use ddof=0.

> - Numpy defaults to Population (ddof=0).
- If you are estimating data you don't have (using a sample): Use ddof=1.
> - Pandas defaults to Sample (ddof=1).


1. Rule for Using `ddof=0` (The Scenario: `"I have the full data"` (Population)):
 
    - `Use Case:` You are calculating the average height of just the students in your specific classroom.
    - You have every single measurement. You are not guessing about the whole school.



2. Rule for Using `ddof=1` (The Scenario: `"I am estimating"` (Sample)):

    - `Use Case:` You surveyed 5 students to guess the average height of all students in the college.
    - You are using this small group to estimate the larger group.

In [12]:
import numpy as np
import pandas as pd

data = [10, 12, 23, 23, 16, 23, 21, 16]

# Numpy uses N (Population)
print(f"Numpy:  {np.std(data):.4f}")        # Output: 4.8989

# Pandas uses N-1 (Sample)
print(f"Pandas: {pd.Series(data).std():.4f}") # Output: 5.2372

# To make Numpy match Pandas, you must set ddof=1 manually:
print(f"Fixed:  {np.std(data, ddof=1):.4f}")  # Output: 5.2372

Numpy:  4.8990
Pandas: 5.2372
Fixed:  5.2372


Summary Checklist
1. Textbook Problem (Sigma given)? $\rightarrow$ Do the math manually (or use my Case 1 code).
2. Real Data (N > 30)? $\rightarrow$ statsmodels.stats.weightstats.ztest.
3. Real Data (N < 30)? $\rightarrow$ scipy.stats.ttest_1samp.

In [13]:
import numpy as np
from scipy import stats
from statsmodels.stats.weightstats import ztest

def perform_z_test(data, pop_mean, known_sigma=None):
    """
    The only Z-test function you will ever need.
    """
    # CASE 1: Textbook / Exam Mode (You have a known Sigma)
    if known_sigma is not None:
        print(f"Running TEXTBOOK Z-Test (Sigma={known_sigma})")
        n = len(data)
        se = known_sigma / np.sqrt(n)
        z = (np.mean(data) - pop_mean) / se
        p = 2 * (1 - stats.norm.cdf(abs(z)))
        return z, p

    # CASE 2: Real World / ML Mode (You don't know Sigma)
    else:
        # Check for small sample size warning
        if len(data) < 30:
            print("WARNING: N < 30. You should probably use a T-test instead!")
        
        print("Running REAL-WORLD Z-Test (Estimated Sigma)")
        # Statsmodels cheats by using sample STD (ddof=1)
        z, p = ztest(data, value=pop_mean)
        return z, p