# Lecture - 05

- Z test
- T test
- Chi Square test

## Real-Time Scenarios for Z-Test and T-Test

### Scenario 1: One Sample Z-Test

**Scenario:** A marketing team wants to know if the average amount spent by customers during a sale is significantly different from the average amount spent normally. The population standard deviation is known.

**Step-by-Step Process:**

1. **Define the Hypotheses:**
   - **Null Hypothesis (H₀):** The average amount spent during the sale is equal to the normal average amount (µ = µ₀).
   - **Alternative Hypothesis (H₁):** The average amount spent during the sale is different from the normal average amount (µ ≠ µ₀).

2. **Collect Data:**
   - Normal average spending (µ₀): $50
   - Population standard deviation (σ): $10
   - Sample size (n): 100
   - Sample mean spending during the sale (x̄): $53


In [1]:
import numpy as np
import scipy.stats as stats

# Given data
population_mean = 50
population_std = 10
sample_size = 100
sample_mean = 53

# Z-Test
z = (sample_mean - population_mean) / (population_std / np.sqrt(sample_size))
p_value = 2 * (1 - stats.norm.cdf(abs(z)))  # two-tailed

print("Z-score:", z)
print("P-value:", p_value)

# Interpretation
if p_value < 0.05:
    print("The average amount spent during the sale is equal to the normal average amount")
else:
    print("The average amount spent during the sale is different from the normal average amount ")

Z-score: 3.0
P-value: 0.002699796063260207
The average amount spent during the sale is equal to the normal average amount


### Scenario 2: Two Sample Z-Test

Let's consider a real-world example where you want to compare the average test scores of two different teaching methods to determine if there is a significant difference between them.

Scenario:

    Teaching Method A:
        Sample size (n₁): 50
        Sample mean (xˉ1xˉ1​): 78
        Population standard deviation (σ₁): 10

    Teaching Method B:
        Sample size (n₂): 60
        Sample mean (xˉ2xˉ2​): 82
        Population standard deviation (σ₂): 12

In [2]:
import numpy as np
import scipy.stats as stats

# Given data
n1 = 50
mean1 = 78
std_dev1 = 10

n2 = 60
mean2 = 82
std_dev2 = 12

# Calculate the Z-score
z = (mean1 - mean2) / np.sqrt((std_dev1**2 / n1) + (std_dev2**2 / n2))

# Calculate the p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z)))  # two-tailed test

print("Z-score:", z)
print("P-value:", p_value)

# Interpretation
if p_value < 0.05:
    print("There is a significant difference between the two teaching methods.")
else:
    print("There is no significant difference between the two teaching methods.")

Z-score: -1.9069251784911845
P-value: 0.05653027716740433
There is no significant difference between the two teaching methods.


### Scenario 3: One-Sample T-Test Example

**Scenario:** A teacher wants to know if the average score of her students in a recent exam is different from the national average score of 70.

**Step-by-Step Process:**

1. **Define the Hypotheses:**
   - **Null Hypothesis (H₀):** The average score of the students is equal to the national average (µ = 70).
   - **Alternative Hypothesis (H₁):** The average score of the students is different from the national average (µ ≠ 70).

2. **Collect Data:**
   - Sample scores: [68, 75, 80, 85, 70, 78, 82, 74, 79, 77]

3. **Perform the One-Sample T-Test**

In [3]:
import numpy as np
import scipy.stats as stats

# Sample data
scores = np.array([68, 75, 80, 85, 70, 78, 82, 74, 79, 77])
population_mean = 70

# One-Sample T-Test
t, p_value = stats.ttest_1samp(scores, population_mean)

print("T-score:", t)
print("P-value:", p_value)

# Interpretation
if p_value < 0.05:
    print("The average score is significantly different from the national average.")
else:
    print("The average score is not significantly different from the national average.")

T-score: 4.116384992583433
P-value: 0.002611994476746316
The average score is significantly different from the national average.


### Scenario 4: Two-Sample T-Test Example

**Scenario:** A company wants to compare the average sales of two different regions to determine if there is a significant difference.

**Step-by-Step Process:**

1. **Define the Hypotheses:**
   - **Null Hypothesis (H₀):** The average sales of the two regions are equal (µ1 = µ2).
   - **Alternative Hypothesis (H₁):** The average sales of the two regions are different (µ1 ≠ µ2).

2. **Collect Data:**
   - Sales data for Region A: [305, 356, 408, 379, 420, 401, 355, 388]
   - Sales data for Region B: [389, 390, 395, 410, 392, 402, 415, 397]

3. **Perform the Two-Sample T-Test**

In [4]:
import numpy as np
import scipy.stats as stats

# Sales data
region_A = np.array([305, 356, 408, 379, 420, 401, 355, 388])
region_B = np.array([389, 390, 395, 410, 392, 402, 415, 397])

# Two-Sample T-Test
t, p_value = stats.ttest_ind(region_A, region_B)

print("T-score:", t)
print("P-value:", p_value)

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in sales between the two regions.")
else:
    print("There is no significant difference in sales between the two regions.")

T-score: -1.6443261402925184
P-value: 0.12236803487610394
There is no significant difference in sales between the two regions.


## paired sample T-test
The paired sample T-test, also known as the dependent sample T-test or matched pairs T-test, is used to determine if there is a significant difference between the means of two related groups. It is commonly used when the same subjects are measured under two different conditions, such as before and after a treatment.

### When to Use a Paired Sample T-Test

- **Two measurements taken on the same group**: For example, measuring blood pressure before and after a treatment on the same individuals.
- **Each subject or unit provides a pair of data points**: This means the data are not independent; instead, they are paired or matched in some way.

### Steps to Perform a Paired Sample T-Test

1. **Define Hypotheses:**
   - **Null Hypothesis (H₀):** The mean difference between the paired observations is zero (µd = 0).
   - **Alternative Hypothesis (H₁):** The mean difference between the paired observations is not zero (µd ≠ 0).

2. **Collect Data:**
   - Record the differences between paired observations.
   - Calculate the mean and standard deviation of these differences.

3. **Calculate the T-Statistic:**

   The formula for the T-statistic is:
   \[
   t = \frac{\bar{d}}{s_d / \sqrt{n}}
   \]
   where:
   - \(\bar{d}\) is the mean of the differences between paired observations.
   - \(s_d\) is the standard deviation of the differences.
   - \(n\) is the number of pairs.

4. **Calculate the P-Value:**

   Use the T-statistic to find the P-value from the T-distribution with \(n - 1\) degrees of freedom.

5. **Make a Decision:**

   Compare the P-value with the significance level (e.g., 0.05) to decide whether to reject the null hypothesis.

### Example in Python

Let's consider a scenario where a nutritionist wants to test the effectiveness of a new diet plan. They measure the weight of the same group of individuals before and after the diet.

**Scenario:**

- **Weight Measurements Before Diet:**
  - [70, 82, 75, 90, 85]

- **Weight Measurements After Diet:**
  - [68, 80, 74, 88, 84]

**Python Code:**

```python
import numpy as np
import scipy.stats as stats

# Data
weight_before = np.array([70, 82, 75, 90, 85])
weight_after = np.array([68, 80, 74, 88, 84])

# Calculate the differences
differences = weight_before - weight_after

# Mean and standard deviation of differences
mean_diff = np.mean(differences)
std_diff = np.std(differences, ddof=1)  # ddof=1 for sample standard deviation
n = len(differences)

# T-Statistic
t_statistic = mean_diff / (std_diff / np.sqrt(n))

# P-Value
p_value = 2 * (1 - stats.t.cdf(np.abs(t_statistic), df=n-1))  # two-tailed test

print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in weight before and after the diet.")
else:
    print("There is no significant difference in weight before and after the diet.")
```

### Explanation of the Code:

1. **Data Setup:**
   - Define the weights before and after the diet.

2. **Calculate Differences:**
   - Compute the differences between paired observations.

3. **Calculate Mean and Standard Deviation:**
   - Compute the mean and standard deviation of these differences.

4. **Calculate the T-Statistic:**
   - Use the formula to compute the T-statistic for the paired differences.

5. **Calculate the P-Value:**
   - Find the P-value using the T-distribution.

6. **Interpret Results:**
   - Compare the P-value with a significance level (e.g., 0.05) to determine if the diet had a significant effect.

### Summary

The paired sample T-test is used to compare two related groups and determine if there is a significant difference between their means. It is particularly useful when dealing with repeated measures or matched pairs of data. The test relies on calculating the differences between pairs, summarizing these differences, and using statistical methods to test the significance of the observed differences.

##  paired Z-test
The paired Z-test is similar in concept to the paired T-test but is used under different conditions. The paired Z-test is typically applied when the sample size is large (n > 30) and the population standard deviation of the differences is known. In contrast, the paired T-test is used when the sample size is small or the population standard deviation is unknown.

### Paired Z-Test Overview

#### When to Use a Paired Z-Test

- **Large Sample Size:** Generally, when the number of pairs is greater than 30.
- **Known Population Standard Deviation:** When the standard deviation of the differences between paired observations is known.

#### Steps to Perform a Paired Z-Test

1. **Define Hypotheses:**
   - **Null Hypothesis (H₀):** The mean difference between the paired observations is zero (\( \mu_d = 0 \)).
   - **Alternative Hypothesis (H₁):** The mean difference between the paired observations is not zero (\( \mu_d \neq 0 \)).

2. **Collect Data:**
   - Record the differences between paired observations.
   - Calculate the mean and standard deviation of these differences.

3. **Calculate the Z-Statistic:**

   The formula for the Z-statistic is:
   \[
   Z = \frac{\bar{d}}{\frac{\sigma_d}{\sqrt{n}}}
   \]
   where:
   - \(\bar{d}\) is the mean of the differences between paired observations.
   - \(\sigma_d\) is the population standard deviation of the differences.
   - \(n\) is the number of pairs.

4. **Calculate the P-Value:**

   Use the Z-statistic to find the P-value from the standard normal distribution.

5. **Make a Decision:**

   Compare the P-value with the significance level (e.g., 0.05) to decide whether to reject the null hypothesis.

### Example in Python

Suppose a researcher is evaluating the effect of a new teaching method on students' test scores. The test scores are recorded before and after the implementation of the new method.

**Scenario:**

- **Test Scores Before Method:**
  - [60, 65, 70, 75, 80]

- **Test Scores After Method:**
  - [62, 68, 72, 78, 85]

Assume the population standard deviation of the differences is known to be 5.

**Python Code:**

```python
import numpy as np
import scipy.stats as stats

# Data
scores_before = np.array([60, 65, 70, 75, 80])
scores_after = np.array([62, 68, 72, 78, 85])

# Calculate differences
differences = scores_after - scores_before

# Given population standard deviation of differences
sigma_d = 5
n = len(differences)

# Calculate the mean difference
mean_diff = np.mean(differences)

# Z-Statistic
z_statistic = mean_diff / (sigma_d / np.sqrt(n))

# Calculate the P-Value
p_value = 2 * (1 - stats.norm.cdf(np.abs(z_statistic)))  # two-tailed test

print("Z-Statistic:", z_statistic)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in test scores before and after the new teaching method.")
else:
    print("There is no significant difference in test scores before and after the new teaching method.")
```

### Explanation of the Code:

1. **Data Setup:**
   - Define the test scores before and after the new teaching method.

2. **Calculate Differences:**
   - Compute the differences between paired observations.

3. **Calculate Mean Difference:**
   - Find the mean of these differences.

4. **Calculate the Z-Statistic:**
   - Use the formula to compute the Z-statistic for the paired differences, using the known population standard deviation.

5. **Calculate the P-Value:**
   - Find the P-value using the Z-distribution. Multiply by 2 for a two-tailed test.

6. **Interpret Results:**
   - Compare the P-value with a significance level (e.g., 0.05) to determine if the difference is statistically significant.

### Summary

The paired Z-test is used to compare the means of two related groups when the sample size is large and the population standard deviation of the differences is known. It provides a way to test if there is a significant difference between paired observations and is particularly useful when the assumptions of the T-test are not met or when dealing with large samples.

## A/B testing
A/B testing can be analyzed using a T-test, especially when dealing with smaller sample sizes or when the data distribution of the metric you're testing is not perfectly normal. The T-test is suitable for comparing the means of two groups and is particularly useful in A/B testing scenarios when you have smaller sample sizes or unknown population variances.

### Types of T-Tests for A/B Testing

1. **Two-Sample T-Test (Independent T-Test):**
   - Used when comparing the means of two independent groups. This is appropriate for A/B testing when the two groups (A and B) are distinct and not paired.
   
2. **Paired T-Test:**
   - Used when comparing the means of two related groups, such as measurements taken from the same subjects before and after an intervention. This is less common in A/B testing but could be applicable in cases where the same subjects are tested under two different conditions.

### Example of A/B Testing with Two-Sample T-Test

Suppose you want to test whether two different marketing campaigns have different impacts on sales. 

**Data Collection:**

- **Group A (Control):** Sales data for the control group.
- **Group B (Treatment):** Sales data for the treatment group.

**Objective:**
- Determine if the average sales for Group B are significantly higher than for Group A.

**Python Code for Two-Sample T-Test:**

```python
import numpy as np
import scipy.stats as stats

# Data
sales_A = np.array([200, 220, 210, 230, 240, 250, 210, 220, 230, 240])
sales_B = np.array([220, 240, 230, 250, 270, 260, 240, 250, 260, 270])

# Perform Two-Sample T-Test
t_statistic, p_value = stats.ttest_ind(sales_A, sales_B)

print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in sales between the two marketing campaigns.")
else:
    print("There is no significant difference in sales between the two marketing campaigns.")
```

### Explanation of the Code:

1. **Define Data:**
   - Provide sales data for both groups.

2. **Perform the T-Test:**
   - Use the `ttest_ind` function from `scipy.stats` to perform the independent T-test.

3. **Interpret Results:**
   - Compare the P-value with the significance level (e.g., 0.05) to determine if the difference between the two groups is statistically significant.

### Example of A/B Testing with Paired T-Test

If you were testing the effect of a new feature on user engagement using the same users before and after the feature was introduced, you might use a paired T-test.

**Data Collection:**

- **Before Feature:** User engagement scores before the feature was introduced.
- **After Feature:** User engagement scores after the feature was introduced.

**Python Code for Paired T-Test:**

```python
import numpy as np
import scipy.stats as stats

# Data
engagement_before = np.array([5, 6, 7, 5, 6, 7, 6, 7, 5, 6])
engagement_after = np.array([7, 8, 9, 8, 9, 10, 8, 9, 7, 8])

# Perform Paired T-Test
t_statistic, p_value = stats.ttest_rel(engagement_before, engagement_after)

print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in user engagement before and after the feature introduction.")
else:
    print("There is no significant difference in user engagement before and after the feature introduction.")
```

### Explanation of the Code:

1. **Define Data:**
   - Provide engagement scores before and after the feature.

2. **Perform the Paired T-Test:**
   - Use the `ttest_rel` function from `scipy.stats` to perform the paired T-test.

3. **Interpret Results:**
   - Compare the P-value with the significance level (e.g., 0.05) to determine if the change is statistically significant.

### Summary

- **Two-Sample T-Test:** Used for comparing the means of two independent groups in A/B testing.
- **Paired T-Test:** Used for comparing the means of two related groups.

Both types of T-tests are useful for A/B testing depending on whether your groups are independent or related.

#### NOTE : The AB testing is also applicable for z test and chi square test

## Chi-Square Tests
The Chi-Square test is a statistical method used to determine if there is a significant association between categorical variables. It is often used in hypothesis testing for categorical data to assess whether observed frequencies differ from expected frequencies.

### Types of Chi-Square Tests

1. **Chi-Square Test of Independence:**
   - Determines if two categorical variables are independent.
   
2. **Chi-Square Test of Goodness of Fit:**
   - Tests if a sample data fits a distribution from a population with a specific distribution.

### Example of Chi-Square Test of Independence

**Scenario:**
Imagine you want to investigate whether there is an association between gender and preference for a new product. You collect survey data from 200 respondents about their gender and their preference for the product.

**Data:**

- **Preference for Product A:**
  - **Male:** 40
  - **Female:** 60

- **Preference for Product B:**
  - **Male:** 30
  - **Female:** 70

**Objective:**
- Determine if gender is independent of product preference.

**Python Code:**

```python
import numpy as np
import pandas as pd
import scipy.stats as stats

# Create a contingency table
data = pd.DataFrame({
    'Product A': [40, 60],
    'Product B': [30, 70]
}, index=['Male', 'Female'])

# Perform Chi-Square Test of Independence
chi2_stat, p_value, dof, expected = stats.chi2_contingency(data)

print("Chi-Square Statistic:", chi2_stat)
print("P-Value:", p_value)
print("Degrees of Freedom:", dof)
print("Expected Frequencies Table:\n", expected)

# Interpretation
if p_value < 0.05:
    print("There is a significant association between gender and product preference.")
else:
    print("There is no significant association between gender and product preference.")
```

### Explanation of the Code:

1. **Define Data:**
   - Create a contingency table representing the counts of each category.

2. **Perform the Chi-Square Test:**
   - Use `stats.chi2_contingency()` to perform the test. This function returns the Chi-Square statistic, P-value, degrees of freedom, and expected frequencies.

3. **Interpret Results:**
   - Compare the P-value with the significance level (e.g., 0.05) to determine if there is a significant association between the variables.

### Example of Chi-Square Test of Goodness of Fit

**Scenario:**
Assume you want to test if the distribution of preferred ice cream flavors in a sample matches the expected distribution. You expect the following proportions based on market research:

- **Vanilla:** 50%
- **Chocolate:** 30%
- **Strawberry:** 20%

You collect data from 100 respondents:

- **Vanilla:** 45
- **Chocolate:** 35
- **Strawberry:** 20

**Objective:**
- Determine if the observed distribution of ice cream flavors matches the expected distribution.

**Python Code:**

```python
import numpy as np
import scipy.stats as stats

# Observed data
observed = np.array([45, 35, 20])

# Expected proportions
expected_proportions = np.array([0.50, 0.30, 0.20])
expected = expected_proportions * np.sum(observed)

# Perform Chi-Square Test of Goodness of Fit
chi2_stat, p_value = stats.chisquare(observed, expected)

print("Chi-Square Statistic:", chi2_stat)
print("P-Value:", p_value)

# Interpretation
if p_value < 0.05:
    print("The observed distribution significantly differs from the expected distribution.")
else:
    print("The observed distribution does not significantly differ from the expected distribution.")
```

### Explanation of the Code:

1. **Define Data:**
   - Set the observed frequencies and calculate the expected frequencies based on the given proportions.

2. **Perform the Chi-Square Test:**
   - Use `stats.chisquare()` to compare the observed and expected frequencies.

3. **Interpret Results:**
   - Compare the P-value with the significance level (e.g., 0.05) to determine if the observed distribution significantly differs from the expected distribution.

### Summary

- **Chi-Square Test of Independence:** Determines if two categorical variables are independent of each other.
- **Chi-Square Test of Goodness of Fit:** Tests if a sample fits a known distribution.

Both tests are useful for analyzing categorical data and determining if there are significant differences between observed and expected frequencies or if there is an association between variables.

#### Prepared By,
Ahamed Basith