<a href="https://colab.research.google.com/github/Zoyasirguroh/Ejuket_DS_ML/blob/main/Week_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Hypothesis Testing Session 1: Introduction

# Import necessary libraries
import scipy.stats as stats
import numpy as np

In [None]:


"""
## Warm-Up: Relatable Scenario
Scenario: Testing if a new cafe recipe improves customer satisfaction.
Imagine you own a cafe and recently introduced a new recipe for cappuccino. You claim it improves customer satisfaction, but how can you prove it?
Hypothesis testing will help us determine if the observed improvement is statistically significant.
"""

print("Warm-Up Scenario:")
print("You own a cafe and want to test if a new recipe improves customer satisfaction.")



In [None]:
# Key Concepts
"""
## What is Hypothesis Testing?
Hypothesis testing is a statistical method used to decide whether there is enough evidence to reject a claim about a population (null hypothesis) based on sample data.
It allows us to make decisions using data rather than guesses.
"""

print("\nKey Concepts:")
print("1. Hypothesis testing helps make decisions using data rather than guesses.")


In [None]:

# Null and Alternative Hypotheses
"""
### Null and Alternative Hypotheses:
- Null Hypothesis (H0): Represents the current state or no effect. Example: "The new recipe does not improve satisfaction."
- Alternative Hypothesis (Ha): Represents the claim we want to test. Example: "The new recipe improves satisfaction."
"""

H0 = "The new recipe does not improve satisfaction."
Ha = "The new recipe improves satisfaction."
print("\nHypotheses:")
print(f"Null Hypothesis (H0): {H0}")
print(f"Alternative Hypothesis (Ha): {Ha}")

In [None]:
# p-Value and Significance Level
"""
### p-Value:
- The p-value represents the probability of observing the data (or something more extreme) if the null hypothesis (H0) is true.
- A smaller p-value indicates stronger evidence against H0.

### Significance Level (alpha):
- The threshold for rejecting H0, often set at 0.05.
- If p-value <= alpha: Reject H0.
- If p-value > alpha: Fail to reject H0.
"""

alpha = 0.05
print(f"\nSignificance Level (alpha): {alpha}")


In [None]:

# Step-by-Step Process
"""
## Steps in Hypothesis Testing:
1. Formulate Hypotheses: Define H0 and Ha.
2. Set Significance Level: Choose an appropriate alpha value (e.g., 0.05).
3. Collect Data: Gather sample data.
4. Choose a Statistical Test: Select based on data type and study design.
5. Perform the Test and Calculate p-Value.
6. Make a Decision: Compare p-value to alpha.
"""

print("\nSteps in Hypothesis Testing:")
steps = [
    "1. Formulate Hypotheses",
    "2. Set Significance Level",
    "3. Collect Data",
    "4. Choose a Statistical Test",
    "5. Perform the Test and Calculate p-Value",
    "6. Make a Decision"
]
for step in steps:
    print(step)

In [None]:

# Interactive Example
"""
### Interactive Example:
You claim that the average satisfaction score for the new recipe is higher than 8. The scores from a sample of 10 customers are:
[7.8, 8.1, 8.4, 7.9, 8.5, 8.2, 8.3, 8.7, 7.6, 8.0]

We will test:
- Null Hypothesis (H0): The average satisfaction score is 8 (mu = 8).
- Alternative Hypothesis (Ha): The average satisfaction score is greater than 8 (mu > 8).
"""

# Sample data: satisfaction scores for the new recipe
data = [7.8, 8.1, 8.4, 7.9, 8.5, 8.2, 8.3, 8.7, 7.6, 8.0]
popmean = 8

# Performing one-sample t-test
t_stat, p_value = stats.ttest_1samp(data, popmean=popmean)

print("\nInteractive Example:")
print(f"Sample Data: {data}")
print(f"Population Mean (H0): {popmean}")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")


In [None]:


# Decision based on p-value
if p_value <= alpha:
    print(f"Decision: Reject H0 (The new recipe improves satisfaction)")
else:
    print(f"Decision: Fail to Reject H0 (Insufficient evidence to conclude improvement)")

# Explanation
"""
### Explanation of Results:
- The t-statistic measures the difference between the sample mean and the population mean in terms of standard errors.
- The p-value helps us decide if the observed difference is statistically significant.
- If p-value <= alpha (0.05), we reject the null hypothesis and support the alternative hypothesis.
"""

print("\nExplanation: The p-value guides our decision to reject or fail to reject the null hypothesis.")

# Summary
"""
## Summary:
- Hypothesis testing helps make data-driven decisions.
- Null and alternative hypotheses form the foundation.
- The p-value determines whether to reject or fail to reject H0.
"""
summary = [
    "Hypothesis testing helps make data-driven decisions.",
    "Null and alternative hypotheses are the foundation.",
    "The p-value determines whether to reject or fail to reject H0."
]
print("\nSummary:")
for point in summary:
    print(f"- {point}")

# End of Session
print("\nQ&A Session: Ask your questions!")


In [None]:
# Session 2: Introduction to t-Tests

# Goal: Learners will understand the types of t-tests (one-sample, two-sample, and paired),
# when to use them, and how to perform them using Python.

# 1. Warm-Up: Recap and Motivation
# Quick Recap:
print("Quick Recap:")
print("1. A null hypothesis (H0) is a statement of no effect or difference.")
print("2. A p-value indicates the probability of observing the data assuming H0 is true.")

# Motivation Example:
print("Motivation Example:")
print("1. One-sample t-test: Is the average weight loss in your program different from 5 kg?")
print("2. Two-sample t-test: Is the average weight loss in your program better than the competitor's?")
print("3. Paired t-test: Did participants lose weight after completing the program?")

# 2. Key Concepts
print("\nKey Concepts:")
print("- A t-test compares means and tests whether differences could have occurred by chance.")
print("- Types of t-tests: One-sample, Two-sample, and Paired.")

# Assumptions of t-tests
print("\nAssumptions:")
print("1. Data are approximately normally distributed.")
print("2. Groups have similar variances (for two-sample t-tests).")
print("3. Observations are independent (except in paired t-tests).\n")

# Walkthrough Examples

# Example A: One-Sample t-Test
import scipy.stats as stats

print("\n--- Example A: One-Sample t-Test ---")
data = [4.8, 5.2, 5.1, 4.9, 5.3, 4.7, 5.0, 5.1]
null_hypothesis_mean = 5  # benchmark value
t_stat, p_value = stats.ttest_1samp(data, popmean=null_hypothesis_mean)
print(f"T-statistic: {t_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("Conclusion: Reject the null hypothesis. The mean is significantly different from 5.")
else:
    print("Conclusion: Fail to reject the null hypothesis. No significant difference.")

# Example B: Two-Sample t-Test
print("\n--- Example B: Two-Sample t-Test ---")
your_program = [5.2, 5.3, 5.1, 5.4, 5.0]
competitor_program = [4.8, 4.9, 4.7, 5.0, 4.6]
t_stat, p_value = stats.ttest_ind(your_program, competitor_program, alternative='greater')
print(f"T-statistic: {t_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("Conclusion: Reject the null hypothesis. Your program performs better.")
else:
    print("Conclusion: Fail to reject the null hypothesis. No significant difference.")

# Example C: Paired t-Test
print("\n--- Example C: Paired t-Test ---")
before = [70, 72, 68, 74, 71]
after = [68, 70, 66, 72, 69]
t_stat, p_value = stats.ttest_rel(before, after, alternative='greater')
print(f"T-statistic: {t_stat}, P-value: {p_value}")
if p_value < 0.05:
    print("Conclusion: Reject the null hypothesis. Participants lost weight after the program.")
else:
    print("Conclusion: Fail to reject the null hypothesis. No significant weight loss.")

# Group Activity
print("\n--- Group Activity ---")
print("Task: Perform a t-test using a dataset provided. Steps:")
print("1. Formulate hypotheses.")
print("2. Perform the test in Python.")
print("3. Present findings and decisions.")

# Summary
print("\nSummary:")
print("1. t-tests compare means to identify significant differences.")
print("2. Choose the appropriate type (one-sample, two-sample, paired) based on the question.")
print("3. Python's scipy.stats library makes t-tests straightforward to perform.")


In [None]:
# Session 3: Introduction to Chi-Square Tests

# Goal: Understand the Chi-square test for independence and goodness of fit,
# when to use them, and how to perform these tests in Python.

### 1. Warm-Up: Relatable Scenario
"""
Scenario: Imagine you're studying whether there is a relationship between gender (male/female) and whether people prefer coffee or tea.
Question: Does gender influence drink preference (coffee or tea)?
Challenge: How do we test if the two variables are related or independent?

This introduces the Chi-square test for independence.
"""

# 2. Key Concepts
## A. What is the Chi-Square Test?
"""
The Chi-square test is a statistical method to examine if there’s a significant association between categorical variables.
It compares the expected frequency of observations with the actual observed frequency.

Types:
1. Goodness of Fit: Compares observed vs. expected distribution of a single variable.
2. Independence: Checks the relationship between two variables.
"""

## Assumptions:
"""
1. Data consists of counts/frequencies.
2. Observations are independent.
3. Expected frequency for each category is at least 5.

In [None]:
### Introduction to Z-Tests
# Goal: Understand Z-tests, when to use them, and how to perform them using Python.

# Import necessary libraries
import numpy as np
import scipy.stats as stats

# --- Section 1: Warm-Up Scenario ---
"""
Imagine you're a quality control officer at a factory that produces light bulbs.
The company claims their light bulbs last an average of 1,000 hours.
You want to test if the actual average lifespan is different from this claimed value based on a sample of 50 bulbs.

Question: Is the average lifespan of the light bulbs actually 1,000 hours, or is it significantly different?
"""

# --- Section 2: Key Concepts ---
"""
What is a Z-Test?
- A statistical test used to determine if there is a significant difference between:
  1. The sample mean and the population mean (One-Sample Z-Test).
  2. The means of two independent samples (Two-Sample Z-Test).

Assumptions of Z-Tests:
1. The data follows a normal distribution (for large samples).
2. The population standard deviation is known.
3. The sample size is large enough (n > 30).
"""

# --- Section 3: Performing One-Sample Z-Test ---
"""
Scenario: Test if the average lifespan of light bulbs is 1,000 hours.
Given:
- Sample size (n) = 50
- Sample mean (̄x) = 1020
- Population mean (μ) = 1000
- Population standard deviation (σ) = 100
"""

# Given data for One-Sample Z-Test
sample_mean = 1020
population_mean = 1000
population_std = 100
sample_size = 50

# Calculate Z-statistic
z_stat = (sample_mean - population_mean) / (population_std / np.sqrt(sample_size))

# Calculate p-value (two-tailed test)
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

# Display results
print(f"One-Sample Z-Test Results:")
print(f"Z-Statistic: {z_stat}")
print(f"P-value: {p_value}")

# --- Section 4: Performing Two-Sample Z-Test ---
"""
Scenario: Compare the average lifespan of light bulbs from two factories.
Factory 1:
- Mean (̄x1) = 1020
- Standard Deviation (σ1) = 100
- Sample Size (n1) = 50

Factory 2:
- Mean (̄x2) = 1030
- Standard Deviation (σ2) = 95
- Sample Size (n2) = 60
"""

# Given data for Two-Sample Z-Test
mean1 = 1020
std1 = 100
n1 = 50

mean2 = 1030
std2 = 95
n2 = 60

# Calculate Z-statistic
z_stat_two_sample = (mean1 - mean2) / np.sqrt((std1**2 / n1) + (std2**2 / n2))

# Calculate p-value (two-tailed test)
p_value_two_sample = 2 * (1 - stats.norm.cdf(abs(z_stat_two_sample)))

# Display results
print(f"\nTwo-Sample Z-Test Results:")
print(f"Z-Statistic: {z_stat_two_sample}")
print(f"P-value: {p_value_two_sample}")

# --- Section 5: Group Activity ---
"""
Activity:
1. Test if the average weight of apples is 150g based on a sample of 100 apples.
2. Compare the average heights of two groups of students from different schools.
Instructions:
- Formulate hypotheses.
- Perform the Z-test in Python.
- Interpret and present findings.
"""

# Placeholder for group activity (students can fill with their own data)
def one_sample_z_test(sample_mean, population_mean, population_std, sample_size):
    z_stat = (sample_mean - population_mean) / (population_std / np.sqrt(sample_size))
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
    return z_stat, p_value

def two_sample_z_test(mean1, mean2, std1, std2, n1, n2):
    z_stat = (mean1 - mean2) / np.sqrt((std1**2 / n1) + (std2**2 / n2))
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
    return z_stat, p_value

# Example usage
# Uncomment the lines below to test with your own data
# z_stat, p_value = one_sample_z_test(150, 155, 10, 100)
# print(f"Z-Statistic: {z_stat}, P-value: {p_value}")

# --- Section 6: Summary ---
"""
Key Takeaways:
- Z-tests are useful for comparing sample means to population means or between two sample means.
- Reject the null hypothesis (H0) if p-value <= 0.05.
"""


In [1]:
import scipy.stats as stats
import numpy as np

data = [7.8, 8.1, 8.4, 7.9, 8.5, 8.2, 8.3, 8.7, 7.6, 8.0]
t_stat, p_value = stats.ttest_1samp(data, popmean=8)
print(f"T-statistic: {t_stat}, P-value: {p_value}")


T-statistic: 1.4055638569974578, P-value: 0.19342205960333114


In [4]:
import scipy.stats as stats

# Sample data
before = [310, 320, 305, 315, 300, 325, 330, 340]
after = [300, 315, 295, 310, 290, 315, 320, 330]

# Set the significance level (alpha)
alpha = 0.05

# Perform the paired t-test
t_stat, p_value = stats.ttest_rel(before, after)

# Print the results
print("t-statistic:", t_stat)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis. The diet improves performance.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in performance before and after the diet.")

t-statistic: 10.692676621563628
p-value: 1.3735119173347963e-05
Reject the null hypothesis. The diet improves performance.


In [5]:
import numpy as np
from scipy import stats

# Given data
scores = np.array([70, 74, 78, 65, 80, 72, 68, 77, 73, 69])

# Hypothesis test
# Null hypothesis: The average score is 75
# Alternative hypothesis: The average score is different from 75

t_statistic, p_value = stats.ttest_1samp(scores, 75)

# Display results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Decision at 0.05 significance level
if p_value < 0.05:
    print("Reject the null hypothesis: The average score differs from 75.")
else:
    print("Fail to reject the null hypothesis: The average score does not differ from 75.")


T-statistic: -1.5925462387337155
P-value: 0.14572598114293872
Fail to reject the null hypothesis: The average score does not differ from 75.


In [3]:
p_value <= 0.05

False

In [6]:
import scipy.stats as stats

# Sample data for Office A and Office B
n1 = 8
mean1 = 82
sd1 = 5

n2 = 8
mean2 = 78
sd2 = 6

# Set the significance level (alpha)
alpha = 0.05

# Perform the two-sample t-test
t_stat, p_value = stats.ttest_ind_from_stats(mean1=mean1, std1=sd1, nobs1=n1,
                                            mean2=mean2, std2=sd2, nobs2=n2,
                                            equal_var=False)

# Print the results
print("t-statistic:", t_stat)
print("p-value:", p_value)

# Compare p-value with alpha to make a decision
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in productivity between the two offices.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in productivity between the two offices.")

t-statistic: 1.4485719366802965
p-value: 0.17018632819073165
Fail to reject the null hypothesis. There is no significant difference in productivity between the two offices.
