# Module 1. Introduction to Statistics  
### Question 1. There is an assumption that there is no significant difference between boys and girls with respect to intelligence. Tests are conducted on two groups and the following are the observations 

#### G/B Mean Standard_Deviation Size 
#### Girls 89   4                 50 
#### Boys  82   9                 120  
### Validate the claim with 5% LoS (Level of Significance). 

In [1]:
import numpy as np
from scipy import stats

def two_sample_z_test(mean1, std1, n1, mean2, std2, n2, los):
   
    standard_error = np.sqrt((std1**2 / n1) + (std2**2 / n2))

    z_statistic = (mean1 - mean2) / standard_error

    p_value = 2 * stats.norm.cdf(-np.abs(z_statistic))

    if p_value < los:
        conclusion = "Reject the null hypothesis. There is a significant difference."
    else:
        conclusion = "Fail to reject the null hypothesis. There is no significant difference."

    return z_statistic, p_value, conclusion

# Data for Girls (Group 1)
mean_girls = 89
std_girls = 4
n_girls = 50

# Data for Boys (Group 2)
mean_boys = 82
std_boys = 9
n_boys = 120

# Level of Significance
level_of_significance = 0.05

# --- Run the Test and Print Results ---
z_stat, p_val, conclusion = two_sample_z_test(
    mean_girls, std_girls, n_girls,
    mean_boys, std_boys, n_boys,
    level_of_significance
)

print("--- Statistical Test Results ---")
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_val:.15f}")
print(f"\nConclusion ({level_of_significance*100}% Level of Significance):")
print(conclusion)


--- Statistical Test Results ---
Z-statistic: 7.0176
P-value: 0.000000000002258

Conclusion (5.0% Level of Significance):
Reject the null hypothesis. There is a significant difference.


## Question 2. Analyze the below data and tell whether you can conclude that smoking causes cancer or not? 

In [12]:
import numpy as np
from scipy.stats import chi2_contingency

def chi_square_test(data, los=0.05):
    """
    Performs a Chi-Square test for independence on a contingency table.

    Args:
        data (list of lists): The observed frequency data in a contingency table.
        los (float): The level of significance (e.g., 0.05 for 5%).

    Returns:
        str: A conclusion based on the test results.
    """
    # The null hypothesis (H0) is that smoking and cancer are independent (no association).
    # The alternative hypothesis (Ha) is that smoking and cancer are dependent (there is an association).

    # Convert the list of lists to a numpy array for chi2_contingency
    observed_table = np.array(data)

    # Perform the Chi-Square test
    chi2_stat, p_value, degrees_of_freedom, expected_table = chi2_contingency(observed_table)

    print("--- Chi-Square Test Results ---")
    print(f"Chi-Square Statistic: {chi2_stat:.4f}")
    print(f"P-value: {p_value:.15f}")
    print(f"Degrees of Freedom: {degrees_of_freedom}")
    print("\nObserved Frequencies:")
    print(observed_table)
    print("\nExpected Frequencies:")
    # Round the expected values for easier viewing
    print(np.round(expected_table, 2))

    # Compare the p-value to the level of significance to make a conclusion.
    print(f"\nComparing p-value with Level of Significance ({los}):")
    if p_value < los:
        conclusion = "Since the p-value is less than the level of significance, we reject the null hypothesis."
        print(conclusion)
        print("Conclusion: There is a statistically significant association between smoking and cancer.")
    else:
        conclusion = "Since the p-value is greater than the level of significance, we fail to reject the null hypothesis."
        print(conclusion)
        print("Conclusion: There is no statistically significant association between smoking and cancer.")

    return conclusion

# --- Provided Data ---
# The contingency table with observed frequencies
# [[Smokers with Cancer, Smokers without Cancer],
#  [Non-Smokers with Cancer, Non-Smokers without Cancer]]
observed_data = [
    [220, 230], # Smokers
    [350, 640]  # Non-Smokers
]

# Note: The totals from the user query are inconsistent with the data
# The provided data is:
# Smokers: 220 + 230 = 450 (not 550)
# Non-Smokers: 350 + 640 = 990
# Cancer: 220 + 350 = 570 (not 680)
# Without Cancer: 230 + 640 = 870 (not 910)
# Total: 450 + 990 = 1440
# The chi-square test will use the 2x2 table provided, ignoring the inconsistent totals.
# Let's use the provided 2x2 table as the observed data.

# Level of Significance
level_of_significance = 0.05

# --- Run the Test and Print Results ---
chi_square_test(observed_data, level_of_significance)


--- Chi-Square Test Results ---
Chi-Square Statistic: 23.1378
P-value: 0.000001507985465
Degrees of Freedom: 1

Observed Frequencies:
[[220 230]
 [350 640]]

Expected Frequencies:
[[178.12 271.88]
 [391.88 598.12]]

Comparing p-value with Level of Significance (0.05):
Since the p-value is less than the level of significance, we reject the null hypothesis.
Conclusion: There is a statistically significant association between smoking and cancer.


'Since the p-value is less than the level of significance, we reject the null hypothesis.'