# Lab 8: Chi-Square Test
### Author: Aimal Khan (aimalexe)
### Objective
Perform various Chi-Square tests using problems from lecture slides

In [10]:
# import required libraries
import numpy as np
from scipy.stats import chi2

## **Function: chi_square_goodness_of_fit**

### **Description**
Performs a Chi-Square Goodness of Fit test to determine if the observed frequencies match the expected frequencies across categories. This function calculates the Chi-Square test statistic, degrees of freedom, p-value, and critical value to assess the null hypothesis.

---

### **Parameters**
- **`observed_frequencies` (list)**: A list of observed frequencies for each category.
- **`expected_frequencies` (list)**: A list of expected frequencies for each category.
- **`alpha` (float, optional)**: Significance level for hypothesis testing (default: `0.05`).

---

### **Returns**
A dictionary containing:
- **`Observed Frequencies`**: List of observed frequencies.
- **`Expected Frequencies`**: List of expected frequencies.
- **`Chi-Square Statistic`**: The computed Chi-Square test statistic.
- **`Degrees of Freedom`**: The number of degrees of freedom for the test.
- **`P-value`**: The p-value indicating the probability of observing the test statistic under the null hypothesis.
- **`Chi-Square Critical Value`**: The critical value of the Chi-Square statistic at the given significance level.
- **`Conclusion`**: Outcome of the hypothesis test.

---

### **Steps and Formulas**

1. **Validate Input**:
   - Ensure the observed and expected frequencies are of the same length.

2. **Compute Chi-Square Test Statistic** ($\chi^2$):
   $$
   \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
   $$
   where $O_i$ and $E_i$ are the observed and expected frequencies for category $i$.

3. **Degrees of Freedom**:
   $$
   \text{Degrees of Freedom} = \text{Number of Categories} - 1
   $$

4. **Compute p-value**:
   Using the Chi-Square cumulative distribution function (CDF):
   $$
   p = 1 - F_{\chi^2}(\chi^2, \text{Degrees of Freedom})
   $$

5. **Determine Chi-Square Critical Value**:
   Using the inverse of the Chi-Square distribution:
   $$
   \chi^2_{\text{Critical}} = F_{\chi^2}^{-1}(1 - \alpha, \text{Degrees of Freedom})
   $$

6. **Hypothesis Test Conclusion**:
   - Reject $H_0$ if $\chi^2 > \chi^2_{\text{Critical}}$ or $p < \alpha$.
   - Fail to reject $H_0$ otherwise.


### **Notes**
- **Null Hypothesis ($H_0$)**: The observed frequencies match the expected frequencies.
- **Alternative Hypothesis ($H_1$)**: The observed frequencies do not match the expected frequencies.
- This test is commonly used for categorical data analysis. Ensure the expected frequencies are not too small (each category’s expected value should ideally be ≥ 5).

In [11]:
def chi_square_goodness_of_fit(observed_frequencies, expected_frequencies, alpha=0.05):
    """
    Perform Chi-Square Goodness of Fit test step-by-step.

    Parameters:
    observed_frequencies (list): A list of observed frequencies for each category.
    expected_frequencies (list): A list of expected frequencies for each category.
    alpha (float): Significance level for hypothesis testing (default: 0.05).

    Returns:
    dict: A dictionary containing the test statistic, p-value, and hypothesis test conclusion.
    """
    # Step 1: Validate Input
    if len(observed_frequencies) != len(expected_frequencies):
        raise ValueError("Observed and expected frequencies must have the same length.")

    # Step 2: Compute the Chi-Square Test Statistic
    chi_square_statistic = sum((o - e) ** 2 / e for o, e in zip(observed_frequencies, expected_frequencies))

    # Step 3: Degrees of Freedom
    degrees_of_freedom = len(observed_frequencies) - 1

    # Step 4: Compute the p-value
    p_value = 1 - chi2.cdf(chi_square_statistic, degrees_of_freedom)

    # Step 5: Determine the Chi-Square Critical Value
    chi_square_critical = chi2.ppf(1 - alpha, degrees_of_freedom)

    # Step 6: Hypothesis Test Conclusion
    if chi_square_statistic > chi_square_critical:
        conclusion = "Reject H0: The observed frequencies do not match the expected frequencies."
    else:
        conclusion = "Fail to Reject H0: The observed frequencies match the expected frequencies."

    # Step 7: Prepare Detailed Results
    results = {
        "Observed Frequencies": observed_frequencies,
        "Expected Frequencies": expected_frequencies,
        "Chi-Square Statistic": chi_square_statistic,
        "Degrees of Freedom": degrees_of_freedom,
        "P-value": p_value,
        "Chi-Square Critical Value": chi_square_critical,
        "Conclusion": conclusion
    }

    return results

## **Function: chi_square_independence_test**

### **Description**
Performs a Chi-Square Test for Independence to determine if two categorical variables are independent. This function calculates the Chi-Square test statistic, expected frequencies, degrees of freedom, p-value, and critical value to evaluate the null hypothesis.

---

### **Parameters**
- **`observed_table` (2D list or numpy array)**: A contingency table of observed frequencies, where rows represent one categorical variable and columns represent the other.
- **`alpha` (float, optional)**: Significance level for hypothesis testing (default: `0.05`).

---

### **Returns**
A dictionary containing:
- **`Observed Table`**: The input table of observed frequencies.
- **`Expected Table`**: The table of expected frequencies computed under the null hypothesis.
- **`Chi-Square Statistic`**: The calculated Chi-Square test statistic.
- **`Degrees of Freedom`**: The degrees of freedom for the test.
- **`P-value`**: The p-value indicating the probability of observing the test statistic under the null hypothesis.
- **`Chi-Square Critical Value`**: The critical value of the Chi-Square statistic at the given significance level.
- **`Conclusion`**: Outcome of the hypothesis test.

---

### **Steps and Formulas**

1. **Convert to NumPy Array**:
   Convert the input contingency table to a NumPy array for efficient computation.

2. **Calculate Row and Column Totals**:
   $$
   \text{Row Totals: } R_i = \sum_j O_{ij}
   $$
   $$
   \text{Column Totals: } C_j = \sum_i O_{ij}
   $$
   $$
   \text{Grand Total: } G = \sum_i \sum_j O_{ij}
   $$

3. **Calculate Expected Frequencies**:
   Under the null hypothesis, the expected frequency for each cell is:
   $$
   E_{ij} = \frac{R_i \times C_j}{G}
   $$

4. **Compute Chi-Square Test Statistic** ($\chi^2$):
   $$
   \chi^2 = \sum_i \sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
   $$

5. **Degrees of Freedom**:
   $$
   \text{Degrees of Freedom} = (\text{Number of Rows} - 1) \times (\text{Number of Columns} - 1)
   $$

6. **Compute p-value**:
   Using the Chi-Square cumulative distribution function (CDF):
   $$
   p = 1 - F_{\chi^2}(\chi^2, \text{Degrees of Freedom})
   $$

7. **Determine Chi-Square Critical Value**:
   Using the inverse of the Chi-Square distribution:
   $$
   \chi^2_{\text{Critical}} = F_{\chi^2}^{-1}(1 - \alpha, \text{Degrees of Freedom})
   $$

8. **Hypothesis Test Conclusion**:
   - Reject $H_0$ if $\chi^2 > \chi^2_{\text{Critical}}$ or $p < \alpha$.
   - Fail to reject $H_0$ otherwise.

### **Notes**
- **Null Hypothesis ($H_0$)**: The two categorical variables are independent.
- **Alternative Hypothesis ($H_1$)**: The two categorical variables are not independent.
- Ensure that all expected frequencies are greater than 5 for the test to be valid. If this condition is not met, consider using Fisher’s exact test.

In [12]:
def chi_square_independence_test(observed_table, alpha=0.05):
    """
    Perform Chi-Square Test for Independence step-by-step.

    Parameters:
    observed_table (2D list or numpy array): A contingency table of observed frequencies.
    alpha (float): Significance level for hypothesis testing (default: 0.05).

    Returns:
    dict: A dictionary containing the test statistic, p-value, and hypothesis test conclusion.
    """
    # Step 1: Convert to NumPy array for easier computation
    observed_table = np.array(observed_table)

    # Step 2: Calculate row and column sums and the grand total
    row_totals = np.sum(observed_table, axis=1)
    column_totals = np.sum(observed_table, axis=0)
    grand_total = np.sum(observed_table)

    # Step 3: Calculate Expected Frequencies
    expected_table = np.outer(row_totals, column_totals) / grand_total

    # Step 4: Compute the Chi-Square Test Statistic
    chi_square_statistic = np.sum((observed_table - expected_table) ** 2 / expected_table)

    # Step 5: Degrees of Freedom
    rows, columns = observed_table.shape
    degrees_of_freedom = (rows - 1) * (columns - 1)

    # Step 6: Compute the p-value
    p_value = 1 - chi2.cdf(chi_square_statistic, degrees_of_freedom)

    # Step 7: Determine the Chi-Square Critical Value
    chi_square_critical = chi2.ppf(1 - alpha, degrees_of_freedom)

    # Step 8: Hypothesis Test Conclusion
    if chi_square_statistic > chi_square_critical:
        conclusion = "Reject H0: The variables are not independent."
    else:
        conclusion = "Fail to Reject H0: The variables are independent."

    # Step 9: Prepare Detailed Results
    results = {
        "Observed Table": observed_table,
        "Expected Table": expected_table,
        "Chi-Square Statistic": chi_square_statistic,
        "Degrees of Freedom": degrees_of_freedom,
        "P-value": p_value,
        "Chi-Square Critical Value": chi_square_critical,
        "Conclusion": conclusion
    }

    return results


Prints a nested dictionary or results with NumPy arrays in a structured, readable format. This function is useful for displaying test results or hierarchical data where some values are NumPy arrays or other data types.

In [13]:
def print_results(results):
    """
    Prints a nested dictionary or results with NumPy arrays in a structured format.

    Parameters:
    results (dict): A dictionary containing key-value pairs where values may include nested dictionaries or NumPy arrays.

    Returns:
    None
    """
    for key, value in results.items():
        if isinstance(value, np.ndarray):
            print(f"{key}:\n{value}")
        elif isinstance(value, dict):
            print(f"{key}:")
            for sub_key, sub_value in value.items():
                print(f"  {sub_key}: {sub_value}")
        else:
            print(f"{key}: {value}")


## Task 1
Lecture 10, Slide 10

Is the frequency of balls with different colors equal in our bag?

In [14]:
#observed frequency 
colorObserved = np.array([50,30,30,10])
colorExpected = np.array([30,30,30,30])

colorResult = chi_square_goodness_of_fit(colorObserved, colorExpected)
print_results(colorResult)

Observed Frequencies:
[50 30 30 10]
Expected Frequencies:
[30 30 30 30]
Chi-Square Statistic: 26.666666666666668
Degrees of Freedom: 3
P-value: 6.914913279310042e-06
Chi-Square Critical Value: 7.814727903251179
Conclusion: Reject H0: The observed frequencies do not match the expected frequencies.


## Task 2
Lecture 10, Slide 26

When we have two or more sets of categorical data  (IV,DV both categorical). Is There a significant effect of 
gender on vote preference?

In [15]:
votePreferenceObserved = [
    [10, 50, 35],  # Male
    [15, 60, 40]   # Female
]

votePreferenceObservedResult = chi_square_independence_test(votePreferenceObserved)
print_results(votePreferenceObservedResult)

Observed Table:
[[10 50 35]
 [15 60 40]]
Expected Table:
[[11.30952381 49.76190476 33.92857143]
 [13.69047619 60.23809524 41.07142857]]
Chi-Square Statistic: 0.3407530684418557
Degrees of Freedom: 2
P-value: 0.843347207721
Chi-Square Critical Value: 5.991464547107979
Conclusion: Fail to Reject H0: The variables are independent.


## Task 3
Lecture 10, Slide 34

Does the hand preference (Left or Right) depend on gender (Male or Female)? 120 Females, 12 were left-handed 180 Males, 24 were left-handed.

$H_0$: The variables are independent.

In [16]:
# Observed data from the contingency table
handPreferenceObserved = [
    [12, 120-12],  # Female
    [24, 180-24]   # Male
]

# Perform Chi-Square Independence Test
handPreferenceObservedResult = chi_square_independence_test(handPreferenceObserved)

print_results(handPreferenceObservedResult)


Observed Table:
[[ 12 108]
 [ 24 156]]
Expected Table:
[[ 14.4 105.6]
 [ 21.6 158.4]]
Chi-Square Statistic: 0.7575757575757577
Degrees of Freedom: 1
P-value: 0.3840882494738519
Chi-Square Critical Value: 3.841458820694124
Conclusion: Fail to Reject H0: The variables are independent.


## Task 4
Lecture 10, Slide 47.

Does class standing (Freshman, Sophomore, Junior, Senior) influence the selection of meal plans (20/week, 10/week, none)?

$H_0$: The variables are independent.

In [17]:
# Observed data from the contingency table
mealPlanObserved = [
    [24, 32, 14],  # Freshman
    [22, 26, 12],  # Sophomore
    [10, 14, 6],   # Junior
    [14, 16, 10]   # Senior
]

# Perform Chi-Square Independence Test
mealPlanResult = chi_square_independence_test(mealPlanObserved, alpha=0.05)

print_results(mealPlanResult)


Observed Table:
[[24 32 14]
 [22 26 12]
 [10 14  6]
 [14 16 10]]
Expected Table:
[[24.5 30.8 14.7]
 [21.  26.4 12.6]
 [10.5 13.2  6.3]
 [14.  17.6  8.4]]
Chi-Square Statistic: 0.7093382807668522
Degrees of Freedom: 6
P-value: 0.9942873142017763
Chi-Square Critical Value: 12.591587243743977
Conclusion: Fail to Reject H0: The variables are independent.


## Task 5
Lecture 10, Slide 53.

Do award preferences (Academy, Nobel, or Olympic) depend on whether the Higher SAT score is Math or Verbal?

$H_0$: Award preferences are independent of the Higher SAT.

In [18]:
# Observed data from the contingency table
awardPreferenceObserved = [
    [21, 68, 116],  # Math
    [10, 79, 61]    # Verbal
]

# Perform Chi-Square Independence Test
awardPreferenceResult = chi_square_independence_test(awardPreferenceObserved, alpha=0.05)

print_results(awardPreferenceResult)


Observed Table:
[[ 21  68 116]
 [ 10  79  61]]
Expected Table:
[[ 17.90140845  84.88732394 102.21126761]
 [ 13.09859155  62.11267606  74.78873239]]
Chi-Square Statistic: 13.622609647147335
Degrees of Freedom: 2
P-value: 0.001101255018562064
Chi-Square Critical Value: 5.991464547107979
Conclusion: Reject H0: The variables are not independent.
