
---
### Introduction
The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. This test is widely used in various fields, including social sciences, marketing, and healthcare, to analyze survey data, experimental results, and observational studies.

---
### Concept
The chi-square test is a non-parametric statistical method used to examine the association between two categorical variables. It evaluates whether the frequencies of observed outcomes significantly deviate from expected frequencies, assuming the variables are independent. The test is grounded in the chi-square distribution, which is applied to count data and helps in determining if any observed deviations could have arisen by random chance.

---
### Null Hypothesis and Alternative Hypothesis
The chi-square test involves formulating two hypotheses:
**Null Hypothesis ($H_0$)**
($H_0$) - Assumes that there is no association between the categorical variables, implying that any observed differences are due to random chance.
**Alternative Hypothesis ($H_1$)**
($H_1$) - Assumes that there is a significant association between the variables, indicating that the observed differences are not due to chance alone.

---
### Formula
The chi-square statistic is calculated using the formula:
$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$
where
$O_i$ is the observed frequency for category $i$.
$E_i$ is the expected frequency for category $i$, calculated as:
$$E_i = \frac{(\text{row total} \times \text{column total})}{\text{grand total}}$$
The sum is taken over all cells in the contingency table.
The calculated chi-square statistic is then compared to a critical value from the chi-square distribution table. This table provides critical values for different degrees of freedom (df) and significance levels ($\alpha$).
The degrees of freedom for the test are calculated as:
$$df = (r-1) \times (c-1)$$
where $r$ is the number of rows and $c$ is the number of columns in the table.

---
### Applications
**Market Research:** Analyzing the association between customer demographics and product preferences.
**Healthcare:** Studying the relationship between patient characteristics and disease incidence.
**Social Sciences:** Investigating the link between social factors (e.g., education level) and behavioral outcomes (e.g., voting patterns).
**Education:** Examining the connection between teaching methods and student performance.
**Quality Control:** Assessing the association between manufacturing conditions and product defects.

---
### Practical Example - Weak Correlation
Suppose a researcher wants to determine if there is an association between gender (male, female) and preference for a new product (like, dislike). The researcher surveys 100 people and records the following data:

| Category | Like | Dislike | Total |
| :------- | :--- | :------ | :---- |
| Male     | 20   | 30      | 50    |
| Female   | 25   | 25      | 50    |
| Total    | 45   | 55      | 100   |

#### Step 1: Calculate Expected Frequencies
Using the formula for expected frequencies:
$E_{\text{Male,Like}} = \frac{(50 \times 45)}{100} = 22.5$
$E_{\text{Male,Dislike}} = \frac{(50 \times 55)}{100} = 27.5$
$E_{\text{Female,Like}} = \frac{(50 \times 45)}{100} = 22.5$
$E_{\text{Female,Dislike}} = \frac{(50 \times 55)}{100} = 27.5$

#### Step 2: Compute Chi-Square Statistic
$\chi^2 = \frac{(20-22.5)^2}{22.5} + \frac{(30-27.5)^2}{27.5} + \frac{(25-22.5)^2}{22.5} + \frac{(25-27.5)^2}{27.5}$
$\chi^2 = \frac{(-2.5)^2}{22.5} + \frac{(2.5)^2}{27.5} + \frac{(2.5)^2}{22.5} + \frac{(-2.5)^2}{27.5}$
$\chi^2 = \frac{6.25}{22.5} + \frac{6.25}{27.5} + \frac{6.25}{22.5} + \frac{6.25}{27.5}$
$\chi^2 \approx 0.277 + 0.227 + 0.277 + 0.227$
$\chi^2 \approx 1.008$ (Note: slight rounding might occur if calculated manually step-by-step, Python will be more precise)

#### Step 3: Determine Degrees of Freedom
$df = (2-1) \times (2-1) = 1 \times 1 = 1$

#### Step 4: Interpret the Result
Using a chi-square distribution table, we compare the calculated chi-square value (1.008) with the critical value at one degree of freedom and a significance level (e.g., 0.05). The critical value, as determined from chi-square distribution tables, is approximately 3.841.
Since $1.008 < 3.841$, we fail to reject the null hypothesis. Thus, there is no significant association between gender and product preference in this sample.

**Python Implementation (Weak Correlation Example)**

In [None]:
# CELL 1: Import necessary libraries
import numpy as np
from scipy.stats import chi2_contingency

In [None]:
# CELL 2: Define the observed contingency table for gender and product preference
# Rows: Gender (Male, Female)
# Columns: Preference (Like, Dislike)
observed_gender_product = np.array([[20, 30],  # Male: 20 Like, 30 Dislike
                                      [25, 25]]) # Female: 25 Like, 25 Dislike

print("Observed Frequencies (Gender vs. Product Preference):")
print(observed_gender_product)

In [None]:
# CELL 3: Perform the Chi-Square Test and print results
# The chi2_contingency function returns:
# 1. chi2_stat: The calculated chi-square statistic
# 2. p_value: The p-value of the test
# 3. dof: Degrees of freedom
# 4. expected_freq: The array of expected frequencies under the null hypothesis
chi2_stat, p_value, dof, expected_freq = chi2_contingency(observed_gender_product)

print(f"\nChi-Square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected_freq)

# Interpretation
alpha = 0.05
print(f"\nSignificance level (alpha): {alpha}")
if p_value < alpha:
    print("Result: Reject the null hypothesis (H0).")
    print("Conclusion: There is a statistically significant association between gender and product preference.")
else:
    print("Result: Fail to reject the null hypothesis (H0).")
    print("Conclusion: There is NO statistically significant association between gender and product preference.")

---
### Practical Example - Strong Association
Consider a study investigating the relationship between smoking status (smoker, non-smoker) and the incidence of lung disease (disease, no disease). The researcher collects data from 200 individuals and records the following information:

| Category   | Disease | No Disease | Total |
| :--------- | :------ | :--------- | :---- |
| Smoker     | 50      | 30         | 80    |
| Non-Smoker | 20      | 100        | 120   |
| Total      | 70      | 130        | 200   |

#### Step 1: Calculate Expected Frequencies
Using the formula for expected frequencies:
$E_{\text{Smoker,Disease}} = \frac{(80 \times 70)}{200} = 28$
$E_{\text{Smoker,No Disease}} = \frac{(80 \times 130)}{200} = 52$
$E_{\text{Non-Smoker,Disease}} = \frac{(120 \times 70)}{200} = 42$
$E_{\text{Non-Smoker,No Disease}} = \frac{(120 \times 130)}{200} = 78$

#### Step 2: Compute Chi-Square Statistic
$\chi^2 = \frac{(50-28)^2}{28} + \frac{(30-52)^2}{52} + \frac{(20-42)^2}{42} + \frac{(100-78)^2}{78}$
$\chi^2 = \frac{(22)^2}{28} + \frac{(-22)^2}{52} + \frac{(-22)^2}{42} + \frac{(22)^2}{78}$
$\chi^2 = \frac{484}{28} + \frac{484}{52} + \frac{484}{42} + \frac{484}{78}$
$\chi^2 \approx 17.29 + 9.31 + 11.52 + 6.21$
$\chi^2 \approx 44.33$

#### Step 3: Determine Degrees of Freedom
$df = (2-1) \times (2-1) = 1$

#### Step 4: Interpret the Result
Using a chi-square distribution table, we compare the calculated chi-square value (44.33) with the critical value at one degree of freedom and a significance level (e.g., 0.05), approximately 3.841. Since $44.33 > 3.841$, we reject the null hypothesis. This indicates a significant association between smoking status and the incidence of lung disease in this sample.

**Python Implementation (Strong Association Example)**

In [None]:
# CELL 4: Define the observed contingency table for smoking and lung disease
# Rows: Status (Smoker, Non-Smoker)
# Columns: Lung Disease (Disease, No Disease)
observed_smoking_disease = np.array([[50, 30],  # Smoker: 50 Disease, 30 No Disease
                                       [20, 100]]) # Non-Smoker: 20 Disease, 100 No Disease

print("Observed Frequencies (Smoking vs. Lung Disease):")
print(observed_smoking_disease)

In [None]:
# CELL 5: Perform the Chi-Square Test and print results
chi2_stat_sd, p_value_sd, dof_sd, expected_freq_sd = chi2_contingency(observed_smoking_disease)

print(f"\nChi-Square Statistic: {chi2_stat_sd:.4f}")
print(f"P-value: {p_value_sd:.4f}") # Expected to be very small
print(f"Degrees of Freedom: {dof_sd}")
print("Expected Frequencies:")
print(expected_freq_sd)

# Interpretation
alpha = 0.05
print(f"\nSignificance level (alpha): {alpha}")
if p_value_sd < alpha:
    print("Result: Reject the null hypothesis (H0).")
    print("Conclusion: There is a statistically significant association between smoking status and lung disease.")
else:
    print("Result: Fail to reject the null hypothesis (H0).")
    print("Conclusion: There is NO statistically significant association between smoking status and lung disease.")

---
### Additional Python Example: Larger Contingency Table (e.g., 3x3)

Let's say we are examining the association between `Education Level` (High School, Bachelor's, Master's/PhD) and `Preferred News Source` (Online, TV, Print).

In [None]:
# CELL 6: Define a larger contingency table (3x3)
# Rows: Education Level (HS, Bachelors, Masters/PhD)
# Columns: News Source (Online, TV, Print)
observed_education_news = np.array([
    [150, 60, 30], # High School
    [200, 40, 20], # Bachelor's
    [100, 20, 15]  # Master's/PhD
])

print("Observed Frequencies (Education vs. News Source):")
print(observed_education_news)

In [None]:
# CELL 7: Perform Chi-Square test on the 3x3 table
chi2_stat_en, p_value_en, dof_en, expected_freq_en = chi2_contingency(observed_education_news)

print(f"\nChi-Square Statistic: {chi2_stat_en:.4f}")
print(f"P-value: {p_value_en:.4f}")
print(f"Degrees of Freedom: {dof_en}") # Should be (3-1)*(3-1) = 4
print("Expected Frequencies:")
print(expected_freq_en)

# Interpretation
alpha = 0.05
print(f"\nSignificance level (alpha): {alpha}")
if p_value_en < alpha:
    print("Result: Reject the null hypothesis (H0).")
    print("Conclusion: There is a statistically significant association between education level and preferred news source.")
else:
    print("Result: Fail to reject the null hypothesis (H0).")
    print("Conclusion: There is NO statistically significant association between education level and preferred news source.")

---
### Python Example: Creating Contingency Table from Raw Data using Pandas

Often, you have raw data rather than a pre-summarized contingency table. Pandas `crosstab` function is excellent for this.

In [None]:
# CELL 8: Import pandas
import pandas as pd

In [None]:
# CELL 9: Create sample raw data
# Imagine this data comes from a survey
data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'North', 'South', 'West'],
    'Product_Choice': ['A', 'B', 'A', 'C', 'C', 'B', 'A', 'A', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'A', 'C', 'C']
}
df_survey = pd.DataFrame(data)

print("Sample Raw Survey Data:")
print(df_survey.head())

In [None]:
# CELL 10: Create a contingency table (cross-tabulation) from the raw data
contingency_table_survey = pd.crosstab(df_survey['Region'], df_survey['Product_Choice'])

print("\nContingency Table (Region vs. Product Choice):")
print(contingency_table_survey)

In [None]:
# CELL 11: Perform Chi-Square test on the created contingency table
chi2_stat_survey, p_value_survey, dof_survey, expected_freq_survey = chi2_contingency(contingency_table_survey)

print(f"\nChi-Square Statistic: {chi2_stat_survey:.4f}")
print(f"P-value: {p_value_survey:.4f}")
print(f"Degrees of Freedom: {dof_survey}")
print("Expected Frequencies:")
print(expected_freq_survey)

# Interpretation
alpha = 0.05
print(f"\nSignificance level (alpha): {alpha}")
if p_value_survey < alpha:
    print("Result: Reject the null hypothesis (H0).")
    print("Conclusion: There is a statistically significant association between region and product choice.")
else:
    print("Result: Fail to reject the null hypothesis (H0).")
    print("Conclusion: There is NO statistically significant association between region and product choice.")

---
### Assumptions of the Chi-Square Test of Independence
1.  **Two Categorical Variables:** The data should consist of counts for two categorical variables.
2.  **Independence of Observations:** Each observation should be independent of all other observations (e.g., one participant's response does not affect another's).
3.  **Sufficiently Large Expected Frequencies:** For the chi-square approximation to be valid, the expected frequencies in each cell of the contingency table should not be too small. A common rule of thumb is:
    * No cell should have an expected frequency less than 1.
    * At least 80% of the cells should have an expected frequency of 5 or more.
    If these conditions are not met, especially for 2x2 tables, Fisher's Exact Test might be a better alternative. For larger tables, you might consider combining categories if it's theoretically sound.

---
### Interpreting the p-value and Chi-Square Statistic
* The **Chi-Square Statistic ($\chi^2$)** measures the overall difference between observed and expected frequencies. A larger value generally indicates a greater discrepancy.
* The **p-value** is the probability of observing the data (or more extreme data) if the null hypothesis (no association) were true.
    * If **p-value < $\alpha$** (your chosen significance level, e.g., 0.05): Reject $H_0$. There is a statistically significant association between the variables.
    * If **p-value $\ge \alpha$**: Fail to reject $H_0$. There is not enough evidence to conclude a significant association between the variables.

---
### Strength of Association (Brief Mention)
While the chi-square test tells you if an association is statistically significant, it doesn't tell you how strong that association is. For 2x2 tables, the **Phi coefficient ($\phi$)** can be used. For larger tables, **Cramer's V** is a common measure of association strength. Both typically range from 0 (no association) to 1 (perfect association).

In [None]:
# CELL 12: Optional - Calculate Cramer's V for the smoking and lung disease example (2x2 table)
# Cramer's V formula: sqrt(chi2_stat / (n * (min(k, r) - 1)))
# where n is total sample size, k is number of columns, r is number of rows

n_sd = observed_smoking_disease.sum()
phi_sd = np.sqrt(chi2_stat_sd / n_sd) # For 2x2 table, Cramer's V simplifies to Phi which is sqrt(chi2/n)
# More general Cramer's V
min_dim_sd = min(observed_smoking_disease.shape) -1
if min_dim_sd == 0: # Avoid division by zero if one dimension is 1 (though chi2_contingency handles this)
    cramers_v_sd = np.nan
else:
    cramers_v_sd = np.sqrt(chi2_stat_sd / (n_sd * min_dim_sd) )


print(f"\nFor Smoking vs. Lung Disease example:")
print(f"Total observations (n): {n_sd}")
print(f"Phi coefficient (approx Cramer's V for 2x2): {phi_sd:.4f}")
print(f"Cramer's V: {cramers_v_sd:.4f}") # Should be same as phi_sd for 2x2
# Interpretation of Cramer's V (general guidelines, context-dependent):
# ~0.1: weak association
# ~0.3: moderate association
# ~0.5 or higher: strong association

---
### Conclusion
The chi-square test is a powerful tool for analyzing the relationship between categorical variables. By comparing observed and expected frequencies, researchers can determine if there is a statistically significant association, providing valuable insights in various fields of study.

---
### Practice Questions (50)

**Conceptual Understanding & Definitions**

1.  What is the primary purpose of a Chi-Square test of independence?
2.  Is the Chi-Square test a parametric or non-parametric test? Explain what that means.
3.  What type of variables is the Chi-Square test designed for?
4.  Define "observed frequency" in the context of a Chi-Square test.
5.  Define "expected frequency" in the context of a Chi-Square test.
6.  What does the Chi-Square distribution describe?
7.  What does it mean if two categorical variables are "independent"?
8.  What does it mean if two categorical variables are "associated"?

**Hypotheses**

9.  State the typical null hypothesis ($H_0$) for a Chi-Square test of independence.
10. State the typical alternative hypothesis ($H_1$) for a Chi-Square test of independence.
11. If a Chi-Square test yields a p-value of 0.03, and $\alpha = 0.05$, what decision do you make regarding the null hypothesis?
12. A researcher wants to test if there's an association between car color preference and buyer's gender. Formulate the $H_0$ and $H_1$.

**Formula & Calculations**

13. Write down the formula for the Chi-Square statistic ($\chi^2$).
14. Explain each component of the Chi-Square statistic formula ($O_i$, $E_i$).
15. Write down the formula for calculating the expected frequency ($E_i$) for a cell in a contingency table.
16. Write down the formula for calculating the degrees of freedom (df) for a Chi-Square test of independence.
17. If a contingency table has 3 rows and 4 columns, what are the degrees of freedom for the Chi-Square test?
18. If a contingency table has 2 rows and 2 columns, what are the degrees of freedom?
19. Why is $(O_i - E_i)$ squared in the Chi-Square formula?
20. What does a larger Chi-Square statistic generally indicate?

**Interpreting Results (p-value, $\chi^2$ statistic)**

21. What is a p-value in the context of a Chi-Square test?
22. If the p-value is 0.25 and $\alpha = 0.05$, what is your conclusion about the association between the variables?
23. If the calculated $\chi^2$ statistic is 15.6 and the critical $\chi^2$ value (for a given df and $\alpha$) is 7.81, what is your conclusion?
24. Does a statistically significant Chi-Square test result tell you about the strength of the association? Why or why not?
25. What does "failing to reject the null hypothesis" mean in practical terms?
26. What does "rejecting the null hypothesis" mean in practical terms?

**Assumptions**

27. List three key assumptions for the Chi-Square test of independence.
28. What is the common rule of thumb regarding expected cell frequencies for a Chi-Square test?
29. What might be an alternative test if the expected cell frequency assumption is violated, especially in a 2x2 table?
30. Why is the "independence of observations" assumption important?

**Python Implementation (SciPy & Pandas)**

31. Which function from the `scipy.stats` module is commonly used to perform a Chi-Square test of independence?
32. What are the four main values returned by `scipy.stats.chi2_contingency`?
33. How can you create a contingency table from raw categorical data in two Pandas Series (`series1`, `series2`)?
34. If `chi2_contingency` returns a p-value of `0.001`, what does this suggest about the association between the variables?
35. What data structure is typically passed as input to `chi2_contingency` (e.g., a NumPy array or Pandas DataFrame representing what)?

**Application & Scenarios**

36. A survey asks 200 people about their preferred vacation type (Beach, Mountains, City) and their income level (Low, Medium, High). What statistical test could be used to see if these are related?
37. For the scenario in Q36, what would be the null hypothesis?
38. If the Chi-Square test for Q36 results in a p-value of 0.02 (with $\alpha = 0.05$), what is the conclusion?
39. A company tests two different website designs (Design A, Design B) and records how many users click a "Sign Up" button for each design. How would you set up the contingency table?
40. If the Chi-Square test for Q39 is significant, what does it imply about the website designs?
41. A teacher wants to know if the final grade (Pass/Fail) is associated with attendance (Regular/Irregular). What are the variables and their categories?
42. If a Chi-Square test shows no significant association between study method and exam scores, does this prove that study method has no effect at all? Explain.

**True/False & Fill-in-the-Blanks**

43. True or False: A p-value of 0.60 means there is a strong association between the variables.
44. True or False: The Chi-Square test can be used to compare the means of two groups.
45. The Chi-Square statistic cannot be \_\_\_\_\_\_\_\_\_\_ (negative/positive/zero).
46. Degrees of freedom for a contingency table with $r$ rows and $c$ columns is \_\_\_\_\_\_\_\_\_\_.
47. If observed frequencies are very close to expected frequencies, the Chi-Square statistic will be \_\_\_\_\_\_\_\_\_\_ (small/large).
48. True or False: Cramer's V is a measure of the statistical significance of an association.
49. The significance level, $\alpha$, represents the probability of making a \_\_\_\_\_\_\_\_\_\_ error.
50. If the expected frequency in a cell is 0.5, this violates an assumption of the \_\_\_\_\_\_\_\_\_\_ test.