# The Chi-Square Test for Data Science

The **Chi-Square ($\chi^2$) test** is a statistical method used to determine if there is a significant association between two categorical variables or if a sample distribution fits a population distribution.

---

## 1. Core Logic
The test compares **Observed (O)** values (actual data) against **Expected (E)** values (what we would expect if there was no relationship/total randomness).

### The Formula
$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$

---

## 2. Main Types in Data Science

| Test Type | Purpose | Example |
| :--- | :--- | :--- |
| **Goodness of Fit** | Checks if sample data matches a known distribution. | Is our website traffic distributed equally across all days of the week? |
| **Test of Independence** | Checks if two categorical variables are related. | Does a user's "City" affect their "Subscription Plan"? |

---

## 3. Assumptions (The "Must-Haves")
1. **Categorical Data:** Variables must be groups (e.g., Male/Female, High/Low, Yes/No).
2. **Independence:** Each observation must be independent (one person cannot be in two groups).
3. **Large Sample Size:** Each cell in your frequency table should have an expected value $\ge$ 5.

---

## 4. Why it Matters for ML (Feature Selection)
In Data Science, we use the **Test of Independence** to filter features:
* **High $\chi^2$ Score + Low p-value:** The feature and target are dependent. **Keep this feature.**
* **Low $\chi^2$ Score + High p-value:** The feature is likely "noise" and doesn't help predict the target. **Drop this feature.**



---

## 5. Python Implementation (SciPy)
Since you are familiar with **pandas**, use `chi2_contingency` to analyze a contingency table (cross-tab):

```python
import pandas as pd
from scipy.stats import chi2_contingency

# 1. Create a contingency table (Cross-tab)
# Rows: Gender | Columns: Product Preference
data = [[10, 20, 30],  # Male
        [20, 15, 5]]   # Female

# 2. Run the test
stat, p, dof, expected = chi2_contingency(data)

print(f"Chi-Square Statistic: {stat:.4f}")
print(f"P-value: {p:.4f}")

# 3. Interpret
if p <= 0.05:
    print("Reject Null Hypothesis: There is a significant relationship.")
else:
    print("Fail to Reject: Variables are likely independent.")

- EX-1

In [20]:
import pandas as pd
import seaborn as sns
from scipy.stats import chi2_contingency
import numpy as np

In [5]:
df = sns.load_dataset('titanic')
df.shape

(891, 15)

In [6]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [23]:
contingency_table = pd.crosstab(df['sex'],df['survived'])
contingency_table

survived,0,1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,81,233
male,468,109


### The Correction (Hypothesis Logic)
1. Null Hypothesis ($H_0$): There is no relationship. They are independent.

2. Alternate Hypothesis ($H_a$): There is a relationship/association.

> P is low, Null must go.
3. If the p-value is less than 0.05, you reject the null hypothesis because it indicates a statistically significant result, meaning there's strong evidence for the alternative hypothesis.

In [28]:
# H0: The "Nothing is happening" hypothesis
Null_Hypothesis = "Gender and Survival are INDEPENDENT (No connection)"

# Ha: The "Something is happening" hypothesis
Alternate_Hypothesis = "Gender and Survival are DEPENDENT (Significant relationship)"

def chi_Square_Test(data):
    # Ensure you've imported chi2_contingency from scipy.stats
    chi2, pvalue, dof, expected = chi2_contingency(data)

    print(f"Chi2 Score : {chi2:.4f}")
    print(f"*P-value    : {pvalue:.4f}") # very small value becomes zero
    # print(f"*P-value    : {pvalue}")
    print(f"DOF        : {dof}")
    
    if pvalue < 0.05:
        # P < 0.05 means we REJECT the Null and accept the Alternate
        print(f"Result: {Alternate_Hypothesis}")
    else:
        # P > 0.05 means we FAIL to reject the Null
        print(f"Result: {Null_Hypothesis}")

# Usage
chi_Square_Test(contingency_table)

Chi2 Score : 260.7170
*P-value    : 0.0000
DOF        : 1
Result: Gender and Survival are DEPENDENT (Significant relationship)
