# **Chi-Squaredd Test in Data Analytics**

The **Chi-Squaredd Test** is a statistical test used to determine if there is a significant association between two **categorical** variables. In data analytics, it helps identify dependencies between variables, which is useful in feature selection and understanding relationships in data.

## **Hypotheses**
- **Null Hypothesis (H₀):** The two variables are independent (no association).
- **Alternative Hypothesis (H₁):** The two variables are dependent (association exists).

## **Formula:**  
$$
\chi^2 = \sum \frac{(O - E)^2}{E}
$$
where:  
- \(O\) = Observed frequency  
- \(E\) = Expected frequency  

---

# **Feature Selection Using Chi-Squared Test**
1. Consider a healthcare_dataset - "https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset" 
2. Apply **Chi-Squared Test** for feature selection.  
3. Identify and rank features based on their Chi-Square scores.  
4. Select the **top 5 features** with the highest Chi-Square scores.  
5. Use these features for further analysis or model building.
   
# **Understanding Data Types in Your Dataset**
Your dataset contains different types of variables:

| Column Name         | Data Type       | Include in Chi-Squaredd? | Reason |
|--------------------|----------------|----------------|--------|
| **id**             | Identifier (Unique) | ❌ No | Not useful for analysis. |
| **gender**         | Categorical | ✅ Yes | Chi-Squared can test if gender is associated with stroke. |
| **age**            | Numerical (Continuous) | ❌ No (Unless Binned) | Chi-Squared works with categorical data, so age should be grouped into categories. |
| **hypertension**   | Categorical (0 = No, 1 = Yes) | ✅ Yes | Can check if hypertension is related to stroke. |
| **heart_disease**  | Categorical (0 = No, 1 = Yes) | ✅ Yes | Can check if heart disease affects stroke occurrence. |
| **ever_married**   | Categorical (Yes/No) | ✅ Yes | Can check if marital status is related to stroke. |
| **work_type**      | Categorical | ✅ Yes | Can test if work type affects stroke risk. |
| **Residence_type** | Categorical (Urban/Rural) | ✅ Yes | Can check if residence type is linked to stroke. |
| **avg_glucose_level** | Numerical (Continuous) | ❌ No (Unless Binned) | Needs binning into ranges (e.g., Low, Normal, High). |
| **bmi**           | Numerical (Continuous, Has Missing Values) | ❌ No (Unless Binned) | Needs binning and handling of missing values. |
| **smoking_status** | Categorical | ✅ Yes | Can test if smoking is linked to stroke. |
| **stroke**        | Categorical (0 = No, 1 = Yes) | ✅ Yes (Target Variable) | This is the outcome variable. |

---

# **Steps to Perform Chi-Squared Test on Your Dataset**

## **1. Data Preparation**
- Convert numerical variables (**age, avg_glucose_level, bmi**) into categories if needed.
- Handle missing values (**bmi has NaN values**).
- Select only categorical variables for the test.

## **2. Perform Chi-Squared Test**
- Use Python (`scipy.stats.chi2_contingency`) to check associations between categorical variables and `stroke`.


In [None]:
# Download the dataset.
import kagglehub

# Download latest version
path = kagglehub.dataset_download("fedesoriano/stroke-prediction-dataset")

print("Path to dataset files:", path)

### Install and load necessary libraries

In [31]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder

In [42]:
# Loading the dataset
df = pd.read_csv('healthcare_dataset.csv')
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [43]:
# Dropping unnecesary columns and handling missing values
df = df.drop(columns=['id'])
df = df.dropna()

# Filtering out the categorical columns.
df_categorical = df.select_dtypes(exclude=['number'])
print(df_categorical.head())

   gender ever_married      work_type Residence_type   smoking_status
0    Male          Yes        Private          Urban  formerly smoked
2    Male          Yes        Private          Rural     never smoked
3  Female          Yes        Private          Urban           smokes
4  Female          Yes  Self-employed          Rural     never smoked
5    Male          Yes        Private          Urban  formerly smoked


In [46]:
# Convert categorical columns to numerical using Label Encoding
encoder = LabelEncoder()
for col in df_categorical.columns.tolist():
    df[col] = encoder.fit_transform(df[col])
    
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,1,67.0,0,1,1,2,1,228.69,36.6,1,1
2,1,80.0,0,1,1,2,0,105.92,32.5,2,1
3,0,49.0,0,0,1,2,1,171.23,34.4,3,1
4,0,79.0,1,0,1,3,0,174.12,24.0,2,1
5,1,81.0,0,0,1,2,1,186.21,29.0,1,1


In [50]:
# Defining features and target variable
x = df.drop(columns=['stroke'])
y = df['stroke']
print("Features", x)
print("Target", y)

Features       gender   age  hypertension  heart_disease  ever_married  work_type  \
0          1  67.0             0              1             1          2   
2          1  80.0             0              1             1          2   
3          0  49.0             0              0             1          2   
4          0  79.0             1              0             1          3   
5          1  81.0             0              0             1          2   
...      ...   ...           ...            ...           ...        ...   
5104       0  13.0             0              0             0          4   
5106       0  81.0             0              0             1          3   
5107       0  35.0             0              0             1          3   
5108       1  51.0             0              0             1          2   
5109       0  44.0             0              0             1          0   

      Residence_type  avg_glucose_level   bmi  smoking_status  
0             

In [57]:
# Applying Chi-Squared test
chi2_selector = SelectKBest(chi2, k=5)
chi2_selector.fit(x, y)

top_features = x.columns[chi2_selector.get_support()]
print("The top 5 features are:", list(top_features))

The top 5 features are: ['age', 'hypertension', 'heart_disease', 'ever_married', 'avg_glucose_level']
