# Chi-Squared Test and Pearson Correlation
## Group 3 - By: Ali

---

## Course: PROG8245 Data Integration Workshop
## Topic: Chi-Squared Contingency Test and Pearson Correlation
## Date: February 2026

---

## Introduction

In this notebook, we will explore two important statistical concepts used in data science and feature engineering:

1. **Chi-Squared Test (Contingency Test)** - Used to determine if there is a significant association between two categorical variables
2. **Pearson Correlation** - Used to measure the linear relationship between numerical variables and identify/remove redundant features

These techniques are essential for:
- Understanding relationships between variables
- Feature selection and dimensionality reduction
- Removing redundant attributes from datasets

---

## 1. Chi-Squared Contingency Test

### What is the Chi-Squared Test?

The Chi-Squared test (χ² test) is a statistical hypothesis test that is valid for sampling distributions of a test statistic that follow a chi-squared distribution. It is used to:

- Determine if there is a significant association between two categorical variables
- Test whether the observed frequencies differ from expected frequencies
- Analyze contingency tables (cross-tabulations)

### Key Concepts:

- **Null Hypothesis (H₀)**: The two variables are independent (no association)
- **Alternative Hypothesis (H₁)**: The two variables are dependent (there is an association)
- **p-value**: If p-value < 0.05, we reject the null hypothesis (significant association exists)
- **Significance Level (α)**: Usually set to 0.05

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Example Dataset: Titanic

Let's use the Titanic dataset to demonstrate the Chi-Squared test. We want to determine if there is a significant association between:
- **Gender** and **Survival** (Did gender affect survival rates?)
- **Passenger Class** and **Survival** (Did class affect survival rates?)

In [None]:
df_titanic = pd.read_csv('reference/DataIntegrationWorkshop/data/titanic.csv')
df_titanic.head()

### Chi-Squared Test: Gender vs Survival

In [None]:
contingency_table_gender = pd.crosstab(df_titanic['Sex'], df_titanic['Survived'])
print("Contingency Table: Gender vs Survival")
print(contingency_table_gender)

In [None]:
chi2, p_value, dof, expected = chi2_contingency(contingency_table_gender)

print("=" * 50)
print("Chi-Squared Test Results: Gender vs Survival")
print("=" * 50)
print(f"Chi-Squared Statistic: {chi2:.4f}")
print(f"P-Value: {p_value:.4e}")
print(f"Degrees of Freedom: {dof}")
print("\nExpected Frequencies:")
print(pd.DataFrame(expected, 
                   index=contingency_table_gender.index, 
                   columns=contingency_table_gender.columns).round(2))

### Interpretation: Gender vs Survival

In [None]:
alpha = 0.05
print("=" * 50)
print("INTERPRETATION")
print("=" * 50)
print(f"Significance Level (α): {alpha}")
print(f"P-Value: {p_value:.4e}")

if p_value < alpha:
    print(f"\nResult: REJECT the null hypothesis (p-value < {alpha})")
    print("Conclusion: There IS a significant association between Gender and Survival.")
    print("Women were more likely to survive than men.")
else:
    print(f"\nResult: FAIL TO REJECT the null hypothesis (p-value >= {alpha})")
    print("Conclusion: There is NO significant association between Gender and Survival.")

### Chi-Squared Test: Passenger Class vs Survival

In [None]:
contingency_table_class = pd.crosstab(df_titanic['Pclass'], df_titanic['Survived'])
print("Contingency Table: Passenger Class vs Survival")
print(contingency_table_class)

In [None]:
chi2_class, p_value_class, dof_class, expected_class = chi2_contingency(contingency_table_class)

print("=" * 50)
print("Chi-Squared Test Results: Passenger Class vs Survival")
print("=" * 50)
print(f"Chi-Squared Statistic: {chi2_class:.4f}")
print(f"P-Value: {p_value_class:.4e}")
print(f"Degrees of Freedom: {dof_class}")

### Interpretation: Passenger Class vs Survival

In [None]:
print("=" * 50)
print("INTERPRETATION")
print("=" * 50)
print(f"Significance Level (α): {alpha}")
print(f"P-Value: {p_value_class:.4e}")

if p_value_class < alpha:
    print(f"\nResult: REJECT the null hypothesis (p-value < {alpha})")
    print("Conclusion: There IS a significant association between Passenger Class and Survival.")
    print("Higher class passengers were more likely to survive.")
else:
    print(f"\nResult: FAIL TO REJECT the null hypothesis (p-value >= {alpha})")
    print("Conclusion: There is NO significant association.")

### Visualizing the Chi-Squared Results

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.heatmap(contingency_table_gender, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('Gender vs Survival\nChi-Squared Test: p < 0.001')
axes[0].set_xlabel('Survived')
axes[0].set_ylabel('Gender')

sns.heatmap(contingency_table_class, annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title('Passenger Class vs Survival\nChi-Squared Test: p < 0.001')
axes[1].set_xlabel('Survived')
axes[1].set_ylabel('Pclass')

plt.tight_layout()
plt.show()

---

## 2. Pearson Correlation (Removing Attribute Redundancies)

### What is Pearson Correlation?

The Pearson correlation coefficient (r) measures the linear relationship between two continuous variables:

- **Range**: -1 to +1
- **+1**: Perfect positive linear relationship
- **-1**: Perfect negative linear relationship
- **0**: No linear relationship

### Why Remove Redundant Features?

In feature engineering, when two features are highly correlated:
- They provide similar information
- They can cause multicollinearity in models
- They increase complexity without adding value
- Removing one can simplify the model without losing much information

### Decision Rule:
- If |correlation| > 0.7 to 0.9, consider removing one of the correlated features

### Example: Insurance Dataset

In [None]:
df_insurance = pd.read_csv('reference/DataIntegrationWorkshop/data/insurance.csv')
print("Insurance Dataset:")
print(f"Shape: {df_insurance.shape}")
print("\nColumns:")
print(df_insurance.dtypes)
df_insurance.head()

### Encode Categorical Variables

In [None]:
from sklearn.preprocessing import LabelEncoder

df_encoded = df_insurance.copy()

le = LabelEncoder()
df_encoded['sex'] = le.fit_transform(df_encoded['sex'])
df_encoded['smoker'] = le.fit_transform(df_encoded['smoker'])
df_encoded['region'] = le.fit_transform(df_encoded['region'])

print("Encoded Dataset:")
df_encoded.head()

### Calculate Pearson Correlation Matrix

In [None]:
correlation_matrix = df_encoded.corr(method='pearson')

print("Pearson Correlation Matrix:")
print(correlation_matrix.round(3))

### Visualizing Correlation Matrix

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f',
            linewidths=0.5)
plt.title('Pearson Correlation Matrix - Insurance Dataset', fontsize=14)
plt.tight_layout()
plt.show()

### Identifying Highly Correlated Features

In [None]:
threshold = 0.7

print(f"Finding highly correlated features (|r| > {threshold}):")
print("=" * 60)

highly_correlated = []

for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            col1 = correlation_matrix.columns[i]
            col2 = correlation_matrix.columns[j]
            corr_value = correlation_matrix.iloc[i, j]
            highly_correlated.append((col1, col2, corr_value))
            print(f"  {col1} ↔ {col2}: {corr_value:.3f}")

if not highly_correlated:
    print("  No highly correlated feature pairs found in this dataset.")
    print("\nNote: In this dataset, the highest correlations are:")
    corr_pairs = []
    for i in range(len(correlation_matrix.columns)):
        for j in range(i+1, len(correlation_matrix.columns)):
            corr_pairs.append((correlation_matrix.columns[i], 
                            correlation_matrix.columns[j], 
                            correlation_matrix.iloc[i, j]))
    corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
    for pair in corr_pairs[:3]:
        print(f"  {pair[0]} ↔ {pair[1]}: {pair[2]:.3f}")

### Example with Another Dataset: Auto Dataset

In [None]:
df_auto = pd.read_csv('reference/DataIntegrationWorkshop/data/Auto.csv')
print("Auto Dataset:")
print(f"Shape: {df_auto.shape}")
print("\nFirst few rows:")
df_auto.head()

In [None]:
numeric_cols = df_auto.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric columns: {numeric_cols}")

df_auto_numeric = df_auto[numeric_cols].dropna()
auto_corr = df_auto_numeric.corr(method='pearson')

plt.figure(figsize=(10, 8))
sns.heatmap(auto_corr, annot=True, cmap='coolwarm', center=0, fmt='.2f',
            linewidths=0.5)
plt.title('Pearson Correlation Matrix - Auto Dataset', fontsize=14)
plt.tight_layout()
plt.show()

### Identifying Redundant Features in Auto Dataset

In [None]:
threshold = 0.7

print(f"Identifying highly correlated features (|r| > {threshold}):")
print("=" * 60)

auto_highly_correlated = []

for i in range(len(auto_corr.columns)):
    for j in range(i+1, len(auto_corr.columns)):
        if abs(auto_corr.iloc[i, j]) > threshold:
            col1 = auto_corr.columns[i]
            col2 = auto_corr.columns[j]
            corr_value = auto_corr.iloc[i, j]
            auto_highly_correlated.append((col1, col2, corr_value))
            print(f"  {col1} ↔ {col2}: {corr_value:.3f}")

print(f"\nTotal highly correlated pairs found: {len(auto_highly_correlated)}")

### Removing Redundant Features

In [None]:
def remove_correlated_features(df, threshold=0.7):
    """
    Remove features that are highly correlated with each other.
    Keeps the first feature and removes the others.
    """
    corr_matrix = df.corr(method='pearson').abs()
    
    upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > threshold)]
    
    return to_drop

features_to_drop = remove_correlated_features(df_auto_numeric, threshold=0.7)

print(f"Features to remove (correlation > 0.7): {features_to_drop}")
print(f"\nOriginal number of features: {len(df_auto_numeric.columns)}")
print(f"Features to remove: {len(features_to_drop)}")
print(f"Remaining features: {len(df_auto_numeric.columns) - len(features_to_drop)}")

In [None]:
df_reduced = df_auto_numeric.drop(columns=features_to_drop)
print("\nReduced Dataset (after removing redundant features):")
print(f"Columns: {df_reduced.columns.tolist()}")
print(f"Shape: {df_reduced.shape}")

---

## Summary

### Key Takeaways:

#### Chi-Squared Test:
- Used to test independence between categorical variables
- Creates contingency tables from categorical data
- Compares observed vs expected frequencies
- If p-value < 0.05: reject null hypothesis (variables are dependent)
- If p-value >= 0.05: fail to reject null hypothesis (variables are independent)

#### Pearson Correlation:
- Measures linear relationship between numerical variables
- Range: -1 (negative) to +1 (positive)
- Used to identify and remove redundant features
- High correlation (|r| > 0.7) suggests redundancy
- Can simplify models by removing correlated features

#### Practical Applications:
- Feature selection for machine learning models
- Data preprocessing and cleaning
- Understanding variable relationships in exploratory data analysis (EDA)
- Reducing multicollinearity in regression models

---

## References:

- SciPy Documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html
- Pandas Documentation: https://pandas.pydata.org/docs/
- Seaborn Documentation: https://seaborn.pydata.org/

---

**Notebook created by: Ali (Group 3)**
**Course: PROG8245 Data Integration Workshop**
**Date: February 2026**