# Week 6 - CHAID

CHAID, or Chi-squared Automatic Interaction Detection, is a classification method for building decision trees by using chi-square statistics to identify optimal splits.

https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=modeling-chaid-node

A chi-squared test (also chi-square or χ2 test) is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.

https://en.wikipedia.org/wiki/Chi-squared_test 

* Finds the best split
* Segments data and performs feature analysis
* Looks at all features and finds the most difference
* Performs Chi-Square test showing whether the observed outcomes are different or not from the expected
* Ideal for categorical features

In [1]:
# !pip install chaid

In [2]:
# titanic_chaid was processed in another project 
import pandas as pd

titanic = pd.read_csv('titanic_chaid.csv')
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 668 entries, 0 to 667
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  668 non-null    int64  
 1   sex         668 non-null    int64  
 2   age         668 non-null    float64
 3   sibsp       668 non-null    int64  
 4   parch       668 non-null    int64  
 5   fare        668 non-null    float64
 6   adult_male  668 non-null    int64  
 7   alone       668 non-null    int64  
 8   pclass_2    668 non-null    float64
 9   pclass_3    668 non-null    float64
 10  embarked_Q  668 non-null    float64
 11  embarked_S  668 non-null    float64
 12  who_man     668 non-null    float64
 13  who_woman   668 non-null    float64
 14  survived    668 non-null    int64  
dtypes: float64(8), int64(7)
memory usage: 78.4 KB


In [3]:
# convert continuous to ordinal category
titanic.age = pd.qcut(titanic.age, q=4, labels = [0,1,2,3])
titanic.age = pd.factorize(titanic.age)[0]
titanic.fare = pd.qcut(titanic.fare, q=4, labels = [0,1,2,3])
titanic.fare = pd.factorize(titanic.fare)[0]

In [4]:
# https://github.com/Rambatino/CHAID
from CHAID import Tree

ordinal_features = ['pclass_2', 'pclass_3', 'age', 'fare']
nominal_features = ['sex', 'sibsp', 'parch', 'adult_male', 'alone', 'embarked_Q', 'embarked_S', 'who_man', 'who_woman']

X_names = ordinal_features + nominal_features
y_name = titanic.survived.name

tree = Tree.from_pandas_df(titanic,
                           dict(zip(X_names, ['ordinal']*4 + ['nominal']*9)),
                           y_name)

tree.print_tree()

([], {0: 415.0, 1: 253.0}, (adult_male, p=3.3261520722634765e-47, score=208.23960281976576, groups=[[0.0], [1.0]]), dof=1))
|-- ([0.0], {0: 75.0, 1: 188.0}, (pclass_3, p=1.6518991428375545e-19, score=81.61696652433046, groups=[[0.0], [1.0]]), dof=1))
|   |-- ([0.0], {0: 4.0, 1: 126.0}, <Invalid Chaid Split> - the max depth has been reached)
|   +-- ([1.0], {0: 71.0, 1: 62.0}, <Invalid Chaid Split> - the max depth has been reached)
+-- ([1.0], {0: 340.0, 1: 65.0}, (fare, p=0.012013100711646758, score=8.843514997445991, groups=[[0.0, 1.0], [2.0], [3.0]]), dof=2))
    |-- ([0.0, 1.0], {0: 196.0, 1: 35.0}, <Invalid Chaid Split> - the max depth has been reached)
    |-- ([2.0], {0: 51.0, 1: 19.0}, <Invalid Chaid Split> - the max depth has been reached)
    +-- ([3.0], {0: 93.0, 1: 11.0}, <Invalid Chaid Split> - the max depth has been reached)



In [5]:
titanic.groupby(['adult_male', 'survived'])['survived'].count()

adult_male  survived
0           0            75
            1           188
1           0           340
            1            65
Name: survived, dtype: int64

In [6]:
import pandas as pd

ct = pd.crosstab(titanic.adult_male, titanic.survived, margins=True)
ct

survived,0,1,All
adult_male,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,75,188,263
1,340,65,405
All,415,253,668


https://towardsdatascience.com/chi-square-test-for-independence-in-python-with-examples-from-the-ibm-hr-analytics-dataset-97b9ec9bb80a

$\chi^2 = \sum{[(observed-expected)^2 / expected]}$

In [7]:
import numpy as np

obs = np.append(ct.iloc[0][0:2].values, ct.iloc[1][0:2].values)
obs

array([ 75, 188, 340,  65], dtype=int64)

In [8]:
rsum = ct.iloc[0:2,2].values
exp = []
for j in rsum:
    for val in ct.iloc[2,0:2].values:
        exp.append(val * j / rsum.sum())

print(exp)

[163.39071856287424, 99.60928143712574, 251.60928143712576, 153.39071856287424]


In [9]:
test_statistic = ((obs - exp)**2/exp).sum()
test_statistic

208.23960281976576

In [10]:
import scipy

pvalue = scipy.stats.chi.sf(abs(test_statistic), 1)
pvalue

0.0