In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


# Chi square goodness of fit test

**t-test to check whether a sample mean differs from the an expected (population) mean. The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution. For example, you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at your church or school match that of the entire U.S. population or whether the computer browser preferences of your friends match those of Internet uses as a whole.**

**When working with categorical data, the values themselves aren't of much use for statistical testing because categories like "male", "female," and "other" have no mathematical meaning. Tests dealing with categorical variables are based on variable counts instead of the actual value of the variables themselves.**

**Let's generate some fake demographic data for U.S. and Minnesota and walk through the chi-square goodness of fit test to check whether they are different:**

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats



In [2]:
national = pd.DataFrame(['white']*100000 + ['hispanic']*60000 + ['asian']*15000 + ['black']*50000 + ['other']*35000)

In [3]:
national.head()

Unnamed: 0,0
0,white
1,white
2,white
3,white
4,white


In [4]:
minnesota = pd.DataFrame(["white"]*600 + ["hispanic"]*300 + \
                         ["black"]*250 +["asian"]*75 + ["other"]*150)

In [5]:
minnesota.head()

Unnamed: 0,0
0,white
1,white
2,white
3,white
4,white


In [6]:
national_table = pd.crosstab(index=national[0],columns = "count")
minnesota_table = pd.crosstab(index = minnesota[0],columns = "count")

In [7]:
national_table

col_0,count
0,Unnamed: 1_level_1
asian,15000
black,50000
hispanic,60000
other,35000
white,100000


In [8]:
minnesota_table

col_0,count
0,Unnamed: 1_level_1
asian,75
black,250
hispanic,300
other,150
white,600


Chi-squared tests are based on the so-called chi-squared statistic. You calculate the chi-squared statistic with the following formula:

sum((observed−expected)^2/expected)
 
In the formula, observed is the actual observed count for each category and expected is the expected count based on the distribution of the population for the corresponding category. Let's calculate the chi-squared statistic for our data to illustrate:

In [9]:
observed = minnesota_table
observed

col_0,count
0,Unnamed: 1_level_1
asian,75
black,250
hispanic,300
other,150
white,600


In [10]:
# To get population ratios

national_ratios = national_table/len(national)
national_ratios

col_0,count
0,Unnamed: 1_level_1
asian,0.057692
black,0.192308
hispanic,0.230769
other,0.134615
white,0.384615


In [11]:
# Get expected count 

expected = national_ratios * len(minnesota)
expected

col_0,count
0,Unnamed: 1_level_1
asian,79.326923
black,264.423077
hispanic,317.307692
other,185.096154
white,528.846154


In [12]:
chi_squared_stat = (( (observed - expected)**2)/expected).sum()
chi_squared_stat

col_0
count    18.194805
dtype: float64

Note: The chi-squared test assumes none of the expected counts are less than 5.

Similar to the t-test where we compared the t-test statistic to a critical value based on the t-distribution to determine whether the result is significant, in the chi-square test we compare the chi-square test statistic to a critical value based on the chi-square distribution. The scipy library shorthand for the chi-square distribution is chi2. Let's use this knowledge to find the critical value for 95% confidence level and check the p-value of our result:

## Critical Value

In [13]:
# Find the critical value for 95% confidence 
# And degrees of freedom is the no of variable categories - 1

crit = stats.chi2.ppf(q = 0.95,df = 4)

print('Critical value',crit)

Critical value 9.487729036781154


## P value

In [14]:
p_val = 1 - stats.chi2.cdf(x = chi_squared_stat,df=4)

print('P value',p_val)

P value [0.00113047]


**Since the chi-squared statistic exceeds the critical value, we'd reject the null hypothesis that the two distributions are the same.**

**We can also directlydo the chi square fitness of test using the scipy function**

In [15]:
stats.chisquare(f_obs = observed,f_exp = expected)

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

# Chi square test of independence

Independence is a key concept in probability that describes a situation where knowing the value of one variable tells you nothing about the value of another. For instance, the month you were born probably doesn't tell you anything about which web browser you use, so we'd expect birth month and browser preference to be independent. On the other hand, your month of birth might be related to whether you excelled at sports in school, so month of birth and sports performance might not be independent.

The chi-squared test of independence tests whether two categorical variables are independent. The test of independence is commonly used to determine whether variables like education, political views and other preferences vary based on demographic factors like gender, race and religion. Let's generate some fake voter polling data and perform a test of independence:

In [18]:
np.random.seed(10)

# Sample data randomly at fixed probabilities 
voter_race = np.random.choice(a = ["asian","black","hispanic","other","white"],
                              p=[0.05, 0.15 ,0.25, 0.05, 0.5],size = 1000)

voter_race[0:15]

array(['white', 'asian', 'white', 'white', 'other', 'hispanic', 'black',
       'white', 'black', 'black', 'white', 'white', 'asian', 'white',
       'white'], dtype='<U8')

In [19]:
voter_party = np.random.choice(a=["democrat","independent","republican"],p=[0.4, 0.2, 0.4],size=1000)
voter_party[0:15]

array(['democrat', 'republican', 'independent', 'republican', 'democrat',
       'democrat', 'republican', 'democrat', 'independent', 'democrat',
       'republican', 'independent', 'independent', 'republican',
       'democrat'], dtype='<U11')

In [20]:
voters = pd.DataFrame({'race':voter_race,'party':voter_party})

Unnamed: 0,race,party
0,white,democrat
1,asian,republican
2,white,independent
3,white,republican
4,other,democrat


In [23]:
voter_tab = pd.crosstab(voters.race,voters.party,margins=True)
voter_tab

party,democrat,independent,republican,All
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
asian,21,7,32,60
black,65,25,64,154
hispanic,107,50,94,251
other,15,8,15,38
white,189,96,212,497
All,397,186,417,1000


In [25]:
#To get without margin values

observed = voter_tab.iloc[0:5,0:3]
observed

party,democrat,independent,republican
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
asian,21,7,32
black,65,25,64
hispanic,107,50,94
other,15,8,15
white,189,96,212


For a test of independence, we use the same chi-squared formula that we used for the goodness-of-fit test. The main difference is we have to calculate the expected counts of each cell in a 2-dimensional table instead of a 1-dimensional table. To get the expected count for a cell, multiply the row total for that cell by the column total for that cell and then divide by the total number of observations. We can quickly get the expected counts for all cells in the table by taking the row totals and column totals of the table, performing an outer product on them with the np.outer() function and dividing by the number of observations:

In [26]:
expected = np.outer(voter_tab['All'][0:5],voter_tab.loc['All'][0:3])/1000
expected

array([[ 23.82 ,  11.16 ,  25.02 ],
       [ 61.138,  28.644,  64.218],
       [ 99.647,  46.686, 104.667],
       [ 15.086,   7.068,  15.846],
       [197.309,  92.442, 207.249]])

In [27]:
expected = pd.DataFrame(expected)

expected.columns = ["democrat","independent","republican"]
expected.index = ["asian","black","hispanic","other","white"]

expected

Unnamed: 0,democrat,independent,republican
asian,23.82,11.16,25.02
black,61.138,28.644,64.218
hispanic,99.647,46.686,104.667
other,15.086,7.068,15.846
white,197.309,92.442,207.249


In [30]:
chi_squared_stat = (((observed - expected)**2)/expected).sum()
chi_squared_stat

party
democrat       1.470788
independent    2.509342
republican     3.189191
dtype: float64

In [32]:
chi_squared_stat = chi_squared_stat.sum()
chi_squared_stat

7.169321280162059

In [34]:
# Critical value for 95% confidence & df = (5-1)*(3-1) = 8 
crit = stats.chi2.ppf(q=0.95,df=8)
print('Critical value',crit)

Critical value 15.50731305586545


In [36]:
p_val = 1 - stats.chi2.cdf(x = chi_squared_stat,df=8)
print('P value',p_val)

P value 0.518479392948842


**As P value is higher than the significance level we can accept the null hypothesis that there is no significant relationship between two variables**

stats.chi2_contingency() function can also be used to conduct a test of independence automatically given a frequency table of observed counts

In [37]:
stats.chi2_contingency(observed)

(7.169321280162059,
 0.518479392948842,
 8,
 array([[ 23.82 ,  11.16 ,  25.02 ],
        [ 61.138,  28.644,  64.218],
        [ 99.647,  46.686, 104.667],
        [ 15.086,   7.068,  15.846],
        [197.309,  92.442, 207.249]]))

**Chi-squared tests provide a way to investigate differences in the distributions of categorical variables with the same categories and the dependence between categorical variables.**