<a href="https://colab.research.google.com/github/dilaraesmer/hypothesis_testing/blob/main/Chi_Squared_Tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [None]:
national = pd.DataFrame(["white"]*100000 + ["hispanic"]*60000 +\
                        ["black"]*50000 + ["asian"]*15000 + ["other"]*35000)

minnesota = pd.DataFrame(["white"]*600 + ["hispanic"]*300 +\
                        ["black"]*250 + ["asian"]*75 + ["other"]*150)


In [None]:
national_table = pd.crosstab(index = national[0], columns = "count")
minnesota_table = pd.crosstab(index = minnesota[0], columns = "count")

In [None]:
print("National")
print(national_table)
print(" ")
print( "Minnesota" )
print(minnesota_table)

In [None]:
observed = minnesota_table
national_ratios = national_table/len(national) # Get poplation ratios
print(national_ratios)
expected = national_ratios * len(minnesota) # Get expected counts
chi_squared_stat = (((observed-expected)**2)/expected).sum()
print(chi_squared_stat)

The len() function returns the number of items in an object.

When the object is a string, the len() function returns the number of characters in the string.

In [None]:
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 4)   # Df = number of variable categories - 1

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x = chi_squared_stat, # Find the p value
                             df = 4)

print("P value")
print(p_value)

Interpretation :
We can see that the critical chi-squared value is 9.49 well the one we found 18 that's a lot bigger than that so this will be significant at the 5% level and p-value is 0.001 so that's pretty small p-value. It's even significant at the 1% level so it can be pretty confident that the distribution of counts across those categories, is actually different for the minnesota sample than it was for the population so in this case since our chi-squared statistic exceeds the critical value and the p value is low. We'd reject the null hypothesis that the two distributions are the same and accept the alternative hypothesis that they are actually different. 

In [None]:
stats.chisquare(f_obs = observed, # Array of observed counts 
                f_exp = expected) # Array of expected counts 

The test results agree with the values we calculated above. 

**Chi-Squared Test of Independence**

The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables.  The frequency of each category for one nominal variable is compared across the categories of the second nominal variable.  The data can be displayed in a contingency table where each row represents a category for one variable and each column represents a category for the other variable.  For example, say a researcher wants to examine the relationship between gender (male vs. female) and empathy (high vs. low).  The chi-square test of independence can be used to examine this relationship.  The null hypothesis for this test is that there is no relationship between gender and empathy.  The alternative hypothesis is that there is a relationship between gender and empathy (e.g. there are more high-empathy females than high-empathy males). 

https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/chi-square/

In [None]:
np.random.seed(10)
# Sample data randomly at fixed probabilities
voter_race = np.random.choice(a = ["asian", "black", "hispanic", "other", "white"],
                              p = [0.05, 0.15, 0.25, 0.05, 0.5],
                              size = 1000)

In [None]:
# Sample data randomly at fixed probabilities
voter_party = np.random.choice(a = ["democrat", "independent", "republican"],
                               p = [0.4, 0.2, 0.4],
                               size = 1000)

In [None]:
voters = pd.DataFrame({"race" : voter_race,
                       "party" : voter_party })

voter_tab = pd.crosstab(voters.race, voters.party, margins = True)

voter_tab.columns = ["democrat", "independent", "republican", "row_totals"]

voter_tab.index = ["asian", "black", "hispanic", "other", "white", "col_totals"]

observed = voter_tab.iloc[0:5,0:3] # Get table without totals for later use

voter_tab

You can find detailed information about the np.outer() function from this link.

https://www.pythonpool.com/numpy-outer/

In [None]:
expected = np.outer(voter_tab["row_totals"][0:5],
                  voter_tab.loc["col_totals"][0:3]) / 1000

expected = pd.DataFrame(expected)

expected.columns = ["democrat", "independent", "republican"]
expected.index = ["asian", "black", "hispanic", "other", "white"]

expected

In [None]:
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()

print(chi_squared_stat)

In [None]:
crit = stats.chi2.ppf(q= 0.95, # Find the critical value for 95% confidence*
                      df = 8)

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat, # Find the p value 
                             df=8)

print("P value")
print(p_value)

Note: The degrees of freedom for a test of independence equals the product of the number of categories in each variable minus 1. In this case we have a *5x3* table so df *= 4x2 = 8*.

As with the goodness-of-fit test, we can use scipy to conduct a test of independence quickly. Use stats.chi2_contingency() function to conduct a test of independence automatically given a frequency table of observed counts:

In [None]:
stats.chi2_contingency(observed = observed)