### Find the relationship between two categorical variables

In [1]:
import pandas as pd
data=pd.read_csv('data/Dataset_Of_2CategoricalVariables.csv')
data

Unnamed: 0,Gender,Like Shopping?
0,Male,No
1,Female,Yes
2,Male,Yes
3,Female,Yes
4,Female,Yes
5,Male,Yes
6,Male,No
7,Female,No
8,Female,No


In [2]:
data.columns

Index(['Gender', 'Like Shopping?'], dtype='object')

### Contingency table
Contingency tables (also called crosstabs or two-way tables) are used in statistics to summarize the relationship between several categorical variables. A contingency table is a special type of frequency distribution table, where two variables are shown simultaneously.

In [3]:
#Contingency Table
contingency_table = pd.crosstab(data["Gender"],data["Like Shopping?"])
print('contingency_table :-\n',contingency_table)

contingency_table :-
 Like Shopping?  No  Yes
Gender                 
Female           2    3
Male             2    2


In [4]:
#Observed Values
Observed_Values = contingency_table.values 
print("Observed Values :-\n",Observed_Values)

Observed Values :-
 [[2 3]
 [2 2]]


#### What is a Chi Square Test?
There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:

1. A chi-square goodness of fit test determines if a sample data matches a population. For more details on this type, see: Goodness of Fit Test.
2. A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.
3. A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.
4. A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

#### What is a Chi-Square Statistic?
The formula for the chi-square statistic used in the chi square test is:

The chi-square formula.

#### X = sum((O - E)^2/E)

“O” is your observed value and E is your expected value. 

#### Tip:
The Chi-square statistic can only be used on numbers. They can’t be used for percentages, proportions, means or similar statistical value. For example, if you have 10 percent of 200 people, you would need to convert that to a number (20) before you can run a test statistic.

### Chi Square P-Values.
A chi square test will give you a p-value. The p-value will tell you if your test results are significant or not. In order to perform a chi square test and get the p-value, you need two pieces of information:

#### Degrees of freedom. 
That’s just the number of categories minus 1.
The alpha level(α). This is chosen by you, or the researcher. The usual alpha level is 0.05 (5%), but you could also have other levels like 0.01 or 0.10.
 
In elementary statistics or AP statistics, both the degrees of freedom(df) and the alpha level are usually given to you in a question. You don’t normally have to figure out what they are. You may have to figure out the df yourself, but it’s pretty simple: count the categories and subtract 1.

Degrees of freedom are placed as a subscript after the chi-square (Χ2) symbol.

In [12]:
degreess_of_freedom = len(data['Like Shopping?'].unique()) - 1
print (degreess_of_freedom)

1


#### Uses
The chi-squared distribution has many uses in statistics, including:

Confidence interval estimation for a population standard deviation of a normal distribution from a sample standard deviation.
Independence of two criteria of classification of qualitative variables.
Relationships between categorical variables (contingency tables).
Sample variance study when the underlying distribution is normal.
Tests of deviations of differences between expected and observed frequencies (one-way tables).
The chi-square test (a goodness of fit test).

In [7]:
Observed_Values

array([[2, 3],
       [2, 2]], dtype=int64)

#### expected_value = row_total*column_total/sample_size

Ref. Contingency table while calculating expected values.
Expected value shows how the values appear if it is a good distribution.

https://www.youtube.com/watch?v=ZUGKFoHUHQI

In [13]:
#Expected Values
import scipy.stats
b = scipy.stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)

Expected Values :-
 [[2.22222222 2.77777778]
 [1.77777778 2.22222222]]


#### Degrees of Freedom
The number of categories minus 1. 

In [14]:
#Degree of Freedom
no_of_rows=len(contingency_table.iloc[0:2,0])
no_of_columns=len(contingency_table.iloc[0,0:2])
df=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:",df)

#or
#df=b[2]
#print("Degree of Freedom:-",df)

Degree of Freedom: 1


In [15]:
#Significance Level 5%
alpha=0.05

In [18]:
#chi-square statistic - χ2
from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:",chi_square_statistic)
chi_square

chi-square statistic: 0.09000000000000008


array([0.05, 0.04])

#### Critical Value

In hypothesis testing, a critical value is a point on the test distribution that is compared to the test statistic to determine whether to reject the null hypothesis. If the absolute value of your test statistic is greater than the critical value, you can declare statistical significance and reject the null hypothesis.

In [19]:
#critical_value
critical_value=chi2.ppf(q=1-alpha,df=df)
print('critical_value:',critical_value)

critical_value: 3.841458820694124


#### p-value
The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected.

In [20]:
#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=df)
print('p-value:',p_value)

p-value: 0.7641771556220945


In [21]:
print('Significance level: ',alpha)
print('Degree of Freedom: ',df)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)

Significance level:  0.05
Degree of Freedom:  1
chi-square statistic: 0.09000000000000008
critical_value: 3.841458820694124
p-value: 0.7641771556220945


In [22]:
#compare chi_square_statistic with critical_value and p-value which is the probability of getting chi-square>0.09 (chi_square_statistic)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

Retain H0,There is no relationship between 2 categorical variables
Retain H0,There is no relationship between 2 categorical variables


#Explanation : https://medium.com/@kuldeepnpatel/chi-square-test-of-independence-bafd14028250