**Import packages**

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

In [2]:
import io
df2 = pd.read_csv('chiSquare.csv')

In [3]:
df2.head()

Unnamed: 0,City,Brand
0,Mumbai,A
1,Chennai,C
2,Mumbai,A
3,Mumbai,C
4,Chennai,C


**Null hypothesis:** The categorical values city and brand are independent.

**Alternative hypothesis:** The categorical values city and brand are not independent.

Here are the calculations to accept or reject Null hypothesis.

**Observed frequency (contingency table)**

In [4]:
contingency_table = pd.crosstab(df2.City, df2.Brand, margins=True)

In [5]:
contingency_table

Brand,A,B,C,All
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chennai,165,47,191,403
Mumbai,279,73,225,577
All,444,120,416,980


In [6]:
contingency_table['A']

City
Chennai    165
Mumbai     279
All        444
Name: A, dtype: int64

In [7]:
contingency_table['A']['Chennai']

165

In [8]:
contingency_table['All']['All']

980

In [9]:
contingency_table.transpose()

City,Chennai,Mumbai,All
Brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,165,279,444
B,47,73,120
C,191,225,416
All,403,577,980


**Calculate the expected frequency**

In [10]:
cities = list(df2['City'].unique())
brands = list(df2['Brand'].unique())

exp1 = {}

for i in cities:
  exp2 = {}
  for j in brands:
    exp2[j] = contingency_table.transpose()[i]['All'] * contingency_table[j]['All'] / (contingency_table['All']['All'])

  exp1[i] = exp2


In [11]:
exp1

{'Mumbai': {'A': 261.41632653061225,
  'C': 244.93061224489796,
  'B': 70.65306122448979},
 'Chennai': {'A': 182.58367346938775,
  'C': 171.06938775510204,
  'B': 49.3469387755102}}

**Chi-square calculation**

In [12]:
chiSquareCal = 0
for i in cities:
  for j in brands:
    val = (contingency_table.transpose()[i][j] - exp1[i][j])**2/exp1[i][j]
    chiSquareCal = chiSquareCal + val

In [13]:
chiSquareCal #calculated chi-square

7.009543616823935

**Degrees of freedom**

In [14]:
dof = (len(cities)-1) * (len(brands)-1)

In [15]:
dof

2

In [16]:
stats.chi2.ppf(1-0.05, df=dof) # tabulated chi-square

5.991464547107979

Since calculated value of chi-square is more than tabulated value, we reject null-hypothesis.  

**Shortcut to the chi-squared test**

In [17]:
contab = np.array([contingency_table.transpose()['Chennai'][0:3].values,
                  contingency_table.transpose()['Mumbai'][0:3].values])

stats.chi2_contingency(contab)

(7.009543616823934,
 0.03005363054744611,
 2,
 array([[182.58367347,  49.34693878, 171.06938776],
        [261.41632653,  70.65306122, 244.93061224]]))

Since the p-value calculated above (0.03) is less than 0.05 (confidence level), we reject the Null hypothesis

In [18]:
# Another way to calculate the p-value from the chi-square calculation
1 - stats.chi2.cdf(chiSquareCal, dof)

0.030053630547446142