# A Chi-Squared Test on US/Hawaii Age Distribution Data

I will be doing a chi-squared goodness of fit test on United States and Hawaii age distribution data. I will be testing to see if the distribution of Hawaii's age population matches the expected distribution in the United States.

Data: https://www.census.gov/data/tables/2023/demo/age-and-sex/2023-age-sex-composition.html \
https://data.census.gov/profile/Hawaii?g=040XX00US15

In [65]:
import pandas as pd
import numpy as np
import math
import scipy.stats as stats

### Data Cleaning/Preparing

In [66]:
df = pd.read_excel('pop_dist.xlsx')

df.head()

Unnamed: 0,Age,US_Count,HI_Count
0,.Under 5 years,18473,84552
1,.5 to 9 years,19929,84983
2,.10 to 14 years,20698,87743
3,.15 to 19 years,21577,78904
4,.20 to 24 years,22281,91878


In [67]:
df['Age'] = df['Age'].str.replace('.','')

df.set_index('Age', inplace = True)

In [68]:
df_pop = df.copy()

In [69]:
df_pop = df_pop[['US_Count']]
df_pop

Unnamed: 0_level_0,US_Count
Age,Unnamed: 1_level_1
Under 5 years,18473
5 to 9 years,19929
10 to 14 years,20698
15 to 19 years,21577
20 to 24 years,22281
25 to 29 years,21788
30 to 34 years,23126
35 to 39 years,22178
40 to 44 years,21309
45 to 49 years,19537


In [70]:
# Counts were in the thousands so I normalized it to their actual counts
df_pop['Count'] = df_pop['US_Count']*1000

df_pop.drop(columns = 'US_Count', inplace = True)

In [71]:
df_pop

Unnamed: 0_level_0,Count
Age,Unnamed: 1_level_1
Under 5 years,18473000
5 to 9 years,19929000
10 to 14 years,20698000
15 to 19 years,21577000
20 to 24 years,22281000
25 to 29 years,21788000
30 to 34 years,23126000
35 to 39 years,22178000
40 to 44 years,21309000
45 to 49 years,19537000


In [72]:
df_pop['Count'].sum()

330632000

In [73]:
# Preparing sample data (Hawaii)

df_samp = df.copy()

df_samp.head()

Unnamed: 0_level_0,US_Count,HI_Count
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
Under 5 years,18473,84552
5 to 9 years,19929,84983
10 to 14 years,20698,87743
15 to 19 years,21577,78904
20 to 24 years,22281,91878


In [74]:
df_samp = df[['HI_Count']]

df_samp.head()

Unnamed: 0_level_0,HI_Count
Age,Unnamed: 1_level_1
Under 5 years,84552
5 to 9 years,84983
10 to 14 years,87743
15 to 19 years,78904
20 to 24 years,91878


In [75]:
df_samp.rename(columns = {'HI_Count': 'Count'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_samp.rename(columns = {'HI_Count': 'Count'}, inplace = True)


In [76]:
df_samp.head()

Unnamed: 0_level_0,Count
Age,Unnamed: 1_level_1
Under 5 years,84552
5 to 9 years,84983
10 to 14 years,87743
15 to 19 years,78904
20 to 24 years,91878


In [77]:
df_samp['Count'].sum()

1450589

### Start Testing

I will use a confidence level of 95%.

In [78]:
# Defining what is observed

observed = df_samp

In [79]:
# Getting the ratios of the population counts

pop_ratios = df_pop/df_pop['Count'].sum()

In [80]:
pop_ratios

Unnamed: 0_level_0,Count
Age,Unnamed: 1_level_1
Under 5 years,0.055872
5 to 9 years,0.060275
10 to 14 years,0.062601
15 to 19 years,0.06526
20 to 24 years,0.067389
25 to 29 years,0.065898
30 to 34 years,0.069945
35 to 39 years,0.067078
40 to 44 years,0.064449
45 to 49 years,0.05909


In [81]:
pop_ratios.sum()

Count    1.0
dtype: float64

In [82]:
# Getting expected counts based on the sample counts

expected = pop_ratios * (df_samp['Count'].sum())

In [83]:
expected

Unnamed: 0_level_0,Count
Age,Unnamed: 1_level_1
Under 5 years,81046.996652
5 to 9 years,87434.937275
10 to 14 years,90808.787782
15 to 19 years,94665.243694
20 to 24 years,97753.918281
25 to 29 years,95590.968606
30 to 34 years,101461.205249
35 to 39 years,97302.02413
40 to 44 years,93489.441436
45 to 49 years,85715.107107


In [84]:
# Calcualting chi-square and pvalue

results = stats.chisquare(f_obs = observed, # observed counts
                f_exp = expected) # expected counts

results

Power_divergenceResult(statistic=array([12476.21156936]), pvalue=array([0.]))

In [85]:
print('Chi Square Statistic')
print(results.statistic)

Chi Square Statistic
[12476.21156936]


In [86]:
print('P-value')
print(results.pvalue)

P-value
[0.]


In [87]:
# Getting critical value

crit = stats.chi2.ppf(q = 0.95, # critical value for 95% confidence level
                      df = len(df.index) - 1)

print('Critical Value')
print(crit)

Critical Value
27.58711163827534


In [88]:
print('Chi Square Statistic:', results.statistic)
print('Critcal Value:', crit)
print('P-Value:', results.pvalue)

Chi Square Statistic: [12476.21156936]
Critcal Value: 27.58711163827534
P-Value: [0.]


With a p-value of practically 0, a chi square statistic of 12,476.21 heavily surpassing the critical value of 27.59, we can reject the null hypothesis with a 95% confidence level.

We can conclude that the age distribution in Hawaii is different than in the United States with a 95% confidence level.