The following demonstrates how to conduct a chi-squared goodness of fit test to determine whether a set of data belong to a certain probability distribution, such as a binomial distribution in this example. 

A researcher hypothesized sex determination in human births could be considered a Bernoulli process but suspected that larger families tend to have more male children than females. She collected data from 320 families, each with exactly 5 children, and generated the results shown below:

number of male children per family: 0    1    2    3    4    5
occurence frequency:                12   42   92   108  46   20

Confirm that the total number of male children for the 320 families is 834, with an average number of 2.61 male children per family.

Part A- If a success is considered a male child, typically most people consider the probability of success to be 0.5. Determine the expected frequency of occurence of 0,1,2,3,4 and 5 male children for 320 families each with exactly five children, assuming p=0.5. Perform a Chi-Squared Goodness of Fit test to determine if the number of male children in families with five children can be considered binomially distributed with p=0.5. What is the value of the critical value? What is the value of the test statistic based on sampling data? Determine the p-value for the computed test statistic. 

In [1]:
import numpy as np
import scipy.stats as sp

In [43]:
#Calcuating the expected values of a binomial distribution of x from 0-5 with p=0.5 for n=320
x = range(0,6)
n = 320
C= 5
p =.5
q= 1-p

Ei = sp.binom.pmf(x, C, p)*n #pmf = probability mass function of the binomial distribution
Oi = np.array([12.0,42,92,108,46,20])

print('The expected values Ei for 0-5 male children in 320 families with p=0.5 are:', Ei)
print('The observed values Oi for 0-5 male children in 320 families with p=0.5 are:', Oi)

The expected values Ei for 0-5 male children in 320 families with p=0.5 are: [  10.   50.  100.  100.   50.   10.]
The observed values Oi for 0-5 male children in 320 families with p=0.5 are: [  12.   42.   92.  108.   46.   20.]


In [49]:
print('The chi-squared test statistic and p-value are', sp.stats.chisquare(Oi, f_exp=Ei))

The chi-squared test statistic and p-value are Power_divergenceResult(statistic=13.279999999999994, pvalue=0.020891491981912837)


In [62]:
crit = sp.chi2.isf(.05,5) #critical value for α=.05, df=6 bins- 0 estimated statistics -1= 5
print("The critical value is:", crit)

The critical value is: 11.0704976935


13.28 > 11.07

Because the value of the test statistic is greater than the critical value, we reject the null hypothesis that the data can be considered binomially distributed with p=0.5. The p-value for the computed test statistic is 0.02, which is not significant. 

Part B - Determine the observed probability of success, p', based upon the researchers data. Similar to part A, perform a Chi-Squared Goodness of Fit test to determine if the number of male children in families with five children can be considered binomially distributed with p = p'. How many sample statistics are you using to estimate population parameters in this part of the problem? What is the value of the critical value? What is the value of the test statistic based on sampling data? Determine the p-value for the computed test statistic. 

In [57]:
pprime = 834/1600
print('The observed probability of success is:',pprime)

The observed probability of success is: 0.52125


In [59]:
Ei_prime = sp.binom.pmf(x, C, pprime)*n #Calcuated Ei_prime with p'=.52125

print('The expected values Ei for 0-5 male children in 320 families with p=pprime are:', Ei_prime)
print('The observed values are the same:', Oi)

The expected values Ei for 0-5 male children in 320 families with p=pprime are: [   8.04811018   43.81281911   95.40441551  103.87373699   56.54745212
   12.31346608]
The observed values are the same: [  12.   42.   92.  108.   46.   20.]


In [60]:
print('The chi-squared test statistic and p-value are', sp.stats.chisquare(Oi, f_exp=Ei_prime))

The chi-squared test statistic and p-value are Power_divergenceResult(statistic=9.0664906197147275, pvalue=0.10644128017345877)


In [63]:
crit_prime = sp.chi2.isf(.05,4) #critical value for α=.05, df=6 bins- 1 estimated statistic -1= 4
print('The critical value is:', crit_prime)

The critical value is: 9.48772903678


9.07 < 9.49

The data falls into the acceptance region. With a probability of success of .52125, we can conclude with 90% confidence that the data is binomially distributed. Note the p-value is slightly higher at 10.6%. Also note the nu (degrees of freedom) value here was 4, not 5, as the sample probability of success of .52125 was used to estimate the population probability of success. 

In summary, with a p=.50, the data fails the Chi-Squared Goodness of fit test for a binomial distribution. However, when the sample p of .5125 is used, the data passes the Chi Squared Goodness of Fit test. The conclusions of the researcher seem reasonable. 