# $\chi^2$ Goodness of Fit

While I won't vouch for the grammar involved, this statistical procedure has a name which describes exactly what it does:

<br><center><b> <span style="color: purple" >Tests Whether Observed Data Matches a Given Probability Model</span></b></center>

## Example 1: Doberman Breeding
Dobermans can be bred in 4 colors: black, red, blue and **fawn**, all of which have rust colored highlights. <b> <span style="color: purple" >Fawn-colored Dobermans are the rarest and most prized color.</span></b></center> If a black male Doberman (hetero dominant allele) and a fawn female Doberman (homo recessive allele) were bred, on average half their pups would be black, a quarter blue and the remaining quarter fawn. Over the course of several years, a certain Doberman breeder had a pair of dogs from which 28 pups were born: 11 black, 11 blue and 6 fawn. Test the hypothesis that these dogs have the predicted genetics at the $\alpha=0.1$ level of significance.

### Hypotheses

Our hypothesis just tests theoretical probabilities, so the list of probabilities will be our null hypothesis:

$$\begin{align}H_0 : p_\text{black}&=\frac{1}{2} \\ p_\text{blue}&=\frac{1}{4} \\ p_\text{fawn}&=\frac{1}{4}\end{align}$$

The alternative hypothesis is that at least one of the probabilities is not correct. The Doberman breeder is hoping to fail to reject the null indicating the parent dogs have the hypothesized genetics which, if true, will be valuable.

### Data and Expected Cell Couts

There are 28 total dogs in the observed data, so the expected cell counts can be calculated by multiplying the 3 probabilities in the null hypothesis by 28. The results are organized below:

<table style="width:30%">
<tr>
  <th></th>
  <th style="text-align: center;">Black</th>
  <th style="text-align: center;">Blue</th> 
  <th style="text-align: center;">Fawn</th>
</tr>
<tr>
  <td>Observed</td>
  <td style="text-align: center;">11</td>
  <td style="text-align: center;">11</td>
  <td style="text-align: center;">6</td>
</tr>
<tr>
  <td>Expected</td>
  <td style="text-align: center;">14</td>
  <td style="text-align: center;">7</td>
  <td style="text-align: center;">7</td>
</tr>
</table>

Let's go ahead and create vectors that correspond to the rows in the table above understanding that R has the capability of calculating the exected cell counts based upon the probabilities involved:

In [39]:
observed = c(11, 11, 6)
expected = c(0.5, 0.25, 0.25)

**Checking the Assumptions.** As with other proportion tests, as long as we ensure the sample size is adequate, our category data will be appropriate for the test. For all $\chi^2$ procedures, we require that no more than $20\%$ of Expected cell counts can be less than 5. Since the smallest Expected cell count is 7 which is greater than 5, we have no low Expected cell counts at all (0%) which is clearly less than 20%.

### Conducting the Test

We will create an output variable **res** (abbreviation of "results"). Having a variable name for the results makes inspecting the values in the expected cells quite easy to accomplish:

In [40]:
res <- chisq.test(observed, p = expected)
res
res$expected


	Chi-squared test for given probabilities

data:  observed
X-squared = 3.0714, df = 2, p-value = 0.2153


## Example 2: Mouse Genetics

Suppose researchers cross a pure breeding white mouse with a pure breeding brown mouse. All F1 (first filial generation) progeny are brown. The researchers then construct an F2 (second filial generation) cross by breeding pairs from F1 group. If the researchers’ genetics model is correct, the brown-to-white ratio in the F2 group should be 3:1.

In total, researchers raise 200 of the F2 offspring and observe 164 brown and and the rest white. Test the hypothesis that the genetics model is correct at the $\alpha=0.1$ level of significance.

### Hypotheses

The required list of probabilities in our null hypothesis:

$$\begin{align}H_0 : p_\text{brown}&=\frac{3}{4} \\ p_\text{white}&=\frac{1}{4}\end{align}$$

## Example 3: Births on Week Days vs. Weekends

Are fewer human babies born on weekend days (proportionally) then week days?

Note that, naturally, 2 out of 7 babies would be born on weekend days while 5 out of 7 would be born on weekdays. Modern medicine produced a large increase in scheduled births (planned inductions or planned C-sections) in recent years. If parents and doctors work together to schedule a birth, it sure ain’t gonna be at midnight on a Saturday! Consider the following observed data for births in one county in Georgia for 2023.

<table style="width:35%">
<tr>
  <th>Su</th>
  <th>M</th>
  <th>Tu</th> 
  <th>W</th>
  <th>Tu</th> 
  <th>F</th>
  <th>Sa</th>
</tr>
<tr>
  <td>11</td>
  <td>29</td>
  <td>16</td>
  <td>14</td>
  <td>17</td>
  <td>23</td>
  <td>9</td>
</tr>
</table>

Test if we have evidence at the $\alpha=0.05$ level that fewer babies (proportionally) are born on weekends?

### Hypotheses

The probabilities for each day will be $\frac{1}{7}$:

$$\begin{align}H_0 &: p_\text{Su}=p_\text{M}=p_\text{Tu}=p_\text{W}=p_\text{Th}=p_\text{F}=p_\text{Sa}=\frac{1}{7} \\ H_a &: \text{The observed data do not match expectations based on the probilities.}\end{align}$$

In [41]:
observed = ...
expected = ...

ERROR: Error in eval(expr, envir, enclos): '...' used in an incorrect context


In [42]:
chisq.test(...)

ERROR: Error in eval(expr, envir, enclos): '...' used in an incorrect context


## Example 4: Births on Week Days vs. Weekends

Are fewer human babies born on weekend days (proportionally) then week days?

Note that, naturally, 1 out of 7 babies would be born on each day of the week regardless of whether it were a weekend or a weekday. Modern medicine has produced a recent increase in scheduled births due to planned inductions or planned C-sections. If parents and doctors work together to schedule a birth, it sure ain’t gonna be at midnight on a Saturday! Consider the following observed data for births in a famous data set from 1978.

In [27]:
births = read.csv('https://faculty.ung.edu/rsinn/data/births78.csv')
head(births, 3)

Unnamed: 0_level_0,rownames,date,births,wday,year,month,day_of_year,day_of_month,day_of_week
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<int>,<int>,<int>,<int>,<int>
1,1,1978-01-01,7701,Sun,1978,1,1,1,1
2,2,1978-01-02,7527,Mon,1978,1,2,2,2
3,3,1978-01-03,8825,Tue,1978,1,3,3,3


Is there evidence at the $\alpha=0.05$ level that fewer babies (proportionally) are born on weekends?

Our hypotheses:

$$\begin{align}H_0 &: p_{Su} = p_{M} = p_{Tu} = p_{W} = p_{Th} = p_{F} = p_{Sa} = \frac{1}{7}\\H_a &: \text{Observed Data do not Conform to this Probability Model}\end{align}$$

We can subset our births table into days of the week as follows (example with Sundays):

In [38]:
sunday <- subset(births, wday == 'Sun')
sum(sunday$births)