# Problem Set 2, Part One: Due Thursday, January 30 by 8am Eastern Standard Time

## Name: David Millard

**Show your work on all problems!** Be sure to give credit to any
collaborators, or outside sources used in solving the problems. Note
that if using an outside source to do a calculation, you should use it
as a reference for the method, and actually carry out the calculation
yourself; it’s not sufficient to quote the results of a calculation
contained in an outside source.

Fill in your solutions in the notebook below, inserting markdown and/or code cells as needed.  Try to do reasonably well with the typesetting, but don't feel compelled to replicate my formatting exactly.  **You do NOT need to make random variables blue!**

In [1]:
%matplotlib inline

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8.0,5.0)
plt.rcParams['font.size'] = 14

### Conover Problems on Hypothesis Testing

#### Exercise 2.3.6:

A coin is tossed 4 times, and the critical region is “one head or less.”
Let $p=P(\text{Head})$ for each toss. The hypotheses are $H_0$: $p=0.5$
and $H_1$: $p=0.1$. Find the power of the test.

$H_0: p = 0.5$ (null hypothesis)

$H_1: p = 0.1$ (alternative hypothesis)

Find $P(X \leq 1 | H_1)$. Given $X|H_1 \sim \text{Binomial}(n, p)$.

In [3]:
n = 4  
p_H1 = 0.1  

power = stats.binom.cdf(1, n, p_H1)

print(power)

0.9477


What additional
assumption are you making that was not stated in the problem?

I make the assumption the tosses can be modeled by a binomial.

#### Problem 2.3.1:

There are 12 plastic chips in a jar, and the chips are numbered
consecutively from 1 to 12. An experiment consists of drawing 2 chips
with replacement. The outcome of the experiment consists of the 2
numbers on the chips, in the order they are drawn. Let the test
statistic ${\color{royalblue}{X}}$ be the sum of the numbers on the drawn
chips and let the critical region correspond to values of
${\color{royalblue}{X}}$ that are less than 5. Suppose that if $H_0$ is true
the drawing of the chips is random. Also suppose that if $H_1$ is true
chips 1, 2, and 3 are each twice as likely to be drawn as each of the
other chips.

**(a)**  List the points in the critical region.

$H_0 = X$ follows the distribution of the sum of two independent draws from a uniform distribution over ${1,2,…,12}$

$H_1 = X$ follows the distribution of the sum of two independent draws from a weighted distribution, where chips 1, 2, and 3 are twice as likely as the others.

$P(X < 5 | H_1) = P(X = 2 | H_1) + P(X = 3 | H_1) + P(X = 4 | H_1) = P(X = (1,1) | H_1) + P(X = (1,2) | H_1) + P(X = (2,1) | H_1) + P(X = (2,2) | H_1) + P(X = (1,3) | H_1) + P(X = (3,1) | H_1)$

**(b)**  Find $\alpha$.

In [4]:
6 * (1/12)**2

0.041666666666666664

$\alpha = P(X < 5 | H_0) = P(X = (a, b) | H_0) = 6 \cdot P(a) \cdot P(b) = 6 \cdot \frac{1}{12} \cdot \frac{1}{12} = 0.0416$

**(c)** What is the power?

In [5]:
6 * 2/15 * 2/15

0.10666666666666667

$1 = 3(2p) + 9p$

$p = \frac{1}{15}$

$P(X < 5 | H_1) = P(X = (1,1) | H_1) + P(X = (1,2) | H_1) + P(X = (2,1) | H_1) + P(X = (2,2) | H_1) + P(X = (1,3) | H_1) + P(X = (3,1) | H_1)$

$P(X < 5 | H_1) = 6 \cdot \frac{2}{15} \cdot \frac{2}{15} = 0.106$

**(d)** Are $H_0$ and $H_1$ simple or composite?

Simple.

**(e)** Is the test one tailed or two tailed?

One tailed.

#### Review Problem 2.6.12:

Under one theory of genetics each offspring of two particular dogs
should have a 25% chance of being spotted in color. Let this be the null
hypothesis. Under another theory each puppy should have a 75% chance of
being spotted. Let this be the alternative hypothesis. A litter of
puppies is born. There are five spotted puppies out of the eight puppies
in the litter.

**(a)** Using the target level of significance of 0.05, find the critical
    region for a conservative test.

$H_0: p = 0.25$

$H_1: p = 0.75$

$\alpha = 0.05, k= 5, n = 8$

$X \sim Binomial(n=8, p=p)$

$P(X \geq k | H_0) \leq \alpha$

$P(X \geq 5 | H_0) \leq 0.05$

In [6]:
n = 8
p_H0 = 0.25
alpha = 0.05

for k in range(0, n+1):
    p_value = stats.binom.sf(k, n, p_H0)
    if p_value <= alpha:
        # adding +1 to account for the survival function getting strictly greater than
        print(f'Region {k+1}, p-val {p_value}')
        break

Region 5, p-val 0.0272979736328125


**(b)** What is the exact level of significance for your test? (Use the
    exact formula, and then use the normal approximation, to get two
    slightly different results.)

The exact level of significance is 0.027. The approximation is 0.0206.

In [7]:
n = 8
p = 0.25

mu = n * p
sigma = np.sqrt(n * p * (1-p))

stats.norm.sf(4.5, mu, sigma)

0.020613416668581838

**(c)** What is the exact power of your test? (Use the tables \[or
    `stats.binom.cdf()`\] to get your answer.

$P(X \geq 5| H_1) = 0.886$

In [8]:
n = 8
p_H1 = 0.75

stats.binom.sf(4, n, p_H1)

0.8861846923828125

**(d)** What is the $p$-value in this case?

The p-val in this case is 0.027.

**(e)** Is the test unbiased? Explain.

Yes. The probability of rejecting the null when the alternative is true is greater than the probability of rejecting the null when the null is true, i.e. the power is greater than alpha.

**(f)** Is the alternative hypothesis simple or composite?

Simple.

### Conover Problems on Binomial Proportion

In each of the following exercises clearly state $H_0$, $H_1$,
${\color{royalblue}{T}}$, the decision rule, $\alpha$, the decision, the
$p$-value and the name of the test used, where such information is
appropriate.

#### Exercise 3.1.6:

A civic group reported to the town council that at least 60% of the town
residents were in favor of a particular bond issue. The town council
then asked a random sample of 100 residents if they were in favor of the
bond issue. Forty-eight said yes. Is the report of the civic group
reasonable?

$H_0: p \geq 0.6$

$H_1: p \leq 0.6$

$T = \frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}}= -2.45$

Since $T < Z_\alpha$, reject $H_0$.

In [9]:
p_hat = 0.48
p = 0.60
n = 100
a = 0.05

t = (p_hat - p) / np.sqrt((p * (1 - p)) / n)
print(t)

z = stats.norm.ppf(a)
print(z)

p_val = stats.norm.cdf(t)
print(p_val)

-2.4494897427831783
-1.6448536269514729
0.007152939217714809


#### Exercise 3.1.12:

Seventy chemical detection kits of one type are placed in a gas chamber
together, for a fixed period of time, and a measured amount of lethal
gas is introduced into the chamber. Fifty-six kits register positive for
the lethal gas, while the other 14 fail to register positive. Find a 90%
confidence interval for the probability of registering positive under
these conditions.

$SE = \sqrt{\frac{\hat{p} (1-\hat{p})}{n}} = 0.0478$

$z_{\frac{\alpha}{2}} = 1.644$

$CI = \hat{p} \pm z_{\frac{\alpha}{2}} \cdot SE = [0.721, 0.878]$

In [10]:
n = 70
p_hat = 56 / 70

z = stats.norm.ppf(1 - 0.05)
print(z)

se = np.sqrt(p_hat * (1 - p_hat) / n)
print(se)

lower = p_hat - z * se
upper = p_hat + z * se

print(f'CI [{lower}, {upper}]')

1.6448536269514722
0.047809144373375745
CI [0.7213609554760064, 0.8786390445239937]


#### Problem 3.1.1:

*The continuity correction.* It is obvious that if ${\color{royalblue}{Y}}$
has a binomial distribution, then
$$P({\color{royalblue}{Y}}{\mathbin{\le}}4) = P({\color{royalblue}{Y}}{\mathbin{\le}}4.1) = \cdots = P({\color{royalblue}{Y}}{\mathbin{\le}}4.999)$$
because ${\color{royalblue}{Y}}$ takes on only integer values, such as 4 or
5, but no values between integers. Therefore, which number should be
used in the normal approximation to the binomial distribution: 4, or
4.1, or what? The *continuity correction* (because we are trying to use
a continuous distribution such as the normal to approximate a discrete
distribution such as the binomial) says to use the number midway between
two adjacent values in the discrete distribution. That is, the binomial
distribution estimate $P({\color{royalblue}{Y}}{\mathbin{\le}}4)$, with
$$P({\color{royalblue}{Y}}{\mathbin{\le}}4) \cong
P\left({\color{royalblue}{Z}}{\mathbin{\le}}\frac{4+0.5-np}{\sqrt{nqp}}\right)$$
where ${\color{royalblue}{Z}}$ has a normal distribution, because $4.5$ is
halfway between $4$ and $5$.

Usually the continuity correction works well when using the normal
distribution to approximate binomial probabilities.

**(a)** For $n=20$, $p=0.1$, find the exact value of
    $P({\color{royalblue}{Y}}{\mathbin{\le}}1)$ from Table A3 \[or
    `stats.binom.cdf()`\]. Use the normal approximation to estimate
    $P({\color{royalblue}{Y}}{\mathbin{\le}}1)$, first without the continuity
    correction and then with the continuity correction. Which estimate
    is closer?

In [11]:
n = 20
p = 0.1

g = stats.binom.cdf(1, n, p)
print(g)

0.3917469981251679


In [12]:
mu = n * p
sigma = np.sqrt(n * p * (1 - p))

no_c = stats.norm.cdf(1, mu, sigma)
print(no_c)

no_c = stats.norm.cdf(1.5, mu, sigma)
print(no_c)

0.228028270125128
0.35469405750711314


The continuity corrected estimate is much closer.

**(b)** Repeat part a, but change from $p=0.1$ to $p=0.3$. Now which
    estimate is closer?

In [13]:
p = 0.3

g = stats.binom.cdf(1, n, p)
print(g)

mu = n * p
sigma = np.sqrt(n * p * (1 - p))

no_c = stats.norm.cdf(1, mu, sigma)
print(no_c)

no_c = stats.norm.cdf(1.5, mu, sigma)
print(no_c)

0.007637259774199995
0.007348710885011736
0.014054020073575896


Now the continuity corrected estimate is much closer.