
#  Test of two proportions

##  Assignment 

You may wonder if the amount paid for freight affects whether or not the shipment was late. Recall that in the `late_shipments` dataset, whether or not the shipment was late is stored in the `late` column. Freight costs are stored in the `freight_cost_group` column, and the categories are `"expensive"` and `"reasonable"`.

The hypotheses to test, with `"late"` corresponding to the proportion of late shipments for that group, are

\(H_{0}\): \(late_{\text{expensive}} -  late_{\text{reasonable}} = 0\) 

\(H_{A}\): \(late_{\text{expensive}} -  late_{\text{reasonable}} &gt; 0\)

`p_hats` contains the estimates of population proportions (sample proportions) for each `freight_cost_group`:

```
freight_cost_group  late
expensive           Yes     0.082569
reasonable          Yes     0.035165
Name: late, dtype: float64

```

`ns` contains the sample sizes for these groups:

```
freight_cost_group
expensive     545
reasonable    455
Name: late, dtype: int64

```

`pandas` and `numpy` have been imported under their usual aliases, and `norm` is available from `scipy.stats`.

##  Pre exercise code 

```
import pandas as pd
import numpy as np
from scipy.stats import norm
late_shipments = pd.read_feather(
  path = "/usr/local/share/datasets/late_shipments.feather"
)
late_shipments['freight_cost_group'] = np.where(late_shipments['freight_cost_usd'] <= 5000, "reasonable", "expensive")

p_hats = late_shipments.groupby("freight_cost_group")['late'].value_counts(normalize=True)
p_hats = p_hats[p_hats<0.5]
ns = late_shipments.groupby("freight_cost_group")['late'].count()
```


In [15]:
import pandas as pd
import numpy as np
from scipy.stats import norm
late_shipments = pd.read_feather(
  path = "late_shipments.feather"
)
late_shipments['freight_cost_group'] = np.where(late_shipments['freight_cost_usd'] <= 5000, "reasonable", "expensive")

p_hats = late_shipments.groupby("freight_cost_group")['late'].value_counts(normalize=True)
p_hats = p_hats[p_hats<0.5]
ns = late_shipments.groupby("freight_cost_group")['late'].count()

In [16]:
p_hats

freight_cost_group  late
expensive           Yes     0.082569
reasonable          Yes     0.035165
Name: proportion, dtype: float64

In [17]:
ns

freight_cost_group
expensive     545
reasonable    455
Name: late, dtype: int64

##  Instructions 

- Calculate the pooled sample proportion, \(\hat{p}\), from `p_hats` and `ns`.

$$
\hat{p} = \frac{n_{\text{expensive}} \times \hat{p}_{\text{expensive}} + n_{\text{reasonable}} \times \hat{p}_{\text{reasonable}}}{n_{\text{expensive}} + n_{\text{reasonable}}}
$$



In [23]:
p_hat = (ns['expensive'] * p_hats['expensive'] + ns['reasonable'] * p_hats['reasonable']) / (ns['expensive'] + ns['reasonable'])
p_hat

late
Yes    0.061
Name: proportion, dtype: float64

Calculate the standard error of the sample **using this equation.**

$$
\text{SE}(\hat{p}_{\text{expensive}} - \hat{p}_{\text{reasonable}}) = \sqrt{\dfrac{\hat{p} \times (1 - \hat{p})}{n_{\text{expensive}}} + \dfrac{\hat{p} \times (1 - \hat{p})}{n_{\text{reasonable}}}}
$$

- Calculate `p_hat` multiplied by `(1 - p_hat)`.
- Divide `p_hat_times_not_p_hat` by the number of `"reasonable"` rows and by the number of `"expensive"` rows, and sum those two values.
- Calculate `std_error` by taking the square root of `p_hat_times_not_p_hat_over_ns`.



In [25]:
std_error = np.sqrt(p_hat * (1 - p_hat) / ns['expensive'] + p_hat * (1 - p_hat) / ns['reasonable'])
std_error

late
Yes    0.015198
Name: proportion, dtype: float64


- Calculate the z-score **using the following equation.**

$$
z = \frac{(\hat{p}_{\text{expensive}} - \hat{p}_{\text{reasonable}})}{\text{SE}(\hat{p}_{\text{expensive}} - \hat{p}_{\text{reasonable}})}
$$



In [26]:
z_score = (p_hats['expensive'] - p_hats['reasonable']) / std_error
z_score

late
Yes    3.11904
Name: proportion, dtype: float64


- Calculate the p-value from the z-score.


In [28]:
p_value = 1 - norm.cdf(z_score)
p_value = p_value[0]
p_value

0.0009072060637050905