
#  proportions_ztest() for two samples

##  Assignment 

That took a lot of effort to calculate the p-value, so while it is useful to see how the calculations work, it isn't practical to do in real-world analyses. For daily usage, it's better to use the `statsmodels` package.

Recall the hypotheses.

\(H_{0}\): \(late_{\text{expensive}} -  late_{\text{reasonable}} = 0\) 

\(H_{A}\): \(late_{\text{expensive}} -  late_{\text{reasonable}} &gt; 0\)

`late_shipments` is available, containing the `freight_cost_group` column. `numpy` and `pandas` have been loaded under their standard aliases, and `proportions_ztest` has been loaded from `statsmodels.stats.proportion`.

##  Pre exercise code 

```
import pandas as pd
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
late_shipments = pd.read_feather(
  path = "/usr/local/share/datasets/late_shipments.feather"
)
late_shipments['freight_cost_group'] = np.where(late_shipments['freight_cost_usd'] <= 5000, "reasonable", "expensive")
```


In [29]:
import pandas as pd
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
late_shipments = pd.read_feather(
  path = "late_shipments.feather"
)
late_shipments['freight_cost_group'] = np.where(late_shipments['freight_cost_usd'] <= 5000, "reasonable", "expensive")

In [30]:
late_shipments.head(1)

Unnamed: 0,id,country,managed_by,fulfill_via,vendor_inco_term,shipment_mode,late_delivery,late,product_group,sub_classification,...,line_item_value,pack_price,unit_price,manufacturing_site,first_line_designation,weight_kilograms,freight_cost_usd,freight_cost_groups,line_item_insurance_usd,freight_cost_group
0,36203.0,Nigeria,PMO - US,Direct Drop,EXW,Air,1.0,Yes,HRDT,HIV test,...,266644.0,89.0,0.89,"Alere Medical Co., Ltd.",Yes,1426.0,33279.83,expensive,373.83,expensive


##  Instructions 

- Get the counts of the `late` column grouped by `freight_cost_group`.



In [31]:
counts = late_shipments.groupby('freight_cost_group')['late'].count()
counts

freight_cost_group
expensive     545
reasonable    455
Name: late, dtype: int64

- Extract the number of `"Yes"`'s for the two `freight_cost_group` into a `numpy` array, specifying the `'expensive'` count and then `'reasonable'`.
- Determine the overall number of rows in each `freight_cost_group` as a `numpy` array, specifying the `'expensive'` count and then `'reasonable'`.
- Run a z-test using `proportions_ztest()`, specifying `alternative` as `"larger"`.


In [35]:
success_counts = np.array([len(late_shipments.query('(freight_cost_group == "expensive") & (late == "Yes")')),
                          len(late_shipments.query('(freight_cost_group == "reasonable") & (late == "Yes")'))])
success_counts

array([45, 16])

In [37]:
n = np.array([counts['expensive'], counts['reasonable']])
n

array([545, 455], dtype=int64)

In [38]:
z_score, p_value = proportions_ztest(count=success_counts, nobs=n, alternative='larger')
p_value

0.0009072060637051224